While running CASA in HTCondor seems to work without issue, running mpicasa in HTCondor produces several issues.
RuntimeError("Failed to create %s/.matplotlib
The following is an example of the error mpicasa produces when running in HTCondor when ${HOME}/.matplotlib isn't already populated. This happens for example, when we set HOME=${HOME:=$TMPDIR} because we are running on a remote site or simulating it. We have to set HOME to something because mpicasa expects HOME and HTCondor doesn't set HOME by default.
Attempting to get matplotlib to generate its own .matplotlib produced inconsistent results. So the solution was to add a populated .matplotlib directory to transfer_input_files for each DAG or job.
471 raise RuntimeError("Failed to create %s/.matplotlib; consider setting MPLCONFIGDIR to a writable directory for matplotlib configuration data"%h)472--> 473 os.mkdir(p)474475 return pOSError: [Errno 17] File exists: '/lustre/aoc/admin/tmp/condor/nmpost071/executedir_130008/.matplotlib'In [2]: Do you really want to exit ([y]/n)?--------------------------------------------------------------------------casa has exited due to process rank 1 with PID 130184 onnode nmpost071 exiting improperly. There are three reasons this could occur:1. this process did not call "init" before exiting, but others inthe job did. This can cause a job to hang indefinitely while it waitsfor all processes to call "init". By rule, if one process calls "init",then ALL processes must call "init" prior to termination.2. this process called "init", but exited without calling "finalize".By rule, all processes that call "init" MUST call "finalize" prior toexiting or it will be considered an "abnormal termination"3. this process called "MPI_Abort" or "orte_abort" and the mca parameterorte_create_session_dirs is set to false. In this case, the run-time cannotdetect that the abort call was an abnormal termination. Hence, the onlyerror message you will receive is this one.This may have caused other processes in the application to beterminated by signals sent by casa (as reported here).You can avoid this message by specifying -quiet on the casa command line.--------------------------------------------------------------------------
SOLUTION: transfer a populated .matplotlib to your job like so
transfer_input_files = /users/krowe/.matplotlib
Warning about /tmp on a networked filesystem
The warning about /tmp only appears when /tmp is a network filesystem (e.g. Lustre or NFS). Open MPI produces this warning because HTCondor bind mounts /tmp and /var/tmp to its scratch area and we currently use Lustre for the scratch areas.
--------------------------------------------------------------------------
WARNING: Open MPI will create a shared memory backing file in a
directory that appears to be mounted on a network filesystem.
Creating the shared memory backup file on a network file system, such
as NFS or Lustre is not recommended -- it may cause excessive network
traffic to your file servers and/or cause shared memory traffic in
Open MPI to be much slower than expected.
You may want to check what the typical temporary directory is on your
node. Possible sources of the location of this temporary directory
include the $TEMPDIR, $TEMP, and $TMP environment variables.
Note, too, that system administrators can set a list of filesystems
where Open MPI is disallowed from creating temporary files by setting
the MCA parameter "orte_no_session_dir".
Local host: nmpost071
Filename: /lustre/aoc/admin/tmp/condor/nmpost071/execute/dir_2906/openmpi-sessions-krowe@nmpost071_0/41160/1/shared_mem_pool.nmpost071
You can set the MCA paramter shmem_mmap_enable_nfs_warning to 0 to
disable this message.
--------------------------------------------------------------------------
SOLUTION: A way to remove this warning and hopefully also prevent Open MPI from using a network filesystem for its shared memory backup file is to do something like the following
mkdir -p /dev/shm/mpicasa
export OMPI_MCA_orte_tmpdir_base=/dev/shm/mpicasa
CFCache Creation
If you set cfcache='' then mpicasa will run for days or longer (I never let it finish naturaly). Perhaps it is trying to create a cfcache but is deadlocked with itself.
Overall slowness
Questions
- Why doesn't the serial job fail with matplotlib errors because of a missing .matplotlib like the parallel case does? Does CASA not start matplotlib or perhaps a different version of matplotlib?
- Can I run my parallel DAGs at CHTC?