Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The following is an example of the error mpicasa produces when running in HTCondor and when ${HOME}/.matplotlib isn't already populated.  This happens when, for example, when we set  HOME=${HOME:=$TMPDIR} because we are running on a remote site or simulating it.  We have to set HOME to something because mpicasa expects it HOME and HTCondor doesn't set HOME by default.

After many attempts Attempting to get matplotlib to generate its own .matplotlib , the results were always inconsistentproduced inconsistent results.  So , the solution was to add a poplulated populated .matplotlib directory to transfer_input_files for each DAG or job.

    471             raise RuntimeError("Failed to create %s/.matplotlib; consider setting MPLCONFIGDIR to a writable directory for matplotlib configuration data"%h)
    472
--> 473         os.mkdir(p)
    474
    475     return p
OSError: [Errno 17] File exists: '/lustre/aoc/admin/tmp/condor/nmpost071/executedir_130008/.matplotlib'
In [2]: Do you really want to exit ([y]/n)?
--------------------------------------------------------------------------
casa has exited due to process rank 1 with PID 130184 on
node nmpost071 exiting improperly. There are three reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
orte_create_session_dirs is set to false. In this case, the run-time cannot
detect that the abort call was an abnormal termination. Hence, the only
error message you will receive is this one.

This may have caused other processes in the application to be
terminated by signals sent by casa (as reported here).

You can avoid this message by specifying -quiet on the casa command line.

--------------------------------------------------------------------------

SOLUTION: transfer a populated .matplotlib to your job like so


transfer_input_files = /users/krowe/.matplotlib







Warning about /tmp on a networked filesystem

The warning about /tmp is only appears when /tmp is a network filesystem (e.g. Lustre or NFS).  This is Open MPI producing produces this warning and happens because HTCondor bind mounts /tmp and /var/tmp for its jobs to its scratch area and we currently use Lustre for the scratch areas.

--------------------------------------------------------------------------

WARNING: Open MPI will create a shared memory backing file in a

directory that appears to be mounted on a network filesystem.

Creating the shared memory backup file on a network file system, such

as NFS or Lustre is not recommended -- it may cause excessive network

traffic to your file servers and/or cause shared memory traffic in

Open MPI to be much slower than expected.


You may want to check what the typical temporary directory is on your

node.  Possible sources of the location of this temporary directory

include the $TEMPDIR, $TEMP, and $TMP environment variables.


Note, too, that system administrators can set a list of filesystems

where Open MPI is disallowed from creating temporary files by setting

the MCA parameter "orte_no_session_dir".


  Local host: nmpost071

  Filename:   /lustre/aoc/admin/tmp/condor/nmpost071/execute/dir_2906/openmpi-sessions-krowe@nmpost071_0/41160/1/shared_mem_pool.nmpost071


You can set the MCA paramter shmem_mmap_enable_nfs_warning to 0 to

disable this message.

--------------------------------------------------------------------------


SOLUTION: A way to remove this warning and hopefully also prevent OpenMPI Open MPI from using a network filesystem for its shared memory backup file is to do something like the following

mkdir -p /dev/shm/mpicasa

export OMPI_MCA_orte_tmpdir_base=/dev/shm/mpicasa

PSF Creation

2020-06-30 15:44:46     SEVERE  tclean::task_tclean::@nmpost071:MPIClient       Exception from task_tclean : No images named VIP_iter0 found on disk. No partial images found either.Multi-term SumWt does not exist. Please create PSFs or Residuals.

...



CFCache Creation

There is a known CASA bug where setting the cfcache='' causes one part of CASA to create a cfcache with a name like imagename_base.cf, and another part of CASA to look for the cfcach as cfcache.cf.  Or something like that.  Anyway, never set cfcache=''.  Either set it to an existing cfcache or to some directory that doesn't exists like cfcach="cachedir".

I don't know why this only seems to be a problem with mpicasa and not serial CASA.  Perhaps it causes some race condition.




Questions

  • Why doesn't the serial job fail with matplotlib errors because of a missing .matplotlib like the parallel case does?  Does CASA not start matplotlib or perhaps a different version of matplotlib?
  • Can I run my parallel DAGs at CHTC?