You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

While running CASA in HTCondor seems to work without issue, running mpicasa in HTCondor produces several issues.


RuntimeError("Failed to create %s/.matplotlib

The following is an example of the error mpicasa produces when running in HTCondor and ${HOME}/.matplotlib isn't already populated.  This happens when, for example, we set  HOME=${HOME:=$TMPDIR} because we are running on a remote site or simulating it.  We have to set HOME to something because mpicasa expects it and HTCondor doesn't set HOME by default.

    471             raise RuntimeError("Failed to create %s/.matplotlib; consider setting MPLCONFIGDIR to a writable directory for matplotlib configuration data"%h)

    472

--> 473         os.mkdir(p)

    474

    475     return p

OSError: [Errno 17] File exists: '/lustre/aoc/admin/tmp/condor/nmpost071/executedir_130008/.matplotlib'

In [2]: Do you really want to exit ([y]/n)?

--------------------------------------------------------------------------

casa has exited due to process rank 1 with PID 130184 on

node nmpost071 exiting improperly. There are three reasons this could occur:


1. this process did not call "init" before exiting, but others in

the job did. This can cause a job to hang indefinitely while it waits

for all processes to call "init". By rule, if one process calls "init",

then ALL processes must call "init" prior to termination.


2. this process called "init", but exited without calling "finalize".

By rule, all processes that call "init" MUST call "finalize" prior to

exiting or it will be considered an "abnormal termination"


3. this process called "MPI_Abort" or "orte_abort" and the mca parameter

orte_create_session_dirs is set to false. In this case, the run-time cannot

detect that the abort call was an abnormal termination. Hence, the only

error message you will receive is this one.


This may have caused other processes in the application to be

terminated by signals sent by casa (as reported here).


You can avoid this message by specifying -quiet on the casa command line.


--------------------------------------------------------------------------

[nmpost071:130145] 3 more processes have sent help message help-opal-shmem-mmap.txt / mmap on nfs

[nmpost071:130145] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages









Warning about /tmp on a networked filesystem

The warning about /tmp is only when /tmp is a network filesystem (Lustre or NFS).  This is Open MPI producing this warning.

WARNING: Open MPI will create a shared memory backing file in a directory that appears to be mounted on a network filesystem.






Overall slowness



  • No labels