Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Transfer mechanism: Documentation implies that only files with an mtime newer than when the transfer_input_files finished will be transferred back to the submit host.  While running a dag, the files in my working directory (which is in both transfer_input_files and transfer_output_files) seem to always have an mtime around the most recent step in the DAG suggesting that the entire working directory is copied from the execution host to the submit host at the end of each DAG step.  Perhaps this means the transfer mechanism only looks at the mtime of the files/dirs specified in transfer_output_files and doesn't descend into the directories.
    • Subdirectories are treated differently
    Flocking: When we flock to CHTC what is the data path for transfer_input_files?  Is it our submit host and CHTC's execution host, is CHTCs submit host involved ?Dataflow is from our schedd (submit host) to their execute host but CCB will reverse the connection.  Their execution hosts are publicly addressable but that may not be necessary.
    How can we  data path for transfer_input_files to our clients given multiple networks.  Currently we assume it will use the 1Gb link but we have IB links.    Is there a way for condor to use the IB link just for transferring files, is that hostname based ? Other ideas?
  • CHTC doesn't have a good solution for this.
  • We could upgrade from 1Gb to 10Gb
  • We could use the IB names for everything (problematic for submit hosts that don't have IB)
  • We could not use transfer mechanism and instead use something else like scp
  • We could use a custom transfer plugin
  • Are there known issues with distributed scratch via NFS or Lustre w.r.t tmpdir or other,  e.g. OpenMPI complains about tmpdir being on network FS?
    • Some problems with log files on the submit host but rare.
  • Any general best practices to support MPI in terms of class ads or other.Use the shared memory transport for security
     Is there a way DAGMan can be told to ignore errors, in some cases we want a DAG to mindlessly continue vs retry.The job is considered successful based on the return of the post script.  If there isn't a post script, the success is based on the return of the job.

Answered Questions:

  • JOB ID question from Daniel
    • When I submit a job, I get a job ID back. My plan is to hold onto that job ID permanently for tracking. We have had issues in the past with Torque/Maui because the job IDs got recycled later and our internal bookkeeping got mixed up. So my questions are:

       - Are job IDs guaranteed to be unique in HTCondor?
       - How unique are they—are they _globally_ unique or just unique within a particular namespace (such as our cluster or the submit node)?

    • A Job ID (ClusterID.ProcID)
    • DNS name of the schedd and ctime of the job_queued.log file.
    • It is unique to a schedd.
    • We should talk with Daniel about this.  They should craft their own ID.  It could be seeded with a JobID but should not depend on just it.
  • UpgradingHTCondor without killing jobs?
    • schedd can be upgraded and restarted without loosing state assuming the restart is less than the timeout.
    • currently restarting execute services will kill jobs.  CHTC is working on improving this.
    • negotiator and collector can be restarted without killing jobs.
    • CHTC works hard to ensure 8.8.x is compatible with 8.8.y or 8.9.x is compatible with 8.9.y.
  • Leaving data on execution host between jobs (data reuse)
    • Todd is working on this now.
  • Ask about installation of CASA locally and ancillary data (cfcache)
    • CHTC has a Ceph filesystem that is available to many of their execution hosts (notibly the larger ones)
    • There is another software filesystem where CASA could live that is more used for admin usage but might be available to us.
    • We could download the tarball each time over HTTP.  CHTC uses a proxy server so it would often be cached.
  • Environment:  Is there a way to have condor "login" when a job starts thus sourcing /etc/proflie and the user's rc files? Currently, not even $HOME is set.
    • A good analogy is Torque does a su - _username_ while HTCondor just does a su _username_
    • WORKAROUND: setting getenv = True which is like the -V option to qsub, may help. It doesn't source rc files but does inherit your current environment. This may be a problem if your current environment is not what you want on the cluster node. Perhaps the cluster node is a different OS or architecture.
    • ANSWER: condor doesn't execute things with a shell.  You could set your executable as /bin/bash and then have the arguments be the executable you used to have.  I just changed our stuff to staticly set $HOME and I think that is good enough.

...

  • Flocking: When we flock to CHTC what is the data path for transfer_input_files?  Is it our submit host and CHTC's execution host, is CHTCs submit host involved ?
    • Dataflow is from our schedd (submit host) to their execute host but CCB will reverse the connection.  Their execution hosts are publicly addressable but that may not be necessary.

  • How can we  data path for transfer_input_files to our clients given multiple networks.  Currently we assume it will use the 1Gb link but we have IB links.    Is there a way for condor to use the IB link just for transferring files, is that hostname based ? Other ideas?
    • CHTC doesn't have a good solution for this.
    • We could upgrade from 1Gb to 10Gb
    • We could use the IB names for everything (problematic for submit hosts that don't have IB)
    • We could not use transfer mechanism and instead use something else like scp
    • We could use a custom transfer plugin

  • Are there known issues with distributed scratch via NFS or Lustre w.r.t tmpdir or other,  e.g. OpenMPI complains about tmpdir being on network FS?
    • Some problems with log files on the submit host but rare.
  • Any general best practices to support MPI in terms of class ads or other.
    • Use the shared memory transport for security

  •  Is there a way DAGMan can be told to ignore errors, in some cases we want a DAG to mindlessly continue vs retry.
    • The job is considered successful based on the return of the post script.  If there isn't a post script, the success is based on the return of the job.