Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • What limits are there to transfer_input_files?  I would sometimes get Failed to transfer files when the number of files was around 10,000
  • Is there a way to generate the dag.dot file without having to submit the job?
    • The -no_submit option doesn't create the .dot file
    • Is adding NOOP to all the JOB commands the right thing to do?  The DAG still gets submitted but then quickly ends.
  • Is there a way to start a dag at a given point?  E.g. if there are 5 steps in the dag, can you start the job at step 3?
    • Is the answer again to add NOOP to the JOB commands you don't want to run?
  • I see at CHTC jobs now require a request_disk setting.  How does one launch interactive jobs?
  • For our initial tests, we want to flock jobs to CHTC that transfer about 60GB input and output.  Eventually we will reduce this significantly but for now what can we do?
    • Globus or rsync to move data?  If Globus, how to do so in an automated way (E.g. no password)?
    Can the transfer mechanism accept manifest files?  E.g. a file that is a list of files?
  • Use include : <some file> in the submit script where <some file> contains the full transfer_input_files line
  • use queue FILES from manifest Which defines the submit variable $(FILES) which could be used in a transfer_input_files like: transfer_input_files = $(FILES)
  • Perhaps a plugin
  • What other options are there than holding a job?  I find myself not noticing, sometimes for hours, that a job is on hold.  Is there a way to make jobs fail instead of get held?  I assume others will make this mistake like me.
    • I see I can set periodic_remove = (JobStatus == 5) but HTCondor doesn't seem to think that is an error so if I have notification = Error I don't get any email.
    • Greg will look into adding a Hold option to notification
    • The HTCondor idea of held jobs is that you submitted a large DAG of jobs, one step is missing a file and you would like to put that file in place and continue the job instead of the whole DAG failing and having to be resubmitted.  This makes sense but it would be nice to be notified when a job gets held.



Answered Questions:

  • JOB ID question from Daniel
    • When I submit a job, I get a job ID back. My plan is to hold onto that job ID permanently for tracking. We have had issues in the past with Torque/Maui because the job IDs got recycled later and our internal bookkeeping got mixed up. So my questions are:

       - Are job IDs guaranteed to be unique in HTCondor?
       - How unique are they—are they _globally_ unique or just unique within a particular namespace (such as our cluster or the submit node)?

    • A Job ID (ClusterID.ProcID)
    • DNS name of the schedd and ctime of the job_queued.log file.
    • It is unique to a schedd.
    • We should talk with Daniel about this.  They should craft their own ID.  It could be seeded with a JobID but should not depend on just it.
  • UpgradingHTCondor without killing jobs?
    • schedd can be upgraded and restarted without loosing state assuming the restart is less than the timeout.
    • currently restarting execute services will kill jobs.  CHTC is working on improving this.
    • negotiator and collector can be restarted without killing jobs.
    • CHTC works hard to ensure 8.8.x is compatible with 8.8.y or 8.9.x is compatible with 8.9.y.
  • Leaving data on execution host between jobs (data reuse)
    • Todd is working on this now.
  • Ask about installation of CASA locally and ancillary data (cfcache)
    • CHTC has a Ceph filesystem that is available to many of their execution hosts (notibly the larger ones)
    • There is another software filesystem where CASA could live that is more used for admin usage but might be available to us.
    • We could download the tarball each time over HTTP.  CHTC uses a proxy server so it would often be cached.
  • Environment:  Is there a way to have condor "login" when a job starts thus sourcing /etc/proflie and the user's rc files? Currently, not even $HOME is set.
    • A good analogy is Torque does a su - _username_ while HTCondor just does a su _username_
    • WORKAROUND: setting getenv = True which is like the -V option to qsub, may help. It doesn't source rc files but does inherit your current environment. This may be a problem if your current environment is not what you want on the cluster node. Perhaps the cluster node is a different OS or architecture.
    • ANSWER: condor doesn't execute things with a shell.  You could set your executable as /bin/bash and then have the arguments be the executable you used to have.  I just changed our stuff to staticly set $HOME and I think that is good enough.

...

  • Does the trasnfer mechanism accept any sort of regular expression?  E.g. transfer_input_files=*.txt
    • No

  • Can the transfer mechanism accept manifest files?  E.g. a file that is a list of files?
    • Use include : <some file> in the submit script where <some file> contains the full transfer_input_files line
    • use queue FILES from manifest Which defines the submit variable $(FILES) which could be used in a transfer_input_files like: transfer_input_files = $(FILES)
    • Perhaps a plugin

  • What other options are there than holding a job?  I find myself not noticing, sometimes for hours, that a job is on hold.  Is there a way to make jobs fail instead of getting held?  I assume others will make this mistake like me.
    • I see I can set periodic_remove = (JobStatus == 5) but HTCondor doesn't seem to think that is an error so if I have notification = Error I don't get any email.
    • Greg will look into adding a Hold option to notification
    • The HTCondor idea of held jobs is that you submitted a large DAG of jobs, one step is missing a file and you would like to put that file in place and continue the job instead of the whole DAG failing and having to be resubmitted.  This makes sense but it would be nice to be notified when a job gets held.
    • Greg wrote "notification = error in the submit file is supposed to send email when the job is held by the system, but there's a bug now where it doesn't.  I'll fix this."