Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Can we get an increase in quota for /software/nu_jrobnett.  Quota appears to be 4GB which is not enough for 2 version of our software package (it's close).
    • CHTC will increase our quota
  • How can we have the .dag.* files written to a different directory?  -usedagdir doesn't help.
    • There isn't a way to tell condor_submit_dag where to put the logs
    How can I set a variable in a DAG file that I can then use in the submit file in a conditional?  None of the following seem to work
    • DAG:
      • VARS step01 CHTC=""

      • VARS step05 CHTC="True"
    • Submit:
      • if defined $(CHTC)
        • requirements = PoolName == "CHTC"
      • endif
    • or
    • DAG:
      • #VARS step01 CHTC="True"
      • VARS step05 CHTC="True"
    • Submit:
      • if defined $(CHTC)
        • requirements = PoolName == "CHTC"
      • endif
    • or
    • DAG:
      • VARS step01 CHTC="False"
      • VARS step05 CHTC="True"
    • Submit:
      • chtc_var = $(CHTC)
      • if $(chtc_var)
        • requirements = PoolName == "CHTC"
      • endif
    • even though when I pass $(chtc_var) as arguments to the shell script, the shell script sees it as True.
    • or
    • DAG:
      • VARS node1 file="chtc.htc"

      • VARS node2 file="aws.htc"
    • Submit:
      • include : $(file)
10/20/20 08:54:36 From submit: ERROR: on Line 9 of submit file:
10/20/20 08:54:36 From submit: Submit:-1:Error "", Line 0, Include Depth 1: can't open file
10/20/20 08:54:36 From submit:
10/20/20 08:54:36 From submit: ERROR: Failed to parse command file (line 9).
10/20/20 08:54:36 failed while reading from pipe.
10/20/20 08:54:36 Read so far: Submitting job(s)ERROR: on Line 9 of submit file: Submit:-1:Error "", Line 0, Include Depth 1: can't open fileERROR: Failed to parse command file (line 9).
10/20/20 08:54:36 ERROR: submit attempt failed
    • Yet I can use a variable defined in a DAG for things like arguments and request_memory.
    • I can also use file = $CHOICE(myindex, chtc.htc, aws.htc) where myindex is defined in a DAG it will set $(file) to the file I want to include but again if I use include : $(file) I get an error
10/20/20 11:58:58 From submit: Submitting job(s)ERROR on Line 13 of submit file: $CHOICE() macro: myindex is invalid index!
10/20/20 11:58:58 failed while reading from pipe.
10/20/20 11:58:58 Read so far: Submitting job(s)ERROR on Line 13 of submit file: $CHOICE() macro: myindex is invalid index!
10/20/20 11:58:58 ERROR: submit attempt failed

 * Perhaps use requirements.  Greg will send an example

...

  • Is there a recommended way to start annexes from a DAG?  We have been using PRE scripts but sometimes it seems to fail.
    • CHTC is working on a BEGIN syntax (provision) that will block a DAG node from starting until the annex is ready.
    • We could have the script not return until the annex is ready.
    • We could also have the job require a specific name that the create_annex creates.
  • If 8.9.9 requires Globus from EPEL then it may have trouble being installed on a Globus endpoint because the EPEL version of Globus conflicts with the Globus.org version.

Answered Questions:

Answered Questions:

  • JOB ID question from Daniel
      JOB ID question from Daniel
      • When I submit a job, I get a job ID back. My plan is to hold onto that job ID permanently for tracking. We have had issues in the past with Torque/Maui because the job IDs got recycled later and our internal bookkeeping got mixed up. So my questions are:

         - Are job IDs guaranteed to be unique in HTCondor?
         - How unique are they—are they _globally_ unique or just unique within a particular namespace (such as our cluster or the submit node)?

      • A Job ID (ClusterID.ProcID)
      • DNS name of the schedd and ctime of the job_queued.log file.
      • It is unique to a schedd.
      • We should talk with Daniel about this.  They should craft their own ID.  It could be seeded with a JobID but should not depend on just it.
    • UpgradingHTCondor without killing jobs?
      • schedd can be upgraded and restarted without loosing state assuming the restart is less than the timeout.
      • currently restarting execute services will kill jobs.  CHTC is working on improving this.
      • negotiator and collector can be restarted without killing jobs.
      • CHTC works hard to ensure 8.8.x is compatible with 8.8.y or 8.9.x is compatible with 8.9.y.
    • Leaving data on execution host between jobs (data reuse)
      • Todd is working on this now.
    • Ask about installation of CASA locally and ancillary data (cfcache)
      • CHTC has a Ceph filesystem that is available to many of their execution hosts (notibly the larger ones)
      • There is another software filesystem where CASA could live that is more used for admin usage but might be available to us.
      • We could download the tarball each time over HTTP.  CHTC uses a proxy server so it would often be cached.
    • Environment:  Is there a way to have condor "login" when a job starts thus sourcing /etc/proflie and the user's rc files? Currently, not even $HOME is set.
      • A good analogy is Torque does a su - _username_ while HTCondor just does a su _username_
      • WORKAROUND: setting getenv = True which is like the -V option to qsub, may help. It doesn't source rc files but does inherit your current environment. This may be a problem if your current environment is not what you want on the cluster node. Perhaps the cluster node is a different OS or architecture.
      • ANSWER: condor doesn't execute things with a shell.  You could set your executable as /bin/bash and then have the arguments be the executable you used to have.  I just changed our stuff to staticly set $HOME and I think that is good enough.

    ...

    • Condor_annex bug: Edit  /usr/libexec/condor/condor-annex-ec2 and comment out the line chkconfig condor || exit 1 because this line is a hold-over from older versions that put condor in init.d. Now that it is in systemd, this line causes condor to exit.
      • SOLUTION: Greg submitted ticket on this.
    • Debugging held jobs.  I had thought that setting when_to_transfer_output = ON_EXIT_OR_EVICT  would copy the scratch area back to the submit machine so files there could be inspected.  But that doesn't seem to happen for me.
    • Memory issue: Greg did find a bug deep in the code that may cause jobs to be killed because of memory issues. HTCondor occationally gets a short read when looking at the process table via /proc, and then something like 2/3 of the processes are missing.
      • SOLUTION: CHTC will work towards a solution.


    • How can we have the .dag.* files written to a different directory?  -usedagdir doesn't help.
      • ANSWER: There isn't a way to tell condor_submit_dag where to put the logs
    • Is there a for-loop structure available to DAG scripts or a range mechanic?
      • No
    • If 8.9.9 requires Globus from EPEL then it may have trouble being installed on a Globus endpoint because the EPEL version of Globus conflicts with the Globus.org version.
      • I told them about it. I have not tried installing HTCondor-8.9.9 so I am only guessing it will be a problem.