Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Given a DAG where some steps are to always  run at AOC and some are to always  run in CHTC how can we dictate this.  Right now local jobs flock  to CHTC if local resources are full.  Can we make local jobs idle instead of flock?
  • We are starting to run 10s of jobs in CHTC requiring 40GB as part of a local DAG.  Are there any options we can set to improve their execution chance.  What memory footprint (32, 20, 16, 8GB) would significantly improve their chances
  • How can I find out what hosts are available for given requirements (LongJobs, memory, staging)
    • condor_status -compact -constraint "HasChtcStaging==true" -constraint 'DetectedMemory>500000' -constraint "CanRunLongJobs isnt Undefined"
  • It looks to me like most hosts at CHTC are setup to run LongJobs.  The following shows a small list of about 20 hosts so I assume all others can run long jobs.  Is the correct?
    • condor_status -compact -constraint "CanRunLongJobs is Undefined"


  • Getting the following when trying to run condor_annex -aws-region us-west-2 -setup ~/.condor/publicKeyFile even after I have removed all the CloudFormation stacks that begin with HTCondorAnnex

    Checking security configuration... OK.

    Checking for configuration bucket... missing.

    Missing configuration bucket. Please log into your AWS account and delete each

    CloudFormation stack whose name starts with 'HTCondorAnnex-', then re-run

    'condor_annex -setup'.


  • Do you have any examples of how to launch instances in the spot marked with condor_annex?  I have read the docs and am still lost.
  • How can we set AWS Tags with condor_annex?  We'd like this to track jobs and set billing tags.
  • Is port 9618 needed for flocking or just for condor_annex?  We would like to close that port on our flocking host (testpost-serv-1) if possible.
  • Are there bugs in the condor.log output of a DAG node?  For example, I have a condor.log file that clearly shows the job taking about three hours to run yet at the bottom lists user time of 13 hours and system time of 1 hour.  https://open-confluence.nrao.edu/download/attachments/40541486/step07.py.condor.log?api=v2
    • And as for the cpu usage report, there could very well be a bug, but first, is your job multi-threaded or multi-process?  If so, the cpu usage will be the aggregate across all cpu cores.

    • Yes they are all parallel jobs to some extent so I accept your answer for that job.  But I have another job that took 21 hours of wallclock time and yet the condor.log shows 55 minutes of user and 5:34 hours of system time.  https://open-confluence.nrao.edu/download/attachments/40541486/step05.py.condor.log?api=v2

...