Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • How can we prevent jobs from flocking?  Right now jobs flock if local resources are full.  Can I make jobs idle instead of flock?
  • Is port 9618 needed for flocking or just for condor_annex?  We would like to close that port on our flocking host (testpost-serv-1) if possible.

  • We are starting to run 10s of jobs in CHTC requiring 40GB.  What options/knobs do we need to set?
  • How can I find out what hosts are available for given requirements (LongJobs, memory, staging)
    • condor_status -compact -constraint "HasChtcStaging==true" -constraint 'DetectedMemory>500000' -constraint "CanRunLongJobs isnt Undefined"
  • It looks to me like most hosts at CHTC are setup to run LongJobs.  The following shows a small list of about 20 hosts.  Is the correct?
    • condor_status -compact -constraint "CanRunLongJobs is Undefined"


  • Getting the following when trying to run condor_annex -aws-region us-west-2 -setup ~/.condor/publicKeyFile even after I have removed all the CloudFormation stacks that begin with HTCondorAnnex

    Checking security configuration... OK.

    Checking for configuration bucket... missing.

    Missing configuration bucket. Please log into your AWS account and delete each

    CloudFormation stack whose name starts with 'HTCondorAnnex-', then re-run

    'condor_annex -setup'.


  • Are there bugs in the condor.log output of a DAG node?  For example, I have a condor.log file that clearly shows the job taking about three hours to run yet at the bottom lists user time of 13 hours and system time of 1 hour.  https://open-confluence.nrao.edu/download/attachments/40541486/step07.py.condor.log?api=v2
    • And as for the cpu usage report, there could very well be a bug, but first, is your job multi-threaded or multi-process?  If so, the cpu usage will be the aggregate across all cpu cores.

    • Yes they are all parallel jobs to some extent so I accept your answer for that job.  But I have another job that took 21 hours of wallclock time and yet the condor.log shows 55 minutes of user and 5:34 hours of system time.  https://open-confluence.nrao.edu/download/attachments/40541486/step05.py.condor.log?api=v2


...