...
- How can we prevent jobs from flocking? Right now jobs flock if local resources are full. Can I make jobs idle instead of flock?
I have a machine with an externaly-accessable, non-NATed address (146.88.10.48) and an internal, non-routable address (10.64.1.226). I want to install condor_annex on this machine such that I can submit jobs to AWS from it. I don't necessarily need to submit jobs to local execute hosts from this machine. Should I make this machine a central manager, a submit host, both, or does it matter?
Last time I configured condor_annex I was using an older version of condor (8.8.3 I think) and used a pool password for security. Now I am using 8.9.7. Is there a newer/better security method I should use?
- How can I find out what hosts are available for given requirements (LongJobs, memory, staging)
- condor_status -compact -constraint "HasChtcStaging==true" -constraint 'DetectedMemory>500000' -constraint "CanRunLongJobs isnt Undefined"
- It looks to me like most hosts at CHTC are setup to run LongJobs. The following shows a small list of about 20 hosts. Is the correct?
- condor_status -compact -constraint "CanRunLongJobs is Undefined"
- Are there bugs in the condor.log output of a DAG node? For example, I have a condor.log file that clearly shows the job taking about three hours to run yet at the bottom lists user time of 13 hours and system time of 1 hour. https://open-confluence.nrao.edu/download/attachments/40541486/step07.py.condor.log?api=v2
And as for the cpu usage report, there could very well be a bug, but first, is your job multi-threaded or multi-process? If so, the cpu usage will be the aggregate across all cpu cores.
- Yes they are all parallel jobs to some extent so I accept your answer for that job. But I have another job that took 21 hours of wallclock time and yet the condor.log shows 55 minutes of user and 5:34 hours of system time. https://open-confluence.nrao.edu/download/attachments/40541486/step05.py.condor.log?api=v2
...