...
- Given a DAG where some steps are to always run at AOC and some are to always run in CHTC how can we dictate this. Right now local jobs flock to CHTC if local resources are full. Can we make local jobs idle instead of flock?
- ANSWER: Use PoolNames. I need to make a testpost PoolName.
- We are starting to run 10s of jobs in CHTC requiring 40GB as part of a local DAG. Are there any options we can set to improve their execution chance. What memory footprint (32, 20, 16, 8GB) would significantly improve their chances. Ask Lauren.
- How can I find out what hosts are available for given requirements (LongJobs, memory, staging)
- condor_status -compact -constraint "HasChtcStaging==true" -constraint 'DetectedMemory>500000' -constraint "CanRunLongJobs isnt Undefined"
- Answer: yes this is correct but it doesn't show what other jobs are waiting on the same resources. Which is fine.
- It looks to me like most hosts at CHTC are setup to run LongJobs. The following shows a small list of about 20 hosts so I assume all others can run long jobs. Is the correct?
- condor_status -compact -constraint "CanRunLongJobs is Undefined"
- JongJobs is for something like 72 hours. So it might be best to not set it unless we really need it like step23.
- Answer: yes this is correct but it doesn't show what other jobs are waiting on the same resources. Which is fine.
- Do you have any examples of how to launch instances in the spot marked with condor_annex? I have read the docs and am still lost.
- James knows how to make this json blob
- How can we set AWS Tags with condor_annex? We'd like this to track jobs and set billing tags.
- Possibly with Launch Templates
- Possibly use aws-user-data to condor_annex
- Is port 9618 needed for flocking or just for condor_annex?
- ANSWER: Greg thinks yes 9618 is needed for both flocking and condor_annex.
- Are there bugs in the condor.log output of a DAG node? For example, I have a condor.log file that clearly shows the job taking about three hours to run yet at the bottom lists user time of 13 hours and system time of 1 hour. https://open-confluence.nrao.edu/download/attachments/40541486/step07.py.condor.log?api=v2
And as for the cpu usage report, there could very well be a bug, but first, is your job multi-threaded or multi-process? If so, the cpu usage will be the aggregate across all cpu cores.
- Yes they are all parallel jobs to some extent so I accept your answer for that job. But I have another job that took 21 hours of wallclock time and yet the condor.log shows 55 minutes of user and 5:34 hours of system time. https://open-confluence.nrao.edu/download/attachments/40541486/step05.py.condor.log?api=v2
- ANSWER: if you look, the user time is actually 6 days and 55 minutes. I missed the 6 in there.
...