...
- What are the clever solutions to submitting N different DAG jobs with each having different parmeters?
- T10t34
- J220200-003000
- bin, working, data
- J220600-003000
- bin, working, data
- ...
- J220200-003000
- T10t35
- J170743-393000
- bin, working, data
- J171241-383000
- bin, working, data
- ...
- J170743-393000
- ANSWERS:
- INCLUDE syntax for DAGs
- include syntax for submit files
- make a template of files
- use a PRE script that populates things
- usedagdir
- T10t34
- It seems that when using DAGs the recommended method is to define variables in the DAG script instead of submit scripts. This makes sense as it allows for only one file, the DAG script, that needs to be edited to make changes. But, is there a way to get variables into the DAG script from the command line or environment or an include_file or something?
- ANSWER: There is an INCLUDE syntax but there is no command-line or environment variable way to get vars into a DAG.
- We are starting to run 10s of jobs in CHTC requiring 40GB as part of a local DAG. Are there any options we can set to improve their execution chance. What memory footprint (32, 20, 16, 8GB) would significantly improve their chances.
- ANSWER: only use +LongJobs if the job needs more than 72 hour, which is the default "walltime".
- How can we set AWS Tags with condor_annex? We'd like this to track jobs and set billing tags.
- Launch Templates didn't work.
- Use aws-user-data options to condor_annex?
- I have tried all sorts of user-data and default-user-data-file options. On-demand apparently no longer works and I was never able to get something working with spot-fleet. I think all things user-data are non-functional.
- I tried setting a tag in the role defined in config.json (aws-ec2-spot-fleet-tagging-role) but that tag didn't translate to the instance.
- What about selftagging? We give the job authentication to run aws command line.
- This should work but for some reason, things like 'wget -qO- http://instance-data/latest/meta-data/instance-id' return nothing
- returns nothing when logged in as nobody (condor_ssh_to_job)
- returns nothing when logged in as centos (ssh -i ~/.ssh/...)
- returns instanceid when logged in as root (ssh as centos then sudo su)
- This should work but for some reason, things like 'wget -qO- http://instance-data/latest/meta-data/instance-id' return nothing
Answered Questions:
- JOB ID question from Daniel
When I submit a job, I get a job ID back. My plan is to hold onto that job ID permanently for tracking. We have had issues in the past with Torque/Maui because the job IDs got recycled later and our internal bookkeeping got mixed up. So my questions are:
- Are job IDs guaranteed to be unique in HTCondor?
- How unique are they—are they _globally_ unique or just unique within a particular namespace (such as our cluster or the submit node)?- A Job ID (ClusterID.ProcID)
- DNS name of the schedd and ctime of the job_queued.log file.
- It is unique to a schedd.
- We should talk with Daniel about this. They should craft their own ID. It could be seeded with a JobID but should not depend on just it.
- UpgradingHTCondor without killing jobs?
- schedd can be upgraded and restarted without loosing state assuming the restart is less than the timeout.
- currently restarting execute services will kill jobs. CHTC is working on improving this.
- negotiator and collector can be restarted without killing jobs.
- CHTC works hard to ensure 8.8.x is compatible with 8.8.y or 8.9.x is compatible with 8.9.y.
- Leaving data on execution host between jobs (data reuse)
- Todd is working on this now.
- Ask about installation of CASA locally and ancillary data (cfcache)
- CHTC has a Ceph filesystem that is available to many of their execution hosts (notibly the larger ones)
- There is another software filesystem where CASA could live that is more used for admin usage but might be available to us.
- We could download the tarball each time over HTTP. CHTC uses a proxy server so it would often be cached.
- Environment: Is there a way to have condor "login" when a job starts thus sourcing /etc/proflie and the user's rc files? Currently, not even $HOME is set.
- A good analogy is Torque does a su - _username_ while HTCondor just does a su _username_
- WORKAROUND: setting getenv = True which is like the -V option to qsub, may help. It doesn't source rc files but does inherit your current environment. This may be a problem if your current environment is not what you want on the cluster node. Perhaps the cluster node is a different OS or architecture.
- ANSWER: condor doesn't execute things with a shell. You could set your executable as /bin/bash and then have the arguments be the executable you used to have. I just changed our stuff to staticly set $HOME and I think that is good enough.
...