...
Exit Code 108 = can not connect to the condor_startd or request refused
...
In progress
Docs wrong for evaluating ClassAds?
...
ANSWER: The first asterisk is shouldn't be there. This is a regex not globbing. Greg will look into updating this document.
HTCondor and Slurm
NRAO has effectively two use cases: 1) Operations triggered jobs. These are well formulated pipeline jobs, they're still fairly monolithic and long running (many hours to few days). 2) User triggered jobs, these are of course not well formulated. We will be moving the operations jobs to htcondor. We plan to move the user triggered jobs to SLURM form Torque. There's enough noise in the two job loads that we don't want to have strict host carve outs for type 1 and type 2 jobs. What we anticipate doing is having a set of nodes known only to htcondor for the bulk of operations and a set of hosts controlled by SLURM for the user facing jobs. Periodically when they have a large set of operations jobs we'd like for them to burst into the SLURM controlled nodes. We neither anticipate nor want the slurm jobs to burst into the htcondor set of nodes.
Say we have two clusters (HTCondor and Slurm) and both can be submitted to from the same host. We want the HTCondor jobs to use the Slurm cluster resources when the HTCondor cluster resources are full, but we probably don't want to support preemption. How could we have HTCondor submit jobs to a Slurm cluster? (HTCondor-C, flocking, overlapping, batch-grid-type, HTCondor-CE, etc)
ANSWER: write our own 'factory' that watched HTCondor and when it is full submit Pilot jobs to Slurm that launch startd daemons thus allowing the Payload jobs waiting in HTCondor to run. Will want to set the startd to exit after being idle for a little while, run the Pilot job as root, and figure out how to do cgroups properly.
Glidein
The only documentation I can find on glinein (https://htcondor.readthedocs.io/en/latest/grid-computing/introduction-grid-computing.html?highlight=glidein#introduction) seems to imply that glidein only works with Globus "HTCondor permits the temporary addition of a Globus-controlled resource to a local pool. This is called glidein." Is this correct? Is there better documentation? Is glidein even a technology or software package or is it just a generic term?
ANSWER: Greg will look at re-writring this.
request_virtualmemory
If I set request_virtualmemory = 2G, condor_submit accepts it as a valid knob but the job stays idle and never runs.
request_memory = 1G
request_virtualmemory = 2G
If I set request_virtualmemory = 2000000, which should be the same as 2G, the job runs but doesn't set memory.memsw.limit_in_bytes in the cgroup.
ANSWER: krowe sent mail to Greg about it
Memory usage report
The memory usage report at the end of the condor log seems incorrect. I can watch the memory.max_usage_in_bytes in the cgroup get over 8,400MB yet the report in the condor log reads 6,464MB. Does the log only report the memory usage of the parent process and not include all the children? Is it an average memory usage over time?
ANSWER: It is a report of a sum of certain fields in memory.stat in the cgroup. Get Greg an example. Try it on two machines in case this is a problem of re-using the same cgroup. Or reboot and try again.
Answered Questions:
- JOB ID question from Daniel
When I submit a job, I get a job ID back. My plan is to hold onto that job ID permanently for tracking. We have had issues in the past with Torque/Maui because the job IDs got recycled later and our internal bookkeeping got mixed up. So my questions are:
- Are job IDs guaranteed to be unique in HTCondor?
- How unique are they—are they _globally_ unique or just unique within a particular namespace (such as our cluster or the submit node)?- A Job ID (ClusterID.ProcID)
- DNS name of the schedd and ctime of the job_queued.log file.
- It is unique to a schedd.
- We should talk with Daniel about this. They should craft their own ID. It could be seeded with a JobID but should not depend on just it.
- UpgradingHTCondor without killing jobs?
- schedd can be upgraded and restarted without loosing state assuming the restart is less than the timeout.
- currently restarting execute services will kill jobs. CHTC is working on improving this.
- negotiator and collector can be restarted without killing jobs.
- CHTC works hard to ensure 8.8.x is compatible with 8.8.y or 8.9.x is compatible with 8.9.y.
- Leaving data on execution host between jobs (data reuse)
- Todd is working on this now.
- Ask about installation of CASA locally and ancillary data (cfcache)
- CHTC has a Ceph filesystem that is available to many of their execution hosts (notibly the larger ones)
- There is another software filesystem where CASA could live that is more used for admin usage but might be available to us.
- We could download the tarball each time over HTTP. CHTC uses a proxy server so it would often be cached.
- Environment: Is there a way to have condor "login" when a job starts thus sourcing /etc/proflie and the user's rc files? Currently, not even $HOME is set.
- A good analogy is Torque does a su - _username_ while HTCondor just does a su _username_
- WORKAROUND: setting getenv = True which is like the -V option to qsub, may help. It doesn't source rc files but does inherit your current environment. This may be a problem if your current environment is not what you want on the cluster node. Perhaps the cluster node is a different OS or architecture.
- ANSWER: condor doesn't execute things with a shell. You could set your executable as /bin/bash and then have the arguments be the executable you used to have. I just changed our stuff to staticly set $HOME and I think that is good enough.
- Flocking: Suppose I have two hosts in the same pool. testpost-master is a submit-host and testpost-serv-1 is both a submit-host and the central-manager. testpost-serv-1 is configured to flock to CHTC but testpost-master is not. Is it possible to submit a job on testpost-master that will flock to CHTC by somehow leveraging testpost-serv-1? In other words, do I have to setup flocking and an external IP on every submit host?
- ANSWER: there isn't a good way to do this. So eventually we will need to make testpost-master flock to CHTC and possibly remove the ability of testpost-serv-1 to flock.
...
- Look in the job log file not the dag log file
- 040 (150.000.000) 2020-06-15 13:05:45 Started transferring input files
Transferring to host: <10.64.10.172:9618?addrs=10.64.10.172-9618&alias=nmpost072.aoc.nrao.edu&noUDP&sock=slot1_1_72656_7984_60>
...
040 (150.000.000) 2020-06-15 13:06:04 Finished transferring input files
Glidein
The only documentation I can find on glinein (https://htcondor.readthedocs.io/en/latest/grid-computing/introduction-grid-computing.html?highlight=glidein#introduction) seems to imply that glidein only works with Globus "HTCondor permits the temporary addition of a Globus-controlled resource to a local pool. This is called glidein." Is this correct? Is there better documentation? Is glidein even a technology or software package or is it just a generic term?
ANSWER: Greg will look at re-writring this.
request_virtualmemory
If I set request_virtualmemory = 2G, condor_submit accepts it as a valid knob but the job stays idle and never runs.
request_memory = 1G
request_virtualmemory = 2G
If I set request_virtualmemory = 2000000, which should be the same as 2G, the job runs but doesn't set memory.memsw.limit_in_bytes in the cgroup.
ANSWER: krowe sent mail to Greg about it
Memory usage report
The memory usage report at the end of the condor log seems incorrect. I can watch the memory.max_usage_in_bytes in the cgroup get over 8,400MB yet the report in the condor log reads 6,464MB. Does the log only report the memory usage of the parent process and not include all the children? Is it an average memory usage over time?
ANSWER: It is a report of a sum of certain fields in memory.stat in the cgroup. Get Greg an example. Try it on two machines in case this is a problem of re-using the same cgroup. Or reboot and try again.
...
Answered Questions:
- JOB ID question from Daniel
When I submit a job, I get a job ID back. My plan is to hold onto that job ID permanently for tracking. We have had issues in the past with Torque/Maui because the job IDs got recycled later and our internal bookkeeping got mixed up. So my questions are:
- Are job IDs guaranteed to be unique in HTCondor?
- How unique are they—are they _globally_ unique or just unique within a particular namespace (such as our cluster or the submit node)?- A Job ID (ClusterID.ProcID)
- DNS name of the schedd and ctime of the job_queued.log file.
- It is unique to a schedd.
- We should talk with Daniel about this. They should craft their own ID. It could be seeded with a JobID but should not depend on just it.
- UpgradingHTCondor without killing jobs?
- schedd can be upgraded and restarted without loosing state assuming the restart is less than the timeout.
- currently restarting execute services will kill jobs. CHTC is working on improving this.
- negotiator and collector can be restarted without killing jobs.
- CHTC works hard to ensure 8.8.x is compatible with 8.8.y or 8.9.x is compatible with 8.9.y.
- Leaving data on execution host between jobs (data reuse)
- Todd is working on this now.
- Ask about installation of CASA locally and ancillary data (cfcache)
- CHTC has a Ceph filesystem that is available to many of their execution hosts (notibly the larger ones)
- There is another software filesystem where CASA could live that is more used for admin usage but might be available to us.
- We could download the tarball each time over HTTP. CHTC uses a proxy server so it would often be cached.
- Environment: Is there a way to have condor "login" when a job starts thus sourcing /etc/proflie and the user's rc files? Currently, not even $HOME is set.
- A good analogy is Torque does a su - _username_ while HTCondor just does a su _username_
- WORKAROUND: setting getenv = True which is like the -V option to qsub, may help. It doesn't source rc files but does inherit your current environment. This may be a problem if your current environment is not what you want on the cluster node. Perhaps the cluster node is a different OS or architecture.
- ANSWER: condor doesn't execute things with a shell. You could set your executable as /bin/bash and then have the arguments be the executable you used to have. I just changed our stuff to staticly set $HOME and I think that is good enough.
- Flocking: Suppose I have two hosts in the same pool. testpost-master is a submit-host and testpost-serv-1 is both a submit-host and the central-manager. testpost-serv-1 is configured to flock to CHTC but testpost-master is not. Is it possible to submit a job on testpost-master that will flock to CHTC by somehow leveraging testpost-serv-1? In other words, do I have to setup flocking and an external IP on every submit host?
- ANSWER: there isn't a good way to do this. So eventually we will need to make testpost-master flock to CHTC and possibly remove the ability of testpost-serv-1 to flock.
- It seems the transfer mechanism won't transfer symlinks to directories (e.g. data/vlass.ms → /lustre/aoc/...) Is there a way around this?
- ANSWER: there is no flag to chase symlinks at the moment. The top level dir (e.g. data) could be a symlink if transfer_input_files=data/ but it will then transfer the contents of data instead of data itself.
- If symlink → data and transfer_input_files=symlink I get the error Transfer of symlinks to directories is not supported.
- if symlink → ../data and transfer_input_files=symlink/ it transfers the contents not the directory. In other words I don't have a data directory in scratch I have a VLASS... directory.
- If data/VLASS → /some/path/VLASS and transfer_input_files=data/VLASS/
- If data/VLASS → /some/path/VLASS and transfer_input_files=data/ I get the error Transfer of symlinks to directories is not supported.
- DAG log time stamps, is there a way to differentiate data import/export time and process run time.
- Look in the job log file not the dag log file
- 040 (150.000.000) 2020-06-15 13:05:45 Started transferring input files
Transferring to host: <10.64.10.172:9618?addrs=10.64.10.172-9618&alias=nmpost072.aoc.nrao.edu&noUDP&sock=slot1_1_72656_7984_60>
...
040 (150.000.000) 2020-06-15 13:06:04 Finished transferring input files
- Rank and Premption: Can we use Rank to set "preferences" without requiring job preemption?
- ANSWER: There are 2 kinds of rank (job rank, machine rank). job rank (RANK=... in a submit file) is purely a preference. That does not preempt. Machine rank (in startd.config) will preempt. Negotiator pre-job rank is a third type of rank that works at a pool level and is often used to pack jobs efficiently.
- Update on software store for CASA either on shared Ceph storage or admin software storage
- Staging area for datasets 100MB - TBs. This is where we could try keeping the cfcache assuming doing so doesn't overwhelm the filesystem.
- /staging/nu_jrobnett
Requirements = (Target.HasCHTCStaging == true)
- Quota: 100GB, 100K files
- Squid area for 100MB - 1GB input or shared software. This is where we could keep casa.tgz and then have the execution host retrieve it via HTTP.
- /squid/nu_jrobnett
- only accessable via this path on the submit hosts. Execution hosts will need to access it via HTTP.
transfer_input_files = http://proxy.chtc.wisc.edu/SQUID/nu_jrobnett/casa.tgz
Software area We can use this in run-time applications. Think of it like /usr/local.
/software/nu_jrobnett/casa/casa-pipeline-release-5.6.1-8.el7
- export PATH=/opt/local/bin:/software/nu_jrobnett/casa/casa-pipeline-release-5.6.1-8.el7/bin:${PATH}
- Quota: 5GB, 100K files
- Staging area for datasets 100MB - TBs. This is where we could try keeping the cfcache assuming doing so doesn't overwhelm the filesystem.
- Public_input_files: How is this different than transfer_input files and when would one want to use it instead of files or URLs with transfer_input_files?
- https://htcondor.readthedocs.io/en/latest/users-manual/file-transfer.html#public-input-files
- This is still a work in progress. It may allow for caching on a squid server, fetchable by others someday.
- Flocking: When we flock to CHTC what is the data path for transfer_input_files? Is it our submit host and CHTC
- Rank and Premption: Can we use Rank to set "preferences" without requiring job preemption?
- ANSWER: There are 2 kinds of rank (job rank, machine rank). job rank (RANK=... in a submit file) is purely a preference. That does not preempt. Machine rank (in startd.config) will preempt. Negotiator pre-job rank is a third type of rank that works at a pool level and is often used to pack jobs efficiently.
...
- /staging/nu_jrobnett
Requirements = (Target.HasCHTCStaging == true)
- Quota: 100GB, 100K files
...
- /squid/nu_jrobnett
- only accessable via this path on the submit hosts. Execution hosts will need to access it via HTTP.
transfer_input_files = http://proxy.chtc.wisc.edu/SQUID/nu_jrobnett/casa.tgz
Software area We can use this in run-time applications. Think of it like /usr/local.
...
/software/nu_jrobnett/casa/casa-pipeline-release-5.6.1-8.el7
...
- Public_input_files: How is this different than transfer_input files and when would one want to use it instead of files or URLs with transfer_input_files?
- https://htcondor.readthedocs.io/en/latest/users-manual/file-transfer.html#public-input-files
- This is still a work in progress. It may allow for caching on a squid server, fetchable by others someday.
- Flocking: When we flock to CHTC what is the data path for transfer_input_files? Is it our submit host and CHTC's execution host, is CHTCs submit host involved ?
- Dataflow is from our schedd (submit host) to their execute host but CCB will reverse the connection. Their execution hosts are publicly addressable but that may not be necessary.
- Dataflow is from our schedd (submit host) to their execute host but CCB will reverse the connection. Their execution hosts are publicly addressable but that may not be necessary.
- How can we choose the data path for transfer_input_files to our clients given multiple networks. Currently we assume it will use the 1Gb link but we have IB links. Is there a way for condor to use the IB link just for transferring files, is that hostname based ? Other ideas?
- CHTC doesn't have a good solution for this.
- We could upgrade from 1Gb to 10Gb
- We could use the IB names for everything (problematic for submit hosts that don't have IB)
- We could not use transfer mechanism and instead use something else like scp
- We could use a custom transfer plugin
- Are there known issues with distributed scratch via NFS or Lustre w.r.t tmpdir or other, e.g. OpenMPI complains about tmpdir being on network FS?
- Some problems with log files on the submit host but rare.
- Any general best practices to support MPI in terms of class ads or other.
- Use the shared memory transport for security
- Use the shared memory transport for security
- Is there a way DAGMan can be told to ignore errors, in some cases we want a DAG to mindlessly continue vs retry.
- The job is considered successful based on the return of the post script. If there isn't a post script, the success is based on the return of the job.
- The job is considered successful based on the return of the post script. If there isn't a post script, the success is based on the return of the job.
- Transfer mechanism: Documentation implies that only files with an mtime newer than when the transfer_input_files finished will be transferred back to the submit host. While running a dag, the files in my working directory (which is in both transfer_input_files and transfer_output_files) seem to always have an mtime around the most recent step in the DAG suggesting that the entire working directory is copied from the execution host to the submit host at the end of each DAG step. Perhaps this means the transfer mechanism only looks at the mtime of the files/dirs specified in transfer_output_files and doesn't descend into the directories.
- Subdirectories are treated differently
- SOLUTION: I think casa just touches every file and therefore condor is forced to copy everything in the working directory. I have been unable to reproduce the problem outside of casa.
- SOLUTION: If you specify a directory, HTCondor will transfer the entire directory not just files with new mtime.
...
testpost-cm-vml krowe >condor_status
Name OpSys Arch State Activity LoadAv Memslot1@testpost001.aoc.nrao.edu LINUX X86_64 Unclaimed Idle 0.000 193
slot1_1@testpost001.aoc.nrao.edu LINUX X86_64 Claimed Busy 0.000
slot1@testpost002.aoc.nrao.edu LINUX X86_64 Unclaimed Idle 0.000 144
slot1_1@testpost002.aoc.nrao.edu LINUX X86_64 Claimed Busy 0.810 49
slot1@testpost003.aoc.nrao.edu LINUX X86_64 Unclaimed Idle 0.000 193
I see that with dynamic slots, the parent slot (slot1) seems always unclaimed and idle and the child slots (slot1_1) are Claimed and Busy. So I tried checking the ChildState attribute which looks to be a list but doesn't behaive like one. For example, none of these show any slots
condor_status -const 'ChildState == { "Claimed" }'
condor_status -const 'sum(ChildState) == 0'
Even though this produces true
classad_eval 'a = { }' 'sum(a) == 0'
ANSWER: Try this
...
Claimed Busy 0.810 49
slot1@testpost003.aoc.nrao.edu LINUX X86_64 Unclaimed Idle 0.000 193
I see that with dynamic slots, the parent slot (slot1) seems always unclaimed and idle and the child slots (slot1_1) are Claimed and Busy. So I tried checking the ChildState attribute which looks to be a list but doesn't behaive like one. For example, none of these show any slots
condor_status -const 'ChildState == { "Claimed" }'
condor_status -const 'sum(ChildState) == 0'
Even though this produces true
classad_eval 'a = { }' 'sum(a) == 0'
ANSWER: Try this
condor_status -const 'size(ChildState) == 0
HTCondor and Slurm
NRAO has effectively two use cases: 1) Operations triggered jobs. These are well formulated pipeline jobs, they're still fairly monolithic and long running (many hours to few days). 2) User triggered jobs, these are of course not well formulated. We will be moving the operations jobs to htcondor. We plan to move the user triggered jobs to SLURM form Torque. There's enough noise in the two job loads that we don't want to have strict host carve outs for type 1 and type 2 jobs. What we anticipate doing is having a set of nodes known only to htcondor for the bulk of operations and a set of hosts controlled by SLURM for the user facing jobs. Periodically when they have a large set of operations jobs we'd like for them to burst into the SLURM controlled nodes. We neither anticipate nor want the slurm jobs to burst into the htcondor set of nodes.
Say we have two clusters (HTCondor and Slurm) and both can be submitted to from the same host. We want the HTCondor jobs to use the Slurm cluster resources when the HTCondor cluster resources are full, but we probably don't want to support preemption. How could we have HTCondor submit jobs to a Slurm cluster? (HTCondor-C, flocking, overlapping, batch-grid-type, HTCondor-CE, etc)
ANSWER: write our own 'factory' that watched HTCondor and when it is full submit Pilot jobs to Slurm that launch startd daemons thus allowing the Payload jobs waiting in HTCondor to run. Will want to set the startd to exit after being idle for a little while, run the Pilot job as root, and figure out how to do cgroups properly.