Open questions:
I want a proper subset of machines to be for the HERA project. These machines will only run HERA jobs and HERA jobs will only run on these machines. This seems to work but is there a better way?
...
HERA = True
STARTD_ATTRS = $(STARTD_ATTRS) HERA
START = ($(START)) && (TARGET.partition =?= "HERA")
...
requirements = (HERA == True)
+partition = "HERA"
Bug: All on one core
- Bug where James's jobs are all put on the same core. Here is top -u krowe showing the Last Used Cpu (SMP) after I submitted five sleep jobs to the same host.
- Is this just a side effect of condor using cpuacct instead of cpuset in cgroup?
- Is this a failure of the Linux kernel to schedule things on separate cores?
- Is this because cpu.shares is set to 100 instead of 1024?
- Check if CPU affinity is set in /proc/self/status
- Is sleep cpu-intensive enough to properly test this? Perhaps submit a while 1 loop instead?
PID USER PR
SOLUTION: yes, this is good. Submit Transforms could also be set on herapost-master (Submit Host)
https://htcondor.readthedocs.io/en/latest/misc-concepts/transforms.html?highlight=submit%20transform
- Bug where James's jobs are all put on the same core. Here is top -u krowe showing the Last Used Cpu (SMP) after I submitted five sleep jobs to the same host.
- Is this just a side effect of condor using cpuacct instead of cpuset in cgroup?
- Is this a failure of the Linux kernel to schedule things on separate cores?
- Is this because cpu.shares is set to 100 instead of 1024?
- Check if CPU affinity is set in /proc/self/status
- Is sleep cpu-intensive enough to properly test this? Perhaps submit a while 1 loop instead?
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND P
66713 krowe 20 0 4364 348 280 S 0.0 0.0 0:00.01 sleep 22
66714 krowe 20 0 4364 352 280 S 0.0 0.0 0:00.02 sleep 24
66715 krowe 20 0 4364 348 280 S 0.0 0.0 0:00.01 sleep 24
66719 krowe 20 0 4364 348 280 S 0.0 0.0 0:00.02 sleep 2
66722 krowe 20 0 4364 352 280 S 0.0 0.0 0:00.02 sleep 22
...
- Perhaps use requirements. Greg will send an example
- SOLUTION:
- DAG:
- JOB step05 step05.htc
- #VARS step05 SITE="chtc"
- #VARS step05 SITE="aws"
- Submit:
- +NRAOAttr = "$(SITE)"
- Requirements = My.NRAOAttr == "chtc" ? PoolName == "CHTC" : PoolName =!= "CHTC"
Requirements = My.NRAOAttr == "chtc" ? (Target.HasCHTCStaging == true) : (Target.HasCHTCStaging =!= true)
- myannex = "krowe-annex"
- +MayUseAWS = True
Requirements = != true)
myannex =My.NRAOAttr == "aws" ? AnnexName == $(myannex) : AnnexName =!= $(myannex)
- I would set myannex in the DAG but when I do that it tries to find an AnnexName of "krowe - annex"
- +MayUseAWS = True
Requirements = My.NRAOAttr == "aws" ? AnnexName == $(myannex) : AnnexName =!= $(myannex)
- I would set myannex in the DAG but when I do that it tries to find an AnnexName of "krowe - annex" (note spaces)
- Is there a config option that will cause condor to not start? We have diskless nodes and it is easier to modify the config file then change systemd.
- SOLUTION: Either set START_MASTER = False or START_DAEMONS = False depending on desired outcome.
- Torque has this command called pbsnodes that can not only offline/drain a node but keeps a note about it that all can see in one place. I know I can use condor_off to drain a node but is there a central place keep notes so I can remember a month later why I set a certain node to drain?
- ANSWER: there is no place to keep such notes but Greg likes the idea and may look into it.
- May want to use condor_drain instead of condor_off. condor_off will kill the startd when all jobs finish and it no longer shows up in condor_status. condor_drain will leave the node in condor_status.
- condor_drain doesn't work for me because it immediatly sets jobs idle instead of letting them run to completion. This is why I use condor_off -startd -peaceful instead.
- How can you tell which job is associated with an email given the email message doesn't include a working dir or the assigned batch_name?
- CHTC will look into adding such information to the email condor sends.
- Bug in condor_annex: Underscores in the AnnexName prevent the annex from moving into the pool.
- Also when I try to terminate an annex with underscores (e.g. krowe_annex_casa5) with the command condor_off -annex krowe_annex_casa5 I get the following error
- Found no ClassAds when querying pool (local)
- Can't find addresses for master's for constraint 'AnnexName =?= "krowe_annex_casa5"'
Perhaps you need to query another pool.
- Greg has noted this bug
- Also when I try to terminate an annex with underscores (e.g. krowe_annex_casa5) with the command condor_off -annex krowe_annex_casa5 I get the following error
- Bug in condor_annex: The following will wait for an annex named krowe - annex - casa5 (note the spaces). If I pass $(myannex) as an argument to a shell script, the spaces are not there.
- include.htc
- myannex = krowe-annex-casa5
- submit.htc
- include : include.htc
- executable = /bin/sleep
- arguments = 127
- +MayUseAWS = True
- requirements = AnnexName == $(myannex)
- queue
- Actually, I think this isn't a bug but a limitation on using macros. The AnnexName needs to be quoted but how can I quote a macro? Note, I have the same problems with AnnexNames that don't have hyphens (E.g. krowetest).
- No: requirements = AnnexName == "$(myannex)"
- No: myannex = "krowe-annex-casa5"
- No: myannex = \"krowe-annex-casa5\"
- No: myannex = "\"krowe-annex-casa5\""
- Idea: +annex = "krowe-annex-casa5"
- requirements = AnnexName == my.annex
- Greg has noted this bug
- include.htc
How can one see nodes that are entirely unclaimed?
...
- (note spaces)
- ANSWER: My conclusion is that there are limitations on what one can do with variables in the submit file that were defined in the DAG file.
- Is there a config option that will cause condor to not start? We have diskless nodes and it is easier to modify the config file then change systemd.
- SOLUTION: Either set START_MASTER = False or START_DAEMONS = False depending on desired outcome.
- Torque has this command called pbsnodes that can not only offline/drain a node but keeps a note about it that all can see in one place. I know I can use condor_off to drain a node but is there a central place keep notes so I can remember a month later why I set a certain node to drain?
- ANSWER: there is no place to keep such notes but Greg likes the idea and may look into it.
- May want to use condor_drain instead of condor_off. condor_off will kill the startd when all jobs finish and it no longer shows up in condor_status. condor_drain will leave the node in condor_status.
- condor_drain doesn't work for me because it immediatly sets jobs idle instead of letting them run to completion. This is why I use condor_off -startd -peaceful instead.
- How can you tell which job is associated with an email given the email message doesn't include a working dir or the assigned batch_name?
- CHTC will look into adding such information to the email condor sends.
- Bug in condor_annex: Underscores in the AnnexName prevent the annex from moving into the pool.
- Also when I try to terminate an annex with underscores (e.g. krowe_annex_casa5) with the command condor_off -annex krowe_annex_casa5 I get the following error
- Found no ClassAds when querying pool (local)
- Can't find addresses for master's for constraint 'AnnexName =?= "krowe_annex_casa5"'
Perhaps you need to query another pool.
- Greg has noted this bug
- Also when I try to terminate an annex with underscores (e.g. krowe_annex_casa5) with the command condor_off -annex krowe_annex_casa5 I get the following error
- Bug in condor_annex: The following will wait for an annex named krowe - annex - casa5 (note the spaces). If I pass $(myannex) as an argument to a shell script, the spaces are not there.
- include.htc
- myannex = krowe-annex-casa5
- submit.htc
- include : include.htc
- executable = /bin/sleep
- arguments = 127
- +MayUseAWS = True
- requirements = AnnexName == $(myannex)
- queue
- Actually, I think this isn't a bug but a limitation on using macros. The AnnexName needs to be quoted but how can I quote a macro? Note, I have the same problems with AnnexNames that don't have hyphens (E.g. krowetest).
- No: requirements = AnnexName == "$(myannex)"
- No: myannex = "krowe-annex-casa5"
- No: myannex = \"krowe-annex-casa5\"
- No: myannex = "\"krowe-annex-casa5\""
- Idea: +annex = "krowe-annex-casa5"
- requirements = AnnexName == my.annex
- Greg has noted this bug
- include.htc
Nodesfree
How can one see nodes that are entirely unclaimed?
SOLUTION: condor_status -const 'PartitionableSlot && Cpus == TotalCpus'
HERA queue
I want a proper subset of machines to be for the HERA project. These machines will only run HERA jobs and HERA jobs will only run on these machines. This seems to work but is there a better way?
machine config | submit file |
---|---|
HERA = True STARTD_ATTRS = $(STARTD_ATTRS) HERA START = ($(START)) && (TARGET.partition =?= "HERA") | requirements = (HERA == True) +partition = "HERA" |
SOLUTION: yes, this is good. Submit Transforms could also be set on herapost-master (Submit Host)
https://htcondor.readthedocs.io/en/latest/misc-concepts/transforms.html?highlight=submit%20transform