Table of Contents | ||
---|---|---|
|
...
Current Questions
Felipe's code
Felipe to share his job visualization software with Greg and maybe present at Throughput 2024.
Big Data
Are we alone in needing to copy in and out many GBs per job? Do other institutions have this problem as well? Does CHTC have any suggestions to help? Sanja will ask this of Bockleman as well.
ANSWER: Greg thinks our transfer times are not uncommon but our processing time is shorter than many. Other jobs have similar data sizes. Some other jobs have similar transfer times but process for many hours. Maybe we can constrain our jobs to only run on sites that seem to transfer quickly. Greg is also interested in why some sites seem slower than others. Is that acutally site specific or is it time specific or...
Felipe does have a long list of excluded sites in his run just for this reason. Greg would like a more declaritive solution like "please run on fast transfer hosts" especially if this is dynamic.
GPUs_Capability
We have a host (testpost001) with both a Tesla T4 (Capability=7.5) and a Tesla L4 (Capability=8.9) and when I run condor_gpu_discovery -prop I see something like the following
DetectedGPUs="GPU-ddc998f9, GPU-40331b00"
Common=[ DriverVersion=12.20; ECCEnabled=true; MaxSupportedVersion=12020; ]
GPU_40331b00=[ id="GPU-40331b00"; Capability=7.5; DeviceName="Tesla T4"; DevicePciBusId="0000:3B:00.0"; DeviceUuid="40331b00-c3b6-fa9a-b8fd-33bec2fcd29c"; GlobalMemoryMb=14931; ]
GPU_ddc998f9=[ id="GPU-ddc998f9"; Capability=8.9; DeviceName="NVIDIA L4"; DevicePciBusId="0000:5E:00.0"; DeviceUuid="ddc998f9-99e2-d9c1-04e3-7cc023a2aa5f"; GlobalMemoryMb=22491; ]
The problem is `condor_status -compact -constraint 'GPUs_Capability >= 7.0'` doesn't show testpost001. It does show testpost001 when I physically remote the T4.
Requesting a specific GPU with `RequireGPUs = (Capability >= 8.0)` or `RequireGPUs = (Capability <= 8.0)` does work however so maybe this is just a condor_status issue.
We then replaced the L4 with a second T4 and then GPUs_Capability functioned as expected.
Can condor handle two different capabilities on the same node?
ANSWER: Greg will look into it. They only recently added support for different GPUs on the same node. So this is going to take some time to get support in condor_status.
Resubmitting Jobs
I have an example in
/lustre/aoc/cluster/pipeline/vlass_prod/spool/se_continuum_imaging/VLASS2.1_T10t30.J194602-033000_P161384v1_2020_08_15T01_21_14.433
of a job that failed on nmpost106 but then HTCondor resubmitted the job on nmpost105. The problem is the job actually did finish, just got an error transferring back all the files, so when the job was resubmitted, it copied over an almost complete run of CASA which sort of makes a mess of things. I would rather HTCondor just fail and not re-submit the job. How can I do that?
022 (167287.000.000) 2023-12-24 02:43:57 Job disconnected, attempting to reconnect
Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to slot1_1@nmpost106.aoc.nrao.edu <10.64.2.180:9618?addrs=10.64.2.180-9618&alias=nmpost106.aoc.nrao.edu&noUDP&sock=startd_5795_776a>
...
023 (167287.000.000) 2023-12-24 02:43:57 Job reconnected to slot1_1@nmpost106.aoc.nrao.edu
startd address: <10.64.2.180:9618?addrs=10.64.2.180-9618&alias=nmpost106.aoc.nrao.edu&noUDP&sock=startd_5795_776a>
starter address: <10.64.2.180:9618?addrs=10.64.2.180-9618&alias=nmpost106.aoc.nrao.edu&noUDP&sock=slot1_1_39813_9c2c_400>
...
007 (167287.000.000) 2023-12-24 02:43:57 Shadow exception!
Error from slot1_1@nmpost106.aoc.nrao.edu: Repeated attempts to transfer output failed for unknown reasons
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
...
040 (167287.000.000) 2023-12-24 02:45:09 Started transferring input files
Transferring to host: <10.64.2.178:9618?addrs=10.64.2.178-9618&alias=nmpost105.aoc.nrao.edu&noUDP&sock=slot1_13_163338_25ab_452>
...
040 (167287.000.000) 2023-12-24 03:09:22 Finished transferring input files
...
001 (167287.000.000) 2023-12-24 03:09:22 Job executing on host: <10.64.2.178:9618?addrs=10.64.2.178-9618&alias=nmpost105.aoc.nrao.edu&noUDP&sock=startd_5724_c431>
ANSWER: Maybe
on_exit_hold = some_expression
periodic_hold = NumShadowStarts > 5
periodic_hold = NumJobStarts > 5
or a startd cron job that checks for IdM and offlines the node if needed
https://htcondor.readthedocs.io/en/latest/admin-manual/ep-policy-configuration.html#startd-cron
RedHat8 Only
Say we have a few RedHat8 nodes and we only want jobs to run on those nodes that request RedHat8 with
requirements = (OpSysAndVer == "RedHat8")
I know I could set up a partition like we have done with VLASS but since HTCondor already has an OS knob, can I use that?
Setting RedHat8 in the job requirements guarantees the job will run on a RedHat8 node, but how do I make that node not run jobs that don't specify the OS they want?
The following didn't do what I wanted.
START = ($(START)) && (TARGET.OpSysAndVer =?= "RedHat8")
Then I thought I needed to specify jobs where OpSysAndVer is not Undefined but that didn't work either. Either of the following do prevent jobs that don't specify an OS from running on the node but they also prevent jobs that DO specify an OS via either OpSysAndVer or OpSysMajorVer respectively.
START = ($(START)) && (TARGET.OpSysAndVer isnt UNDEFINED)
START = ($(START)) && (TARGET.OpSysMajorVer isnt UNDEFINED)
A better long-term solution is probably for our jobs (VLASS, VLA calibration, ingestion, etc) to ask for the OS that they want if they care. Then they can test new OSes when they want and we can upgrade OSes at our schedule (to a certain point). I think asking them to start requesting the OS they want now is not going to happen but maybe by the time RedHat9 is an option they and we will be ready for this.
ANSWER: unparse takes a classad expression and turns into a string then use a regex on it looking for opsysandver.
Is this the right syntax? Probably not as it doesn't work
START = ($(START)) && (regexp(".*RedHat8.*", unparse(TARGET.Requirements)))
Greg thinks this should work. We will poke at it.
The following DOES WORK in the sense that it matches anything.
START = ($(START)) && (regexp(".", unparse(TARGET.Requirements)))
None of these work
START = ($(START)) && (regexp(".*RedHat8.*", unparse(Requirements)))
START = ($(START)) && (regexp(".*a.*", unparse(Requirements)))
START = ($(START)) && (regexp("((OpSysAndVer.*", unparse(Requirements)))
START = ($(START)) && (regexp("((OpSysAndVer.*", unparse(TARGET.Requirements)))
START = ($(START)) && (regexp("\(\(OpSysAndVer.*", unparse(Requirements)))
START = ($(START)) && (regexp("(.*)RedHat8(.*)", unparse(Requirements)))
START = ($(START)) && (regexp("RedHat8", unparse(Requirements), "i"))
START = ($(START)) && (regexp("^.*RedHat8.*$", unparse(Requirements), "i"))
START = ($(START)) && (regexp("^.*RedHat8.*$", unparse(Requirements), "m"))
START = ($(START)) && (regexp("OpSysAdnVer\\s*==\\s*\"RedHat8\"", unparse(Requirements)))
START = $(START) && regexp("OpSysAdnVer\\s*==\\s*\"RedHat8\"", unparse(Requirements))#START = $(START) && debug(regexp(".*RedHat8.*", unparse(TARGET.Requirements)))
This should also work
in the config file
START = $(START) && target.WantTORunOnRedHat8only
Submit file
My.WantToRunonRedHat8Only = true
But I would rather not have to add yet more attributes to the EPs. I would like to use the existing OS attribute that HTCondor provides.
Wasn't there a change to PCRE to PCRE2 or something like that? Could that be causing the problem? 2023-11-13 Greg doesn't think so.
2024-01-03 krowe: Can we use a container like trhis? How does PATh do this?
+SingularityImage = "/cvmfs/singularity.opensciencegrid.org/opensciencegrid/osgvo-el7:latest"
Constant processing
output_destination and stdout/stderr
It used to be that once you set output_destination = someplugin:// then that plugin was responsible for transferring all files even stdout and stderr. That no longer seems to be the case as of version 23. My nraorsync transfer plugin has code in it looking for _condor_stdout and _condor_stderr as arguments but never sees them with version 23. The stdout and stderr files are copied back to the submit directory instead of letting my plugin transfer them.
This is a change. I am not sure if it affects us adversely or not but can we unchange this?
ANSWER: from Greg "After some archeology, it turns out that the change so that a file transfer plugin requesting to transfer the whole sandbox no longer sees stdout/stderr is intentional, and was asked for by several users. The current workaround is to explicitly list the plugin in the stdout/stderr lines of the submit file, e.g."
output = nraorsync://some_location/stdout
error = nraorsync://some_location/stderr
This seems like it should work but my plugin produces errors. Probably my fault.
tokens and collector.locate
It seems that if the submit host is HTC23, you need a user token in order for the API (nraorsync specifically) to locate the schedd.
import os
import classad
import htcondordef upload_file():
try:
ads = classad.parseAds(open(os.environ['_CONDOR_JOB_AD'], 'r'))
for ad in ads:
try:
globaljobid = str(ad['GlobalJobId'])
print("DEBUG: globaljobid is", globaljobid)
except:
return(-1)
except Exception:
return(-1)print("DEBUG: upload_file(): step 1\n")
submithost = globaljobid.split('#')[0]
print("DEBUG: submithost is", submithost)
collector = htcondor.Collector()
print("DEBUG: collector is", collector)
schedd_ad = collector.locate(htcondor.DaemonTypes.Schedd, submithost)
print("DEBUG: schedd_ad is ", schedd_ad)upload_file()
This code works if both the AP and EP are version 10. But if the AP is version 23 then it fails weather the EP is verison 10 or version 23. It works with version 23 iff I have a ~/.condor/tokens.d/nmpost token. Why do I need a user token to run collector.locate against a schedd?
I was going to test this on CHTC but I can't seem get an interactive job on CHTC anymore.
DONE: send greg error ouptut and security config
transfer_output_files change in version 23
My silly nraorsync transfer plugin relies on the user setting transfer_output_files = .job.ad in the submit description file to trigger the transfer of files. Then my nraorsync plugin takes over and looks at +nrao_output_files for the files to copy. But with version 23, this no longer works. I am guessing someone decided that internal files like .job.ad, .machine.ad, _condor_stdout, and _condor_stderr will no longer be tranferrable via trasnfer_output_files. Is that right? If so, I think I can work around it. Just wanted to know.
ANSWER: the starter has an exclude list and .job.ad is probably in it and maybe it is being access sooner or later than before. Greg will see if there is a better, first-class way to trigger transfers.
DONE: We will use condor_transfer since it needs to be there anyway.
Installing version 23
I am looking at upgrading from version 10 to 23 LTS. I noticed that y'all have a repo RPM to install condor but it installs the Feature Release only. It doens't provide repos to install the LTS.
https://htcondor.readthedocs.io/en/main/getting-htcondor/from-our-repositories.html
ANSWER: Greg will find it and get back to me.
DONE: https://research.cs.wisc.edu/htcondor/repo/23.0/el8/x86_64/release/
Virtual memory vs RSS
Looks like condor is reporting RSS but that may actually be virtual memory. At least according to Felipe's tests.
ANSWER: Access to the cgroup information on the nmpost cluster is good because condor is running as root and condor reports the RSS accurately. But on systems using glidein like PATh and OSG they may not have appropriate access to the cgroup so memory reporting on these clusters may be different thatn memory reporting on the nmpost cluster. On glide-in jobs condor reports the virtual memory across all the processes in the job.
CPU usage
Felipe has had jobs put on hold for too much cpu usage.
runResidualCycle_n4.imcycle8.condor.log:012 (269680.000.000) 2024-07-18 17:17:03 Job was held.
runResidualCycle_n4.imcycle8.condor.log- Excessive CPU usage. Please verify that the code is configured to use a limited number of cpus/threads, and matches request_cpus.
GREG: Perhaps only some machines in the OSPool have checks for this and may be doing something wrong or strange.
2024-09-16: Felipe asked about this again.
Missing batch_name
A DAG job, submitted with hundreds of others, doesn't show a batch name in condor_q, just DAG: 371239. Just one job, all the others submitted from the same template do show batch names
/lustre/aoc/cluster/pipeline/vlass_prod/spool/quicklook/VLASS3.2_T17t27.J201445+263000_P172318v1_2024_07_12T16_40_09.270
nmpost-master krowe >condor_q -dag -name mcilroy -g -all
...
vlapipe vlass_ql.dag+370186 7/16 10:30 1 1 _ _ 3 370193.0
vlapipe vlass_ql.dag+370191 7/16 10:31 1 1 _ _ 3 370194.0
vlapipe DAG: 371239 7/16 10:56 1 1 _ _ 3 371536.0
...
GREG: Probably a condor bug. Try submitting it again to see if the name is missing again.
WORKAROUND: condor_qedit job.id JobBatchName '"asdfasdf"'
DAG failed to submit
Another DAG job that was submitted along with hundreds of others looks to have created vlass_ql.dag.condor.sub but never actually submitted the job. condor.log is emtpy.
/lustre/aoc/cluster/pipeline/vlass_prod/spool/quicklook/VLASS3.2_T18t13.J093830+283000_P175122v1_2024_07_06T16_33_34.742
ANSWERs: Perhaps the schedd was too busy to respond. Need more resources in the workflow container?
Need to handle error codes from condor_submit_dag. 0 good. 1 bad. (chausman)
Setup /usr/bin/mail on mcilroy so that it works. Condor will use this to send mail to root when it encounters an error. Need to submit jira ticket to SSA. (krowe)
Resubmitting Jobs
I have an example in
/lustre/aoc/cluster/pipeline/vlass_prod/spool/se_continuum_imaging/VLASS2.1_T10t30.J194602-033000_P161384v1_2020_08_15T01_21_14.433
of a job that failed on nmpost106 but then HTCondor resubmitted the job on nmpost105. The problem is the job actually did finish, just got an error transferring back all the files, so when the job was resubmitted, it copied over an almost complete run of CASA which sort of makes a mess of things. I would rather HTCondor just fail and not re-submit the job. How can I do that?
022 (167287.000.000) 2023-12-24 02:43:57 Job disconnected, attempting to reconnect
Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to slot1_1@nmpost106.aoc.nrao.edu <10.64.2.180:9618?addrs=10.64.2.180-9618&alias=nmpost106.aoc.nrao.edu&noUDP&sock=startd_5795_776a>
...
023 (167287.000.000) 2023-12-24 02:43:57 Job reconnected to slot1_1@nmpost106.aoc.nrao.edu
startd address: <10.64.2.180:9618?addrs=10.64.2.180-9618&alias=nmpost106.aoc.nrao.edu&noUDP&sock=startd_5795_776a>
starter address: <10.64.2.180:9618?addrs=10.64.2.180-9618&alias=nmpost106.aoc.nrao.edu&noUDP&sock=slot1_1_39813_9c2c_400>
...
007 (167287.000.000) 2023-12-24 02:43:57 Shadow exception!
Error from slot1_1@nmpost106.aoc.nrao.edu: Repeated attempts to transfer output failed for unknown reasons
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
...
040 (167287.000.000) 2023-12-24 02:45:09 Started transferring input files
Transferring to host: <10.64.2.178:9618?addrs=10.64.2.178-9618&alias=nmpost105.aoc.nrao.edu&noUDP&sock=slot1_13_163338_25ab_452>
...
040 (167287.000.000) 2023-12-24 03:09:22 Finished transferring input files
...
001 (167287.000.000) 2023-12-24 03:09:22 Job executing on host: <10.64.2.178:9618?addrs=10.64.2.178-9618&alias=nmpost105.aoc.nrao.edu&noUDP&sock=startd_5724_c431>
ANSWER: Maybe
on_exit_hold = some_expression
periodic_hold = NumShadowStarts > 5
periodic_hold = NumJobStarts > 5
or a startd cron job that checks for IdM and offlines the node if needed
https://htcondor.readthedocs.io/en/latest/admin-manual/ep-policy-configuration.html#startd-cron
Constant processing
Our workflows have a process called "ingestion" that puts data into our archive. There are almost always ingestion processes running or needing to run and we don't want them to get stalled because of other jobs. Both ingestion and other jobs are the same user "vlapipe". I thought about setting a high priority in the ingestion submit description file but that won't guarantee that ingestion Our workflows have a process called "ingestion" that puts data into our archive. There are almost always ingestion processes running or needing to run and we don't want them to get stalled because of other jobs. Both ingestion and other jobs are the same user "vlapipe". I thought about setting a high priority in the ingestion submit description file but that won't guarantee that ingestion always runs, especially since we don't do preemption. So my current thinking is to have a dedicated node for ingestion. Can you think of a better solution?
...
2024-02-01 krowe: Talked to chausman today. She thinks SSA will need this and that the host will need access to /lustre/evla like aocngas-master and the nmngas nodes do. That might also mean a variable like HASEVLALUSTRE as well or instead of HIGHPRIORITY.
Priority for Glidein Nodes
We have a factory.sh script that glides in Slurm nodes to HTCondor as needed. The problem is that HTCondor then seems to prefer these nodes to the regular HTCondor nodes such that after a while there are several free regular HTCondor nodes, and three glide-in nodes. Is there a way to set a lower priority on glide-in nodes so that HTCondor only chooses them if the regular HTCondor nodes are all busy? I am going to offline the glide-in nodes to see if that works but that is a manual solution not and automated one.
I would think NEGOTIATOR_PRE_JOB_RANK would be the trick but we already set that on the CMs to the following so that RANK expressions in submit description files are honored and negotiation will prefer NMT nodes over DSOC nodes if possible.
NEGOTIATOR_PRE_JOB_RANK = (10000000 * Target.Rank) + (1000000 * (RemoteOwner =?= UNDEFINED)) - (100000 * Cpus) - Memory
ANSWER: NEGOTIATOR_PRE_JOB_RANK = (10000000 Target.Rank) + (1000000 (RemoteOwner =?= UNDEFINED)) - (100000 * Cpus) - Memory + 100000 * (site == "not-slurm")
I don't like setting not-slurm in the dedicated HTCondor nodes. I would rather set something like "glidein=true" or "glidein=1000" in the default 99-nrao config file and then remove it for the 99-nrao config in snapshots for dedicated HTCondor nodes. But that assumes that the base 99-nrao is for NM. Since we are sharing an image with CV we can't assume that. Therefore every node, weather dedicated HTCondor or not, will need a 99-nrao in its snapshot area.
In progress
condor_remote_cluster
CHTC
...
In progress
condor_remote_cluster
CHTC
000 (901.000.000) 2023-04-14 16:31:38 Job submitted from host: <10.64.1.178:9618?addrs=10.64.1.178-9618&alias=testpost-master.aoc.nrao.edu&noUDP&sock=schedd_2269692_816e>
...
012 (901.000.000) 2023-04-14 16:31:41 Job was held.
Failed to start GAHP: Missing remote command\n
Code 0 Subcode 0
...
testpost-master krowe >cat condor.902.log
000 (902.000.000) 2023-04-14 16:40:37 Job submitted from host: <10.64.1.178:9618?addrs=10.64.1.178-9618&alias=testpost-master.aoc.nrao.edu&noUDP&sock=schedd000 (901.000.000) 2023-04-14 16:31:38 Job submitted from host: <10.64.1.178:9618?addrs=10.64.1.178-9618&alias=testpost-master.aoc.nrao.edu&noUDP&sock=schedd_2269692_816e>
...
012 (901902.000.000) 2023-04-14 16:3140:41 Job was held.
Failed to start GAHP: Missing remote commandAgent pid 3145812\nPermission denied (gssapi-with-mic,keyboard-interactive).\nAgent pid 3145812 killed\n
Code 0 Subcode 0
...testpost-master krowe >cat condor.902.log
PATh
000 (902901.000.000) 2023-04-14 16:4031:37 38 Job submitted from host: <10.64.1.178:9618?addrs=10.64.1.178-9618&alias=testpost-master.aoc.nrao.edu&noUDP&sock=schedd_2269692_816e>
...
012 (902901.000.000) 2023-04-14 16:4031:41 Job was held.
Failed to start GAHP: Agent pid 3145812\nPermission denied (gssapi-with-mic,keyboard-interactive).\nAgent pid 3145812 killed\n
Code 0 Subcode 0
...
PATh
000 (901.000.000) 2023-04-14 16:31:38 Job submitted from host: <10.64.1.178:9618?addrs=10.64.1.178-9618&alias=testpost-master.aoc.nrao.edu&noUDP&sock=schedd_2269692_816e>
...
012 (901.000.000) 2023-04-14 16:31:41 Job was held.
Failed to start GAHP: Missing remote command\n
Code 0 Subcode 0
...
...
We have some jobs that seem to hang possibly because of a race condition or whatnot. I'm pretty sure it is our fault. But, the only way I know to tell is to login to the node and look at _condor_stdout in the scratch area. That gets pretty tedious when I want to check hundreds of jobs to see which ones are hung. Does condor have a way to check the _condor_stdout of a job from the submit host so I can do this programatically?
I thought condor_tail would be the solution but it doesn't display anything.
ANSWER: condor_ssh_to_job might be able to be used non-interactivly. I will try that.
ANSWER: use the FULL jobid with condor_tail. E.g. condor_tail 12345.0 Greg has submitted a patch so you don't have to specify the ProcId (.0).
Bug: condor_off -peaceful
testpost-cm-vml root >condor_off -peaceful -name testpost002
Sent "Set-Peaceful-Shutdown" command to startd testpost002.aoc.nrao.edu
Can't find address for schedd testpost002.aoc.nrao.edu
Can't find address for testpost002.aoc.nrao.edu
Perhaps you need to query another pool.
Yet it works without the -peaceful option
testpost-cm-vml root >condor_off -name testpost002
Sent "Kill-All-Daemons" command to master testpost002.aoc.nrao.edu
ANSWER: Add the -startd option. E.g. condor_off -peaceful -startd -name <hostname> Greg thinks it might be a regression (another bug). This still happens even after I set all the CONDOR_HOST knobs to testpost-cm-vml.aoc.nrao.edu. So it is still a bug and not because of some silly config I had at NRAO.
File Transfer Plugins and HTCondor-C
Is there a way I can use our nraorsync plugin on radial001? Or something similar?
SOLUTION: ssh tunnels
Condor Week (aka Throughput Week)
July 10-14, 2023. Being co-run with the OSG all hands meeting. At the moment, it is not hybrid but entirely in-person. https://path-cc.io/htc23
PROVISIONER node
When I define a PROVISIONER node, that is the only node that runs. The others never run. Also, the PROVISIONER job always returns 1 "exited normally with status 1" even though it is just running /bin/sleep.
JOB node01 node01.htc
JOB node02 node02.htc
JOB node03 node03.htcPARENT node01 CHILD node02 node03
PROVISIONER prov01 provisioner.htc
ANSWER: my prov01 job needs to indicate when it is ready with something like the following but that means the provisioner job has to run in either the local or scheduler universes because our execute nodes cant run condor_qedit.
condor_qedit myJobId ProvisionerState 2
But execute hosts can't run condor_qedit so this really only works if you set universe to local or scheduler.
Does CHTC have resources available for VLASS?
Our Single Epoch jobs
- Are parallelizable with OpenMPI
- 64GB of memory
- ~150GB of storage
- can take week(s) to run
- no checkpointing
- Looking at using GPUs but not there yet
Brian was not scared by this and gave us a form to fill out
https://chtc.cs.wisc.edu/uw-research-computing/form.html
ANSWER: Yes. We and Mark Lacy have started the process with CHTC for VLASS.
Annex to PATh
https://htcondor.org/experimental/ospool/byoc/path-facility
ANSWER: Greg doesn't know but he can connect me with someone who does.
Tod Miller is the person to ask about this.
Hold jobs that exceed disk request
ANSWER: https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToLimitDiskUsageOfJobs
condor_userprio
We want a user (vlapipe) to always have higher priority than other users. I see we can set this with condor_userprio but is that change permenent?
ANSWER: There is no config file for this. Set the priority_factor of vlapipe to 1. That is saved on disk and should persist through reboots and upgrades.
condor_userprio
We want a user (vlapipe) to always have higher priority than other users. I see we can set this with condor_userprio but is that change permenent?
ANSWER: There is no config file for this. Set the priority_factor of vlapipe to 1. That is saved on disk and should persist through reboots and upgrades.
Submitting jobs as other users
At some point in the future we will probably want the ability for a web process to launch condor jobs as different users. The web process will probably not be running as root. Does condor have a method for this or should we make our own setuid root thingy? Tell dlyons the asnswer.
ANSWER: HTCondor doesn't have anything for this. So it is up to us to do some suid-fu.
SSH keys with Duo
I tried following the link below to setup ssh such that I don't have to enter both my password and Duo every time I login to CHTC. It doesn't create anything in ~/.ssh/connections after I login. Thoughts?
https://chtc.cs.wisc.edu/uw-research-computing/configure-ssh
ANSWER: Greg doesn't know what to do here. We should ask Christina.
HTCondor-C and requirements
submitting jobs from the container on shipman as vlapipe to the NRAO RADIAL prototype cluster seems to ignore requirements like the following. Is this expected?
requirements = (machine == "radial001.nrao.radial.local")
and
requirements = (VLASS == True)
+partition = "VLASS"
It also seems to ignore things like
request_memory = 100G
ANSWER:
But I am still having problems.
This forces job to run on radial001
+remote_Requirements = ((machine == "radial001.nrao.radial.local"))
This runs on radialhead even though it only has 64G
+remote_RequestMemory = 102400
This runs on radialhead even though it doesn't have a GPU
request_gpus = 1
+remote_RequestGPUs = 1
ANSWER: This works
+remote_Requirements = ((machine == "radial001.nrao.radial.local") && memory > 102400)
as does this
+remote_Requirements = (memory > 102400)
but Greg will look into why +remote_RequestMemory doesn't work. It should.
Select files to transfer dynamically according to job-slot match
We currently have separate builds of our GPU software for CUDA Capability 7.0 and 8.0, and our jobs specify that both builds should be transferred to the EP, so that the job executable selects the appropriate build to run based on the CUDA Capability of the assigned GPU. Is there a way to do this selection when the job is matched to a slot, so that only the necessary build is transferred according to the slot's CUDA Capability?
ANSWER: $() means expand this locally from the jobad. $$() means expand at job start time.
executable = my_binary.$$(GPUs_capability)
executable = my_binary.$$([int(GPUs_capabilitty)]) # Felipe said this actually works
executable = my_binary.$$([ classad_express(GPUS_capability) ]) # Hopefully you don't need this
CPU/GPU Balancing
We have 30 nodes in a rack at NMT with a power limit of 17 kW and we are able to hit that limit when all 720 cores (24 cores * 30 nodes) are busy. We want to add two GPUs to each node but that would almost certainly put us way over the power limit if each node had 22 cores and 2 GPUs busy. So is there a way to tell HTCondor to reserve X cores for each GPU? That way we could balance the power load.
JOB TRANSFORMS work per schedd so that wouldn't work on the startd side which is what we want.
IDEA: NUM_CPUS = 4 or some other small number greater then the number of GPUs but limiting enough to keep the power draw low.
ANSWER: There isn't a knob for this in HTCondor but Greg is interested in this and will look into this.
WORKAROUND: MODIFY_REQUEST_EXPR_REQUESTCPUS may help by setting each job gets 8cores or something like.
MODIFY_REQUEST_EXPR_REQUESTCPUS = quantize(RequestCpus, isUndefined(RequestGpus) ? {1} : {8, 16, 24, 32, 40})
That is, when a job comes into the startd, if it doesn't request any GPUs, allocate exactly as many cpu cores as it requests. Otherwise, allocate 8 times as many cpus as it requests.
This seems to work. If I ask for 0 GPUs and 4 CPUs, I am given 0 GPUs and 4 CPUs. If I ask for 1 GPU and don't ask for CPUs, I am given 1 GPU and 8 CPUs.
But if I ask for 2 GPUs and don't ask for CPUs, I still am only given 8 CPUs. I was expecting to be given 16 CPUs. This is probably fine as we are not planning on more than 1 GPU per job.
But if I ask for 1 GPU and 4 CPUs, i am given 1 GPU and 8 CPUs. That is probably acceptable.
2024-01-24 krowe: Assuming a node can draw up to 550 Watts when all 24 cores are busy and that node only draws 150 Watts when idle, and that we have 17,300 Watts available to us in an NMT rack,
- we should only need to reserve 3 cores per GPU in order to offset the 72 Watts of an Nvidia L4 GPU.
- This would waste 60 cores.
- Or at least I suggest starting with that and seeing what happens. Another alternative is we just turn off three nodes if we put one L4 in each node.
- This would waste 72 cores.
Upgrading
CHTC just upgrades to the latest version when it becomes available, right? Do you ever run into problems because of this? We are still using version 9 because I can't seem to schedule a time with our VLASS group to test version 10. Much less version 23.
ANSWER: yes. The idea is that CHTC's users are helping them test the latest versions.
Flocking to CHTC?
We may want to run VLASS jobs at CHTC. What is the best way to submit locally and run globally?
ANSWER: Greg thinks flocking is the best idea.
This will require 9618 open to nmpost-master and probably a static NAT and external DNS name.
External users vs staff
We are thinking about making a DMZ ( I don't like that term ) for observers. Does CHTC staff use the same cluster resources that CHTC observers (customers) use?
ANSWER: There is no airgap at CHTC everyone uses the same cluster. Sometime users use a different AP but more for load balancing than security. Everyone does go through 2FA.
Does PATh Cache thingy(tm) (a.k.a. Stash) work outside of PATh?
I see HTCondor-10.x comes with a stash plugin. Does this mean we could read/write to stash from NRAO using HTCondor-10.x?
ANSWER: Greg thinks you can use stash remotely, like at our installation of HTCondor.
Curl_plugin doesn't do FTP
None of the following work. They either hang or produce errors. They work on the shell command line, except at CHTC where the squid server doesn't seem to grok FTP.
transfer_input_files = ftp://demo:password@test.rebex.net:/readme.txt
transfer_input_files = ftp://ftp:@ftp.gnu.org:/welcome.msg
transfer_input_files = ftp://ftp.gnu.org:/welcome.msg
transfer_input_files = ftp://ftp:@ftp.slackware.com:/welcome.msg
transfer_input_files = ftp://ftp.slackware.com:/welcome.msg
2024-02-05: Greg thinks this should work and will look into it.
ANSWER: 2024-02-06 Greg wrote "Just found the problem with ftp file transfer plugin. I'm afraid there's no easy workaround, but I've pushed a fix that will go into the next stable release. "
File Transfer Plugins and HTCondor-C
I see that when a job starts, the execution point (radial001) uses our nraorsync plugin to download the files. This is fine and good. When the job is finished, the execution point (radial001) uses our nraorsync plugin to upload the files, also fine and good. But then the RADIAL schedd (radialhead) also runs our nraorsync plugin to upload files. This causes problems because radialhead doesn't have the _CONDOR_JOB_AD environment variable and the plugin dies. Why is the remote schedd running the plugin and is there a way to prevent it from doing so?
Greg understands this and will ask the HTCondor-c folks about it.
Greg thinks it is a bug and will talk to our HTCondor-C peopole.
2023-08-07: Greg said the HTCondor-C people agree this is a bug and will work on it.
2023-09-25 krowe: send Greg my exact procedure to reproduce this.
2023-10-02 krowe: Sent Greg an example that fails. Turns out it is intermittent.
2024-01-22 krowe: will send email to the condor list
ANSWER: It was K. Scott all along. I now have HTCondor-C workiing from nmpost and testpost clusters to the radial cluster using my nraorsync plugin to trasfer both input and output files. The reason the remote AP (radialhead) was running the nraorsync plugin was because I defined it in the condor config like so.
FILETRANSFER_PLUGINS = $(FILETRANSFER_PLUGINS), /usr/libexec/condor/nraorsync_plugin.py
node and look at _condor_stdout in the scratch area. That gets pretty tedious when I want to check hundreds of jobs to see which ones are hung. Does condor have a way to check the _condor_stdout of a job from the submit host so I can do this programatically?
I thought condor_tail would be the solution but it doesn't display anything.
ANSWER: condor_ssh_to_job might be able to be used non-interactivly. I will try that.
ANSWER: use the FULL jobid with condor_tail. E.g. condor_tail 12345.0 Greg has submitted a patch so you don't have to specify the ProcId (.0).
Bug: condor_off -peaceful
testpost-cm-vml root >condor_off -peaceful -name testpost002
Sent "Set-Peaceful-Shutdown" command to startd testpost002.aoc.nrao.edu
Can't find address for schedd testpost002.aoc.nrao.edu
Can't find address for testpost002.aoc.nrao.edu
Perhaps you need to query another pool.
Yet it works without the -peaceful option
testpost-cm-vml root >condor_off -name testpost002
Sent "Kill-All-Daemons" command to master testpost002.aoc.nrao.edu
ANSWER: Add the -startd option. E.g. condor_off -peaceful -startd -name <hostname> Greg thinks it might be a regression (another bug). This still happens even after I set all the CONDOR_HOST knobs to testpost-cm-vml.aoc.nrao.edu. So it is still a bug and not because of some silly config I had at NRAO.
File Transfer Plugins and HTCondor-C
Is there a way I can use our nraorsync plugin on radial001? Or something similar?
SOLUTION: ssh tunnels
Condor Week (aka Throughput Week)
July 10-14, 2023. Being co-run with the OSG all hands meeting. At the moment, it is not hybrid but entirely in-person. https://path-cc.io/htc23
PROVISIONER node
When I define a PROVISIONER node, that is the only node that runs. The others never run. Also, the PROVISIONER job always returns 1 "exited normally with status 1" even though it is just running /bin/sleep.
JOB node01 node01.htc
JOB node02 node02.htc
JOB node03 node03.htcPARENT node01 CHILD node02 node03
PROVISIONER prov01 provisioner.htc
ANSWER: my prov01 job needs to indicate when it is ready with something like the following but that means the provisioner job has to run in either the local or scheduler universes because our execute nodes cant run condor_qedit.
condor_qedit myJobId ProvisionerState 2
But execute hosts can't run condor_qedit so this really only works if you set universe to local or scheduler.
Does CHTC have resources available for VLASS?
Our Single Epoch jobs
- Are parallelizable with OpenMPI
- 64GB of memory
- ~150GB of storage
- can take week(s) to run
- no checkpointing
- Looking at using GPUs but not there yet
Brian was not scared by this and gave us a form to fill out
https://chtc.cs.wisc.edu/uw-research-computing/form.html
ANSWER: Yes. We and Mark Lacy have started the process with CHTC for VLASS.
Annex to PATh
https://htcondor.org/experimental/ospool/byoc/path-facility
ANSWER: Greg doesn't know but he can connect me with someone who does.
Tod Miller is the person to ask about this.
Hold jobs that exceed disk request
ANSWER: https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToLimitDiskUsageOfJobs
condor_userprio
We want a user (vlapipe) to always have higher priority than other users. I see we can set this with condor_userprio but is that change permenent?
ANSWER: There is no config file for this. Set the priority_factor of vlapipe to 1. That is saved on disk and should persist through reboots and upgrades.
Submitting jobs as other users
At some point in the future we will probably want the ability for a web process to launch condor jobs as different users. The web process will probably not be running as root. Does condor have a method for this or should we make our own setuid root thingy? Tell dlyons the asnswer.
ANSWER: HTCondor doesn't have anything for this. So it is up to us to do some suid-fu.
SSH keys with Duo
I tried following the link below to setup ssh such that I don't have to enter both my password and Duo every time I login to CHTC. It doesn't create anything in ~/.ssh/connections after I login. Thoughts?
https://chtc.cs.wisc.edu/uw-research-computing/configure-ssh
ANSWER: Greg doesn't know what to do here. We should ask Christina.
HTCondor-C and requirements
submitting jobs from the container on shipman as vlapipe to the NRAO RADIAL prototype cluster seems to ignore requirements like the following. Is this expected?
requirements = (machine == "radial001.nrao.radial.local")
and
requirements = (VLASS == True)
+partition = "VLASS"
It also seems to ignore things like
request_memory = 100G
ANSWER:
But I am still having problems.
This forces job to run on radial001
+remote_Requirements = ((machine == "radial001.nrao.radial.local"))
This runs on radialhead even though it only has 64G
+remote_RequestMemory = 102400
This runs on radialhead even though it doesn't have a GPU
request_gpus = 1
+remote_RequestGPUs = 1
ANSWER: This works
+remote_Requirements = ((machine == "radial001.nrao.radial.local") && memory > 102400)
as does this
+remote_Requirements = (memory > 102400)
but Greg will look into why +remote_RequestMemory doesn't work. It should.
Select files to transfer dynamically according to job-slot match
We currently have separate builds of our GPU software for CUDA Capability 7.0 and 8.0, and our jobs specify that both builds should be transferred to the EP, so that the job executable selects the appropriate build to run based on the CUDA Capability of the assigned GPU. Is there a way to do this selection when the job is matched to a slot, so that only the necessary build is transferred according to the slot's CUDA Capability?
ANSWER: $() means expand this locally from the jobad. $$() means expand at job start time.
executable = my_binary.$$(GPUs_capability)
executable = my_binary.$$([int(GPUs_capabilitty)]) # Felipe said this actually works
executable = my_binary.$$([ classad_express(GPUS_capability) ]) # Hopefully you don't need this
CPU/GPU Balancing
We have 30 nodes in a rack at NMT with a power limit of 17 kW and we are able to hit that limit when all 720 cores (24 cores * 30 nodes) are busy. We want to add two GPUs to each node but that would almost certainly put us way over the power limit if each node had 22 cores and 2 GPUs busy. So is there a way to tell HTCondor to reserve X cores for each GPU? That way we could balance the power load.
JOB TRANSFORMS work per schedd so that wouldn't work on the startd side which is what we want.
IDEA: NUM_CPUS = 4 or some other small number greater then the number of GPUs but limiting enough to keep the power draw low.
ANSWER: There isn't a knob for this in HTCondor but Greg is interested in this and will look into this.
WORKAROUND: MODIFY_REQUEST_EXPR_REQUESTCPUS may help by setting each job gets 8cores or something like.
MODIFY_REQUEST_EXPR_REQUESTCPUS = quantize(RequestCpus, isUndefined(RequestGpus) ? {1} : {8, 16, 24, 32, 40})
That is, when a job comes into the startd, if it doesn't request any GPUs, allocate exactly as many cpu cores as it requests. Otherwise, allocate 8 times as many cpus as it requests.
This seems to work. If I ask for 0 GPUs and 4 CPUs, I am given 0 GPUs and 4 CPUs. If I ask for 1 GPU and don't ask for CPUs, I am given 1 GPU and 8 CPUs.
But if I ask for 2 GPUs and don't ask for CPUs, I still am only given 8 CPUs. I was expecting to be given 16 CPUs. This is probably fine as we are not planning on more than 1 GPU per job.
But if I ask for 1 GPU and 4 CPUs, i am given 1 GPU and 8 CPUs. That is probably acceptable.
2024-01-24 krowe: Assuming a node can draw up to 550 Watts when all 24 cores are busy and that node only draws 150 Watts when idle, and that we have 17,300 Watts available to us in an NMT rack,
- we should only need to reserve 3 cores per GPU in order to offset the 72 Watts of an Nvidia L4 GPU.
- This would waste 60 cores.
- Or at least I suggest starting with that and seeing what happens. Another alternative is we just turn off three nodes if we put one L4 in each node.
- This would waste 72 cores.
Upgrading
CHTC just upgrades to the latest version when it becomes available, right? Do you ever run into problems because of this? We are still using version 9 because I can't seem to schedule a time with our VLASS group to test version 10. Much less version 23.
ANSWER: yes. The idea is that CHTC's users are helping them test the latest versions.
Flocking to CHTC?
We may want to run VLASS jobs at CHTC. What is the best way to submit locally and run globally?
ANSWER: Greg thinks flocking is the best idea.
This will require 9618 open to nmpost-master and probably a static NAT and external DNS name.
External users vs staff
We are thinking about making a DMZ ( I don't like that term ) for observers. Does CHTC staff use the same cluster resources that CHTC observers (customers) use?
ANSWER: There is no airgap at CHTC everyone uses the same cluster. Sometime users use a different AP but more for load balancing than security. Everyone does go through 2FA.
Does PATh Cache thingy(tm) (a.k.a. Stash) work outside of PATh?
I see HTCondor-10.x comes with a stash plugin. Does this mean we could read/write to stash from NRAO using HTCondor-10.x?
ANSWER: Greg thinks you can use stash remotely, like at our installation of HTCondor.
Curl_plugin doesn't do FTP
None of the following work. They either hang or produce errors. They work on the shell command line, except at CHTC where the squid server doesn't seem to grok FTP.
transfer_input_files = ftp://demo:password@test.rebex.net:/readme.txt
transfer_input_files = ftp://ftp:@ftp.gnu.org:/welcome.msg
transfer_input_files = ftp://ftp.gnu.org:/welcome.msg
transfer_input_files = ftp://ftp:@ftp.slackware.com:/welcome.msg
transfer_input_files = ftp://ftp.slackware.com:/welcome.msg
2024-02-05: Greg thinks this should work and will look into it.
ANSWER: 2024-02-06 Greg wrote "Just found the problem with ftp file transfer plugin. I'm afraid there's no easy workaround, but I've pushed a fix that will go into the next stable release. "
File Transfer Plugins and HTCondor-C
I see that when a job starts, the execution point (radial001) uses our nraorsync plugin to download the files. This is fine and good. When the job is finished, the execution point (radial001) uses our nraorsync plugin to upload the files, also fine and good. But then the RADIAL schedd (radialhead) also runs our nraorsync plugin to upload files. This causes problems because radialhead doesn't have the _CONDOR_JOB_AD environment variable and the plugin dies. Why is the remote schedd running the plugin and is there a way to prevent it from doing so?
Greg understands this and will ask the HTCondor-c folks about it.
Greg thinks it is a bug and will talk to our HTCondor-C peopole.
2023-08-07: Greg said the HTCondor-C people agree this is a bug and will work on it.
2023-09-25 krowe: send Greg my exact procedure to reproduce this.
2023-10-02 krowe: Sent Greg an example that fails. Turns out it is intermittent.
2024-01-22 krowe: will send email to the condor list
ANSWER: It was K. Scott all along. I now have HTCondor-C workiing from nmpost and testpost clusters to the radial cluster using my nraorsync plugin to trasfer both input and output files. The reason the remote AP (radialhead) was running the nraorsync plugin was because I defined it in the condor config like so.
FILETRANSFER_PLUGINS = $(FILETRANSFER_PLUGINS), /usr/libexec/condor/nraorsync_plugin.py
I probably did this early in my HTCondor-C testing not knowing what I was doing. I commented this out, restarted condor, and now everything seems to be working properly.
Quotes in DAG VARS
I was helping SSA with a syntax problem between HTCondor-9 and HTCondor-10 and I was wondering if you had any thoughts on it. They have a dag with lines like this
JOB SqDeg2/J232156-603000 split.condor
VARS SqDeg2/J232156-603000 jobname="$(JOB)" split_dir="SqDeg2/J232156+603000"
Then they set that split_dir VAR to a variable in the submit description file like this
SPLIT_DIR = "$(split_dir)"
The problem seems to be the quotes around $(split_dir). It works fine in HTCondor-9 but with HTCondor-10 they get an error like this in their pims_split.dag.dagman.out file
02/28/24 16:26:02 submit error: Submit:-1:Unexpected characters following doublequote. Did you forget to escape the double-quote by repeating it? Here is the quote and trailing characters: "SqDeg2/J232156+603000""
Looking at the documentation https://htcondor.readthedocs.io/en/latest/version-history/lts-versions-10-0.html#version-10-0-0 its clear they shouldn't be putting quotes around $(split_dir). So clearly something changed with version 10. Either a change to the syntax or, my guess, just a stricter parser.
Any thoughts on this?
ANSWER: Greg doesn't know why this changed but thinks we are now doing the right thing.
OSDF Cache
Is there a way to prefer a job to run on a machine where the data is cached?
ANSWER: There is no knob in HTCondor for this but CHTC would like to add one for this. they would like to glide in OSDF caches like they glide in nodes. But this is all long-term ideas.
GPU names
HTCondor seems to have short names for GPUs which are the first part of the UUID. Is there a way to use/get the full UUID? This would make it consistant with nvidia-smi.
ANSWER: Greg thinks you can use the full UUID with HTCondor.
But cuda_visible_devices only provides the short UUID name. Is there a way to get the long UUID name from cuda_visisble_devices?
ANSWER: You can't use id 0 because 0 will always be the first GPU that HTCondor chose for you. Some new release of HTCondor supports NVIDIA_VISIBLE_DEVICES which should be the full UUID.
Big Data
Are we alone in needing to copy in and out many GBs per job? Do other institutions have this problem as well? Does CHTC have any suggestions to help? Sanja will ask this of Bockleman as well.
ANSWER: Greg thinks our transfer times are not uncommon but our processing time is shorter than many. Other jobs have similar data sizes. Some other jobs have similar transfer times but process for many hours. Maybe we can constrain our jobs to only run on sites that seem to transfer quickly. Greg is also interested in why some sites seem slower than others. Is that actually site specific or is it time specific or...
Felipe does have a long list of excluded sites in his run just for this reason. Greg would like a more declaritive solution like "please run on fast transfer hosts" especially if this is dynamic.
GPUs_Capability
We have a host (testpost001) with both a Tesla T4 (Capability=7.5) and a Tesla L4 (Capability=8.9) and when I run condor_gpu_discovery -prop I see something like the following
DetectedGPUs="GPU-ddc998f9, GPU-40331b00"
Common=[ DriverVersion=12.20; ECCEnabled=true; MaxSupportedVersion=12020; ]
GPU_40331b00=[ id="GPU-40331b00"; Capability=7.5; DeviceName="Tesla T4"; DevicePciBusId="0000:3B:00.0"; DeviceUuid="40331b00-c3b6-fa9a-b8fd-33bec2fcd29c"; GlobalMemoryMb=14931; ]
GPU_ddc998f9=[ id="GPU-ddc998f9"; Capability=8.9; DeviceName="NVIDIA L4"; DevicePciBusId="0000:5E:00.0"; DeviceUuid="ddc998f9-99e2-d9c1-04e3-7cc023a2aa5f"; GlobalMemoryMb=22491; ]
The problem is `condor_status -compact -constraint 'GPUs_Capability >= 7.0'` doesn't show testpost001. It does show testpost001 when I physically remove the T4.
Requesting a specific GPU with `RequireGPUs = (Capability >= 8.0)` or `RequireGPUs = (Capability <= 8.0)` does work however so maybe this is just a condor_status issue.
We then replaced the L4 with a second T4 and then GPUs_Capability functioned as expected.
Can condor handle two different capabilities on the same node?
ANSWER: Greg will look into it. They only recently added support for different GPUs on the same node. So this is going to take some time to get support in condor_status. Yes this is just a condor_status issue.
Priority for Glidein Nodes
We have a factory.sh script that glides in Slurm nodes to HTCondor as needed. The problem is that HTCondor then seems to prefer these nodes to the regular HTCondor nodes such that after a while there are several free regular HTCondor nodes, and three glide-in nodes. Is there a way to set a lower priority on glide-in nodes so that HTCondor only chooses them if the regular HTCondor nodes are all busy? I am going to offline the glide-in nodes to see if that works but that is a manual solution not and automated one.
I would think NEGOTIATOR_PRE_JOB_RANK would be the trick but we already set that on the CMs to the following so that RANK expressions in submit description files are honored and negotiation will prefer NMT nodes over DSOC nodes if possible.
NEGOTIATOR_PRE_JOB_RANK = (10000000 * Target.Rank) + (1000000 * (RemoteOwner =?= UNDEFINED)) - (100000 * Cpus) - Memory
ANSWER: NEGOTIATOR_PRE_JOB_RANK = (10000000 Target.Rank) + (1000000 (RemoteOwner =?= UNDEFINED)) - (100000 * Cpus) - Memory + 100000 * (site == "not-slurm")
I don't like setting not-slurm in the dedicated HTCondor nodes. I would rather set something like "glidein=true" or "glidein=1000" in the default 99-nrao config file and then remove it for the 99-nrao config in snapshots for dedicated HTCondor nodes. But that assumes that the base 99-nrao is for NM. Since we are sharing an image with CV we can't assume that. Therefore every node, weather dedicated HTCondor or not, will need a 99-nrao in its snapshot area.
SOLUTION
This seems to work. If I set NRAOGLIDEIN = True on a node, then that node will be chosen last. You may ask why not just add 10000000 ASTERISK (NRAOGLIDEIN == True). If I did that I would have to also set it to false on all the other nodes otherwise the negotiator would fail to parse NEGOTIATOR_PRE_JOB_RANK into a float. So I check if it isn't undefined then check if it is true. This way you could set NRAOGLIDEIN to False if you wanted.
NEGOTIATOR_PRE_JOB_RANK = (10000000 * Target.Rank) + (1000000 * (RemoteOwner =?= UNDEFINED)) - (100000 * Cpus) - Memory - 10000000 * ((NRAOGLIDEIN =!= UNDEFINED) && (NRAOGLIDEIN == True))
I configured our pilot.sh script to add the NRAOGLIDEIN = True key/value pair to a node when it glides in to HTCondor. That is the simplest and best place to set this I think.
K8s kubernetes
2024-04-15 krowe: There is a lot of talk around NRAO about k8s these days. Can you explain if/how HTCondor works with k8s? I'm not suggesting we run HTCondor on top of k8s but I would like to know the options.
Condor and k8s have different goals. Condor an infinite number of jobs for finite time each job. k8s runs a finite number of services for infinite time.
There is some support in k8s to run batch jobs but it ins't well formed yet. Running the condor services like the CM in k8s can make some sense.
The new hotness is using EBPF to change routing tables.
RedHat8 Only
Say we have a few RedHat8 nodes and we only want jobs to run on those nodes that request RedHat8 with
requirements = (OpSysAndVer == "RedHat8")
I know I could set up a partition like we have done with VLASS but since HTCondor already has an OS knob, can I use that?
Setting RedHat8 in the job requirements guarantees the job will run on a RedHat8 node, but how do I make that node not run jobs that don't specify the OS they want?
The following didn't do what I wanted.
START = ($(START)) && (TARGET.OpSysAndVer =?= "RedHat8")
Then I thought I needed to specify jobs where OpSysAndVer is not Undefined but that didn't work either. Either of the following do prevent jobs that don't specify an OS from running on the node but they also prevent jobs that DO specify an OS via either OpSysAndVer or OpSysMajorVer respectively.
START = ($(START)) && (TARGET.OpSysAndVer isnt UNDEFINED)
START = ($(START)) && (TARGET.OpSysMajorVer isnt UNDEFINED)
A better long-term solution is probably for our jobs (VLASS, VLA calibration, ingestion, etc) to ask for the OS that they want if they care. Then they can test new OSes when they want and we can upgrade OSes at our schedule (to a certain point). I think asking them to start requesting the OS they want now is not going to happen but maybe by the time RedHat9 is an option they and we will be ready for this.
ANSWER: unparse takes a classad expression and turns into a string then use a regex on it looking for opsysandver.
Is this the right syntax? Probably not as it doesn't work
START = ($(START)) && (regexp(".*RedHat8.*", unparse(TARGET.Requirements)))
Greg thinks this should work. We will poke at it.
The following DOES WORK in the sense that it matches anything.
START = ($(START)) && (regexp(".", unparse(TARGET.Requirements)))
None of these work
START = ($(START)) && (regexp(".*RedHat8.*", unparse(Requirements)))
START = ($(START)) && (regexp(".*a.*", unparse(Requirements)))
START = ($(START)) && (regexp("((OpSysAndVer.*", unparse(Requirements)))
START = ($(START)) && (regexp("((OpSysAndVer.*", unparse(TARGET.Requirements)))
START = ($(START)) && (regexp("\(\(OpSysAndVer.*", unparse(Requirements)))
START = ($(START)) && (regexp("(.*)RedHat8(.*)", unparse(Requirements)))
START = ($(START)) && (regexp("RedHat8", unparse(Requirements), "i"))
START = ($(START)) && (regexp("^.*RedHat8.*$", unparse(Requirements), "i"))
START = ($(START)) && (regexp("^.*RedHat8.*$", unparse(Requirements), "m"))
START = ($(START)) && (regexp("OpSysAdnVer\\s*==\\s*\"RedHat8\"", unparse(Requirements)))
START = $(START) && regexp("OpSysAdnVer\\s*==\\s*\"RedHat8\"", unparse(Requirements))#START = $(START) && debug(regexp(".*RedHat8.*", unparse(TARGET.Requirements)))
This should also work
in the config file
START = $(START) && target.WantTORunOnRedHat8only
Submit file
My.WantToRunonRedHat8Only = true
But I would rather not have to add yet more attributes to the EPs. I would like to use the existing OS attribute that HTCondor provides.
Wasn't there a change to PCRE to PCRE2 or something like that? Could that be causing the problem? 2023-11-13 Greg doesn't think so.
2024-01-03 krowe: Can we use a container like trhis? How does PATh do this?
+SingularityImage = "/cvmfs/singularity.opensciencegrid.org/opensciencegrid/osgvo-el7:latest"
See retired nodes
2024-04-15 krowe: Say I set a few nodes to offline with a command like condor_off -startd -peaceful -name nmpost120 How can I later check to see which nodes are offline?
- condor_status -offline returns nothing
- condor_status -long nmpost120 returns nothing about being offline
- The following shows nodes where startd has actually stopped but it doesn't show nodes that are set offline but still running jobs (e.g. Retiring)
- condor_status -master -constraint 'STARTD_StartTime == 0'
- This shows nodes that are set offline but still running jobs (a.k.a. Retiring)
- condor_status |grep Retiring
ANSWER: 2022-06-27
condor_status -const 'Activity == "Retiring"'
offline ads, which is a way for HTCondor to update the status of a node after the startd has exited.
condor_drain -peaceful # CHTC is working on this. I think this might be the best solution.
Try this: condor_status -constraint 'PartitionableSlot && Cpus && DetectedCpus && State == "Retiring"'
or this: condor_status -const 'PartitionableSlot && State == "Retiring"' -af Name DetectedCpus Cpus
or: condor_status -const 'PartitionableSlot && Activity == "Retiring"' -af Name Cpus DetectedCpus
or: condor_status -const 'partitionableSlot && Activity == "Retiring" && cpus == DetectedCpus'
None of which actually show nodes that have drained. I.e. were in state Retiring and are now done running jobs.
ANSWER: This seems to work fairly well. Not sure if it is perfect or not condor_status -master -constraint 'STARTD_StartTime == 0'
Condor_reboot?
Is there such a thing? Slurm has a nice one `scontrol reboot HOSTNAME`. I know it might not be the condor way, but thought I would ask.
ANSWER: https://htcondor.readthedocs.io/en/latest/admin-manual/configuration-macros.html#MASTER_SHUTDOWN_%3CName%3E and https://htcondor.readthedocs.io/en/latest/man-pages/condor_set_shutdown.html maybe do the latter and then the former and possibly combined with condor_off -peaceful. I'll need to play with it when I feel better.
Felipe's code
Felipe to share his job visualization software with Greg and maybe present at Throughput 2024.
https://github.com/ARDG-NRAO/LibRA/tree/main/frameworks/htclean/read_htclean_logs
Versions and falling behind
We are still using HTCondor-10.0.2. How far can/should we fall behind before catching up again?
ANSWER: Version 24 is coming out around condor week in 2024. It is suggested to drift no more than one major version, e.g. don't be older than 23 once 24 is available.
Sams question
A DAG of three nodes: fetch -> envoy -> deliver. Submit host and cluster are far apart, and we need to propagate large quantities of data from one node to the next. How do we make this transfer quickly (i.e. without going through the submit host) without knowing the data's location at submit time?
krowe: Why do this as a dag? Why not make it one job instead of a dag? Collapsing the DAG into just one job has the advantage that it can use the local condor scratch area and can easily restart if the job fails without need for cleaning up anything. And of course making it one job means all the steps know where the data is.
Greg: condor_chirp condor_chirp_set_job_attr attributeName 'Value' You could do somethig like
condor_chirp set_job_attr DataLocation '"/path/to/something"'
or
condor_chirp put_file local remote
Each DAG has a prescript that runs before the dag nodes.
Another idea is to define the directory before submitting the job (e.g. /lustre/naasc/.../jobid)
Condor history for crashed node
We have nodes crashing sometimes. 1. should HTCondor recover from a crashed node? Will the jobs be restarted somewhere else? 2. How can I see what jobs were running on a node when it crahsed?
How about this
condor_history -name mcilroy -const "stringListMember(\"alias=nmpost091.aoc.nrao.edu\", StarterIpAddr, \"&\") == true"
ANSWER: There is a global event log but it has to be enabled and isn't in our case EVENT_LOG = $(LOG)/EventLog
ANSWER: show jobs that have restarted condor_q -name mcilroy -allusers -const 'NumShadowStarts > 1'
STARTD_ATTRS in glidein nodes
We add the following line to /etc/condor/condor_config on all our Slurm nodes so that if they get called as a glidein node, they can set some special glidein settings.
LOCAL_CONFIG_FILE = /var/run/condor/condor_config.local
Our /etc/condor/config.d/99-nrao file effectivly sets sets the following
STARTD_ATTRS = PoolName NRAO_TRANSFER_HOST HASLUSTRE BATCH
Our /var/run/condor/condor_config.local, which is run by glidein nodes, sets the following
STARTD_ATTRS = $(STARTD_ATTRS) NRAOGLIDEIN
The problem is glidein nodes don't get all the STARD_ATTRS set by 99-nrao. They just get NRAOGLIDEIN. It is like condor-master reads 99-nrao to set its STARTD_ATTRS. Then it read condor_config.local to set its STARTD_ATTRS again but without accessing $(STARTD_ATTRS).
ANSWER: The last line in /var/run/condor/condor_config.local is re-writing STARTD_ATTRS. It should have $(STARTD_ATTRS) appended
STARTD_ATTRS = NRAOGLIDEIN
Output to two places
Some of our pipeline jobs don't set shoud_transfer_files=YES because they need to transfer some output to an area for Analysts to look at and a some other output (may be a subset) to a different area for the User to look at. Is there a condor way to do this? transfer_output_remaps?
ANSWER: Greg doesn't think there is a Condor way to do this. Could make a copy of the subset and use transfer_output_rempas on the copy but that is a bit of a hack.
Pelican?
Felipe is playing with it and we will probably want it at NRAO.
ANSWER: Greg will ask around.
RHEL8 Crashing
We have had many NMT VLASS nodes crash since we upgraded to RHEL8. I think the nodes were busy when they crashed. So I changed our SLOT_TYPE_1 from 100% to 95%. Is this a good idea?
ANSWER: try using RESERVED_MEMORY=4096 (units are in Megabytes) instead of SLOT_TYPE_1=95% and put SLOT_TYPE_1=100% again.
getnenv
Did it change since 10.0? Can we still use getenv in DAGs or regular jobs?
#krowe Nov 5 2024: getenv no longer includes your entire environment as of version 10.7 or so. But instead it only includes the environment variables you list with the "ENV GET" syntax in the .dag file.
https://git.ligo.org/groups/computing/-/epics/30
ANSWER: Yes this is true. CHTC would like users to stop using getenv=true. There may be a knob to restore the old behavior.
DONE: check out docs and remove getenv=trueI probably did this early in my HTCondor-C testing not knowing what I was doing. I commented this out, restarted condor, and now everything seems to be working properly.
...