Table of Contents | ||
---|---|---|
|
Throughput Week 2024 July 8-12
...
Current Questions
See retired nodes
2024-04-15 krowe: Say I set a few nodes to offline with a command like condor_off -startd -peaceful -name nmpost120 How can I later check to see which nodes are offline?
- condor_status -offline returns nothing
- condor_status -long nmpost120 returns nothing about being offline
- The following shows nodes where startd has actually stopped but it doesn't show nodes that are set offline but still running jobs (e.g. Retiring)
- condor_status -master -constraint 'STARTD_StartTime == 0'
- This shows nodes that are set offline but still running jobs (a.k.a. Retiring)
- condor_status |grep Retiring
ANSWER: 2022-06-27
condor_status -const 'Activity == "Retiring"'
offline ads, which is a way for HTCondor to update the status of a node after the startd has exited.
condor_drain -peaceful # CHTC is working on this. I think this might be the best solution.
Try this: condor_status -constraint 'PartitionableSlot && Cpus && DetectedCpus && State == "Retiring"'
or this: condor_status -const 'PartitionableSlot && State == "Retiring"' -af Name DetectedCpus Cpus
K8s kubernetes
2024-04-15 krowe: There is a lot of talk around NRAO about k8s these days. Can you explain if/how HTCondor works with k8s? I'm not suggesting we run HTCondor on top of k8s but I would like to know the options.
Condor and k8s have different goals. Condor an infinite number of jobs for finite time each job. k8s runs a finite number of services for infinite time.
There is some support in k8s to run batch jobs but it ins't well formed yet. Running the condor services like the CM in k8s can make some sense.
Felipe's code
Felipe to share his job visualization software with Greg and maybe present at Throughput 2024.
Resubmitting Jobs
I have an example in
/lustre/aoc/cluster/pipeline/vlass_prod/spool/se_continuum_imaging/VLASS2.1_T10t30.J194602-033000_P161384v1_2020_08_15T01_21_14.433
of a job that failed on nmpost106 but then HTCondor resubmitted the job on nmpost105. The problem is the job actually did finish, just got an error transferring back all the files, so when the job was resubmitted, it copied over an almost complete run of CASA which sort of makes a mess of things. I would rather HTCondor just fail and not re-submit the job. How can I do that?
022 (167287.000.000) 2023-12-24 02:43:57 Job disconnected, attempting to reconnect
Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to slot1_1@nmpost106.aoc.nrao.edu <10.64.2.180:9618?addrs=10.64.2.180-9618&alias=nmpost106.aoc.nrao.edu&noUDP&sock=startd_5795_776a>
...
023 (167287.000.000) 2023-12-24 02:43:57 Job reconnected to slot1_1@nmpost106.aoc.nrao.edu
startd address: <10.64.2.180:9618?addrs=10.64.2.180-9618&alias=nmpost106.aoc.nrao.edu&noUDP&sock=startd_5795_776a>
starter address: <10.64.2.180:9618?addrs=10.64.2.180-9618&alias=nmpost106.aoc.nrao.edu&noUDP&sock=slot1_1_39813_9c2c_400>
...
007 (167287.000.000) 2023-12-24 02:43:57 Shadow exception!
Error from slot1_1@nmpost106.aoc.nrao.edu: Repeated attempts to transfer output failed for unknown reasons
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
...
040 (167287.000.000) 2023-12-24 02:45:09 Started transferring input files
Transferring to host: <10.64.2.178:9618?addrs=10.64.2.178-9618&alias=nmpost105.aoc.nrao.edu&noUDP&sock=slot1_13_163338_25ab_452>
...
040 (167287.000.000) 2023-12-24 03:09:22 Finished transferring input files
...
001 (167287.000.000) 2023-12-24 03:09:22 Job executing on host: <10.64.2.178:9618?addrs=10.64.2.178-9618&alias=nmpost105.aoc.nrao.edu&noUDP&sock=startd_5724_c431>
ANSWER: Maybe
on_exit_hold = some_expression
periodic_hold = NumShadowStarts > 5
periodic_hold = NumJobStarts > 5
or a startd cron job that checks for IdM and offlines the node if needed
https://htcondor.readthedocs.io/en/latest/admin-manual/ep-policy-configuration.html#startd-cron
RedHat8 Only
Say we have a few RedHat8 nodes and we only want jobs to run on those nodes that request RedHat8 with
requirements = (OpSysAndVer == "RedHat8")
I know I could set up a partition like we have done with VLASS but since HTCondor already has an OS knob, can I use that?
Setting RedHat8 in the job requirements guarantees the job will run on a RedHat8 node, but how do I make that node not run jobs that don't specify the OS they want?
The following didn't do what I wanted.
START = ($(START)) && (TARGET.OpSysAndVer =?= "RedHat8")
Then I thought I needed to specify jobs where OpSysAndVer is not Undefined but that didn't work either. Either of the following do prevent jobs that don't specify an OS from running on the node but they also prevent jobs that DO specify an OS via either OpSysAndVer or OpSysMajorVer respectively.
START = ($(START)) && (TARGET.OpSysAndVer isnt UNDEFINED)
START = ($(START)) && (TARGET.OpSysMajorVer isnt UNDEFINED)
A better long-term solution is probably for our jobs (VLASS, VLA calibration, ingestion, etc) to ask for the OS that they want if they care. Then they can test new OSes when they want and we can upgrade OSes at our schedule (to a certain point). I think asking them to start requesting the OS they want now is not going to happen but maybe by the time RedHat9 is an option they and we will be ready for this.
ANSWER: unparse takes a classad expression and turns into a string then use a regex on it looking for opsysandver.
Is this the right syntax? Probably not as it doesn't work
START = ($(START)) && (regexp(".*RedHat8.*", unparse(TARGET.Requirements)))
Greg thinks this should work. We will poke at it.
The following DOES WORK in the sense that it matches anything.
START = ($(START)) && (regexp(".", unparse(TARGET.Requirements)))
None of these work
START = ($(START)) && (regexp(".*RedHat8.*", unparse(Requirements)))
START = ($(START)) && (regexp(".*a.*", unparse(Requirements)))
START = ($(START)) && (regexp("((OpSysAndVer.*", unparse(Requirements)))
START = ($(START)) && (regexp("((OpSysAndVer.*", unparse(TARGET.Requirements)))
START = ($(START)) && (regexp("\(\(OpSysAndVer.*", unparse(Requirements)))
START = ($(START)) && (regexp("(.*)RedHat8(.*)", unparse(Requirements)))
START = ($(START)) && (regexp("RedHat8", unparse(Requirements), "i"))
START = ($(START)) && (regexp("^.*RedHat8.*$", unparse(Requirements), "i"))
START = ($(START)) && (regexp("^.*RedHat8.*$", unparse(Requirements), "m"))
START = ($(START)) && (regexp("OpSysAdnVer\\s*==\\s*\"RedHat8\"", unparse(Requirements)))
START = $(START) && regexp("OpSysAdnVer\\s*==\\s*\"RedHat8\"", unparse(Requirements))#START = $(START) && debug(regexp(".*RedHat8.*", unparse(TARGET.Requirements)))
This should also work
in the config file
START = $(START) && target.WantTORunOnRedHat8only
Submit file
My.WantToRunonRedHat8Only = true
But I would rather not have to add yet more attributes to the EPs. I would like to use the existing OS attribute that HTCondor provides.
Wasn't there a change to PCRE to PCRE2 or something like that? Could that be causing the problem? 2023-11-13 Greg doesn't think so.
2024-01-03 krowe: Can we use a container like trhis? How does PATh do this?
+SingularityImage = "/cvmfs/singularity.opensciencegrid.org/opensciencegrid/osgvo-el7:latest"
Constant processing
output_destination and stdout/stderr
It used to be that once you set output_destination = someplugin:// then that plugin was responsible for transferring all files even stdout and stderr. That no longer seems to be the case as of version 23. My nraorsync transfer plugin has code in it looking for _condor_stdout and _condor_stderr as arguments but never sees them with version 23. The stdout and stderr files are copied back to the submit directory instead of letting my plugin transfer them.
This is a change. I am not sure if it affects us adversely or not but can we unchange this?
ANSWER: from Greg "After some archeology, it turns out that the change so that a file transfer plugin requesting to transfer the whole sandbox no longer sees stdout/stderr is intentional, and was asked for by several users. The current workaround is to explicitly list the plugin in the stdout/stderr lines of the submit file, e.g."
output = nraorsync://some_location/stdout
error = nraorsync://some_location/stderr
This seems like it should work but my plugin produces errors. Probably my fault.
tokens and collector.locate
It seems that if the submit host is HTC23, you need a user token in order for the API (nraorsync specifically) to locate the schedd.
import os
import classad
import htcondordef upload_file():
try:
ads = classad.parseAds(open(os.environ['_CONDOR_JOB_AD'], 'r'))
for ad in ads:
try:
globaljobid = str(ad['GlobalJobId'])
print("DEBUG: globaljobid is", globaljobid)
except:
return(-1)
except Exception:
return(-1)print("DEBUG: upload_file(): step 1\n")
submithost = globaljobid.split('#')[0]
print("DEBUG: submithost is", submithost)
collector = htcondor.Collector()
print("DEBUG: collector is", collector)
schedd_ad = collector.locate(htcondor.DaemonTypes.Schedd, submithost)
print("DEBUG: schedd_ad is ", schedd_ad)upload_file()
This code works if both the AP and EP are version 10. But if the AP is version 23 then it fails weather the EP is verison 10 or version 23. It works with version 23 iff I have a ~/.condor/tokens.d/nmpost token. Why do I need a user token to run collector.locate against a schedd?
I was going to test this on CHTC but I can't seem get an interactive job on CHTC anymore.
DONE: send greg error ouptut and security config
transfer_output_files change in version 23
My silly nraorsync transfer plugin relies on the user setting transfer_output_files = .job.ad in the submit description file to trigger the transfer of files. Then my nraorsync plugin takes over and looks at +nrao_output_files for the files to copy. But with version 23, this no longer works. I am guessing someone decided that internal files like .job.ad, .machine.ad, _condor_stdout, and _condor_stderr will no longer be tranferrable via trasnfer_output_files. Is that right? If so, I think I can work around it. Just wanted to know.
ANSWER: the starter has an exclude list and .job.ad is probably in it and maybe it is being access sooner or later than before. Greg will see if there is a better, first-class way to trigger transfers.
DONE: We will use condor_transfer since it needs to be there anyway.
Installing version 23
I am looking at upgrading from version 10 to 23 LTS. I noticed that y'all have a repo RPM to install condor but it installs the Feature Release only. It doens't provide repos to install the LTS.
https://htcondor.readthedocs.io/en/main/getting-htcondor/from-our-repositories.html
ANSWER: Greg will find it and get back to me.
DONE: https://research.cs.wisc.edu/htcondor/repo/23.0/el8/x86_64/release/
Virtual memory vs RSS
Looks like condor is reporting RSS but that may actually be virtual memory. At least according to Felipe's tests.
ANSWER: Access to the cgroup information on the nmpost cluster is good because condor is running as root and condor reports the RSS accurately. But on systems using glidein like PATh and OSG they may not have appropriate access to the cgroup so memory reporting on these clusters may be different thatn memory reporting on the nmpost cluster. On glide-in jobs condor reports the virtual memory across all the processes in the job.
CPU usage
Felipe has had jobs put on hold for too much cpu usage.
runResidualCycle_n4.imcycle8.condor.log:012 (269680.000.000) 2024-07-18 17:17:03 Job was held.
runResidualCycle_n4.imcycle8.condor.log- Excessive CPU usage. Please verify that the code is configured to use a limited number of cpus/threads, and matches request_cpus.
GREG: Perhaps only some machines in the OSPool have checks for this and may be doing something wrong or strange.
2024-09-16: Felipe asked about this again.
Missing batch_name
A DAG job, submitted with hundreds of others, doesn't show a batch name in condor_q, just DAG: 371239. Just one job, all the others submitted from the same template do show batch names
/lustre/aoc/cluster/pipeline/vlass_prod/spool/quicklook/VLASS3.2_T17t27.J201445+263000_P172318v1_2024_07_12T16_40_09.270
nmpost-master krowe >condor_q -dag -name mcilroy -g -all
...
vlapipe vlass_ql.dag+370186 7/16 10:30 1 1 _ _ 3 370193.0
vlapipe vlass_ql.dag+370191 7/16 10:31 1 1 _ _ 3 370194.0
vlapipe DAG: 371239 7/16 10:56 1 1 _ _ 3 371536.0
...
GREG: Probably a condor bug. Try submitting it again to see if the name is missing again.
WORKAROUND: condor_qedit job.id JobBatchName '"asdfasdf"'
DAG failed to submit
Another DAG job that was submitted along with hundreds of others looks to have created vlass_ql.dag.condor.sub but never actually submitted the job. condor.log is emtpy.
/lustre/aoc/cluster/pipeline/vlass_prod/spool/quicklook/VLASS3.2_T18t13.J093830+283000_P175122v1_2024_07_06T16_33_34.742
ANSWERs: Perhaps the schedd was too busy to respond. Need more resources in the workflow container?
Need to handle error codes from condor_submit_dag. 0 good. 1 bad. (chausman)
Setup /usr/bin/mail on mcilroy so that it works. Condor will use this to send mail to root when it encounters an error. Need to submit jira ticket to SSA. (krowe)
Resubmitting Jobs
I have an example in
/lustre/aoc/cluster/pipeline/vlass_prod/spool/se_continuum_imaging/VLASS2.1_T10t30.J194602-033000_P161384v1_2020_08_15T01_21_14.433
of a job that failed on nmpost106 but then HTCondor resubmitted the job on nmpost105. The problem is the job actually did finish, just got an error transferring back all the files, so when the job was resubmitted, it copied over an almost complete run of CASA which sort of makes a mess of things. I would rather HTCondor just fail and not re-submit the job. How can I do that?
022 (167287.000.000) 2023-12-24 02:43:57 Job disconnected, attempting to reconnect
Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to slot1_1@nmpost106.aoc.nrao.edu <10.64.2.180:9618?addrs=10.64.2.180-9618&alias=nmpost106.aoc.nrao.edu&noUDP&sock=startd_5795_776a>
...
023 (167287.000.000) 2023-12-24 02:43:57 Job reconnected to slot1_1@nmpost106.aoc.nrao.edu
startd address: <10.64.2.180:9618?addrs=10.64.2.180-9618&alias=nmpost106.aoc.nrao.edu&noUDP&sock=startd_5795_776a>
starter address: <10.64.2.180:9618?addrs=10.64.2.180-9618&alias=nmpost106.aoc.nrao.edu&noUDP&sock=slot1_1_39813_9c2c_400>
...
007 (167287.000.000) 2023-12-24 02:43:57 Shadow exception!
Error from slot1_1@nmpost106.aoc.nrao.edu: Repeated attempts to transfer output failed for unknown reasons
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
...
040 (167287.000.000) 2023-12-24 02:45:09 Started transferring input files
Transferring to host: <10.64.2.178:9618?addrs=10.64.2.178-9618&alias=nmpost105.aoc.nrao.edu&noUDP&sock=slot1_13_163338_25ab_452>
...
040 (167287.000.000) 2023-12-24 03:09:22 Finished transferring input files
...
001 (167287.000.000) 2023-12-24 03:09:22 Job executing on host: <10.64.2.178:9618?addrs=10.64.2.178-9618&alias=nmpost105.aoc.nrao.edu&noUDP&sock=startd_5724_c431>
ANSWER: Maybe
on_exit_hold = some_expression
periodic_hold = NumShadowStarts > 5
periodic_hold = NumJobStarts > 5
or a startd cron job that checks for IdM and offlines the node if needed
https://htcondor.readthedocs.io/en/latest/admin-manual/ep-policy-configuration.html#startd-cron
Constant processing
Our workflows have a process called Our workflows have a process called "ingestion" that puts data into our archive. There are almost always ingestion processes running or needing to run and we don't want them to get stalled because of other jobs. Both ingestion and other jobs are the same user "vlapipe". I thought about setting a high priority in the ingestion submit description file but that won't guarantee that ingestion always runs, especially since we don't do preemption. So my current thinking is to have a dedicated node for ingestion. Can you think of a better solution?
...
ANSWER: https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToLimitDiskUsageOfJobs
condor_userprio
We want a user (vlapipe) to always have higher priority than other users. I see we can set this with condor_userprio but is that change permenent?
ANSWER: There is no config file for this. Set the priority_factor of vlapipe to 1. That is saved on disk and should persist through reboots and upgrades.
condor_userprio
We want a user (vlapipe) to always have higher priority than other users. I see we can set this with condor_userprio but is that change permenent?
...
I see that when a job starts, the execution point (radial001) uses our nraorsync plugin to download the files. This is fine and good. When the job is finished, the execution point (radial001) uses our nraorsync plugin to upload the files, also fine and good. But then the RADIAL schedd (radialhead) also runs our nraorsync plugin to upload files. This causes problems because radialhead doesn't have the _CONDOR_JOB_AD environment variable and the plugin dies. Why is the remote schedd running the plugin and is there a way to prevent it from doing so?
Greg understands this and will ask the HTCondor-c folks about it.
Greg thinks it is a bug and will talk to our HTCondor-C peopole.
2023-08-07: Greg said the HTCondor-C people agree this is a bug and will work on it.
2023-09-25 krowe: send Greg my exact procedure to reproduce this.
2023-10-02 krowe: Sent Greg an example that fails. Turns out it is intermittent.
2024-01-22 krowe: will send email to the condor list
ANSWER: It was K. Scott all along. I now have HTCondor-C workiing from nmpost and testpost clusters to the radial cluster using my nraorsync plugin to trasfer both input and output files. The reason the remote AP (radialhead) was running the nraorsync plugin was because I defined it in the condor config like so.
FILETRANSFER_PLUGINS = $(FILETRANSFER_PLUGINS), /usr/libexec/condor/nraorsync_plugin.py
I probably did this early in my HTCondor-C testing not knowing what I was doing. I commented this out, restarted condor, and now everything seems to be working properly.
Quotes in DAG VARS
I was helping SSA with a syntax problem between HTCondor-9 and HTCondor-10 and I was wondering if you had any thoughts on it. They have a dag with lines like this
JOB SqDeg2/J232156-603000 split.condor
VARS SqDeg2/J232156-603000 jobname="$(JOB)" split_dir="SqDeg2/J232156+603000"
Then they set that split_dir VAR to a variable in the submit description file like this
SPLIT_DIR = "$(split_dir)"
The problem seems to be the quotes around $(split_dir). It works fine in HTCondor-9 but with HTCondor-10 they get an error like this in their pims_split.dag.dagman.out file
02/28/24 16:26:02 submit error: Submit:-1:Unexpected characters following doublequote. Did you forget to escape the double-quote by repeating it? Here is the quote and trailing characters: "SqDeg2/J232156+603000""
Looking at the documentation https://htcondor.readthedocs.io/en/latest/version-history/lts-versions-10-0.html#version-10-0-0 its clear they shouldn't be putting quotes around $(split_dir). So clearly something changed with version 10. Either a change to the syntax or, my guess, just a stricter parser.
Any thoughts on this?
ANSWER: Greg doesn't know why this changed but thinks we are now doing the right thing.
OSDF Cache
Is there a way to prefer a job to run on a machine where the data is cached?
ANSWER: There is no knob in HTCondor for this but CHTC would like to add one for this. they would like to glide in OSDF caches like they glide in nodes. But this is all long-term ideas.
GPU names
HTCondor seems to have short names for GPUs which are the first part of the UUID. Is there a way to use/get the full UUID? This would make it consistant with nvidia-smi.
ANSWER: Greg thinks you can use the full UUID with HTCondor.
But cuda_visible_devices only provides the short UUID name. Is there a way to get the long UUID name from cuda_visisble_devices?
ANSWER: You can't use id 0 because 0 will always be the first GPU that HTCondor chose for you. Some new release of HTCondor supports NVIDIA_VISIBLE_DEVICES which should be the full UUID.
Big Data
Are we alone in needing to copy in and out many GBs per job? Do other institutions have this problem as well? Does CHTC have any suggestions to help? Sanja will ask this of Bockleman as well.
ANSWER: Greg thinks our transfer times are not uncommon but our processing time is shorter than many. Other jobs have similar data sizes. Some other jobs have similar transfer times but process for many hours. Maybe we can constrain our jobs to only run on sites that seem to transfer quickly. Greg is also interested in why some sites seem slower than others. Is that actually site specific or is it time specific or...
Felipe does have a long list of excluded sites in his run just for this reason. Greg would like a more declaritive solution like "please run on fast transfer hosts" especially if this is dynamic.
GPUs_Capability
We have a host (testpost001) with both a Tesla T4 (Capability=7.5) and a Tesla L4 (Capability=8.9) and when I run condor_gpu_discovery -prop I see something like the following
DetectedGPUs="GPU-ddc998f9, GPU-40331b00"
Common=[ DriverVersion=12.20; ECCEnabled=true; MaxSupportedVersion=12020; ]
GPU_40331b00=[ id="GPU-40331b00"; Capability=7.5; DeviceName="Tesla T4"; DevicePciBusId="0000:3B:00.0"; DeviceUuid="40331b00-c3b6-fa9a-b8fd-33bec2fcd29c"; GlobalMemoryMb=14931; ]
GPU_ddc998f9=[ id="GPU-ddc998f9"; Capability=8.9; DeviceName="NVIDIA L4"; DevicePciBusId="0000:5E:00.0"; DeviceUuid="ddc998f9-99e2-d9c1-04e3-7cc023a2aa5f"; GlobalMemoryMb=22491; ]
The problem is `condor_status -compact -constraint 'GPUs_Capability >= 7.0'` doesn't show testpost001. It does show testpost001 when I physically remove the T4.
Requesting a specific GPU with `RequireGPUs = (Capability >= 8.0)` or `RequireGPUs = (Capability <= 8.0)` does work however so maybe this is just a condor_status issue.
We then replaced the L4 with a second T4 and then GPUs_Capability functioned as expected.
Can condor handle two different capabilities on the same node?
ANSWER: Greg will look into it. They only recently added support for different GPUs on the same node. So this is going to take some time to get support in condor_status. Yes this is just a condor_status issue.
Priority for Glidein Nodes
We have a factory.sh script that glides in Slurm nodes to HTCondor as needed. The problem is that HTCondor then seems to prefer these nodes to the regular HTCondor nodes such that after a while there are several free regular HTCondor nodes, and three glide-in nodes. Is there a way to set a lower priority on glide-in nodes so that HTCondor only chooses them if the regular HTCondor nodes are all busy? I am going to offline the glide-in nodes to see if that works but that is a manual solution not and automated one.
I would think NEGOTIATOR_PRE_JOB_RANK would be the trick but we already set that on the CMs to the following so that RANK expressions in submit description files are honored and negotiation will prefer NMT nodes over DSOC nodes if possible.
NEGOTIATOR_PRE_JOB_RANK = (10000000 * Target.Rank) + (1000000 * (RemoteOwner =?= UNDEFINED)) - (100000 * Cpus) - Memory
ANSWER: NEGOTIATOR_PRE_JOB_RANK = (10000000 Target.Rank) + (1000000 (RemoteOwner =?= UNDEFINED)) - (100000 * Cpus) - Memory + 100000 * (site == "not-slurm")
I don't like setting not-slurm in the dedicated HTCondor nodes. I would rather set something like "glidein=true" or "glidein=1000" in the default 99-nrao config file and then remove it for the 99-nrao config in snapshots for dedicated HTCondor nodes. But that assumes that the base 99-nrao is for NM. Since we are sharing an image with CV we can't assume that. Therefore every node, weather dedicated HTCondor or not, will need a 99-nrao in its snapshot area.
SOLUTION
This seems to work. If I set NRAOGLIDEIN = True on a node, then that node will be chosen last. You may ask why not just add 10000000 ASTERISK (NRAOGLIDEIN == True). If I did that I would have to also set it to false on all the other nodes otherwise the negotiator would fail to parse NEGOTIATOR_PRE_JOB_RANK into a float. So I check if it isn't undefined then check if it is true. This way you could set NRAOGLIDEIN to False if you wanted.
NEGOTIATOR_PRE_JOB_RANK = (10000000 * Target.Rank) + (1000000 * (RemoteOwner =?= UNDEFINED)) - (100000 * Cpus) - Memory - 10000000 * ((NRAOGLIDEIN =!= UNDEFINED) && (NRAOGLIDEIN == True))
(radial001) uses our nraorsync plugin to download the files. This is fine and good. When the job is finished, the execution point (radial001) uses our nraorsync plugin to upload the files, also fine and good. But then the RADIAL schedd (radialhead) also runs our nraorsync plugin to upload files. This causes problems because radialhead doesn't have the _CONDOR_JOB_AD environment variable and the plugin dies. Why is the remote schedd running the plugin and is there a way to prevent it from doing so?
Greg understands this and will ask the HTCondor-c folks about it.
Greg thinks it is a bug and will talk to our HTCondor-C peopole.
2023-08-07: Greg said the HTCondor-C people agree this is a bug and will work on it.
2023-09-25 krowe: send Greg my exact procedure to reproduce this.
2023-10-02 krowe: Sent Greg an example that fails. Turns out it is intermittent.
2024-01-22 krowe: will send email to the condor list
ANSWER: It was K. Scott all along. I now have HTCondor-C workiing from nmpost and testpost clusters to the radial cluster using my nraorsync plugin to trasfer both input and output files. The reason the remote AP (radialhead) was running the nraorsync plugin was because I defined it in the condor config like so.
FILETRANSFER_PLUGINS = $(FILETRANSFER_PLUGINS), /usr/libexec/condor/nraorsync_plugin.py
I probably did this early in my HTCondor-C testing not knowing what I was doing. I commented this out, restarted condor, and now everything seems to be working properly.
Quotes in DAG VARS
I was helping SSA with a syntax problem between HTCondor-9 and HTCondor-10 and I was wondering if you had any thoughts on it. They have a dag with lines like this
JOB SqDeg2/J232156-603000 split.condor
VARS SqDeg2/J232156-603000 jobname="$(JOB)" split_dir="SqDeg2/J232156+603000"
Then they set that split_dir VAR to a variable in the submit description file like this
SPLIT_DIR = "$(split_dir)"
The problem seems to be the quotes around $(split_dir). It works fine in HTCondor-9 but with HTCondor-10 they get an error like this in their pims_split.dag.dagman.out file
02/28/24 16:26:02 submit error: Submit:-1:Unexpected characters following doublequote. Did you forget to escape the double-quote by repeating it? Here is the quote and trailing characters: "SqDeg2/J232156+603000""
Looking at the documentation https://htcondor.readthedocs.io/en/latest/version-history/lts-versions-10-0.html#version-10-0-0 its clear they shouldn't be putting quotes around $(split_dir). So clearly something changed with version 10. Either a change to the syntax or, my guess, just a stricter parser.
Any thoughts on this?
ANSWER: Greg doesn't know why this changed but thinks we are now doing the right thing.
OSDF Cache
Is there a way to prefer a job to run on a machine where the data is cached?
ANSWER: There is no knob in HTCondor for this but CHTC would like to add one for this. they would like to glide in OSDF caches like they glide in nodes. But this is all long-term ideas.
GPU names
HTCondor seems to have short names for GPUs which are the first part of the UUID. Is there a way to use/get the full UUID? This would make it consistant with nvidia-smi.
ANSWER: Greg thinks you can use the full UUID with HTCondor.
But cuda_visible_devices only provides the short UUID name. Is there a way to get the long UUID name from cuda_visisble_devices?
ANSWER: You can't use id 0 because 0 will always be the first GPU that HTCondor chose for you. Some new release of HTCondor supports NVIDIA_VISIBLE_DEVICES which should be the full UUID.
Big Data
Are we alone in needing to copy in and out many GBs per job? Do other institutions have this problem as well? Does CHTC have any suggestions to help? Sanja will ask this of Bockleman as well.
ANSWER: Greg thinks our transfer times are not uncommon but our processing time is shorter than many. Other jobs have similar data sizes. Some other jobs have similar transfer times but process for many hours. Maybe we can constrain our jobs to only run on sites that seem to transfer quickly. Greg is also interested in why some sites seem slower than others. Is that actually site specific or is it time specific or...
Felipe does have a long list of excluded sites in his run just for this reason. Greg would like a more declaritive solution like "please run on fast transfer hosts" especially if this is dynamic.
GPUs_Capability
We have a host (testpost001) with both a Tesla T4 (Capability=7.5) and a Tesla L4 (Capability=8.9) and when I run condor_gpu_discovery -prop I see something like the following
DetectedGPUs="GPU-ddc998f9, GPU-40331b00"
Common=[ DriverVersion=12.20; ECCEnabled=true; MaxSupportedVersion=12020; ]
GPU_40331b00=[ id="GPU-40331b00"; Capability=7.5; DeviceName="Tesla T4"; DevicePciBusId="0000:3B:00.0"; DeviceUuid="40331b00-c3b6-fa9a-b8fd-33bec2fcd29c"; GlobalMemoryMb=14931; ]
GPU_ddc998f9=[ id="GPU-ddc998f9"; Capability=8.9; DeviceName="NVIDIA L4"; DevicePciBusId="0000:5E:00.0"; DeviceUuid="ddc998f9-99e2-d9c1-04e3-7cc023a2aa5f"; GlobalMemoryMb=22491; ]
The problem is `condor_status -compact -constraint 'GPUs_Capability >= 7.0'` doesn't show testpost001. It does show testpost001 when I physically remove the T4.
Requesting a specific GPU with `RequireGPUs = (Capability >= 8.0)` or `RequireGPUs = (Capability <= 8.0)` does work however so maybe this is just a condor_status issue.
We then replaced the L4 with a second T4 and then GPUs_Capability functioned as expected.
Can condor handle two different capabilities on the same node?
ANSWER: Greg will look into it. They only recently added support for different GPUs on the same node. So this is going to take some time to get support in condor_status. Yes this is just a condor_status issue.
Priority for Glidein Nodes
We have a factory.sh script that glides in Slurm nodes to HTCondor as needed. The problem is that HTCondor then seems to prefer these nodes to the regular HTCondor nodes such that after a while there are several free regular HTCondor nodes, and three glide-in nodes. Is there a way to set a lower priority on glide-in nodes so that HTCondor only chooses them if the regular HTCondor nodes are all busy? I am going to offline the glide-in nodes to see if that works but that is a manual solution not and automated one.
I would think NEGOTIATOR_PRE_JOB_RANK would be the trick but we already set that on the CMs to the following so that RANK expressions in submit description files are honored and negotiation will prefer NMT nodes over DSOC nodes if possible.
NEGOTIATOR_PRE_JOB_RANK = (10000000 * Target.Rank) + (1000000 * (RemoteOwner =?= UNDEFINED)) - (100000 * Cpus) - Memory
ANSWER: NEGOTIATOR_PRE_JOB_RANK = (10000000 Target.Rank) + (1000000 (RemoteOwner =?= UNDEFINED)) - (100000 * Cpus) - Memory + 100000 * (site == "not-slurm")
I don't like setting not-slurm in the dedicated HTCondor nodes. I would rather set something like "glidein=true" or "glidein=1000" in the default 99-nrao config file and then remove it for the 99-nrao config in snapshots for dedicated HTCondor nodes. But that assumes that the base 99-nrao is for NM. Since we are sharing an image with CV we can't assume that. Therefore every node, weather dedicated HTCondor or not, will need a 99-nrao in its snapshot area.
SOLUTION
This seems to work. If I set NRAOGLIDEIN = True on a node, then that node will be chosen last. You may ask why not just add 10000000 ASTERISK (NRAOGLIDEIN == True). If I did that I would have to also set it to false on all the other nodes otherwise the negotiator would fail to parse NEGOTIATOR_PRE_JOB_RANK into a float. So I check if it isn't undefined then check if it is true. This way you could set NRAOGLIDEIN to False if you wanted.
NEGOTIATOR_PRE_JOB_RANK = (10000000 * Target.Rank) + (1000000 * (RemoteOwner =?= UNDEFINED)) - (100000 * Cpus) - Memory - 10000000 * ((NRAOGLIDEIN =!= UNDEFINED) && (NRAOGLIDEIN == True))
I configured our pilot.sh script to add the NRAOGLIDEIN = True key/value pair to a node when it glides in to HTCondor. That is the simplest and best place to set this I think.
K8s kubernetes
2024-04-15 krowe: There is a lot of talk around NRAO about k8s these days. Can you explain if/how HTCondor works with k8s? I'm not suggesting we run HTCondor on top of k8s but I would like to know the options.
Condor and k8s have different goals. Condor an infinite number of jobs for finite time each job. k8s runs a finite number of services for infinite time.
There is some support in k8s to run batch jobs but it ins't well formed yet. Running the condor services like the CM in k8s can make some sense.
The new hotness is using EBPF to change routing tables.
RedHat8 Only
Say we have a few RedHat8 nodes and we only want jobs to run on those nodes that request RedHat8 with
requirements = (OpSysAndVer == "RedHat8")
I know I could set up a partition like we have done with VLASS but since HTCondor already has an OS knob, can I use that?
Setting RedHat8 in the job requirements guarantees the job will run on a RedHat8 node, but how do I make that node not run jobs that don't specify the OS they want?
The following didn't do what I wanted.
START = ($(START)) && (TARGET.OpSysAndVer =?= "RedHat8")
Then I thought I needed to specify jobs where OpSysAndVer is not Undefined but that didn't work either. Either of the following do prevent jobs that don't specify an OS from running on the node but they also prevent jobs that DO specify an OS via either OpSysAndVer or OpSysMajorVer respectively.
START = ($(START)) && (TARGET.OpSysAndVer isnt UNDEFINED)
START = ($(START)) && (TARGET.OpSysMajorVer isnt UNDEFINED)
A better long-term solution is probably for our jobs (VLASS, VLA calibration, ingestion, etc) to ask for the OS that they want if they care. Then they can test new OSes when they want and we can upgrade OSes at our schedule (to a certain point). I think asking them to start requesting the OS they want now is not going to happen but maybe by the time RedHat9 is an option they and we will be ready for this.
ANSWER: unparse takes a classad expression and turns into a string then use a regex on it looking for opsysandver.
Is this the right syntax? Probably not as it doesn't work
START = ($(START)) && (regexp(".*RedHat8.*", unparse(TARGET.Requirements)))
Greg thinks this should work. We will poke at it.
The following DOES WORK in the sense that it matches anything.
START = ($(START)) && (regexp(".", unparse(TARGET.Requirements)))
None of these work
START = ($(START)) && (regexp(".*RedHat8.*", unparse(Requirements)))
START = ($(START)) && (regexp(".*a.*", unparse(Requirements)))
START = ($(START)) && (regexp("((OpSysAndVer.*", unparse(Requirements)))
START = ($(START)) && (regexp("((OpSysAndVer.*", unparse(TARGET.Requirements)))
START = ($(START)) && (regexp("\(\(OpSysAndVer.*", unparse(Requirements)))
START = ($(START)) && (regexp("(.*)RedHat8(.*)", unparse(Requirements)))
START = ($(START)) && (regexp("RedHat8", unparse(Requirements), "i"))
START = ($(START)) && (regexp("^.*RedHat8.*$", unparse(Requirements), "i"))
START = ($(START)) && (regexp("^.*RedHat8.*$", unparse(Requirements), "m"))
START = ($(START)) && (regexp("OpSysAdnVer\\s*==\\s*\"RedHat8\"", unparse(Requirements)))
START = $(START) && regexp("OpSysAdnVer\\s*==\\s*\"RedHat8\"", unparse(Requirements))#START = $(START) && debug(regexp(".*RedHat8.*", unparse(TARGET.Requirements)))
This should also work
in the config file
START = $(START) && target.WantTORunOnRedHat8only
Submit file
My.WantToRunonRedHat8Only = true
But I would rather not have to add yet more attributes to the EPs. I would like to use the existing OS attribute that HTCondor provides.
Wasn't there a change to PCRE to PCRE2 or something like that? Could that be causing the problem? 2023-11-13 Greg doesn't think so.
2024-01-03 krowe: Can we use a container like trhis? How does PATh do this?
+SingularityImage = "/cvmfs/singularity.opensciencegrid.org/opensciencegrid/osgvo-el7:latest"
See retired nodes
2024-04-15 krowe: Say I set a few nodes to offline with a command like condor_off -startd -peaceful -name nmpost120 How can I later check to see which nodes are offline?
- condor_status -offline returns nothing
- condor_status -long nmpost120 returns nothing about being offline
- The following shows nodes where startd has actually stopped but it doesn't show nodes that are set offline but still running jobs (e.g. Retiring)
- condor_status -master -constraint 'STARTD_StartTime == 0'
- This shows nodes that are set offline but still running jobs (a.k.a. Retiring)
- condor_status |grep Retiring
ANSWER: 2022-06-27
condor_status -const 'Activity == "Retiring"'
offline ads, which is a way for HTCondor to update the status of a node after the startd has exited.
condor_drain -peaceful # CHTC is working on this. I think this might be the best solution.
Try this: condor_status -constraint 'PartitionableSlot && Cpus && DetectedCpus && State == "Retiring"'
or this: condor_status -const 'PartitionableSlot && State == "Retiring"' -af Name DetectedCpus Cpus
or: condor_status -const 'PartitionableSlot && Activity == "Retiring"' -af Name Cpus DetectedCpus
or: condor_status -const 'partitionableSlot && Activity == "Retiring" && cpus == DetectedCpus'
None of which actually show nodes that have drained. I.e. were in state Retiring and are now done running jobs.
ANSWER: This seems to work fairly well. Not sure if it is perfect or not condor_status -master -constraint 'STARTD_StartTime == 0'
Condor_reboot?
Is there such a thing? Slurm has a nice one `scontrol reboot HOSTNAME`. I know it might not be the condor way, but thought I would ask.
ANSWER: https://htcondor.readthedocs.io/en/latest/admin-manual/configuration-macros.html#MASTER_SHUTDOWN_%3CName%3E and https://htcondor.readthedocs.io/en/latest/man-pages/condor_set_shutdown.html maybe do the latter and then the former and possibly combined with condor_off -peaceful. I'll need to play with it when I feel better.
Felipe's code
Felipe to share his job visualization software with Greg and maybe present at Throughput 2024.
https://github.com/ARDG-NRAO/LibRA/tree/main/frameworks/htclean/read_htclean_logs
Versions and falling behind
We are still using HTCondor-10.0.2. How far can/should we fall behind before catching up again?
ANSWER: Version 24 is coming out around condor week in 2024. It is suggested to drift no more than one major version, e.g. don't be older than 23 once 24 is available.
Sams question
A DAG of three nodes: fetch -> envoy -> deliver. Submit host and cluster are far apart, and we need to propagate large quantities of data from one node to the next. How do we make this transfer quickly (i.e. without going through the submit host) without knowing the data's location at submit time?
krowe: Why do this as a dag? Why not make it one job instead of a dag? Collapsing the DAG into just one job has the advantage that it can use the local condor scratch area and can easily restart if the job fails without need for cleaning up anything. And of course making it one job means all the steps know where the data is.
Greg: condor_chirp condor_chirp_set_job_attr attributeName 'Value' You could do somethig like
condor_chirp set_job_attr DataLocation '"/path/to/something"'
or
condor_chirp put_file local remote
Each DAG has a prescript that runs before the dag nodes.
Another idea is to define the directory before submitting the job (e.g. /lustre/naasc/.../jobid)
Condor history for crashed node
We have nodes crashing sometimes. 1. should HTCondor recover from a crashed node? Will the jobs be restarted somewhere else? 2. How can I see what jobs were running on a node when it crahsed?
How about this
condor_history -name mcilroy -const "stringListMember(\"alias=nmpost091.aoc.nrao.edu\", StarterIpAddr, \"&\") == true"
ANSWER: There is a global event log but it has to be enabled and isn't in our case EVENT_LOG = $(LOG)/EventLog
ANSWER: show jobs that have restarted condor_q -name mcilroy -allusers -const 'NumShadowStarts > 1'
STARTD_ATTRS in glidein nodes
We add the following line to /etc/condor/condor_config on all our Slurm nodes so that if they get called as a glidein node, they can set some special glidein settings.
LOCAL_CONFIG_FILE = /var/run/condor/condor_config.local
Our /etc/condor/config.d/99-nrao file effectivly sets sets the following
STARTD_ATTRS = PoolName NRAO_TRANSFER_HOST HASLUSTRE BATCH
Our /var/run/condor/condor_config.local, which is run by glidein nodes, sets the following
STARTD_ATTRS = $(STARTD_ATTRS) NRAOGLIDEIN
The problem is glidein nodes don't get all the STARD_ATTRS set by 99-nrao. They just get NRAOGLIDEIN. It is like condor-master reads 99-nrao to set its STARTD_ATTRS. Then it read condor_config.local to set its STARTD_ATTRS again but without accessing $(STARTD_ATTRS).
ANSWER: The last line in /var/run/condor/condor_config.local is re-writing STARTD_ATTRS. It should have $(STARTD_ATTRS) appended
STARTD_ATTRS = NRAOGLIDEIN
Output to two places
Some of our pipeline jobs don't set shoud_transfer_files=YES because they need to transfer some output to an area for Analysts to look at and a some other output (may be a subset) to a different area for the User to look at. Is there a condor way to do this? transfer_output_remaps?
ANSWER: Greg doesn't think there is a Condor way to do this. Could make a copy of the subset and use transfer_output_rempas on the copy but that is a bit of a hack.
Pelican?
Felipe is playing with it and we will probably want it at NRAO.
ANSWER: Greg will ask around.
RHEL8 Crashing
We have had many NMT VLASS nodes crash since we upgraded to RHEL8. I think the nodes were busy when they crashed. So I changed our SLOT_TYPE_1 from 100% to 95%. Is this a good idea?
ANSWER: try using RESERVED_MEMORY=4096 (units are in Megabytes) instead of SLOT_TYPE_1=95% and put SLOT_TYPE_1=100% again.
getnenv
Did it change since 10.0? Can we still use getenv in DAGs or regular jobs?
#krowe Nov 5 2024: getenv no longer includes your entire environment as of version 10.7 or so. But instead it only includes the environment variables you list with the "ENV GET" syntax in the .dag file.
https://git.ligo.org/groups/computing/-/epics/30
ANSWER: Yes this is true. CHTC would like users to stop using getenv=true. There may be a knob to restore the old behavior.
DONE: check out docs and remove getenv=trueI configured our pilot.sh script to add the NRAOGLIDEIN = True key/value pair to a node when it glides in to HTCondor. That is the simplest and best place to set this I think.
...