...
Felipe does have a long list of excluded sites in his run just for this reason. Greg would like a more declaritive solution like "please run on fast transfer hosts" especially if this is dynamic.
File Transfer Plugins and HTCondor-C
I see that when a job starts, the execution point (radial001) uses our nraorsync plugin to download the files. This is fine and good. When the job is finished, the execution point (radial001) uses our nraorsync plugin to upload the files, also fine and good. But then the RADIAL schedd (radialhead) also runs our nraorsync plugin to upload files. This causes problems because radialhead doesn't have the _CONDOR_JOB_AD environment variable and the plugin dies. Why is the remote schedd running the plugin and is there a way to prevent it from doing so?
Greg understands this and will ask the HTCondor-c folks about it.
Greg thinks it is a bug and will talk to our HTCondor-C peopole.
2023-08-07: Greg said the HTCondor-C people agree this is a bug and will work on it.
2023-09-25 krowe: send Greg my exact procedure to reproduce this.
2023-10-02 krowe: Sent Greg an example that fails. Turns out it is intermittent.
2024-01-22 krowe: will send email to the condor list
ANSWER: It was K. Scott all along. I now have HTCondor-C workiing from nmpost and testpost clusters to the radial cluster using my nraorsync plugin to trasfer both input and output files. The reason the remote AP (radialhead) was running the nraorsync plugin was because I defined it in the condor config like so.
FILETRANSFER_PLUGINS = $(FILETRANSFER_PLUGINS), /usr/libexec/condor/nraorsync_plugin.py
I probably did this early in my HTCondor-C testing not knowing what I was doing. I commented this out, restarted condor, and now everything seems to be working properly.
GPUs_Capability
We have a host (testpost001) with both a Tesla T4 (Capability=7.5) and a Tesla L4 (Capability=8.9) and when I run condor_gpu_discovery -prop I see something like the following
DetectedGPUs="GPU-ddc998f9, GPU-40331b00"
Common=[ DriverVersion=12.20; ECCEnabled=true; MaxSupportedVersion=12020; ]
GPU_40331b00=[ id="GPU-40331b00"; Capability=7.5; DeviceName="Tesla T4"; DevicePciBusId="0000:3B:00.0"; DeviceUuid="40331b00-c3b6-fa9a-b8fd-33bec2fcd29c"; GlobalMemoryMb=14931; ]
GPU_ddc998f9=[ id="GPU-ddc998f9"; Capability=8.9; DeviceName="NVIDIA L4"; DevicePciBusId="0000:5E:00.0"; DeviceUuid="ddc998f9-99e2-d9c1-04e3-7cc023a2aa5f"; GlobalMemoryMb=22491; ]
The problem is `condor_status -compact -constraint 'GPUs_Capability >= 7.0'` doesn't show testpost001. It does show testpost001 when I physically remote the T4.
Requesting a specific GPU with `RequireGPUs = (Capability >= 8.0)` or `RequireGPUs = (Capability <= 8.0)` does work however so maybe this is just a condor_status issue.
We then replaced the L4 with a second T4 and then GPUs_Capability functioned as expected.
Can condor handle two different capabilities on the same node?
ANSWER: Greg will look into it. They only recently added support for different GPUs on the same node. So this is going to take some time to get support in condor_status.
Resubmitting Jobs
I have an example in
/lustre/aoc/cluster/pipeline/vlass_prod/spool/se_continuum_imaging/VLASS2.1_T10t30.J194602-033000_P161384v1_2020_08_15T01_21_14.433
of a job that failed on nmpost106 but then HTCondor resubmitted the job on nmpost105. The problem is the job actually did finish, just got an error transferring back all the files, so when the job was resubmitted, it copied over an almost complete run of CASA which sort of makes a mess of things. I would rather HTCondor just fail and not re-submit the job. How can I do that?
GPUs_Capability
We have a host (testpost001) with both a Tesla T4 (Capability=7.5) and a Tesla L4 (Capability=8.9) and when I run condor_gpu_discovery -prop I see something like the following
DetectedGPUs="GPU-ddc998f9, GPU-40331b00"
Common=[ DriverVersion=12.20; ECCEnabled=true; MaxSupportedVersion=12020; ]
GPU_40331b00=[ id="GPU-40331b00"; Capability=7.5; DeviceName="Tesla T4"; DevicePciBusId="0000:3B:00.0"; DeviceUuid="40331b00-c3b6-fa9a-b8fd-33bec2fcd29c"; GlobalMemoryMb=14931; ]
GPU_ddc998f9=[ id="GPU-ddc998f9"; Capability=8.9; DeviceName="NVIDIA L4"; DevicePciBusId="0000:5E:00.0"; DeviceUuid="ddc998f9-99e2-d9c1-04e3-7cc023a2aa5f"; GlobalMemoryMb=22491; ]
The problem is `condor_status -compact -constraint 'GPUs_Capability >= 7.0'` doesn't show testpost001. It does show testpost001 when I physically remote the T4.
Requesting a specific GPU with `RequireGPUs = (Capability >= 8.0)` or `RequireGPUs = (Capability <= 8.0)` does work however so maybe this is just a condor_status issue.
We then replaced the L4 with a second T4 and then GPUs_Capability functioned as expected.
Can condor handle two different capabilities on the same node?
ANSWER: Greg will look into it. They only recently added support for different GPUs on the same node. So this is going to take some time to get support in condor_status.
Resubmitting Jobs
I have an example in
/lustre/aoc/cluster/pipeline/vlass_prod/spool/se_continuum_imaging/VLASS2.1_T10t30.J194602-033000_P161384v1_2020_08_15T01_21_14.433
of a job that failed on nmpost106 but then HTCondor resubmitted the job on nmpost105. The problem is the job actually did finish, just got an error transferring back all the files, so when the job was resubmitted, it copied over an almost complete run of CASA which sort of makes a mess of things. I would rather HTCondor just fail and not re-submit the job. How can I do that?
022 (167287.000.000) 2023-12-24 02:43:57 Job disconnected, attempting to reconnect
Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to slot1_1@nmpost106.aoc.nrao.edu <10.64.2.180:9618?addrs=10.64.2.180-9618&alias=nmpost106.aoc.nrao.edu&noUDP&sock=startd_5795_022 (167287.000.000) 2023-12-24 02:43:57 Job disconnected, attempting to reconnect
Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to slot1_1@nmpost106.aoc.nrao.edu <10.64.2.180:9618?addrs=10.64.2.180-9618&alias=nmpost106.aoc.nrao.edu&noUDP&sock=startd_5795_776a>
...
023 (167287.000.000) 2023-12-24 02:43:57 Job reconnected to slot1_1@nmpost106.aoc.nrao.edu
startd address: <10.64.2.180:9618?addrs=10.64.2.180-9618&alias=nmpost106.aoc.nrao.edu&noUDP&sock=startd_5795_776a>
starter address: <10.64.2.180:9618?addrs=10.64.2.180-9618&alias=nmpost106.aoc.nrao.edu&noUDP&sock=slot1_1_39813_9c2c_400>
...
007 (167287.000.000) 2023-12-24 02:43:57 Shadow exception!
Error from slot1_1@nmpost106.aoc.nrao.edu: Repeated attempts to transfer output failed for unknown reasons
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
...
040 (167287.000.000) 2023-12-24 02:45:09 Started transferring input files
Transferring to host: <10.64.2.178:9618?addrs=10.64.2.178-9618&alias=nmpost105.aoc.nrao.edu&noUDP&sock=slot1_13_163338_25ab_452>
...
040 (167287.000.000) 2023-12-24 03:09:22 Finished transferring input files
...
001 (167287.000.000) 2023-12-24 03:09:22 Job executing on host: <10.64.2.178:9618?addrs=10.64.2.178-9618&alias=nmpost105.aoc.nrao.edu&noUDP&sock=startd_5724_c431>
...
testpost001 krowe >/usr/bin/time -f "real %e\nuser %U\nkernel %S\nwaits %w" tar xf bzip2.tgz
real 311.63
user 310.06
kernel 16.43
waits 1128218
ANSWER: Greg is as surprised as we are.
Federation
Does this look correct?
https://staff.nrao.edu/wiki/bin/view/NM/HTCondor-federations
ANSWER: yes
Reservations
Reservations from the Double Tree were for Sunday Jul.9 through Thursday Jul. 13 (4 nights). But I need at least until Friday Jul. 14 right?
ANSWER: Greg will look into it.
Scatter/Gather problem
At some point we will have a scatter/gather problem. For example we will launch 10^5 jobs, each of which will produce some output that will need to be summed with the output of all the other jobs. Lanching 10^5 jobs is not hard. Dong the sumation is not hard. Moving all the output around is the hard part.
One idea is to have a dedicated process running to which each job uploads its output. This process could sum output as it arrives; it doesn't need to wait until all the output is done. It would be nice if this process also ran in the same HTCondor environment (PATh, CHTC, etc) because that would keep all the data "close" and presumably keep transfer times short.
ANSWER: DAGs and sub-DAGs of course. Provisioner node was created to submit jobs into the cloud. It exists as long as the DAG is working.
Nodes talking to each other becomes difficult in federated clusters.
https://ccl.cse.nd.edu/software/taskvine/
makeflow
Astra suggests something like Apache Beam for ngVLA data which is more of a data approach than a compute approach.
What is annex?
Yet another way to federate condor clusters. Annexes are useful when you have an allocation on a system (AWS) you have the ability to start jobs on a system. You give annex your credentials and how many workers you want. Annex will launch the startds and create a local central manager. It then configures flocking from your local to the remote pool. So in a sense annex is an effemeral flocking relationship for just the one person setting up the annex.
Condor and Kubernetes (k8s)
Condor supports Docker, Singularity, and Apptainer. In OSG the Central Managers are in K8s and most of the Schedds are also in k8s. In the OSPool some worker nodes are in k8s and they are allowed to run unprivilaged Apptainer but not Docker (because privilages). PATh has worker nodes in k8s. They are backfilled on demand.
NSF Archive Storage
Are you aware of any archive storage funded by NSF we could use? We are looking for off-site backup of our archive (NGAS).
ANSWER: Greg doesn't know of one.
Hung Jobs and viewing stdout
We have some jobs that seem to hang possibly because of a race condition or whatnot. I'm pretty sure it is our fault. But, the only way I know to tell is to login to the node and look at _condor_stdout in the scratch area. That gets pretty tedious when I want to check hundreds of jobs to see which ones are hung. Does condor have a way to check the _condor_stdout of a job from the submit host so I can do this programatically?
I thought condor_tail would be the solution but it doesn't display anything.
ANSWER: condor_ssh_to_job might be able to be used non-interactivly. I will try that.
ANSWER: use the FULL jobid with condor_tail. E.g. condor_tail 12345.0 Greg has submitted a patch so you don't have to specify the ProcId (.0).
Bug: condor_off -peaceful
testpost-cm-vml root >condor_off -peaceful -name testpost002
Sent "Set-Peaceful-Shutdown" command to startd testpost002.aoc.nrao.edu
Can't find address for schedd testpost002.aoc.nrao.edu
Can't find address for testpost002.aoc.nrao.edu
Perhaps you need to query another pool.
Yet it works without the -peaceful option
testpost-cm-vml root >condor_off -name testpost002
Sent "Kill-All-Daemons" command to master testpost002.aoc.nrao.edu
ANSWER: Add the -startd option. E.g. condor_off -peaceful -startd -name <hostname> Greg thinks it might be a regression (another bug). This still happens even after I set all the CONDOR_HOST knobs to testpost-cm-vml.aoc.nrao.edu. So it is still a bug and not because of some silly config I had at NRAO.
File Transfer Plugins and HTCondor-C
Is there a way I can use our nraorsync plugin on radial001? Or something similar?
SOLUTION: ssh tunnels
Condor Week (aka Throughput Week)
July 10-14, 2023. Being co-run with the OSG all hands meeting. At the moment, it is not hybrid but entirely in-person. https://path-cc.io/htc23
PROVISIONER node
When I define a PROVISIONER node, that is the only node that runs. The others never run. Also, the PROVISIONER job always returns 1 "exited normally with status 1" even though it is just running /bin/sleep.
JOB node01 node01.htc
JOB node02 node02.htc
JOB node03 node03.htcPARENT node01 CHILD node02 node03
PROVISIONER prov01 provisioner.htc
ANSWER: my prov01 job needs to indicate when it is ready with something like the following but that means the provisioner job has to run in either the local or scheduler universes because our execute nodes cant run condor_qedit.
condor_qedit myJobId ProvisionerState 2
But execute hosts can't run condor_qedit so this really only works if you set universe to local or scheduler.
Does CHTC have resources available for VLASS?
Our Single Epoch jobs
- Are parallelizable with OpenMPI
- 64GB of memory
- ~150GB of storage
- can take week(s) to run
- no checkpointing
- Looking at using GPUs but not there yet
Brian was not scared by this and gave us a form to fill out
https://chtc.cs.wisc.edu/uw-research-computing/form.html
ANSWER: Yes. We and Mark Lacy have started the process with CHTC for VLASS.
Annex to PATh
https://htcondor.org/experimental/ospool/byoc/path-facility
ANSWER: Greg doesn't know but he can connect me with someone who does.
Tod Miller is the person to ask about this.
Hold jobs that exceed disk request
ANSWER: https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToLimitDiskUsageOfJobs
condor_userprio
We want a user (vlapipe) to always have higher priority than other users. I see we can set this with condor_userprio but is that change permenent?
ANSWER: There is no config file for this. Set the priority_factor of vlapipe to 1. That is saved on disk and should persist through reboots and upgrades.
condor_userprio
We want a user (vlapipe) to always have higher priority than other users. I see we can set this with condor_userprio but is that change permenent?
ANSWER: There is no config file for this. Set the priority_factor of vlapipe to 1. That is saved on disk and should persist through reboots and upgrades.
Submitting jobs as other users
At some point in the future we will probably want the ability for a web process to launch condor jobs as different users. The web process will probably not be running as root. Does condor have a method for this or should we make our own setuid root thingy? Tell dlyons the asnswer.
ANSWER: HTCondor doesn't have anything for this. So it is up to us to do some suid-fu.
SSH keys with Duo
I tried following the link below to setup ssh such that I don't have to enter both my password and Duo every time I login to CHTC. It doesn't create anything in ~/.ssh/connections after I login. Thoughts?
https://chtc.cs.wisc.edu/uw-research-computing/configure-ssh
ANSWER: Greg doesn't know what to do here. We should ask Christina.
HTCondor-C and requirements
submitting jobs from the container on shipman as vlapipe to the NRAO RADIAL prototype cluster seems to ignore requirements like the following. Is this expected?
requirements = (machine == "radial001.nrao.radial.local")
and
requirements = (VLASS == True)
+partition = "VLASS"
It also seems to ignore things like
request_memory = 100G
ANSWER:
But I am still having problems.
This forces job to run on radial001
+remote_Requirements = ((machine == "radial001.nrao.radial.local"))
This runs on radialhead even though it only has 64G
+remote_RequestMemory = 102400
This runs on radialhead even though it doesn't have a GPU
request_gpus = 1
+remote_RequestGPUs = 1
ANSWER: This works
+remote_Requirements = ((machine == "radial001.nrao.radial.local") && memory > 102400)
as does this
+remote_Requirements = (memory > 102400)
but Greg will look into why +remote_RequestMemory doesn't work. It should.
Select files to transfer dynamically according to job-slot match
We currently have separate builds of our GPU software for CUDA Capability 7.0 and 8.0, and our jobs specify that both builds should be transferred to the EP, so that the job executable selects the appropriate build to run based on the CUDA Capability of the assigned GPU. Is there a way to do this selection when the job is matched to a slot, so that only the necessary build is transferred according to the slot's CUDA Capability?
ANSWER: $() means expand this locally from the jobad. $$() means expand at job start time.
executable = my_binary.$$(GPUs_capability)
executable = my_binary.$$([int(GPUs_capabilitty)]) # Felipe said this actually works
executable = my_binary.$$([ classad_express(GPUS_capability) ]) # Hopefully you don't need this
CPU/GPU Balancing
We have 30 nodes in a rack at NMT with a power limit of 17 kW and we are able to hit that limit when all 720 cores (24 cores * 30 nodes) are busy. We want to add two GPUs to each node but that would almost certainly put us way over the power limit if each node had 22 cores and 2 GPUs busy. So is there a way to tell HTCondor to reserve X cores for each GPU? That way we could balance the power load.
JOB TRANSFORMS work per schedd so that wouldn't work on the startd side which is what we want.
IDEA: NUM_CPUS = 4 or some other small number greater then the number of GPUs but limiting enough to keep the power draw low.
ANSWER: There isn't a knob for this in HTCondor but Greg is interested in this and will look into this.
WORKAROUND: MODIFY_REQUEST_EXPR_REQUESTCPUS may help by setting each job gets 8cores or something like.
MODIFY_REQUEST_EXPR_REQUESTCPUS = quantize(RequestCpus, isUndefined(RequestGpus) ? {1} : {8, 16, 24, 32, 40})
That is, when a job comes into the startd, if it doesn't request any GPUs, allocate exactly as many cpu cores as it requests. Otherwise, allocate 8 times as many cpus as it requests.
This seems to work. If I ask for 0 GPUs and 4 CPUs, I am given 0 GPUs and 4 CPUs. If I ask for 1 GPU and don't ask for CPUs, I am given 1 GPU and 8 CPUs.
But if I ask for 2 GPUs and don't ask for CPUs, I still am only given 8 CPUs. I was expecting to be given 16 CPUs. This is probably fine as we are not planning on more than 1 GPU per job.
But if I ask for 1 GPU and 4 CPUs, i am given 1 GPU and 8 CPUs. That is probably acceptable.
2024-01-24 krowe: Assuming a node can draw up to 550 Watts when all 24 cores are busy and that node only draws 150 Watts when idle, and that we have 17,300 Watts available to us in an NMT rack,
- we should only need to reserve 3 cores per GPU in order to offset the 72 Watts of an Nvidia L4 GPU.
- This would waste 60 cores.
- Or at least I suggest starting with that and seeing what happens. Another alternative is we just turn off three nodes if we put one L4 in each node.
- This would waste 72 cores.
Upgrading
CHTC just upgrades to the latest version when it becomes available, right? Do you ever run into problems because of this? We are still using version 9 because I can't seem to schedule a time with our VLASS group to test version 10. Much less version 23.
ANSWER: yes. The idea is that CHTC's users are helping them test the latest versions.
Flocking to CHTC?
We may want to run VLASS jobs at CHTC. What is the best way to submit locally and run globally?
ANSWER: Greg thinks flocking is the best idea.
This will require 9618 open to nmpost-master and probably a static NAT and external DNS name.
External users vs staff
We are thinking about making a DMZ ( I don't like that term ) for observers. Does CHTC staff use the same cluster resources that CHTC observers (customers) use?
ANSWER: There is no airgap at CHTC everyone uses the same cluster. Sometime users use a different AP but more for load balancing than security. Everyone does go through 2FA.
Does PATh Cache thingy(tm) (a.k.a. Stash) work outside of PATh?
I see HTCondor-10.x comes with a stash plugin. Does this mean we could read/write to stash from NRAO using HTCondor-10.x?
ANSWER: Greg thinks you can use stash remotely, like at our installation of HTCondor.
Curl_plugin doesn't do FTP
None of the following work. They either hang or produce errors. They work on the shell command line, except at CHTC where the squid server doesn't seem to grok FTP.
transfer_input_files = ftp://demo:password@test.rebex.net:/readme.txt
transfer_input_files = ftp://ftp:@ftp.gnu.org:/welcome.msg
transfer_input_files = ftp://ftp.gnu.org:/welcome.msg
transfer_input_files = ftp://ftp:@ftp.slackware.com:/welcome.msg
transfer_input_files = ftp://ftp.slackware.com:/welcome.msg
2024-02-05: Greg thinks this should work and will look into it.
xf bzip2.tgz
real 311.63
user 310.06
kernel 16.43
waits 1128218
ANSWER: Greg is as surprised as we are.
Federation
Does this look correct?
https://staff.nrao.edu/wiki/bin/view/NM/HTCondor-federations
ANSWER: yes
Reservations
Reservations from the Double Tree were for Sunday Jul.9 through Thursday Jul. 13 (4 nights). But I need at least until Friday Jul. 14 right?
ANSWER: Greg will look into it.
Scatter/Gather problem
At some point we will have a scatter/gather problem. For example we will launch 10^5 jobs, each of which will produce some output that will need to be summed with the output of all the other jobs. Lanching 10^5 jobs is not hard. Dong the sumation is not hard. Moving all the output around is the hard part.
One idea is to have a dedicated process running to which each job uploads its output. This process could sum output as it arrives; it doesn't need to wait until all the output is done. It would be nice if this process also ran in the same HTCondor environment (PATh, CHTC, etc) because that would keep all the data "close" and presumably keep transfer times short.
ANSWER: DAGs and sub-DAGs of course. Provisioner node was created to submit jobs into the cloud. It exists as long as the DAG is working.
Nodes talking to each other becomes difficult in federated clusters.
https://ccl.cse.nd.edu/software/taskvine/
makeflow
Astra suggests something like Apache Beam for ngVLA data which is more of a data approach than a compute approach.
What is annex?
Yet another way to federate condor clusters. Annexes are useful when you have an allocation on a system (AWS) you have the ability to start jobs on a system. You give annex your credentials and how many workers you want. Annex will launch the startds and create a local central manager. It then configures flocking from your local to the remote pool. So in a sense annex is an effemeral flocking relationship for just the one person setting up the annex.
Condor and Kubernetes (k8s)
Condor supports Docker, Singularity, and Apptainer. In OSG the Central Managers are in K8s and most of the Schedds are also in k8s. In the OSPool some worker nodes are in k8s and they are allowed to run unprivilaged Apptainer but not Docker (because privilages). PATh has worker nodes in k8s. They are backfilled on demand.
NSF Archive Storage
Are you aware of any archive storage funded by NSF we could use? We are looking for off-site backup of our archive (NGAS).
ANSWER: Greg doesn't know of one.
Hung Jobs and viewing stdout
We have some jobs that seem to hang possibly because of a race condition or whatnot. I'm pretty sure it is our fault. But, the only way I know to tell is to login to the node and look at _condor_stdout in the scratch area. That gets pretty tedious when I want to check hundreds of jobs to see which ones are hung. Does condor have a way to check the _condor_stdout of a job from the submit host so I can do this programatically?
I thought condor_tail would be the solution but it doesn't display anything.
ANSWER: condor_ssh_to_job might be able to be used non-interactivly. I will try that.
ANSWER: use the FULL jobid with condor_tail. E.g. condor_tail 12345.0 Greg has submitted a patch so you don't have to specify the ProcId (.0).
Bug: condor_off -peaceful
testpost-cm-vml root >condor_off -peaceful -name testpost002
Sent "Set-Peaceful-Shutdown" command to startd testpost002.aoc.nrao.edu
Can't find address for schedd testpost002.aoc.nrao.edu
Can't find address for testpost002.aoc.nrao.edu
Perhaps you need to query another pool.
Yet it works without the -peaceful option
testpost-cm-vml root >condor_off -name testpost002
Sent "Kill-All-Daemons" command to master testpost002.aoc.nrao.edu
ANSWER: Add the -startd option. E.g. condor_off -peaceful -startd -name <hostname> Greg thinks it might be a regression (another bug). This still happens even after I set all the CONDOR_HOST knobs to testpost-cm-vml.aoc.nrao.edu. So it is still a bug and not because of some silly config I had at NRAO.
File Transfer Plugins and HTCondor-C
Is there a way I can use our nraorsync plugin on radial001? Or something similar?
SOLUTION: ssh tunnels
Condor Week (aka Throughput Week)
July 10-14, 2023. Being co-run with the OSG all hands meeting. At the moment, it is not hybrid but entirely in-person. https://path-cc.io/htc23
PROVISIONER node
When I define a PROVISIONER node, that is the only node that runs. The others never run. Also, the PROVISIONER job always returns 1 "exited normally with status 1" even though it is just running /bin/sleep.
JOB node01 node01.htc
JOB node02 node02.htc
JOB node03 node03.htcPARENT node01 CHILD node02 node03
PROVISIONER prov01 provisioner.htc
ANSWER: my prov01 job needs to indicate when it is ready with something like the following but that means the provisioner job has to run in either the local or scheduler universes because our execute nodes cant run condor_qedit.
condor_qedit myJobId ProvisionerState 2
But execute hosts can't run condor_qedit so this really only works if you set universe to local or scheduler.
Does CHTC have resources available for VLASS?
Our Single Epoch jobs
- Are parallelizable with OpenMPI
- 64GB of memory
- ~150GB of storage
- can take week(s) to run
- no checkpointing
- Looking at using GPUs but not there yet
Brian was not scared by this and gave us a form to fill out
https://chtc.cs.wisc.edu/uw-research-computing/form.html
ANSWER: Yes. We and Mark Lacy have started the process with CHTC for VLASS.
Annex to PATh
https://htcondor.org/experimental/ospool/byoc/path-facility
ANSWER: Greg doesn't know but he can connect me with someone who does.
Tod Miller is the person to ask about this.
Hold jobs that exceed disk request
ANSWER: https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToLimitDiskUsageOfJobs
condor_userprio
We want a user (vlapipe) to always have higher priority than other users. I see we can set this with condor_userprio but is that change permenent?
ANSWER: There is no config file for this. Set the priority_factor of vlapipe to 1. That is saved on disk and should persist through reboots and upgrades.
condor_userprio
We want a user (vlapipe) to always have higher priority than other users. I see we can set this with condor_userprio but is that change permenent?
ANSWER: There is no config file for this. Set the priority_factor of vlapipe to 1. That is saved on disk and should persist through reboots and upgrades.
Submitting jobs as other users
At some point in the future we will probably want the ability for a web process to launch condor jobs as different users. The web process will probably not be running as root. Does condor have a method for this or should we make our own setuid root thingy? Tell dlyons the asnswer.
ANSWER: HTCondor doesn't have anything for this. So it is up to us to do some suid-fu.
SSH keys with Duo
I tried following the link below to setup ssh such that I don't have to enter both my password and Duo every time I login to CHTC. It doesn't create anything in ~/.ssh/connections after I login. Thoughts?
https://chtc.cs.wisc.edu/uw-research-computing/configure-ssh
ANSWER: Greg doesn't know what to do here. We should ask Christina.
HTCondor-C and requirements
submitting jobs from the container on shipman as vlapipe to the NRAO RADIAL prototype cluster seems to ignore requirements like the following. Is this expected?
requirements = (machine == "radial001.nrao.radial.local")
and
requirements = (VLASS == True)
+partition = "VLASS"
It also seems to ignore things like
request_memory = 100G
ANSWER:
But I am still having problems.
This forces job to run on radial001
+remote_Requirements = ((machine == "radial001.nrao.radial.local"))
This runs on radialhead even though it only has 64G
+remote_RequestMemory = 102400
This runs on radialhead even though it doesn't have a GPU
request_gpus = 1
+remote_RequestGPUs = 1
ANSWER: This works
+remote_Requirements = ((machine == "radial001.nrao.radial.local") && memory > 102400)
as does this
+remote_Requirements = (memory > 102400)
but Greg will look into why +remote_RequestMemory doesn't work. It should.
Select files to transfer dynamically according to job-slot match
We currently have separate builds of our GPU software for CUDA Capability 7.0 and 8.0, and our jobs specify that both builds should be transferred to the EP, so that the job executable selects the appropriate build to run based on the CUDA Capability of the assigned GPU. Is there a way to do this selection when the job is matched to a slot, so that only the necessary build is transferred according to the slot's CUDA Capability?
ANSWER: $() means expand this locally from the jobad. $$() means expand at job start time.
executable = my_binary.$$(GPUs_capability)
executable = my_binary.$$([int(GPUs_capabilitty)]) # Felipe said this actually works
executable = my_binary.$$([ classad_express(GPUS_capability) ]) # Hopefully you don't need this
CPU/GPU Balancing
We have 30 nodes in a rack at NMT with a power limit of 17 kW and we are able to hit that limit when all 720 cores (24 cores * 30 nodes) are busy. We want to add two GPUs to each node but that would almost certainly put us way over the power limit if each node had 22 cores and 2 GPUs busy. So is there a way to tell HTCondor to reserve X cores for each GPU? That way we could balance the power load.
JOB TRANSFORMS work per schedd so that wouldn't work on the startd side which is what we want.
IDEA: NUM_CPUS = 4 or some other small number greater then the number of GPUs but limiting enough to keep the power draw low.
ANSWER: There isn't a knob for this in HTCondor but Greg is interested in this and will look into this.
WORKAROUND: MODIFY_REQUEST_EXPR_REQUESTCPUS may help by setting each job gets 8cores or something like.
MODIFY_REQUEST_EXPR_REQUESTCPUS = quantize(RequestCpus, isUndefined(RequestGpus) ? {1} : {8, 16, 24, 32, 40})
That is, when a job comes into the startd, if it doesn't request any GPUs, allocate exactly as many cpu cores as it requests. Otherwise, allocate 8 times as many cpus as it requests.
This seems to work. If I ask for 0 GPUs and 4 CPUs, I am given 0 GPUs and 4 CPUs. If I ask for 1 GPU and don't ask for CPUs, I am given 1 GPU and 8 CPUs.
But if I ask for 2 GPUs and don't ask for CPUs, I still am only given 8 CPUs. I was expecting to be given 16 CPUs. This is probably fine as we are not planning on more than 1 GPU per job.
But if I ask for 1 GPU and 4 CPUs, i am given 1 GPU and 8 CPUs. That is probably acceptable.
2024-01-24 krowe: Assuming a node can draw up to 550 Watts when all 24 cores are busy and that node only draws 150 Watts when idle, and that we have 17,300 Watts available to us in an NMT rack,
- we should only need to reserve 3 cores per GPU in order to offset the 72 Watts of an Nvidia L4 GPU.
- This would waste 60 cores.
- Or at least I suggest starting with that and seeing what happens. Another alternative is we just turn off three nodes if we put one L4 in each node.
- This would waste 72 cores.
Upgrading
CHTC just upgrades to the latest version when it becomes available, right? Do you ever run into problems because of this? We are still using version 9 because I can't seem to schedule a time with our VLASS group to test version 10. Much less version 23.
ANSWER: yes. The idea is that CHTC's users are helping them test the latest versions.
Flocking to CHTC?
We may want to run VLASS jobs at CHTC. What is the best way to submit locally and run globally?
ANSWER: Greg thinks flocking is the best idea.
This will require 9618 open to nmpost-master and probably a static NAT and external DNS name.
External users vs staff
We are thinking about making a DMZ ( I don't like that term ) for observers. Does CHTC staff use the same cluster resources that CHTC observers (customers) use?
ANSWER: There is no airgap at CHTC everyone uses the same cluster. Sometime users use a different AP but more for load balancing than security. Everyone does go through 2FA.
Does PATh Cache thingy(tm) (a.k.a. Stash) work outside of PATh?
I see HTCondor-10.x comes with a stash plugin. Does this mean we could read/write to stash from NRAO using HTCondor-10.x?
ANSWER: Greg thinks you can use stash remotely, like at our installation of HTCondor.
Curl_plugin doesn't do FTP
None of the following work. They either hang or produce errors. They work on the shell command line, except at CHTC where the squid server doesn't seem to grok FTP.
transfer_input_files = ftp://demo:password@test.rebex.net:/readme.txt
transfer_input_files = ftp://ftp:@ftp.gnu.org:/welcome.msg
transfer_input_files = ftp://ftp.gnu.org:/welcome.msg
transfer_input_files = ftp://ftp:@ftp.slackware.com:/welcome.msg
transfer_input_files = ftp://ftp.slackware.com:/welcome.msg
2024-02-05: Greg thinks this should work and will look into it.
ANSWER: 2024-02-06 Greg wrote "Just found the problem with ftp file transfer plugin. I'm afraid there's no easy workaround, but I've pushed a fix that will go into the next stable release. "
File Transfer Plugins and HTCondor-C
I see that when a job starts, the execution point (radial001) uses our nraorsync plugin to download the files. This is fine and good. When the job is finished, the execution point (radial001) uses our nraorsync plugin to upload the files, also fine and good. But then the RADIAL schedd (radialhead) also runs our nraorsync plugin to upload files. This causes problems because radialhead doesn't have the _CONDOR_JOB_AD environment variable and the plugin dies. Why is the remote schedd running the plugin and is there a way to prevent it from doing so?
Greg understands this and will ask the HTCondor-c folks about it.
Greg thinks it is a bug and will talk to our HTCondor-C peopole.
2023-08-07: Greg said the HTCondor-C people agree this is a bug and will work on it.
2023-09-25 krowe: send Greg my exact procedure to reproduce this.
2023-10-02 krowe: Sent Greg an example that fails. Turns out it is intermittent.
2024-01-22 krowe: will send email to the condor list
ANSWER: It was K. Scott all along. I now have HTCondor-C workiing from nmpost and testpost clusters to the radial cluster using my nraorsync plugin to trasfer both input and output files. The reason the remote AP (radialhead) was running the nraorsync plugin was because I defined it in the condor config like so.
FILETRANSFER_PLUGINS = $(FILETRANSFER_PLUGINS), /usr/libexec/condor/nraorsync_plugin.py
I probably did this early in my HTCondor-C testing not knowing what I was doing. I commented this out, restarted condor, and now everything seems to be working properly.ANSWER: 2024-02-06 Greg wrote "Just found the problem with ftp file transfer plugin. I'm afraid there's no easy workaround, but I've pushed a fix that will go into the next stable release. "
...