Table of Contents | ||
---|---|---|
|
...
Current Questions
Pelican recursive
You need to add "Listing" to your Capabilities. It is sungular, not plural like the documentation reads. Then
transfer_input_files = osdf:///nrao-ardg/krowe/dir?recursive
pools
This is pretty amazing. I can check the status and queue of OSG and PATH from nmpost-master
condor_status -pool cm-1.ospool.osg-htc.org
or
condor_status -pool htcondor-cm-path.osg.chtc.io
As long as I have a token for them in ~/.condor/tokens.d
Health Check
Is there a health check mechanism for condor? For example, if the Lustre filesystem is unavailable on a node, can condor be told to down that node?
...
ANSWER: the output of startd_crond can change the machine add. So we need a class add like 'working = false' or change HASLUSTRE or something. Then have another system, could be a cron job on nmpost-master, that checks for nodes with 'working = false', then drain the node.
Using Pelican to replace nraorsync?
nraorsync does three things: uses rsync to only write back what has changed; use the faster network (IB, 10g, etc), our AP, nmpost-master, doesn't have an external IP; use our "data move" host gibson.
Greg thinks Pelican can write back only what has changed.
Writing to NRAO Origin
If we want to write to our Origin do we need to enable authentication?
What is involved with doing that?
Greg doesn't know. I will look at the docs.
output_destination and stdout/stderr
output_destination and stdout/stderr
It used to be that once you set output_destination = someplugin:// then that plugin was responsible for transferring all files even stdout and stderr. That no longer seems to be the case as of version 23. My nraorsync transfer It used to be that once you set output_destination = someplugin:// then that plugin was responsible for transferring all files even stdout and stderr. That no longer seems to be the case as of version 23. My nraorsync transfer plugin has code in it looking for _condor_stdout and _condor_stderr as arguments but never sees them with version 23. The stdout and stderr files are copied back to the submit directory instead of letting my plugin transfer them.
...
DONE: send greg error ouptut and security config
transfer_output_files change in version 23
My silly nraorsync transfer plugin relies on the user setting transfer_output_files = .job.ad in the submit description file to trigger the transfer of files. Then my nraorsync plugin takes over and looks at +nrao_output_files for the files to copy. But with version 23, this no longer works. I am guessing someone decided that internal files like .job.ad, .machine.ad, _condor_stdout, and _condor_stderr will no longer be tranferrable via trasnfer_output_files. Is that right? If so, I think I can work around it. Just wanted to know.
ANSWER: the starter has an exclude list and .job.ad is probably in it and maybe it is being access sooner or later than before. Greg will see if there is a better, first-class way to trigger transfers.
DONE: We will use condor_transfer since it needs to be there anyway.
Installing version 23
I am looking at upgrading from version 10 to 23 LTS. I noticed that y'all have a repo RPM to install condor but it installs the Feature Release only. It doens't provide repos to install the LTS.
https://htcondor.readthedocs.io/en/main/getting-htcondor/from-our-repositories.html
ANSWER: Greg will find it and get back to me.
DONE: https://research.cs.wisc.edu/htcondor/repo/23.0/el8/x86_64/release/
Virtual memory vs RSS
Looks like condor is reporting RSS but that may actually be virtual memory. At least according to Felipe's tests.
ANSWER: Access to the cgroup information on the nmpost cluster is good because condor is running as root and condor reports the RSS accurately. But on systems using glidein like PATh and OSG they may not have appropriate access to the cgroup so memory reporting on these clusters may be different thatn memory reporting on the nmpost cluster. On glide-in jobs condor reports the virtual memory across all the processes in the job.
CPU usage
Felipe has had jobs put on hold for too much cpu usage.
runResidualCycle_n4.imcycle8.condor.log:012 (269680.000.000) 2024-07-18 17:17:03 Job was held.
runResidualCycle_n4.imcycle8.condor.log- Excessive CPU usage. Please verify that the code is configured to use a limited number of cpus/threads, and matches request_cpus.
GREG: Perhaps only some machines in the OSPool have checks for this and may be doing something wrong or strange.
2024-09-16: Felipe asked about this again.
Missing batch_name
A DAG job, submitted with hundreds of others, doesn't show a batch name in condor_q, just DAG: 371239. Just one job, all the others submitted from the same template do show batch names
Virtual memory vs RSS
Looks like condor is reporting RSS but that may actually be virtual memory. At least according to Felipe's tests.
ANSWER: Access to the cgroup information on the nmpost cluster is good because condor is running as root and condor reports the RSS accurately. But on systems using glidein like PATh and OSG they may not have appropriate access to the cgroup so memory reporting on these clusters may be different thatn memory reporting on the nmpost cluster. On glide-in jobs condor reports the virtual memory across all the processes in the job.
CPU usage
Felipe has had jobs put on hold for too much cpu usage.
runResidualCycle_n4.imcycle8.condor.log:012 (269680.000.000) 2024-07-18 17:17:03 Job was held.
runResidualCycle_n4.imcycle8.condor.log- Excessive CPU usage. Please verify that the code is configured to use a limited number of cpus/threads, and matches request_cpus.
GREG: Perhaps only some machines in the OSPool have checks for this and may be doing something wrong or strange.
2024-09-16: Felipe asked about this again.
Missing batch_name
A DAG job, submitted with hundreds of others, doesn't show a batch name in condor_q, just DAG: 371239. Just one job, all the others submitted from the same template do show batch names
/lustre/aoc/cluster/pipeline/vlass_prod/spool/quicklook/VLASS3.2_T17t27.J201445+263000_P172318v1_2024_07_12T16_40_09.270
nmpost-master krowe >condor_q -dag -name mcilroy -g -all
...
vlapipe vlass_ql.dag+370186 7/16 10:30 1 1 _ _ 3 370193.0
vlapipe vlass_ql.dag+370191 7/16 10:31 1 1 _ _ 3 370194.0
vlapipe DAG: 371239 7/16 10:56 1 1 _ _ 3 371536.0
...
GREG: Probably a condor bug. Try submitting it again to see if the name is missing again.
WORKAROUND: condor_qedit job.id JobBatchName '"asdfasdf"'
DAG failed to submit
Another DAG job that was submitted along with hundreds of others looks to have created vlass_ql.dag.condor.sub but never actually submitted the job. condor.log is emtpy.
/lustre/aoc/cluster/pipeline/vlass_/lustre/aoc/cluster/pipeline/vlass_prod/spool/quicklook/VLASS3.2_T17t27T18t13.J201445J093830+263000283000_P172318v1P175122v1_2024_07_12T1606T16_40_09.270
nmpost-master krowe >condor_q -dag -name mcilroy -g -all
...
vlapipe vlass_ql.dag+370186 7/16 10:30 1 1 _ _ 3 370193.0
vlapipe vlass_ql.dag+370191 7/16 10:31 1 1 _ _ 3 370194.0
vlapipe DAG: 371239 7/16 10:56 1 1 _ _ 3 371536.0
...
33_34.742
ANSWERs: Perhaps the schedd was too busy to respond. Need more resources in the workflow container?
Need to handle error codes from condor_submit_dag. 0 good. 1 bad. (chausman)
Setup /usr/bin/mail on mcilroy so that it works. Condor will use this to send mail to root when it encounters an error. Need to submit jira ticket to SSA. (krowe)
Resubmitting Jobs
I have an example in
/lustre/aoc/cluster/pipeline/vlass_prod/spool/se_continuum_imaging/VLASS2.1_T10t30.J194602-033000_P161384v1_2020_08_15T01_21_14.433
of a job that failed on nmpost106 but then HTCondor resubmitted the job on nmpost105. The problem is the job actually did finish, just got an error transferring back all the files, so when the job was resubmitted, it copied over an almost complete run of CASA which sort of makes a mess of things. I would rather HTCondor just fail and not re-submit the job. How can I do that?
022 (167287.000.000) 2023-12-24 02:43:57 Job disconnected, attempting to reconnect
Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to slot1_1@nmpost106.aoc.nrao.edu <10.64.2.180:9618?addrs=10.64.2.180-9618&alias=nmpost106.aoc.nrao.edu&noUDP&sock=startd_5795_776a>
...
023 (167287.000.000) 2023-12-24 02:43:57 Job reconnected to slot1_1@nmpost106.aoc.nrao.edu
startd address: <10.64.2.180:9618?addrs=10.64.2.180-9618&alias=nmpost106.aoc.nrao.edu&noUDP&sock=startd_5795_776a>
starter address:
GREG: Probably a condor bug. Try submitting it again to see if the name is missing again.
WORKAROUND: condor_qedit job.id JobBatchName '"asdfasdf"'
DAG failed to submit
Another DAG job that was submitted along with hundreds of others looks to have created vlass_ql.dag.condor.sub but never actually submitted the job. condor.log is emtpy.
/lustre/aoc/cluster/pipeline/vlass_prod/spool/quicklook/VLASS3.2_T18t13.J093830+283000_P175122v1_2024_07_06T16_33_34.742
ANSWERs: Perhaps the schedd was too busy to respond. Need more resources in the workflow container?
Need to handle error codes from condor_submit_dag. 0 good. 1 bad. (chausman)
Setup /usr/bin/mail on mcilroy so that it works. Condor will use this to send mail to root when it encounters an error. Need to submit jira ticket to SSA. (krowe)
Resubmitting Jobs
I have an example in
/lustre/aoc/cluster/pipeline/vlass_prod/spool/se_continuum_imaging/VLASS2.1_T10t30.J194602-033000_P161384v1_2020_08_15T01_21_14.433
of a job that failed on nmpost106 but then HTCondor resubmitted the job on nmpost105. The problem is the job actually did finish, just got an error transferring back all the files, so when the job was resubmitted, it copied over an almost complete run of CASA which sort of makes a mess of things. I would rather HTCondor just fail and not re-submit the job. How can I do that?
022 (167287.000.000) 2023-12-24 02:43:57 Job disconnected, attempting to reconnect
Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to slot1_1@nmpost106.aoc.nrao.edu <10.64.2.180:9618?addrs=10.64.2.180-9618&alias=nmpost106.aoc.nrao.edu&noUDP&sock=startd_5795_776a>slot1_1_39813_9c2c_400>
...
023 007 (167287.000.000) 2023-12-24 02:43:57 Job reconnected to Shadow exception!
Error from slot1_1@nmpost106.aoc.nrao.edu
startd address: <10.64.2.180:9618?addrs=10.64.2.180-9618&alias=nmpost106.aoc.nrao.edu&noUDP&sock=startd_5795_776a>
starter address: <10.64.2.180:9618?addrs=10.64.2.180-9618&alias=nmpost106.aoc.nrao.edu&noUDP&sock=slot1_1_39813_9c2c_400>
...
007 (167287.000.000) 2023-12-24 02:43:57 Shadow exception!
Error from slot1_1@nmpost106.aoc.nrao.edu: Repeated attempts to transfer output failed for unknown reasons
0 - : Repeated attempts to transfer output failed for unknown reasons
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
...
040 (167287.000.000) 2023-12-24 02:45:09 Started transferring input files
Transferring to host: <10.64.2.178:9618?addrs=10.64.2.178-9618&alias=nmpost105.aoc.nrao.edu&noUDP&sock=slot1_13_163338_25ab_452>
...
040 (167287.000.000) 2023-12-24 03:09:22 Finished transferring input files
...
001 (167287.000.000) 2023-12-24 03:09:22 Job executing on host: <10.64.2.178:9618?addrs=10.64.2.178-9618&alias=nmpost105.aoc.nrao.edu&noUDP&sock=startd_5724_c431>
...
nmpost-master krowe >condor_userlog condor.log
Job Host Start Time Evict Time Wall Time Good Time CPU Usage
7315.0 10.7.7.168 2/11 19:42 2/11 23:35 0+03:52 0+03:52 0+08:31
7316.0 10.7.7.168 2/11 23:35 2/12 05:03 0+05:27 0+05:27 0+05:01
7317.0 10.7.7.168 2/12 05:03 2/12 06:13 0+01:09 0+01:09 0+00:33Host/Job Wall Time Good Time CPU Usage Avg Alloc Avg Lost Goodput Util.
10.7.7.168 0+10:29 0+10:29 0+14:06 0+03:29 0+00:00 100.0% 134.4%
7315.0 0+03:52 0+03:52 0+08:31 0+03:52 0+00:00 100.0% 219.9%
7316.0 0+05:27 0+05:27 0+05:01 0+05:27 0+00:00 100.0% 92.1%
7317.0 0+01:09 0+01:09 0+00:33 0+01:09 0+00:00 100.0% 47.7%Total 0+10:29 0+10:29 0+14:06 0+03:29 0+00:00 100.0% 134.4%
ANSWER: Greg is not aware of any such bugs or reasons this would happen.
Seeing hostnames in condor_q output
What is the condor way to see the hostnames in condor_q output. Say a user wants to see what jobs are running on host nmpost037.
The reason I want to know is so when I am helping some user with our HTCondor install I can show them how to see the hostnames without telling them to use my script.
+03:29 0+00:00 100.0% 134.4%
ANSWER: Greg is not aware of any such bugs or reasons this would happen.
Seeing hostnames in condor_q output
What is the condor way to see the hostnames in condor_q output. Say a user wants to see what jobs are running on host nmpost037.
The reason I want to know is so when I am helping some user with our HTCondor install I can show them how to see the hostnames without telling them to use my script.
ANSWER: condor_q -run -all -g
Using Pelican to replace nraorsync?
nraorsync does three things: uses rsync to only write back what has changed; use the faster network (IB, 10g, etc), our AP, nmpost-master, doesn't have an external IP; use our "data move" host gibson.
Greg thinks Pelican can write back only what has changed.
ANSWER: I found a ?recursive option to the URL but I don't know if that does any deduplication like rsync does or not.
Writing to NRAO Origin
If we want to write to our Origin do we need to enable authentication?
What is involved with doing that?
Greg doesn't know. I will look at the docs.
transfer_output_files change in version 23
My silly nraorsync transfer plugin relies on the user setting transfer_output_files = .job.ad in the submit description file to trigger the transfer of files. Then my nraorsync plugin takes over and looks at +nrao_output_files for the files to copy. But with version 23, this no longer works. I am guessing someone decided that internal files like .job.ad, .machine.ad, _condor_stdout, and _condor_stderr will no longer be tranferrable via trasnfer_output_files. Is that right? If so, I think I can work around it. Just wanted to know.
ANSWER: the starter has an exclude list and .job.ad is probably in it and maybe it is being access sooner or later than before. Greg will see if there is a better, first-class way to trigger transfers.
DONE: We will use condor_transfer since it needs to be there anyway.
Installing version 23
I am looking at upgrading from version 10 to 23 LTS. I noticed that y'all have a repo RPM to install condor but it installs the Feature Release only. It doens't provide repos to install the LTS.
https://htcondor.readthedocs.io/en/main/getting-htcondor/from-our-repositories.html
ANSWER: Greg will find it and get back to me.
DONE: https://research.cs.wisc.edu/htcondor/repo/23.0/el8/x86_64/release/
campuschampions
Have you heard of this email list https://campuschampions.cyberinfrastructure.org/
pools
This is pretty amazing. I can check the status and queue of OSG and PATH from nmpost-master
condor_status -pool cm-1.ospool.osg-htc.org
or
condor_status -pool htcondor-cm-path.osg.chtc.io
As long as I have a token for them in ~/.condor/tokens.dANSWER: condor_q -run -all -g
...