ANSWER: the output of startd_crond can change the machine add. So we need a class add like 'working = false' or change HASLUSTRE or something. Then have another system, could be a cron job on nmpost-master, that checks for nodes with 'working = false', then drain the node.

Using Pelican to replace nraorsync?

nraorsync does three things: uses rsync to only write back what has changed; use the faster network (IB, 10g, etc), our AP, nmpost-master, doesn't have an external IP; use our "data move" host gibson.

Greg thinks Pelican can write back only what has changed.

Writing to NRAO Origin

If we want to write to our Origin do we need to enable authentication?

What is involved with doing that?

Greg doesn't know. I will look at the docs.

output_destination and stdout/stderr

...

DONE: send greg error ouptut and security config

transfer_output_files change in version 23

My silly nraorsync transfer plugin relies on the user setting transfer_output_files = .job.ad in the submit description file to trigger the transfer of files. Then my nraorsync plugin takes over and looks at +nrao_output_files for the files to copy. But with version 23, this no longer works. I am guessing someone decided that internal files like .job.ad, .machine.ad, _condor_stdout, and _condor_stderr will no longer be tranferrable via trasnfer_output_files. Is that right? If so, I think I can work around it. Just wanted to know.

ANSWER: the starter has an exclude list and .job.ad is probably in it and maybe it is being access sooner or later than before. Greg will see if there is a better, first-class way to trigger transfers.

DONE: We will use condor_transfer since it needs to be there anyway.

Installing version 23

I am looking at upgrading from version 10 to 23 LTS. I noticed that y'all have a repo RPM to install condor but it installs the Feature Release only. It doens't provide repos to install the LTS.

https://htcondor.readthedocs.io/en/main/getting-htcondor/from-our-repositories.html

ANSWER: Greg will find it and get back to me.

DONE: https://research.cs.wisc.edu/htcondor/repo/23.0/el8/x86_64/release/

Virtual memory vs RSS

Looks like condor is reporting RSS but that may actually be virtual memory. At least according to Felipe's tests.

ANSWER: Access to the cgroup information on the nmpost cluster is good because condor is running as root and condor reports the RSS accurately. But on systems using glidein like PATh and OSG they may not have appropriate access to the cgroup so memory reporting on these clusters may be different thatn memory reporting on the nmpost cluster. On glide-in jobs condor reports the virtual memory across all the processes in the job.

CPU usage

Felipe has had jobs put on hold for too much cpu usage.

runResidualCycle_n4.imcycle8.condor.log:012 (269680.000.000) 2024-07-18 17:17:03 Job was held.
runResidualCycle_n4.imcycle8.condor.log- Excessive CPU usage. Please verify that the code is configured to use a limited number of cpus/threads, and matches request_cpus.

GREG: Perhaps only some machines in the OSPool have checks for this and may be doing something wrong or strange.

2024-09-16: Felipe asked about this again.

Missing batch_name

Virtual memory vs RSS

Looks like condor is reporting RSS but that may actually be virtual memory. At least according to Felipe's tests.

ANSWER: Access to the cgroup information on the nmpost cluster is good because condor is running as root and condor reports the RSS accurately. But on systems using glidein like PATh and OSG they may not have appropriate access to the cgroup so memory reporting on these clusters may be different thatn memory reporting on the nmpost cluster. On glide-in jobs condor reports the virtual memory across all the processes in the job.

CPU usage

Felipe has had jobs put on hold for too much cpu usage.

runResidualCycle_n4.imcycle8.condor.log:012 (269680.000.000) 2024-07-18 17:17:03 Job was held.
runResidualCycle_n4.imcycle8.condor.log- Excessive CPU usage. Please verify that the code is configured to use a limited number of cpus/threads, and matches request_cpus.

GREG: Perhaps only some machines in the OSPool have checks for this and may be doing something wrong or strange.

2024-09-16: Felipe asked about this again.

Missing batch_name

A DAG job, submitted with hundreds of others, doesn't show a batch name in condor_q, just DAG: 371239. Just one job, all the others submitted from the same template do show batch names

/lustre/aoc/cluster/pipeline/vlass_prod/spool/quicklook/VLASS3.2_T17t27.J201445+263000_P172318v1_2024_07_12T16_40_09.270

nmpost-master krowe >condor_q -dag -name mcilroy -g -all
...
vlapipe vlass_ql.dag+370186 7/16 10:30 1 1 _ _ 3 370193.0
vlapipe vlass_ql.dag+370191 7/16 10:31 1 1 _ _ 3 370194.0
vlapipe DAG: 371239 7/16 10:56 1 1 _ _ 3 371536.0
...

GREG: Probably a condor bug. Try submitting it again to see if the name is missing again.

WORKAROUND: condor_qedit job.id JobBatchName '"asdfasdf"'

DAG failed to submit

Another DAG job that was submitted along with hundreds of others looks to have created vlass_ql.dag.condor.sub but never actually submitted the job. condor.log is emtpy.

/lustre/aoc/cluster/pipeline/vlass_prod/spool/quicklook/VLASS3.2_T18t13.J093830+283000_P175122v1_2024_07_06T16_33_34.742

ANSWERs: Perhaps the schedd was too busy to respond. Need more resources in the workflow container?

Need to handle error codes from condor_submit_dag. 0 good. 1 bad. (chausman)

Setup /usr/bin/mail on mcilroy so that it works. Condor will use this to send mail to root when it encounters an error. Need to submit jira ticket to SSA. (krowe)

Resubmitting Jobs

I have an example in A DAG job, submitted with hundreds of others, doesn't show a batch name in condor_q, just DAG: 371239. Just one job, all the others submitted from the same template do show batch names

/lustre/aoc/cluster/pipeline/vlass_prod/spool/quicklook/VLASS3.2_T17t27.J201445+263000_P172318v1_2024_07_12T16_40_09.270

nmpost-master krowe >condor_q -dag -name mcilroy -g -all

...

vlapipe vlass_ql.dag+370186 7/16 10:30 1 1 _ _ 3 370193.0

vlapipe vlass_ql.dag+370191 7/16 10:31 1 1 _ _ 3 370194.0

vlapipe DAG: 371239 7/16 10:56 1 1 _ _ 3 371536.0

...

se_continuum_imaging/VLASS2.1_T10t30.J194602-033000_P161384v1_2020_08_15T01_21_14.433

of a job that failed on nmpost106 but then HTCondor resubmitted the job on nmpost105. The problem is the job actually did finish, just got an error transferring back all the files, so when the job was resubmitted, it copied over an almost complete run of CASA which sort of makes a mess of things. I would rather HTCondor just fail and not re-submit the job. How can I do that?

022 (167287.000.000) 2023-12-24 02:43:57 Job disconnected, attempting to reconnect
Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to slot1_1@nmpost106.aoc.nrao.edu <10.64.2.180:9618?addrs=10.64.2.180-9618&alias=nmpost106.aoc.nrao.edu&noUDP&sock=startd_5795_776a>
...
023 (167287.000.000) 2023-12-24 02:43:57 Job reconnected to slot1_1@nmpost106.aoc.nrao.edu
startd address: <10.64.2.180:9618?addrs=10.64.2.180-9618&alias=nmpost106.aoc.nrao.edu&noUDP&sock=startd_5795_776a>
starter address: <10.64.2.180:9618?addrs=10.64.2.180-9618&alias=nmpost106.aoc.nrao.edu&noUDP&sock=slot1_1_39813_9c2c_400>
...
007 (167287.000.000) 2023-12-24 02:43:57 Shadow exception!
Error from slot1_1@nmpost106.aoc.nrao.edu: Repeated attempts to transfer output failed for unknown reasons
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
...
040 (167287.000.000) 2023-12-24 02:45:09 Started transferring input files
Transferring to host: <10.64.2.178:9618?addrs=10.64.2.178-9618&alias=nmpost105.aoc.nrao.edu&noUDP&sock=slot1_13_163338_25ab_452>
...
040 (167287.000.000) 2023-12-24 03:09:22 Finished transferring input files
...
001

GREG: Probably a condor bug. Try submitting it again to see if the name is missing again.

WORKAROUND: condor_qedit job.id JobBatchName '"asdfasdf"'

DAG failed to submit

Another DAG job that was submitted along with hundreds of others looks to have created vlass_ql.dag.condor.sub but never actually submitted the job. condor.log is emtpy.

/lustre/aoc/cluster/pipeline/vlass_prod/spool/quicklook/VLASS3.2_T18t13.J093830+283000_P175122v1_2024_07_06T16_33_34.742

ANSWERs: Perhaps the schedd was too busy to respond. Need more resources in the workflow container?

Need to handle error codes from condor_submit_dag. 0 good. 1 bad. (chausman)

Setup /usr/bin/mail on mcilroy so that it works. Condor will use this to send mail to root when it encounters an error. Need to submit jira ticket to SSA. (krowe)

Resubmitting Jobs

I have an example in

/lustre/aoc/cluster/pipeline/vlass_prod/spool/se_continuum_imaging/VLASS2.1_T10t30.J194602-033000_P161384v1_2020_08_15T01_21_14.433

of a job that failed on nmpost106 but then HTCondor resubmitted the job on nmpost105. The problem is the job actually did finish, just got an error transferring back all the files, so when the job was resubmitted, it copied over an almost complete run of CASA which sort of makes a mess of things. I would rather HTCondor just fail and not re-submit the job. How can I do that?

022 (167287.000.000) 2023-12-24 0203:4309:57 Job disconnected, attempting to reconnect
Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to slot1_1@nmpost106.aoc.nrao.edu 22 Job executing on host: <10.64.2.180178:9618?addrs=10.64.2.180178-9618&alias=nmpost106nmpost105.aoc.nrao.edu&noUDP&sock=startd_5795_776a>
...
023 (167287.000.000) 2023-12-24 02:43:57 Job reconnected to slot1_1@nmpost106.aoc.nrao.edu
startd address: <10.64.2.180:9618?addrs=10.64.2.180-9618&alias=nmpost106.aoc.nrao.edu&noUDP&sock=startd_5795_776a>
starter address: <10.64.2.180:9618?addrs=10.64.2.180-9618&alias=nmpost106.aoc.nrao.edu&noUDP&sock=slot1_1_39813_9c2c_400>
...
007 (167287_5724_c431>

ANSWER: Maybe

on_exit_hold = some_expression

periodic_hold = NumShadowStarts > 5

periodic_hold = NumJobStarts > 5

or a startd cron job that checks for IdM and offlines the node if needed

https://htcondor.readthedocs.io/en/latest/admin-manual/ep-policy-configuration.html#custom-and-system-slot-attributes

https://htcondor.readthedocs.io/en/latest/admin-manual/ep-policy-configuration.html#startd-cron

Constant processing

Our workflows have a process called "ingestion" that puts data into our archive. There are almost always ingestion processes running or needing to run and we don't want them to get stalled because of other jobs. Both ingestion and other jobs are the same user "vlapipe". I thought about setting a high priority in the ingestion submit description file but that won't guarantee that ingestion always runs, especially since we don't do preemption. So my current thinking is to have a dedicated node for ingestion. Can you think of a better solution?

What about using the local scheduling universe so it runs on the Access Point. The AP is a docker container with only limited Lustre access so this would be a bad option.
ANSWER: A dedicated node is a good solution given no preemption.

So on the node I would need to set something like the following

# High priority only jobs
HIGHPRIORITY = True
STARTD_ATTRS = $(STARTD_ATTRS) HIGHPRIORITY
START = ($(START)) && (TARGET.priority =?= "HIGHPRIORITY")

Nov. 13, 2023 krowe: I need to implement this. Make a node a HIGHPRIROITY node and have SSA put HIGHPRIORITY in the ingestion jobs.

2024-02-01 krowe: Talked to chausman today. She thinks SSA will need this and that the host will need access to /lustre/evla like aocngas-master and the nmngas nodes do. That might also mean a variable like HASEVLALUSTRE as well or instead of HIGHPRIORITY.

...

In progress

condor_remote_cluster

CHTC

000 (901.000.000) 2023-1204-24 0214 16:43:57 Shadow exception!
Error from slot1_1@nmpost106.aoc.nrao.edu: Repeated attempts to transfer output failed for unknown reasons
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
...
040 (167287.000.000) 2023-12-24 02:45:09 Started transferring input files
Transferring to 31:38 Job submitted from host: <10.64.21.178:9618?addrs=10.64.21.178-9618&alias=nmpost105testpost-master.aoc.nrao.edu&noUDP&sock=slot1schedd_13_163338_25ab_452>2269692_816e>
...
040 012 (167287901.000.000) 2023-12-24 03:09:22 Finished transferring input files
...
001 (16728704-14 16:31:41 Job was held.
Failed to start GAHP: Missing remote command\n
Code 0 Subcode 0
...
testpost-master krowe >cat condor.902.log
000 (902.000.000) 2023-1204-24 0314 16:0940:22 37 Job executing on submitted from host: <10.64.21.178:9618?addrs=10.64.21.178-9618&alias=nmpost105testpost-master.aoc.nrao.edu&noUDP&sock=startdschedd_57242269692_c431>

ANSWER: Maybe

on_exit_hold = some_expression

periodic_hold = NumShadowStarts > 5

periodic_hold = NumJobStarts > 5

or a startd cron job that checks for IdM and offlines the node if needed

https://htcondor.readthedocs.io/en/latest/admin-manual/ep-policy-configuration.html#custom-and-system-slot-attributes

https://htcondor.readthedocs.io/en/latest/admin-manual/ep-policy-configuration.html#startd-cron

Constant processing

Our workflows have a process called "ingestion" that puts data into our archive. There are almost always ingestion processes running or needing to run and we don't want them to get stalled because of other jobs. Both ingestion and other jobs are the same user "vlapipe". I thought about setting a high priority in the ingestion submit description file but that won't guarantee that ingestion always runs, especially since we don't do preemption. So my current thinking is to have a dedicated node for ingestion. Can you think of a better solution?

What about using the local scheduling universe so it runs on the Access Point. The AP is a docker container with only limited Lustre access so this would be a bad option.
ANSWER: A dedicated node is a good solution given no preemption.

So on the node I would need to set something like the following

# High priority only jobs
HIGHPRIORITY = True
STARTD_ATTRS = $(STARTD_ATTRS) HIGHPRIORITY
START = ($(START)) && (TARGET.priority =?= "HIGHPRIORITY")

Nov. 13, 2023 krowe: I need to implement this. Make a node a HIGHPRIROITY node and have SSA put HIGHPRIORITY in the ingestion jobs.

2024-02-01 krowe: Talked to chausman today. She thinks SSA will need this and that the host will need access to /lustre/evla like aocngas-master and the nmngas nodes do. That might also mean a variable like HASEVLALUSTRE as well or instead of HIGHPRIORITY.

In progress

condor_remote_cluster

CHTC

816e>
...
012 (902.000.000) 2023-04-14 16:40:41 Job was held.
Failed to start GAHP: Agent pid 3145812\nPermission denied (gssapi-with-mic,keyboard-interactive).\nAgent pid 3145812 killed\n
Code 0 Subcode 0
...

PATh

000 (901.000.000) 2023-04-14 16:31:38 Job submitted from host: <10.64.1.178:9618?addrs=10.64.1.178-9618&alias=testpost-master.aoc.nrao.edu&noUDP&sock=schedd_2269692_816e>
...
012 (901.000.000) 2023-04-14 16:31:000 (901.000.000) 2023-04-14 16:31:38 Job submitted from host: <10.64.1.178:9618?addrs=10.64.1.178-9618&alias=testpost-master.aoc.nrao.edu&noUDP&sock=schedd_2269692_816e>
...
012 (901.000.000) 2023-04-14 16:31:41 Job was held.
Failed to start GAHP: Missing remote command\n
Code 0 Subcode 0
...testpost-master krowe >cat condor.902.log
000 (902.000.000) 2023-04-14 16:40:37 Job submitted from host: <10.64.1.178:9618?addrs=10.64.1.178-9618&alias=testpost-master.aoc.nrao.edu&noUDP&sock=schedd_2269692_816e>
...
012 (902.000.000) 2023-04-14 16:40:41 Job was held.
Failed to start GAHP: Agent pid 3145812\nPermission denied (gssapi-with-mic,keyboard-interactive).\nAgent pid 3145812 killed\n
Code 0 Subcode 0
...

PATh

000 (901.000.000) 2023-04-14 16:31:38 Job submitted from host: <10.64.1.178:9618?addrs=10.64.1.178-9618&alias=testpost-master.aoc.nrao.edu&noUDP&sock=schedd_2269692_816e>
...
012 (901.000.000) 2023-04-14 16:31:41 Job was held.
Failed to start GAHP: Missing remote command\n
Code 0 Subcode 0
...

Radial

It works but seems to leave a job on the radial cluster for like 30 minutes.

Radial

It works but seems to leave a job on the radial cluster for like 30 minutes.

[root@radialhead htcondor-10.0.3-1]# ~krowe/bin/condor_qstat
JobId Owner JobBatchName CPUs JS Mem(MB) ElapTime SubmitHost Slot RemoteHost(s)
--------- ------[root@radialhead htcondor-10.0.3-1]# ~krowe/bin/condor_qstat
JobId Owner JobBatchName CPUs JS Mem(MB) ElapTime SubmitHost Slot RemoteHost(s)
-- ------------------ ---- -- -------- ----------- ---------------- -------- -----------------------------------
99 nrao 1 C 1024 0+0:13:22 radialhead.nrao.

...

Some of our pipeline jobs don't set shoud_transfer_files=YES because they need to transfer some output to an area for Analysts to look at and a some other output (may be a subset) to a different area for the User to look at. Is there a condor way to do this? transfer_output_remaps?

ANSWER: Greg doesn't think there is a Condor way to do this. Could make a copy of the subset and use transfer_output_rempas on the copy but that is a bit of a hack.

Pelican?

Felipe is playing with it and we will probably want it at NRAO.

ANSWER: Greg will ask around.

RHEL8 Crashing

We have had many NMT VLASS nodes crash since we upgraded to RHEL8. I think the nodes were busy when they crashed. So I changed our SLOT_TYPE_1 from 100% to 95%. Is this a good idea?

ANSWER: try using RESERVED_MEMORY=4096 (units are in Megabytes) instead of SLOT_TYPE_1=95% and put SLOT_TYPE_1=100% again.

getnenv

Did it change since 10.0? Can we still use getenv in DAGs or regular jobs?

#krowe Nov 5 2024: getenv no longer includes your entire environment as of version 10.7 or so. But instead it only includes the environment variables you list with the "ENV GET" syntax in the .dag file.

https://git.ligo.org/groups/computing/-/epics/30

ANSWER: Yes this is true. CHTC would like users to stop using getenv=true. There may be a knob to restore the old behavior.

DONE: check out docs and remove getenv=true

condor_userlog

condor_userlog /users/krowe/htcondor/condor_userlog/tmprn04xnqo/condor.log shows over 100% CPU Utilization. How does that happen? Hyperthreading is disabled.

nmpost-master krowe >condor_userlog condor.log
Job Host Start Time Evict Time Wall Time Good Time CPU Usage
7315.0 10.7.7.168 2/11 19:42 2/11 23:35 0+03:52 0+03:52 0+08:31
7316.0 10.7.7.168 2/11 23:35 2/12 05:03 0+05:27 0+05:27 0+05:01
7317.0 10.7.7.168 2/12 05:03 2/12 06:13 0+01:09 0+01:09 0+00:33
Host/Job Wall Time Good Time CPU Usage Avg Alloc Avg Lost Goodput Util.
10.7.7.168 0+10:29 0+10:29 0+14:06 0+03:29 0+00:00 100.0% 134.4%
7315.0 0+03:52 0+03:52 0+08:31 0+03:52 0+00:00 100.0% 219.9%
7316.0 0+05:27 0+05:27 0+05:01 0+05:27 0+00:00 100.0% 92.1%
7317.0 0+01:09 0+01:09 0+00:33 0+01:09 0+00:00 100.0% 47.7%
Total 0+10:29 0+10:29 0+14:06 0+03:29 0+00:00 100.0% 134.4%

ANSWER: Greg is not aware of any such bugs or reasons this would happen.

Seeing hostnames in condor_q output

What is the condor way to see the hostnames in condor_q output. Say a user wants to see what jobs are running on host nmpost037.

The reason I want to know is so when I am helping some user with our HTCondor install I can show them how to see the hostnames without telling them to use my script.

User to look at. Is there a condor way to do this? transfer_output_remaps?

ANSWER: Greg doesn't think there is a Condor way to do this. Could make a copy of the subset and use transfer_output_rempas on the copy but that is a bit of a hack.

Pelican?

Felipe is playing with it and we will probably want it at NRAO.

ANSWER: Greg will ask around.

RHEL8 Crashing

We have had many NMT VLASS nodes crash since we upgraded to RHEL8. I think the nodes were busy when they crashed. So I changed our SLOT_TYPE_1 from 100% to 95%. Is this a good idea?

ANSWER: try using RESERVED_MEMORY=4096 (units are in Megabytes) instead of SLOT_TYPE_1=95% and put SLOT_TYPE_1=100% again.

getnenv

Did it change since 10.0? Can we still use getenv in DAGs or regular jobs?

#krowe Nov 5 2024: getenv no longer includes your entire environment as of version 10.7 or so. But instead it only includes the environment variables you list with the "ENV GET" syntax in the .dag file.

https://git.ligo.org/groups/computing/-/epics/30

ANSWER: Yes this is true. CHTC would like users to stop using getenv=true. There may be a knob to restore the old behavior.

DONE: check out docs and remove getenv=true

condor_userlog

condor_userlog /users/krowe/htcondor/condor_userlog/tmprn04xnqo/condor.log shows over 100% CPU Utilization. How does that happen? Hyperthreading is disabled.

nmpost-master krowe >condor_userlog condor.log
Job Host Start Time Evict Time Wall Time Good Time CPU Usage
7315.0 10.7.7.168 2/11 19:42 2/11 23:35 0+03:52 0+03:52 0+08:31
7316.0 10.7.7.168 2/11 23:35 2/12 05:03 0+05:27 0+05:27 0+05:01
7317.0 10.7.7.168 2/12 05:03 2/12 06:13 0+01:09 0+01:09 0+00:33
Host/Job Wall Time Good Time CPU Usage Avg Alloc Avg Lost Goodput Util.
10.7.7.168 0+10:29 0+10:29 0+14:06 0+03:29 0+00:00 100.0% 134.4%
7315.0 0+03:52 0+03:52 0+08:31 0+03:52 0+00:00 100.0% 219.9%
7316.0 0+05:27 0+05:27 0+05:01 0+05:27 0+00:00 100.0% 92.1%
7317.0 0+01:09 0+01:09 0+00:33 0+01:09 0+00:00 100.0% 47.7%
Total 0+10:29 0+10:29 0+14:06 0+03:29 0+00:00 100.0% 134.4%

ANSWER: Greg is not aware of any such bugs or reasons this would happen.

Seeing hostnames in condor_q output

What is the condor way to see the hostnames in condor_q output. Say a user wants to see what jobs are running on host nmpost037.

The reason I want to know is so when I am helping some user with our HTCondor install I can show them how to see the hostnames without telling them to use my script.

ANSWER: condor_q -run -all -g

Using Pelican to replace nraorsync?

nraorsync does three things: uses rsync to only write back what has changed; use the faster network (IB, 10g, etc), our AP, nmpost-master, doesn't have an external IP; use our "data move" host gibson.

Greg thinks Pelican can write back only what has changed.

ANSWER: I found a ?recursive option to the URL but I don't know if that does any deduplication like rsync does or not.

Writing to NRAO Origin

If we want to write to our Origin do we need to enable authentication?

What is involved with doing that?

Greg doesn't know. I will look at the docs.

transfer_output_files change in version 23

My silly nraorsync transfer plugin relies on the user setting transfer_output_files = .job.ad in the submit description file to trigger the transfer of files. Then my nraorsync plugin takes over and looks at +nrao_output_files for the files to copy. But with version 23, this no longer works. I am guessing someone decided that internal files like .job.ad, .machine.ad, _condor_stdout, and _condor_stderr will no longer be tranferrable via trasnfer_output_files. Is that right? If so, I think I can work around it. Just wanted to know.

ANSWER: the starter has an exclude list and .job.ad is probably in it and maybe it is being access sooner or later than before. Greg will see if there is a better, first-class way to trigger transfers.

DONE: We will use condor_transfer since it needs to be there anyway.

Installing version 23

I am looking at upgrading from version 10 to 23 LTS. I noticed that y'all have a repo RPM to install condor but it installs the Feature Release only. It doens't provide repos to install the LTS.

https://htcondor.readthedocs.io/en/main/getting-htcondor/from-our-repositories.html

ANSWER: Greg will find it and get back to me.

DONE: https://research.cs.wisc.edu/htcondor/repo/23.0/el8/x86_64/release/

campuschampions

Have you heard of this email list https://campuschampions.cyberinfrastructure.org/

pools

This is pretty amazing. I can check the status and queue of OSG and PATH from nmpost-master

condor_status -pool cm-1.ospool.osg-htc.org

or

condor_status -pool htcondor-cm-path.osg.chtc.io

As long as I have a token for them in ~/.condor/tokens.dANSWER: condor_q -run -all -g

...

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 1093

New Version Current

Key

Using Pelican to replace nraorsync?

Writing to NRAO Origin

output_destination and stdout/stderr

transfer_output_files change in version 23

Installing version 23

Virtual memory vs RSS

CPU usage

Missing batch_name

Virtual memory vs RSS

CPU usage

Missing batch_name

DAG failed to submit

Resubmitting Jobs

DAG failed to submit

Resubmitting Jobs

Constant processing

In progress

condor_remote_cluster

Constant processing

In progress

condor_remote_cluster

Pelican?

RHEL8 Crashing

getnenv

condor_userlog

Seeing hostnames in condor_q output

Pelican?

RHEL8 Crashing

getnenv

condor_userlog

Seeing hostnames in condor_q output

Using Pelican to replace nraorsync?

Writing to NRAO Origin

transfer_output_files change in version 23

Installing version 23

campuschampions

pools