Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

ANSWER: the output of startd_crond can change the machine add.  So we need a class add like 'working = false' or change HASLUSTRE or something.  Then have another system, could be a cron job on nmpost-master, that checks for nodes with 'working = false', then drain the node.

Using Pelican to replace nraorsync?

nraorsync does three things: uses rsync to only write back what has changed; use the faster network (IB, 10g, etc), our AP, nmpost-master, doesn't have an external IP; use our "data move" host gibson.

Greg thinks Pelican can write back only what has changed.

Writing to NRAO Origin

If we want to write to our Origin do we need to enable authentication?

What is involved with doing that?

Greg doesn't know.  I will look at the docs.


output_destination and stdout/stderr

...

DONE: send greg error ouptut and security config

transfer_output_files change in version 23

My silly nraorsync transfer plugin relies on the user setting transfer_output_files = .job.ad in the submit description file to trigger the transfer of files.  Then my nraorsync plugin takes over and looks at +nrao_output_files for the files to copy.  But with version 23, this no longer works.  I am guessing someone decided that internal files like .job.ad, .machine.ad, _condor_stdout, and _condor_stderr will no longer be tranferrable via trasnfer_output_files.  Is that right?  If so, I think I can work around it.  Just wanted to know.

ANSWER: the starter has an exclude list and .job.ad is probably in it and maybe it is being access sooner or later than before.  Greg will see if there is a better, first-class way to trigger transfers.

DONE: We will use condor_transfer since it needs to be there anyway.

Installing version 23

I am looking at upgrading from version 10 to 23 LTS.  I noticed that y'all have a repo RPM to install condor but it installs the Feature Release only.  It doens't provide repos to install the LTS.

https://htcondor.readthedocs.io/en/main/getting-htcondor/from-our-repositories.html

ANSWER: Greg will find it and get back to me.

DONE: https://research.cs.wisc.edu/htcondor/repo/23.0/el8/x86_64/release/

Virtual memory vs RSS

Looks like condor is reporting RSS but that may actually be virtual memory.  At least according to Felipe's tests.

ANSWER: Access to the cgroup information on the nmpost cluster is good because condor is running as root and condor reports the RSS accurately.  But on systems using glidein like PATh and OSG they may not have appropriate access to the cgroup so memory reporting on these clusters may be different thatn memory reporting on the nmpost cluster.  On glide-in jobs condor reports the virtual memory across all the processes in the job.

CPU usage

Felipe has had jobs put on hold for too much cpu usage.

runResidualCycle_n4.imcycle8.condor.log:012 (269680.000.000) 2024-07-18 17:17:03 Job was held.

runResidualCycle_n4.imcycle8.condor.log- Excessive CPU usage. Please verify that the code is configured to use a limited number of cpus/threads, and matches request_cpus.

GREG: Perhaps only some machines in the OSPool have checks for this and may be doing something wrong or strange.

2024-09-16: Felipe asked about this again.

Missing batch_name


Virtual memory vs RSS

Looks like condor is reporting RSS but that may actually be virtual memory.  At least according to Felipe's tests.

ANSWER: Access to the cgroup information on the nmpost cluster is good because condor is running as root and condor reports the RSS accurately.  But on systems using glidein like PATh and OSG they may not have appropriate access to the cgroup so memory reporting on these clusters may be different thatn memory reporting on the nmpost cluster.  On glide-in jobs condor reports the virtual memory across all the processes in the job.


CPU usage

Felipe has had jobs put on hold for too much cpu usage.

runResidualCycle_n4.imcycle8.condor.log:012 (269680.000.000) 2024-07-18 17:17:03 Job was held.

runResidualCycle_n4.imcycle8.condor.log- Excessive CPU usage. Please verify that the code is configured to use a limited number of cpus/threads, and matches request_cpus.

GREG: Perhaps only some machines in the OSPool have checks for this and may be doing something wrong or strange.

2024-09-16: Felipe asked about this again.


Missing batch_name

A DAG job, submitted with hundreds of others, doesn't show a batch name in condor_q, just DAG: 371239.  Just one job, all the others submitted from the same template do show batch names

/lustre/aoc/cluster/pipeline/vlass_prod/spool/quicklook/VLASS3.2_T17t27.J201445+263000_P172318v1_2024_07_12T16_40_09.270

nmpost-master krowe >condor_q -dag -name mcilroy -g -all

...

vlapipe  vlass_ql.dag+370186            7/16 10:30      1      1      _      _      3 370193.0

vlapipe  vlass_ql.dag+370191            7/16 10:31      1      1      _      _      3 370194.0

vlapipe  DAG: 371239                    7/16 10:56      1      1      _      _      3 371536.0

...


GREG: Probably a condor bug.  Try submitting it again to see if the name is missing again.

WORKAROUND: condor_qedit job.id JobBatchName '"asdfasdf"'


DAG failed to submit

Another DAG job that was submitted along with hundreds of others looks to have created vlass_ql.dag.condor.sub but never actually submitted the job.  condor.log is emtpy.

/lustre/aoc/cluster/pipeline/vlass_prod/spool/quicklook/VLASS3.2_T18t13.J093830+283000_P175122v1_2024_07_06T16_33_34.742

ANSWERs: Perhaps the schedd was too busy to respond.  Need more resources in the workflow container?

Need to handle error codes from condor_submit_dag.  0 good.  1 bad. (chausman)

Setup /usr/bin/mail on mcilroy so that it works.  Condor will use this to send mail to root when it encounters an error.  Need to submit jira ticket to SSA. (krowe)


Resubmitting Jobs

I have an example in A DAG job, submitted with hundreds of others, doesn't show a batch name in condor_q, just DAG: 371239.  Just one job, all the others submitted from the same template do show batch names

/lustre/aoc/cluster/pipeline/vlass_prod/spool/quicklook/VLASS3.2_T17t27.J201445+263000_P172318v1_2024_07_12T16_40_09.270

nmpost-master krowe >condor_q -dag -name mcilroy -g -all

...

vlapipe  vlass_ql.dag+370186            7/16 10:30      1      1      _      _      3 370193.0

vlapipe  vlass_ql.dag+370191            7/16 10:31      1      1      _      _      3 370194.0

vlapipe  DAG: 371239                    7/16 10:56      1      1      _      _      3 371536.0

...

se_continuum_imaging/VLASS2.1_T10t30.J194602-033000_P161384v1_2020_08_15T01_21_14.433

of a job that failed on nmpost106 but then HTCondor resubmitted the job on nmpost105.  The problem is the job actually did finish, just got an error transferring back all the files, so when the job was resubmitted, it copied over an almost complete run of CASA which sort of makes a mess of things.  I would rather HTCondor just fail and not re-submit the job.  How can I do that?

022 (167287.000.000) 2023-12-24 02:43:57 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot1_1@nmpost106.aoc.nrao.edu <10.64.2.180:9618?addrs=10.64.2.180-9618&alias=nmpost106.aoc.nrao.edu&noUDP&sock=startd_5795_776a>
...
023 (167287.000.000) 2023-12-24 02:43:57 Job reconnected to slot1_1@nmpost106.aoc.nrao.edu
    startd address: <10.64.2.180:9618?addrs=10.64.2.180-9618&alias=nmpost106.aoc.nrao.edu&noUDP&sock=startd_5795_776a>
    starter address: <10.64.2.180:9618?addrs=10.64.2.180-9618&alias=nmpost106.aoc.nrao.edu&noUDP&sock=slot1_1_39813_9c2c_400>
...
007 (167287.000.000) 2023-12-24 02:43:57 Shadow exception!
        Error from slot1_1@nmpost106.aoc.nrao.edu: Repeated attempts to transfer output failed for unknown reasons
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...
040 (167287.000.000) 2023-12-24 02:45:09 Started transferring input files
        Transferring to host: <10.64.2.178:9618?addrs=10.64.2.178-9618&alias=nmpost105.aoc.nrao.edu&noUDP&sock=slot1_13_163338_25ab_452>
...
040 (167287.000.000) 2023-12-24 03:09:22 Finished transferring input files
...
001

GREG: Probably a condor bug.  Try submitting it again to see if the name is missing again.

WORKAROUND: condor_qedit job.id JobBatchName '"asdfasdf"'

DAG failed to submit

Another DAG job that was submitted along with hundreds of others looks to have created vlass_ql.dag.condor.sub but never actually submitted the job.  condor.log is emtpy.

/lustre/aoc/cluster/pipeline/vlass_prod/spool/quicklook/VLASS3.2_T18t13.J093830+283000_P175122v1_2024_07_06T16_33_34.742

ANSWERs: Perhaps the schedd was too busy to respond.  Need more resources in the workflow container?

Need to handle error codes from condor_submit_dag.  0 good.  1 bad. (chausman)

Setup /usr/bin/mail on mcilroy so that it works.  Condor will use this to send mail to root when it encounters an error.  Need to submit jira ticket to SSA. (krowe)

Resubmitting Jobs

I have an example in 

/lustre/aoc/cluster/pipeline/vlass_prod/spool/se_continuum_imaging/VLASS2.1_T10t30.J194602-033000_P161384v1_2020_08_15T01_21_14.433

of a job that failed on nmpost106 but then HTCondor resubmitted the job on nmpost105.  The problem is the job actually did finish, just got an error transferring back all the files, so when the job was resubmitted, it copied over an almost complete run of CASA which sort of makes a mess of things.  I would rather HTCondor just fail and not re-submit the job.  How can I do that?

022 (167287.000.000) 2023-12-24 0203:4309:57 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot1_1@nmpost106.aoc.nrao.edu 22 Job executing on host: <10.64.2.180178:9618?addrs=10.64.2.180178-9618&alias=nmpost106nmpost105.aoc.nrao.edu&noUDP&sock=startd_5795_776a>
...
023 (167287.000.000) 2023-12-24 02:43:57 Job reconnected to slot1_1@nmpost106.aoc.nrao.edu
    startd address: <10.64.2.180:9618?addrs=10.64.2.180-9618&alias=nmpost106.aoc.nrao.edu&noUDP&sock=startd_5795_776a>
    starter address: <10.64.2.180:9618?addrs=10.64.2.180-9618&alias=nmpost106.aoc.nrao.edu&noUDP&sock=slot1_1_39813_9c2c_400>
...
007 (167287_5724_c431>

ANSWER: Maybe 

on_exit_hold = some_expression

periodic_hold = NumShadowStarts > 5

periodic_hold = NumJobStarts > 5

or a startd cron job that checks for IdM and offlines the node if needed

https://htcondor.readthedocs.io/en/latest/admin-manual/ep-policy-configuration.html#custom-and-system-slot-attributes

https://htcondor.readthedocs.io/en/latest/admin-manual/ep-policy-configuration.html#startd-cron



Constant processing

Our workflows have a process called "ingestion" that puts data into our archive.  There are almost always ingestion processes running or needing to run and we don't want them to get stalled because of other jobs.  Both ingestion and other jobs are the same user "vlapipe".  I thought about setting a high priority in the ingestion submit description file but that won't guarantee that ingestion always runs, especially since we don't do preemption.  So my current thinking is to have a dedicated node for ingestion.  Can you think of a better solution?

  • What about using the local scheduling universe so it runs on the Access Point.  The AP is a docker container with only limited Lustre access so this would be a bad option.
  • ANSWER: A dedicated node is a good solution given no preemption.

So on the node I would need to set something like the following

# High priority only jobs

HIGHPRIORITY = True

STARTD_ATTRS = $(STARTD_ATTRS) HIGHPRIORITY

START = ($(START)) && (TARGET.priority =?= "HIGHPRIORITY")

Nov. 13, 2023 krowe: I need to implement this.  Make a node a HIGHPRIROITY node and have SSA put HIGHPRIORITY in the ingestion jobs.

2024-02-01 krowe: Talked to chausman today.  She thinks SSA will need this and that the host will need access to /lustre/evla like aocngas-master and the nmngas nodes do.  That might also mean a variable like HASEVLALUSTRE as well or instead of HIGHPRIORITY.





...

In progress

condor_remote_cluster

CHTC

000 (901.000.000) 2023-1204-24 0214 16:43:57 Shadow exception!
        Error from slot1_1@nmpost106.aoc.nrao.edu: Repeated attempts to transfer output failed for unknown reasons
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...
040 (167287.000.000) 2023-12-24 02:45:09 Started transferring input files
        Transferring to 31:38 Job submitted from host: <10.64.21.178:9618?addrs=10.64.21.178-9618&alias=nmpost105testpost-master.aoc.nrao.edu&noUDP&sock=slot1schedd_13_163338_25ab_452>2269692_816e>
...
040 012 (167287901.000.000) 2023-12-24 03:09:22 Finished transferring input files
...
001 (16728704-14 16:31:41 Job was held.
        Failed to start GAHP: Missing remote command\n
        Code 0 Subcode 0
...
testpost-master krowe >cat condor.902.log 
000 (902.000.000) 2023-1204-24 0314 16:0940:22 37 Job executing on submitted from host: <10.64.21.178:9618?addrs=10.64.21.178-9618&alias=nmpost105testpost-master.aoc.nrao.edu&noUDP&sock=startdschedd_57242269692_c431>

ANSWER: Maybe 

on_exit_hold = some_expression

periodic_hold = NumShadowStarts > 5

periodic_hold = NumJobStarts > 5

or a startd cron job that checks for IdM and offlines the node if needed

https://htcondor.readthedocs.io/en/latest/admin-manual/ep-policy-configuration.html#custom-and-system-slot-attributes

https://htcondor.readthedocs.io/en/latest/admin-manual/ep-policy-configuration.html#startd-cron

Constant processing

Our workflows have a process called "ingestion" that puts data into our archive.  There are almost always ingestion processes running or needing to run and we don't want them to get stalled because of other jobs.  Both ingestion and other jobs are the same user "vlapipe".  I thought about setting a high priority in the ingestion submit description file but that won't guarantee that ingestion always runs, especially since we don't do preemption.  So my current thinking is to have a dedicated node for ingestion.  Can you think of a better solution?

  • What about using the local scheduling universe so it runs on the Access Point.  The AP is a docker container with only limited Lustre access so this would be a bad option.
  • ANSWER: A dedicated node is a good solution given no preemption.

So on the node I would need to set something like the following

# High priority only jobs

HIGHPRIORITY = True

STARTD_ATTRS = $(STARTD_ATTRS) HIGHPRIORITY

START = ($(START)) && (TARGET.priority =?= "HIGHPRIORITY")

Nov. 13, 2023 krowe: I need to implement this.  Make a node a HIGHPRIROITY node and have SSA put HIGHPRIORITY in the ingestion jobs.

2024-02-01 krowe: Talked to chausman today.  She thinks SSA will need this and that the host will need access to /lustre/evla like aocngas-master and the nmngas nodes do.  That might also mean a variable like HASEVLALUSTRE as well or instead of HIGHPRIORITY.

In progress

condor_remote_cluster

CHTC

816e>
...
012 (902.000.000) 2023-04-14 16:40:41 Job was held.
        Failed to start GAHP: Agent pid 3145812\nPermission denied (gssapi-with-mic,keyboard-interactive).\nAgent pid 3145812 killed\n
        Code 0 Subcode 0
...

PATh

000 (901.000.000) 2023-04-14 16:31:38 Job submitted from host: <10.64.1.178:9618?addrs=10.64.1.178-9618&alias=testpost-master.aoc.nrao.edu&noUDP&sock=schedd_2269692_816e>
...
012 (901.000.000) 2023-04-14 16:31:000 (901.000.000) 2023-04-14 16:31:38 Job submitted from host: <10.64.1.178:9618?addrs=10.64.1.178-9618&alias=testpost-master.aoc.nrao.edu&noUDP&sock=schedd_2269692_816e>
...
012 (901.000.000) 2023-04-14 16:31:41 Job was held.
        Failed to start GAHP: Missing remote command\n
        Code 0 Subcode 0
...testpost-master krowe >cat condor.902.log 
000 (902.000.000) 2023-04-14 16:40:37 Job submitted from host: <10.64.1.178:9618?addrs=10.64.1.178-9618&alias=testpost-master.aoc.nrao.edu&noUDP&sock=schedd_2269692_816e>
...
012 (902.000.000) 2023-04-14 16:40:41 Job was held.
        Failed to start GAHP: Agent pid 3145812\nPermission denied (gssapi-with-mic,keyboard-interactive).\nAgent pid 3145812 killed\n
        Code 0 Subcode 0
...

PATh

000 (901.000.000) 2023-04-14 16:31:38 Job submitted from host: <10.64.1.178:9618?addrs=10.64.1.178-9618&alias=testpost-master.aoc.nrao.edu&noUDP&sock=schedd_2269692_816e>
...
012 (901.000.000) 2023-04-14 16:31:41 Job was held.
        Failed to start GAHP: Missing remote command\n
        Code 0 Subcode 0
...

Radial

It works but seems to leave a job on the radial cluster for like 30 minutes.

Radial

It works but seems to leave a job on the radial cluster for like 30 minutes.

[root@radialhead htcondor-10.0.3-1]# ~krowe/bin/condor_qstat 
JobId     Owner    JobBatchName       CPUs JS Mem(MB)  ElapTime    SubmitHost       Slot     RemoteHost(s)       
--------- ------[root@radialhead htcondor-10.0.3-1]# ~krowe/bin/condor_qstat 
JobId     Owner    JobBatchName       CPUs JS Mem(MB)  ElapTime    SubmitHost       Slot     RemoteHost(s)       
-- ------------------ ---- -- -------- ----------- ---------------- -------- -----------------------------------
       99 nrao                        1    C      1024 0+0:13:22   radialhead.nrao.                              

...

Some of our pipeline jobs don't set shoud_transfer_files=YES because they need to transfer some output to an area for Analysts to look at and a some other output (may be a subset) to a different area for the User to look at.  Is there a condor way to do this?  transfer_output_remaps?

ANSWER: Greg doesn't think there is a Condor way to do this.  Could make a copy of the subset and use transfer_output_rempas on the copy but that is a bit of a hack.

Pelican?

Felipe is playing with it and we will probably want it at NRAO.

ANSWER: Greg will ask around.

RHEL8 Crashing

We have had many NMT VLASS nodes crash since we upgraded to RHEL8.  I think the nodes were busy when they crashed.  So I changed our SLOT_TYPE_1 from 100% to 95%.  Is this a good idea?

ANSWER: try using RESERVED_MEMORY=4096 (units are in Megabytes) instead of SLOT_TYPE_1=95% and put SLOT_TYPE_1=100% again.

getnenv

Did it change since 10.0?  Can we still use getenv in DAGs or regular jobs?

#krowe Nov  5 2024: getenv no longer includes your entire environment as of version 10.7 or so.  But instead it only includes the environment variables you list with the "ENV GET" syntax in the .dag file.

https://git.ligo.org/groups/computing/-/epics/30

ANSWER: Yes this is true.  CHTC would like users to stop using getenv=true.  There may be a knob to restore the old behavior.

DONE: check out docs and remove getenv=true

condor_userlog

condor_userlog /users/krowe/htcondor/condor_userlog/tmprn04xnqo/condor.log shows over 100% CPU Utilization.  How does that happen?  Hyperthreading is disabled.

nmpost-master krowe >condor_userlog condor.log 

Job      Host            Start Time  Evict Time  Wall Time Good Time CPU Usage
7315.0   10.7.7.168       2/11 19:42  2/11 23:35   0+03:52   0+03:52   0+08:31
7316.0   10.7.7.168       2/11 23:35  2/12 05:03   0+05:27   0+05:27   0+05:01
7317.0   10.7.7.168       2/12 05:03  2/12 06:13   0+01:09   0+01:09   0+00:33

Host/Job        Wall Time Good Time CPU Usage Avg Alloc  Avg Lost Goodput  Util.

10.7.7.168        0+10:29   0+10:29   0+14:06   0+03:29   0+00:00  100.0% 134.4%

7315.0            0+03:52   0+03:52   0+08:31   0+03:52   0+00:00  100.0% 219.9%
7316.0            0+05:27   0+05:27   0+05:01   0+05:27   0+00:00  100.0%  92.1%
7317.0            0+01:09   0+01:09   0+00:33   0+01:09   0+00:00  100.0%  47.7%

Total             0+10:29   0+10:29   0+14:06   0+03:29   0+00:00  100.0% 134.4%

ANSWER: Greg is not aware of any such bugs or reasons this would happen.

Seeing hostnames in condor_q output

What is the condor way to see the hostnames in condor_q output.  Say a user wants to see what jobs are running on host nmpost037.

The reason I want to know is so when I am helping some user with our HTCondor install I can show them how to see the hostnames without telling them to use my script.

User to look at.  Is there a condor way to do this?  transfer_output_remaps?

ANSWER: Greg doesn't think there is a Condor way to do this.  Could make a copy of the subset and use transfer_output_rempas on the copy but that is a bit of a hack.


Pelican?

Felipe is playing with it and we will probably want it at NRAO.

ANSWER: Greg will ask around.


RHEL8 Crashing

We have had many NMT VLASS nodes crash since we upgraded to RHEL8.  I think the nodes were busy when they crashed.  So I changed our SLOT_TYPE_1 from 100% to 95%.  Is this a good idea?

ANSWER: try using RESERVED_MEMORY=4096 (units are in Megabytes) instead of SLOT_TYPE_1=95% and put SLOT_TYPE_1=100% again.



getnenv

Did it change since 10.0?  Can we still use getenv in DAGs or regular jobs?

#krowe Nov  5 2024: getenv no longer includes your entire environment as of version 10.7 or so.  But instead it only includes the environment variables you list with the "ENV GET" syntax in the .dag file.


https://git.ligo.org/groups/computing/-/epics/30

ANSWER: Yes this is true.  CHTC would like users to stop using getenv=true.  There may be a knob to restore the old behavior.

DONE: check out docs and remove getenv=true


condor_userlog

condor_userlog /users/krowe/htcondor/condor_userlog/tmprn04xnqo/condor.log shows over 100% CPU Utilization.  How does that happen?  Hyperthreading is disabled.

nmpost-master krowe >condor_userlog condor.log 

Job      Host            Start Time  Evict Time  Wall Time Good Time CPU Usage
7315.0   10.7.7.168       2/11 19:42  2/11 23:35   0+03:52   0+03:52   0+08:31
7316.0   10.7.7.168       2/11 23:35  2/12 05:03   0+05:27   0+05:27   0+05:01
7317.0   10.7.7.168       2/12 05:03  2/12 06:13   0+01:09   0+01:09   0+00:33

Host/Job        Wall Time Good Time CPU Usage Avg Alloc  Avg Lost Goodput  Util.

10.7.7.168        0+10:29   0+10:29   0+14:06   0+03:29   0+00:00  100.0% 134.4%

7315.0            0+03:52   0+03:52   0+08:31   0+03:52   0+00:00  100.0% 219.9%
7316.0            0+05:27   0+05:27   0+05:01   0+05:27   0+00:00  100.0%  92.1%
7317.0            0+01:09   0+01:09   0+00:33   0+01:09   0+00:00  100.0%  47.7%

Total             0+10:29   0+10:29   0+14:06   0+03:29   0+00:00  100.0% 134.4%

ANSWER: Greg is not aware of any such bugs or reasons this would happen.



Seeing hostnames in condor_q output

What is the condor way to see the hostnames in condor_q output.  Say a user wants to see what jobs are running on host nmpost037.

The reason I want to know is so when I am helping some user with our HTCondor install I can show them how to see the hostnames without telling them to use my script.

ANSWER: condor_q -run -all -g


Using Pelican to replace nraorsync?

nraorsync does three things: uses rsync to only write back what has changed; use the faster network (IB, 10g, etc), our AP, nmpost-master, doesn't have an external IP; use our "data move" host gibson.

Greg thinks Pelican can write back only what has changed.

ANSWER: I found a ?recursive option to the URL but I don't know if that does any deduplication like rsync does or not.



Writing to NRAO Origin

If we want to write to our Origin do we need to enable authentication?

What is involved with doing that?

Greg doesn't know.  I will look at the docs.


transfer_output_files change in version 23

My silly nraorsync transfer plugin relies on the user setting transfer_output_files = .job.ad in the submit description file to trigger the transfer of files.  Then my nraorsync plugin takes over and looks at +nrao_output_files for the files to copy.  But with version 23, this no longer works.  I am guessing someone decided that internal files like .job.ad, .machine.ad, _condor_stdout, and _condor_stderr will no longer be tranferrable via trasnfer_output_files.  Is that right?  If so, I think I can work around it.  Just wanted to know.

ANSWER: the starter has an exclude list and .job.ad is probably in it and maybe it is being access sooner or later than before.  Greg will see if there is a better, first-class way to trigger transfers.

DONE: We will use condor_transfer since it needs to be there anyway.


Installing version 23

I am looking at upgrading from version 10 to 23 LTS.  I noticed that y'all have a repo RPM to install condor but it installs the Feature Release only.  It doens't provide repos to install the LTS.

https://htcondor.readthedocs.io/en/main/getting-htcondor/from-our-repositories.html

ANSWER: Greg will find it and get back to me.

DONE: https://research.cs.wisc.edu/htcondor/repo/23.0/el8/x86_64/release/



campuschampions

Have you heard of this email list https://campuschampions.cyberinfrastructure.org/


pools

This is pretty amazing.  I can check the status and queue of OSG and PATH from nmpost-master

condor_status -pool cm-1.ospool.osg-htc.org

or

condor_status -pool htcondor-cm-path.osg.chtc.io

As long as I have a token for them in ~/.condor/tokens.dANSWER: condor_q -run -all -g



...