Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

But cuda_visible_devices only provides the short UUID name.  Is there a way to get the long UUID name from cuda_visisble_devices?

Quotes in DAG VARS

I was helping SSA with a syntax problem between HTCondor-9 and HTCondor-10 and I was wondering if you had any thoughts on it.  They have a dag with lines like this

  JOB SqDeg2/J232156-603000 split.condor
  VARS SqDeg2/J232156-603000 jobname="$(JOB)" split_dir="SqDeg2/J232156+603000" 

Then they set that split_dir VAR to a variable in the submit description file like this

  SPLIT_DIR = "$(split_dir)"

The problem seems to be the quotes around $(split_dir).  It works fine in HTCondor-9 but with HTCondor-10 they get an error like this in their pims_split.dag.dagman.out file

  02/28/24 16:26:02 submit error: Submit:-1:Unexpected characters following doublequote.  Did you forget to escape the double-quote by repeating it?  Here is the quote and trailing characters: "SqDeg2/J232156+603000""

Looking at the documentation https://htcondor.readthedocs.io/en/latest/version-history/lts-versions-10-0.html#version-10-0-0 its clear they shouldn't be putting quotes around $(split_dir). So clearly something changed with version 10.  Either a change to the syntax or, my guess, just a stricter parser.

Any thoughts on this?

ANSWER: Greg doesn't know why this changed but thinks we are now doing the right thing.

Felipe's code

Felipe to share his job visualization software with Greg and maybe present at Throughput 2024.

Big Data

Are we alone in needing to copy in and out many GBs per job?  Do other institutions have this problem as well?  Does CHTC have any suggestions to help?  Sanja will ask this of Bockleman as well.

ANSWER: Greg thinks our transfer times are not uncommon but our processing time is shorter than many.  Other jobs have similar data sizes.  Some other jobs have similar transfer times but process for many hours.  Maybe we can constrain our jobs to only run on sites that seem to transfer quickly.  Greg is also interested in why some sites seem slower than others.  Is that actually site specific or is it time specific or...

Felipe does have a long list of excluded sites in his run just for this reason.  Greg would like a more declaritive solution like "please run on fast transfer hosts" especially if this is dynamic.

GPUs_Capability

We have a host (testpost001) with both a Tesla T4 (Capability=7.5) and a Tesla L4 (Capability=8.9) and when I run condor_gpu_discovery -prop I see something like the following

DetectedGPUs="GPU-ddc998f9, GPU-40331b00"

Common=[ DriverVersion=12.20; ECCEnabled=true; MaxSupportedVersion=12020; ]

GPU_40331b00=[ id="GPU-40331b00"; Capability=7.5; DeviceName="Tesla T4"; DevicePciBusId="0000:3B:00.0"; DeviceUuid="40331b00-c3b6-fa9a-b8fd-33bec2fcd29c"; GlobalMemoryMb=14931; ]

GPU_ddc998f9=[ id="GPU-ddc998f9"; Capability=8.9; DeviceName="NVIDIA L4"; DevicePciBusId="0000:5E:00.0"; DeviceUuid="ddc998f9-99e2-d9c1-04e3-7cc023a2aa5f"; GlobalMemoryMb=22491; ]

The problem is `condor_status -compact -constraint 'GPUs_Capability >= 7.0'` doesn't show testpost001.  It does show testpost001 when I physically remote the T4.

Requesting a specific GPU with `RequireGPUs = (Capability >= 8.0)` or `RequireGPUs = (Capability <= 8.0)` does work however so maybe this is just a condor_status issue.

We then replaced the L4 with a second T4 and then GPUs_Capability functioned as expected.

Can condor handle two different capabilities on the same node?

ANSWER: Greg will look into it.  They only recently added support for different GPUs on the same node.  So this is going to take some time to get support in condor_status.

Resubmitting Jobs

I have an example in 

/lustre/aoc/cluster/pipeline/vlass_prod/spool/se_continuum_imaging/VLASS2.1_T10t30.J194602-033000_P161384v1_2020_08_15T01_21_14.433

of a job that failed on nmpost106 but then HTCondor resubmitted the job on nmpost105.  The problem is the job actually did finish, just got an error transferring back all the files, so when the job was resubmitted, it copied over an almost complete run of CASA which sort of makes a mess of things.  I would rather HTCondor just fail and not re-submit the job.  How can I do that?


Felipe's code

Felipe to share his job visualization software with Greg and maybe present at Throughput 2024.


Big Data

Are we alone in needing to copy in and out many GBs per job?  Do other institutions have this problem as well?  Does CHTC have any suggestions to help?  Sanja will ask this of Bockleman as well.

ANSWER: Greg thinks our transfer times are not uncommon but our processing time is shorter than many.  Other jobs have similar data sizes.  Some other jobs have similar transfer times but process for many hours.  Maybe we can constrain our jobs to only run on sites that seem to transfer quickly.  Greg is also interested in why some sites seem slower than others.  Is that actually site specific or is it time specific or...

Felipe does have a long list of excluded sites in his run just for this reason.  Greg would like a more declaritive solution like "please run on fast transfer hosts" especially if this is dynamic.


GPUs_Capability

We have a host (testpost001) with both a Tesla T4 (Capability=7.5) and a Tesla L4 (Capability=8.9) and when I run condor_gpu_discovery -prop I see something like the following

DetectedGPUs="GPU-ddc998f9, GPU-40331b00"

Common=[ DriverVersion=12.20; ECCEnabled=true; MaxSupportedVersion=12020; ]

GPU_40331b00=[ id="GPU-40331b00"; Capability=7.5; DeviceName="Tesla T4"; DevicePciBusId="0000:3B:00.0"; DeviceUuid="40331b00-c3b6-fa9a-b8fd-33bec2fcd29c"; GlobalMemoryMb=14931; ]

GPU_ddc998f9=[ id="GPU-ddc998f9"; Capability=8.9; DeviceName="NVIDIA L4"; DevicePciBusId="0000:5E:00.0"; DeviceUuid="ddc998f9-99e2-d9c1-04e3-7cc023a2aa5f"; GlobalMemoryMb=22491; ]

The problem is `condor_status -compact -constraint 'GPUs_Capability >= 7.0'` doesn't show testpost001.  It does show testpost001 when I physically remote the T4.

Requesting a specific GPU with `RequireGPUs = (Capability >= 8.0)` or `RequireGPUs = (Capability <= 8.0)` does work however so maybe this is just a condor_status issue.

We then replaced the L4 with a second T4 and then GPUs_Capability functioned as expected.

Can condor handle two different capabilities on the same node?

ANSWER: Greg will look into it.  They only recently added support for different GPUs on the same node.  So this is going to take some time to get support in condor_status.


Resubmitting Jobs

I have an example in 

/lustre/aoc/cluster/pipeline/vlass_prod/spool/se_continuum_imaging/VLASS2.1_T10t30.J194602-033000_P161384v1_2020_08_15T01_21_14.433

of a job that failed on nmpost106 but then HTCondor resubmitted the job on nmpost105.  The problem is the job actually did finish, just got an error transferring back all the files, so when the job was resubmitted, it copied over an almost complete run of CASA which sort of makes a mess of things.  I would rather HTCondor just fail and not re-submit the job.  How can I do that?

022 (167287.000.000) 2023-12-24 02:43:57 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot1_1@nmpost106.aoc.nrao.edu <10.64.2.180:9618?addrs=10.64.2.180-9618&alias=nmpost106.aoc.nrao.edu&noUDP&sock=startd_5795_776a>
...
023 (167287.000.000) 2023-12-24 02:43:57 Job reconnected to slot1_1@nmpost106.aoc.nrao.edu
    startd address: <10.64.2.180:9618?addrs=10.64.2.180-9618&alias=nmpost106.aoc.nrao.edu&noUDP&sock=startd_5795_776a>
    starter address: <10.64.2.180:9618?addrs=10.64.2.180-9618&alias=nmpost106.aoc.nrao.edu&noUDP&sock=slot1_1_39813_9c2c_400>
...
007 (167287.000.000) 2023-12-24 02:43:57 Shadow exception!
        Error from slot1_1@nmpost106.aoc.nrao.edu: Repeated attempts to transfer output failed for unknown reasons
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...
040 (167287.000.000) 2023-12-24 02:45:09 Started transferring input files
        Transferring to host: <10.64.2.178:9618?addrs=10.64.2.178-9618&alias=nmpost105.aoc.nrao.edu&noUDP&sock=slot1_13_163338_25ab_452>
...
040 (167287.022 (167287.000.000) 2023-12-24 0203:43:57 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot1_1@nmpost106.aoc.nrao.edu <10.64.2.180:9618?addrs=10.64.2.180-9618&alias=nmpost106.aoc.nrao.edu&noUDP&sock=startd_5795_776a>
...
023 09:22 Finished transferring input files
...
001 (167287.000.000) 2023-12-24 0203:4309:57 Job reconnected to slot1_1@nmpost106.aoc.nrao.edu
    startd address22 Job executing on host: <10.64.2.180178:9618?addrs=10.64.2.180178-9618&alias=nmpost106nmpost105.aoc.nrao.edu&noUDP&sock=startd_57955724_776a>
    starter address: <10.64.2.180:9618?addrs=10.64.2.180-9618&alias=nmpost106.aoc.nrao.edu&noUDP&sock=slot1_1_39813_9c2c_400>
...
007 (167287.000.000) 2023-12-24 02:43:57 Shadow exception!
        Error from slot1_1@nmpost106.aoc.nrao.edu: Repeated attempts to transfer output failed for unknown reasons
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...
040 (167287.000.000) 2023-12-24 02:45:09 Started transferring input files
        Transferring to host: <10.64.2.178:9618?addrs=10.64.2.178-9618&alias=nmpost105.aoc.nrao.edu&noUDP&sock=slot1_13_163338_25ab_452>
...
040 (167287.000.000) 2023-12-24 03:09:22 Finished transferring input files
...
001 (167287.000.000) 2023-12-24 03:09:22 Job executing on host: <10.64.2.178:9618?addrs=10.64.2.178-9618&alias=nmpost105.aoc.nrao.edu&noUDP&sock=startd_5724_c431>

ANSWER: Maybe 

on_exit_hold = some_expression

periodic_hold = NumShadowStarts > 5

periodic_hold = NumJobStarts > 5

or a startd cron job that checks for IdM and offlines the node if needed

https://htcondor.readthedocs.io/en/latest/admin-manual/ep-policy-configuration.html#custom-and-system-slot-attributes

https://htcondor.readthedocs.io/en/latest/admin-manual/ep-policy-configuration.html#startd-cron

RedHat8 Only

Say we have a few RedHat8 nodes and we only want jobs to run on those nodes that request RedHat8 with

requirements = (OpSysAndVer == "RedHat8")

I know I could set up a partition like we have done with VLASS but since HTCondor already has an OS knob, can I use that?

Setting RedHat8 in the job requirements guarantees the job will run on a RedHat8 node, but how do I make that node not run jobs that don't specify the OS they want?

The following didn't do what I wanted.

c431>

ANSWER: Maybe 

on_exit_hold = some_expression

periodic_hold = NumShadowStarts > 5

periodic_hold = NumJobStarts > 5

or a startd cron job that checks for IdM and offlines the node if needed

https://htcondor.readthedocs.io/en/latest/admin-manual/ep-policy-configuration.html#custom-and-system-slot-attributes

https://htcondor.readthedocs.io/en/latest/admin-manual/ep-policy-configuration.html#startd-cron


RedHat8 Only

Say we have a few RedHat8 nodes and we only want jobs to run on those nodes that request RedHat8 with

requirements = (OpSysAndVer == "RedHat8")

I know I could set up a partition like we have done with VLASS but since HTCondor already has an OS knob, can I use that?

Setting RedHat8 in the job requirements guarantees the job will run on a RedHat8 node, but how do I make that node not run jobs that don't specify the OS they want?

The following didn't do what I wanted.

START = ($(START)) && (TARGET.OpSysAndVer =?= "RedHat8")

Then I thought I needed to specify jobs where OpSysAndVer is not Undefined but that didn't work either.  Either of the following do prevent jobs that don't specify an OS from running on the node but they also prevent jobs that DO specify an OS via either OpSysAndVer or OpSysMajorVer respectively.

START = ($(START)) && (TARGET.OpSysAndVer isnt UNDEFINED)

START = ($(START)) && (TARGET.OpSysMajorVer isnt UNDEFINED)


A better long-term solution is probably for our jobs (VLASS, VLA calibration, ingestion, etc) to ask for the OS that they want if they care.  Then they can test new OSes when they want and we can upgrade OSes at our schedule (to a certain point).  I think asking them to start requesting the OS they want now is not going to happen but maybe by the time RedHat9 is an option they and we will be ready for this.

ANSWER: unparse takes a classad expression and turns into a string then use a regex on it looking for opsysandver.

Is this the right syntax?  Probably not as it doesn't work

START = ($(START)) && (TARGET.OpSysAndVer =?= "RedHat8")regexp(".*RedHat8.*", unparse(TARGET.Requirements)))

Greg thinks this should work.  We will poke at it.

The following DOES WORK in the sense that it matches anythingThen I thought I needed to specify jobs where OpSysAndVer is not Undefined but that didn't work either.  Either of the following do prevent jobs that don't specify an OS from running on the node but they also prevent jobs that DO specify an OS via either OpSysAndVer or OpSysMajorVer respectively.

START = ($(START)) && (regexp(TARGET.OpSysAndVer isnt UNDEFINED)

START = ($(START)) && (TARGET.OpSysMajorVer isnt UNDEFINED)

A better long-term solution is probably for our jobs (VLASS, VLA calibration, ingestion, etc) to ask for the OS that they want if they care.  Then they can test new OSes when they want and we can upgrade OSes at our schedule (to a certain point).  I think asking them to start requesting the OS they want now is not going to happen but maybe by the time RedHat9 is an option they and we will be ready for this.

ANSWER: unparse takes a classad expression and turns into a string then use a regex on it looking for opsysandver.

".", unparse(TARGET.Requirements)))

None of these Is this the right syntax?  Probably not as it doesn't work

START = ($(START)) && (regexp(".*RedHat8.*", unparse(TARGET.Requirements)))

Greg thinks this should work.  We will poke at it.

...


START = ($(START)) && (regexp(".*a.*", unparse(TARGET.Requirements)))

...


START = ($(START)) && (regexp(".*RedHat8.*", unparse(Requirements)))
START = ($(START)) && (regexp(".*a.*", unparse(Requirements)))
START = ($(START)) && (regexp("((OpSysAndVer.*", unparse(Requirements)))
START = ($(START)) && (regexp("((OpSysAndVer.*", unparse(TARGET.Requirements)))
START = ($(START)) && (regexp("\(\(OpSysAndVer.*", unparse(Requirements)))
START = ($(START)) && (regexp("(.*)RedHat8(.*)", unparse(Requirements)))
START = ($(START)) && (regexp("RedHat8", unparse(Requirements), "i"))
START = ($(START)) && (regexp("^.*RedHat8.*$", unparse(Requirements), "i"))
START = ($(START)) && (regexp("^.*RedHat8.*$", unparse(Requirements), "m"))
START = ($(START)) && (regexp("OpSysAdnVer\\s*==\\s*\"RedHat8\"", unparse(Requirements)))
START = $(START) && regexp("OpSysAdnVer\\s*==\\s*\"RedHat8\"", unparse(Requirements))

#START = $(START) && debug(regexp(".*RedHat8.*", unparse(TARGET.Requirements)))

...

executable = my_binary.$$([ classad_express(GPUS_capability) ]) # Hopefully you don't need this

CPU/GPU Balancing

We have 30 nodes in a rack at NMT with a power limit of 17 kW and we are able to hit that limit when all 720 cores (24 cores * 30 nodes) are busy.  We want to add two GPUs to each node but that would almost certainly put us way over the power limit if each node had 22 cores and 2 GPUs busy.  So is there a way to tell HTCondor to reserve X cores for each GPU?  That way we could balance the power load.

JOB TRANSFORMS work per schedd so that wouldn't work on the startd side which is what we want.

IDEA: NUM_CPUS = 4 or some other small number greater then the number of GPUs but limiting enough to keep the power draw low.

ANSWER: There isn't a knob for this in HTCondor but Greg is interested in this and will look into this. 

WORKAROUND: MODIFY_REQUEST_EXPR_REQUESTCPUS may help by setting each job gets 8cores or something like.

MODIFY_REQUEST_EXPR_REQUESTCPUS = quantize(RequestCpus, isUndefined(RequestGpus) ? {1} : {8, 16, 24, 32, 40})

That is, when a job comes into the startd, if it doesn't request any GPUs, allocate exactly as many cpu cores as it requests. Otherwise, allocate 8 times as many cpus as it requests.

This seems to work. If I ask for 0 GPUs and 4 CPUs, I am given 0 GPUs and 4 CPUs.  If I ask for 1 GPU and don't ask for CPUs, I am given 1 GPU and 8 CPUs.

But if I ask for 2 GPUs and don't ask for CPUs, I still am only given 8 CPUs.  I was expecting to be given 16 CPUs.  This is probably fine as we are not planning on more than 1 GPU per job.

But if I ask for 1 GPU and 4 CPUs, i am given 1 GPU and 8 CPUs.  That is probably acceptable.

2024-01-24 krowe: Assuming a node can draw up to 550 Watts when all 24 cores are busy and that node only draws 150 Watts when idle, and that we have 17,300 Watts available to us in an NMT rack,

  • we should only need to reserve 3 cores per GPU in order to offset the 72 Watts of an Nvidia L4 GPU.
    • This would waste 60 cores.
  • Or at least I suggest starting with that and seeing what happens.  Another alternative is we just turn off three nodes if we put one L4 in each node.
    • This would waste 72 cores.

Upgrading

CHTC just upgrades to the latest version when it becomes available, right?  Do you ever run into problems because of this?  We are still using version 9 because I can't seem to schedule a time with our VLASS group to test version 10.  Much less version 23.

ANSWER: yes.  The idea is that CHTC's users are helping them test the latest versions.

Flocking to CHTC?

We may want to run VLASS jobs at CHTC.  What is the best way to submit locally and run globally?

ANSWER: Greg thinks flocking is the best idea.

This will require 9618 open to nmpost-master and probably a static NAT and external DNS name.

External users vs staff

We are thinking about making a DMZ ( I don't like that term ) for observers.  Does CHTC staff use the same cluster resources that CHTC observers (customers) use?

ANSWER: There is no airgap at CHTC everyone uses the same cluster.  Sometime users use a different AP but more for load balancing than security.  Everyone does go through 2FA.

Does PATh Cache thingy(tm) (a.k.a. Stash) work outside of PATh?

I see HTCondor-10.x comes with a stash plugin.  Does this mean we could read/write to stash from NRAO using HTCondor-10.x?

ANSWER: Greg thinks you can use stash remotely, like at our installation of HTCondor.

Curl_plugin doesn't do FTP

None of the following work.  They either hang or produce errors.  They work on the shell command line, except at CHTC where the squid server doesn't seem to grok FTP.

transfer_input_files = ftp://demo:password@test.rebex.net:/readme.txt
transfer_input_files = ftp://ftp:@ftp.gnu.org:/welcome.msg
transfer_input_files = ftp://ftp.gnu.org:/welcome.msg
transfer_input_files = ftp://ftp:@ftp.slackware.com:/welcome.msg
transfer_input_files = ftp://ftp.slackware.com:/welcome.msg

2024-02-05: Greg thinks this should work and will look into it.

ANSWER: 2024-02-06 Greg wrote "Just found the problem with ftp file transfer plugin.  I'm afraid there's no easy workaround, but I've pushed a fix that will go into the next stable release. "

File Transfer Plugins and HTCondor-C

I see that when a job starts, the execution point (radial001) uses our nraorsync plugin to download the files.  This is fine and good.  When the job is finished, the execution point (radial001) uses our nraorsync plugin to upload the files, also fine and good.  But then the RADIAL schedd (radialhead) also runs our nraorsync plugin to upload files.  This causes problems because radialhead doesn't have the _CONDOR_JOB_AD environment variable and the plugin dies.  Why is the remote schedd running the plugin and is there a way to prevent it from doing so?

Greg understands this and will ask the HTCondor-c folks about it.

Greg thinks it is a bug and will talk to our HTCondor-C peopole.

2023-08-07: Greg said the HTCondor-C people agree this is a bug and will work on it.

2023-09-25 krowe: send Greg my exact procedure to reproduce this.

2023-10-02 krowe: Sent Greg an example that fails.  Turns out it is intermittent.

2024-01-22 krowe: will send email to the condor list

ANSWER: It was K. Scott all along.  I now have HTCondor-C workiing from nmpost and testpost clusters to the radial cluster using my nraorsync plugin to trasfer both input and output files.  The reason the remote AP (radialhead) was running the nraorsync plugin was because I defined it in the condor config like so.

FILETRANSFER_PLUGINS = $(FILETRANSFER_PLUGINS), /usr/libexec/condor/nraorsync_plugin.py




CPU/GPU Balancing

We have 30 nodes in a rack at NMT with a power limit of 17 kW and we are able to hit that limit when all 720 cores (24 cores * 30 nodes) are busy.  We want to add two GPUs to each node but that would almost certainly put us way over the power limit if each node had 22 cores and 2 GPUs busy.  So is there a way to tell HTCondor to reserve X cores for each GPU?  That way we could balance the power load.

JOB TRANSFORMS work per schedd so that wouldn't work on the startd side which is what we want.

IDEA: NUM_CPUS = 4 or some other small number greater then the number of GPUs but limiting enough to keep the power draw low.

ANSWER: There isn't a knob for this in HTCondor but Greg is interested in this and will look into this. 

WORKAROUND: MODIFY_REQUEST_EXPR_REQUESTCPUS may help by setting each job gets 8cores or something like.

MODIFY_REQUEST_EXPR_REQUESTCPUS = quantize(RequestCpus, isUndefined(RequestGpus) ? {1} : {8, 16, 24, 32, 40})

That is, when a job comes into the startd, if it doesn't request any GPUs, allocate exactly as many cpu cores as it requests. Otherwise, allocate 8 times as many cpus as it requests.

This seems to work. If I ask for 0 GPUs and 4 CPUs, I am given 0 GPUs and 4 CPUs.  If I ask for 1 GPU and don't ask for CPUs, I am given 1 GPU and 8 CPUs.

But if I ask for 2 GPUs and don't ask for CPUs, I still am only given 8 CPUs.  I was expecting to be given 16 CPUs.  This is probably fine as we are not planning on more than 1 GPU per job.

But if I ask for 1 GPU and 4 CPUs, i am given 1 GPU and 8 CPUs.  That is probably acceptable.

2024-01-24 krowe: Assuming a node can draw up to 550 Watts when all 24 cores are busy and that node only draws 150 Watts when idle, and that we have 17,300 Watts available to us in an NMT rack,

  • we should only need to reserve 3 cores per GPU in order to offset the 72 Watts of an Nvidia L4 GPU.
    • This would waste 60 cores.
  • Or at least I suggest starting with that and seeing what happens.  Another alternative is we just turn off three nodes if we put one L4 in each node.
    • This would waste 72 cores.

Upgrading

CHTC just upgrades to the latest version when it becomes available, right?  Do you ever run into problems because of this?  We are still using version 9 because I can't seem to schedule a time with our VLASS group to test version 10.  Much less version 23.

ANSWER: yes.  The idea is that CHTC's users are helping them test the latest versions.



Flocking to CHTC?

We may want to run VLASS jobs at CHTC.  What is the best way to submit locally and run globally?

ANSWER: Greg thinks flocking is the best idea.

This will require 9618 open to nmpost-master and probably a static NAT and external DNS name.



External users vs staff

We are thinking about making a DMZ ( I don't like that term ) for observers.  Does CHTC staff use the same cluster resources that CHTC observers (customers) use?

ANSWER: There is no airgap at CHTC everyone uses the same cluster.  Sometime users use a different AP but more for load balancing than security.  Everyone does go through 2FA.



Does PATh Cache thingy(tm) (a.k.a. Stash) work outside of PATh?

I see HTCondor-10.x comes with a stash plugin.  Does this mean we could read/write to stash from NRAO using HTCondor-10.x?

ANSWER: Greg thinks you can use stash remotely, like at our installation of HTCondor.



Curl_plugin doesn't do FTP

None of the following work.  They either hang or produce errors.  They work on the shell command line, except at CHTC where the squid server doesn't seem to grok FTP.

transfer_input_files = ftp://demo:password@test.rebex.net:/readme.txt
transfer_input_files = ftp://ftp:@ftp.gnu.org:/welcome.msg
transfer_input_files = ftp://ftp.gnu.org:/welcome.msg
transfer_input_files = ftp://ftp:@ftp.slackware.com:/welcome.msg
transfer_input_files = ftp://ftp.slackware.com:/welcome.msg

2024-02-05: Greg thinks this should work and will look into it.

ANSWER: 2024-02-06 Greg wrote "Just found the problem with ftp file transfer plugin.  I'm afraid there's no easy workaround, but I've pushed a fix that will go into the next stable release. "


File Transfer Plugins and HTCondor-C

I see that when a job starts, the execution point (radial001) uses our nraorsync plugin to download the files.  This is fine and good.  When the job is finished, the execution point (radial001) uses our nraorsync plugin to upload the files, also fine and good.  But then the RADIAL schedd (radialhead) also runs our nraorsync plugin to upload files.  This causes problems because radialhead doesn't have the _CONDOR_JOB_AD environment variable and the plugin dies.  Why is the remote schedd running the plugin and is there a way to prevent it from doing so?

Greg understands this and will ask the HTCondor-c folks about it.

Greg thinks it is a bug and will talk to our HTCondor-C peopole.

2023-08-07: Greg said the HTCondor-C people agree this is a bug and will work on it.

2023-09-25 krowe: send Greg my exact procedure to reproduce this.

2023-10-02 krowe: Sent Greg an example that fails.  Turns out it is intermittent.

2024-01-22 krowe: will send email to the condor list

ANSWER: It was K. Scott all along.  I now have HTCondor-C workiing from nmpost and testpost clusters to the radial cluster using my nraorsync plugin to trasfer both input and output files.  The reason the remote AP (radialhead) was running the nraorsync plugin was because I defined it in the condor config like so.

FILETRANSFER_PLUGINS = $(FILETRANSFER_PLUGINS), /usr/libexec/condor/nraorsync_plugin.py

I probably did this early in my HTCondor-C testing not knowing what I was doing.  I commented this out, restarted condor, and now everything seems to be working properly.



Quotes in DAG VARS

I was helping SSA with a syntax problem between HTCondor-9 and HTCondor-10 and I was wondering if you had any thoughts on it.  They have a dag with lines like this

  JOB SqDeg2/J232156-603000 split.condor
  VARS SqDeg2/J232156-603000 jobname="$(JOB)" split_dir="SqDeg2/J232156+603000" 

Then they set that split_dir VAR to a variable in the submit description file like this

  SPLIT_DIR = "$(split_dir)"

The problem seems to be the quotes around $(split_dir).  It works fine in HTCondor-9 but with HTCondor-10 they get an error like this in their pims_split.dag.dagman.out file

  02/28/24 16:26:02 submit error: Submit:-1:Unexpected characters following doublequote.  Did you forget to escape the double-quote by repeating it?  Here is the quote and trailing characters: "SqDeg2/J232156+603000""

Looking at the documentation https://htcondor.readthedocs.io/en/latest/version-history/lts-versions-10-0.html#version-10-0-0 its clear they shouldn't be putting quotes around $(split_dir). So clearly something changed with version 10.  Either a change to the syntax or, my guess, just a stricter parser.

Any thoughts on this?

ANSWER: Greg doesn't know why this changed but thinks we are now doing the right thingI probably did this early in my HTCondor-C testing not knowing what I was doing.  I commented this out, restarted condor, and now everything seems to be working properly.



...