Current Questions

Health Check

Is there a health check mechanism for condor?  For example, if the Lustre filesystem is unavailable on a node, can condor be told to down that node?

https://github.com/mej/nhc

I see how I could test for this, but how do I drain a node when the lustre_check.sh returns a failure?

STARTD_CRON_JOBLIST = $(STARTD_CRON_JOBLIST) lustre_check

STARTD_CRON_lustre_check_MODE = Periodic

STARTD_CRON_lustre_check_PERIOD = 1m

STARTD_CRON_lustre_check_EXECUTABLE = /opt/services/lustre_check.sh

ANSWER: the output of startd_crond can change the machine add.  So we need a class add like 'working = false' or change HASLUSTRE or something.  Then have another system, could be a cron job on nmpost-master, that checks for nodes with 'working = false', then drain the node.


output_destination and stdout/stderr

It used to be that once you set output_destination = someplugin:// then that plugin was responsible for transferring all files even stdout and stderr.  That no longer seems to be the case as of version 23.  My nraorsync transfer plugin has code in it looking for _condor_stdout and _condor_stderr as arguments but never sees them with version 23.  The stdout and stderr files are copied back to the submit directory instead of letting my plugin transfer them.

This is a change.  I am not sure if it affects us adversely or not but can we unchange this?

ANSWER: from Greg "After some archeology, it turns out that the change so that a file transfer plugin requesting to transfer the whole sandbox no longer sees stdout/stderr is intentional, and was asked for by several users.  The current workaround is to explicitly list the plugin in the stdout/stderr lines of the submit file, e.g."

output = nraorsync://some_location/stdout
error    = nraorsync://some_location/stderr

This seems like it should work but my plugin produces errors.  Probably my fault.


tokens and collector.locate

It seems that if the submit host is HTC23, you need a user token in order for the API (nraorsync specifically) to locate the schedd.

import os
import classad
import htcondor

def upload_file():
    try:
        ads = classad.parseAds(open(os.environ['_CONDOR_JOB_AD'], 'r'))
        for ad in ads:
            try:
                globaljobid = str(ad['GlobalJobId'])
                print("DEBUG: globaljobid is", globaljobid)
            except:
                return(-1)
    except Exception:
        return(-1)

    print("DEBUG: upload_file(): step 1\n")
    submithost = globaljobid.split('#')[0]
    print("DEBUG: submithost is", submithost)
    collector = htcondor.Collector()
    print("DEBUG: collector is", collector)
    schedd_ad = collector.locate(htcondor.DaemonTypes.Schedd, submithost)
    print("DEBUG: schedd_ad is ", schedd_ad)

upload_file()


This code works if both the AP and EP are version 10.  But if the AP is version 23 then it fails weather the EP is verison 10 or version 23.  It works with version 23 iff I have a ~/.condor/tokens.d/nmpost token.  Why do I need a user token to run collector.locate against a schedd?

I was going to test this on CHTC but I can't seem get an interactive job on CHTC anymore.

DONE: send greg error ouptut and security config


Virtual memory vs RSS

Looks like condor is reporting RSS but that may actually be virtual memory.  At least according to Felipe's tests.

ANSWER: Access to the cgroup information on the nmpost cluster is good because condor is running as root and condor reports the RSS accurately.  But on systems using glidein like PATh and OSG they may not have appropriate access to the cgroup so memory reporting on these clusters may be different thatn memory reporting on the nmpost cluster.  On glide-in jobs condor reports the virtual memory across all the processes in the job.


CPU usage

Felipe has had jobs put on hold for too much cpu usage.

runResidualCycle_n4.imcycle8.condor.log:012 (269680.000.000) 2024-07-18 17:17:03 Job was held.

runResidualCycle_n4.imcycle8.condor.log- Excessive CPU usage. Please verify that the code is configured to use a limited number of cpus/threads, and matches request_cpus.

GREG: Perhaps only some machines in the OSPool have checks for this and may be doing something wrong or strange.

2024-09-16: Felipe asked about this again.


Missing batch_name

A DAG job, submitted with hundreds of others, doesn't show a batch name in condor_q, just DAG: 371239.  Just one job, all the others submitted from the same template do show batch names

/lustre/aoc/cluster/pipeline/vlass_prod/spool/quicklook/VLASS3.2_T17t27.J201445+263000_P172318v1_2024_07_12T16_40_09.270

nmpost-master krowe >condor_q -dag -name mcilroy -g -all

...

vlapipe  vlass_ql.dag+370186            7/16 10:30      1      1      _      _      3 370193.0

vlapipe  vlass_ql.dag+370191            7/16 10:31      1      1      _      _      3 370194.0

vlapipe  DAG: 371239                    7/16 10:56      1      1      _      _      3 371536.0

...


GREG: Probably a condor bug.  Try submitting it again to see if the name is missing again.

WORKAROUND: condor_qedit job.id JobBatchName '"asdfasdf"'


DAG failed to submit

Another DAG job that was submitted along with hundreds of others looks to have created vlass_ql.dag.condor.sub but never actually submitted the job.  condor.log is emtpy.

/lustre/aoc/cluster/pipeline/vlass_prod/spool/quicklook/VLASS3.2_T18t13.J093830+283000_P175122v1_2024_07_06T16_33_34.742

ANSWERs: Perhaps the schedd was too busy to respond.  Need more resources in the workflow container?

Need to handle error codes from condor_submit_dag.  0 good.  1 bad. (chausman)

Setup /usr/bin/mail on mcilroy so that it works.  Condor will use this to send mail to root when it encounters an error.  Need to submit jira ticket to SSA. (krowe)


Resubmitting Jobs

I have an example in 

/lustre/aoc/cluster/pipeline/vlass_prod/spool/se_continuum_imaging/VLASS2.1_T10t30.J194602-033000_P161384v1_2020_08_15T01_21_14.433

of a job that failed on nmpost106 but then HTCondor resubmitted the job on nmpost105.  The problem is the job actually did finish, just got an error transferring back all the files, so when the job was resubmitted, it copied over an almost complete run of CASA which sort of makes a mess of things.  I would rather HTCondor just fail and not re-submit the job.  How can I do that?

022 (167287.000.000) 2023-12-24 02:43:57 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot1_1@nmpost106.aoc.nrao.edu <10.64.2.180:9618?addrs=10.64.2.180-9618&alias=nmpost106.aoc.nrao.edu&noUDP&sock=startd_5795_776a>
...
023 (167287.000.000) 2023-12-24 02:43:57 Job reconnected to slot1_1@nmpost106.aoc.nrao.edu
    startd address: <10.64.2.180:9618?addrs=10.64.2.180-9618&alias=nmpost106.aoc.nrao.edu&noUDP&sock=startd_5795_776a>
    starter address: <10.64.2.180:9618?addrs=10.64.2.180-9618&alias=nmpost106.aoc.nrao.edu&noUDP&sock=slot1_1_39813_9c2c_400>
...
007 (167287.000.000) 2023-12-24 02:43:57 Shadow exception!
        Error from slot1_1@nmpost106.aoc.nrao.edu: Repeated attempts to transfer output failed for unknown reasons
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
...
040 (167287.000.000) 2023-12-24 02:45:09 Started transferring input files
        Transferring to host: <10.64.2.178:9618?addrs=10.64.2.178-9618&alias=nmpost105.aoc.nrao.edu&noUDP&sock=slot1_13_163338_25ab_452>
...
040 (167287.000.000) 2023-12-24 03:09:22 Finished transferring input files
...
001 (167287.000.000) 2023-12-24 03:09:22 Job executing on host: <10.64.2.178:9618?addrs=10.64.2.178-9618&alias=nmpost105.aoc.nrao.edu&noUDP&sock=startd_5724_c431>

ANSWER: Maybe 

on_exit_hold = some_expression

periodic_hold = NumShadowStarts > 5

periodic_hold = NumJobStarts > 5

or a startd cron job that checks for IdM and offlines the node if needed

https://htcondor.readthedocs.io/en/latest/admin-manual/ep-policy-configuration.html#custom-and-system-slot-attributes

https://htcondor.readthedocs.io/en/latest/admin-manual/ep-policy-configuration.html#startd-cron



Constant processing

Our workflows have a process called "ingestion" that puts data into our archive.  There are almost always ingestion processes running or needing to run and we don't want them to get stalled because of other jobs.  Both ingestion and other jobs are the same user "vlapipe".  I thought about setting a high priority in the ingestion submit description file but that won't guarantee that ingestion always runs, especially since we don't do preemption.  So my current thinking is to have a dedicated node for ingestion.  Can you think of a better solution?

So on the node I would need to set something like the following

# High priority only jobs

HIGHPRIORITY = True

STARTD_ATTRS = $(STARTD_ATTRS) HIGHPRIORITY

START = ($(START)) && (TARGET.priority =?= "HIGHPRIORITY")

Nov. 13, 2023 krowe: I need to implement this.  Make a node a HIGHPRIROITY node and have SSA put HIGHPRIORITY in the ingestion jobs.

2024-02-01 krowe: Talked to chausman today.  She thinks SSA will need this and that the host will need access to /lustre/evla like aocngas-master and the nmngas nodes do.  That might also mean a variable like HASEVLALUSTRE as well or instead of HIGHPRIORITY.






In progress

condor_remote_cluster

CHTC

000 (901.000.000) 2023-04-14 16:31:38 Job submitted from host: <10.64.1.178:9618?addrs=10.64.1.178-9618&alias=testpost-master.aoc.nrao.edu&noUDP&sock=schedd_2269692_816e>
...
012 (901.000.000) 2023-04-14 16:31:41 Job was held.
        Failed to start GAHP: Missing remote command\n
        Code 0 Subcode 0
...
testpost-master krowe >cat condor.902.log 
000 (902.000.000) 2023-04-14 16:40:37 Job submitted from host: <10.64.1.178:9618?addrs=10.64.1.178-9618&alias=testpost-master.aoc.nrao.edu&noUDP&sock=schedd_2269692_816e>
...
012 (902.000.000) 2023-04-14 16:40:41 Job was held.
        Failed to start GAHP: Agent pid 3145812\nPermission denied (gssapi-with-mic,keyboard-interactive).\nAgent pid 3145812 killed\n
        Code 0 Subcode 0
...

PATh

000 (901.000.000) 2023-04-14 16:31:38 Job submitted from host: <10.64.1.178:9618?addrs=10.64.1.178-9618&alias=testpost-master.aoc.nrao.edu&noUDP&sock=schedd_2269692_816e>
...
012 (901.000.000) 2023-04-14 16:31:41 Job was held.
        Failed to start GAHP: Missing remote command\n
        Code 0 Subcode 0
...

Radial

It works but seems to leave a job on the radial cluster for like 30 minutes.

[root@radialhead htcondor-10.0.3-1]# ~krowe/bin/condor_qstat 
JobId     Owner    JobBatchName       CPUs JS Mem(MB)  ElapTime    SubmitHost       Slot     RemoteHost(s)       
--------- -------- ------------------ ---- -- -------- ----------- ---------------- -------- --------------------
       99 nrao                        1    C      1024 0+0:13:22   radialhead.nrao.                              

ANSWER: Greg will look into it.

condor_on

Once an HTCondor node is set to offline, If you then set it online, it will remain in the Retiring status until all the jobs finish.  For example


testpost-master root >condor_off -startd -peaceful -name testpost001

testpost-master root >condor_status 
Name                             OpSys      Arch   State     Activity LoadAv Mem     ActvtyTime

slot1@testpost001.aoc.nrao.edu   LINUX      X86_64 Unclaimed Idle      0.000 515351 13+00:25:08
slot1_1@testpost001.aoc.nrao.edu LINUX      X86_64 Claimed   Retiring  0.010    128  0+00:00:00


testpost-master root >condor_on -startd -name testpost001

testpost-master root >condor_status 
Name                             OpSys      Arch   State     Activity LoadAv Mem     ActvtyTime

slot1@testpost001.aoc.nrao.edu   LINUX      X86_64 Unclaimed Idle      0.000 515351 13+00:26:15
slot1_1@testpost001.aoc.nrao.edu LINUX      X86_64 Claimed   Retiring  0.000    128  0+00:01:07

This is not critical or a problem, just an observation.

ANSWER: Greg will look into this it may be a bug.

ANSWER: Greg thought that there was a -peaceful option to condor_drain but didn't see it yet.  That would be the better solution for us.


Miscount of Idle DAGs

condor_q doesn't seem to be showing idle DAGs (condor_dagman) as idle jobs in the totals at the bottom of the output.

If I set MAX_RUNNING_SCHEDULER_JOBS_PER_OWNER = 1 and submit two DAGs, one will have status Running and the other will have status Idle according to condor_q -g -all -nobatch -dag.  But the totals at the bottom of the condor_q command will show 0 idle jobs.  Is this correct?  Is this because DAGs are in a different universe or not seen as "real" jobs?

testpost-master krowe >condor_q -g -all -nobatch -dag


-- Schedd: testpost-master.aoc.nrao.edu : <10.64.1.178:9618?... @ 02/27/23 15:01:40
 ID      OWNER/NODENAME      SUBMITTED     RUN_TIME ST PRI SIZE CMD
 671.0   krowe              2/27 15:01   0+00:00:39 R  0    0.3 condor_dagman -p 0 -f -l . -Lockfile small.dag.lock -AutoRescue 1 -DoRescueFrom 0 -Dag small.dag -Suppress_notification -CsdVersion $CondorVersion:
 672.0    |-node01          2/27 15:01   0+00:00:20 R  0    0.0 sleep 127
 673.0   krowe              2/27 15:01   0+00:00:00 I  0    0.3 condor_dagman -p 0 -f -l . -Lockfile small.dag.lock -AutoRescue 1 -DoRescueFrom 0 -Dag small.dag -Suppress_notification -CsdVersion $CondorVersion:
 674.0   krowe              2/27 15:01   0+00:00:00 I  0    0.3 condor_dagman -p 0 -f -l . -Lockfile small.dag.lock -AutoRescue 1 -DoRescueFrom 0 -Dag small.dag -Suppress_notification -CsdVersion $CondorVersion:

Total for query: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended 
Total for all users: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended

ANSWER: Greg thinks this is a bug and will report it.


condor_q -all -nobatch -const "JobUniverse == 7" 

When I run condor_q -all -nobatch -const "JobUniverse == 7" at CHTC I see

29 jobs; 0 completed, 0 removed, 0 idle, 28 running, 1 held, 0 suspended

When I run it at NRAO (when I am running one DAG) I see

Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended 

Total for all users: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended

When I run it at PATh (when I am running one DAG) I see

Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended 

Total for all users: 4 jobs; 0 completed, 0 removed, 2 idle, 0 running, 2 held, 0 suspended

ANSWER: Greg thinks this is a bug and will report it.


Nvidia GPUDirect Storage?

https://developer.nvidia.com/blog/gpudirect-storage/ (gds-tools) doesn't seem available on PATh GPU nodes.  Should it be?  Would it help us with our performance issues?

ANSWER: No but Greg will suggest it be installed.


Singularity at PATh

My singularity jobs run but get the following error output

INFO  Discarding path '/hadoop'. File does not exist

INFO  Discarding path '/ceph'. File does not exist

INFO  Discarding path '/hdfs'. File does not exist

INFO  Discarding path '/lizard'. File does not exist

INFO  Discarding path '/mnt/hadoop'. File does not exist

INFO  Discarding path '/mnt/hdfs'. File does not exist

WARNING: Environment variable HAS_SINGULARITY already has value [True], will not forward new value [1] from parent process environment

WARNING: Environment variable REQUIRED_OS already has value [default], will not forward new value [] from parent process environment

/srv/.gwms-user-job-wrapper.sh: line 882: /usr/bin/singularity: No such file or directory

WORKAROUND:

universe = vanilla

+SingularityImage = "/path/to/myimage"

or

+SingularityImage = "docker://debian"

2023-02-06 krowe: I sent mail to Christina about this and she also suggested the workaround.  She said they are working on a proper fix.

Blocking on upload

Don't have condor block on the transfer plugin uploading.  It doesn't block on download.  When it blocks on upload and the upload is large, the job may get killed if NOT_RESPONDING_TIMEOUT isn't set to something larger than the 3600 second default.

stdout and stderr with plugins

When using a transfer plugin to transfer output files, stdout and stderr are copied back as _condor_stdout and _condor_stderr.  It doesn't rename them to what output and error are set to in the submit description file.  If I use a transfer plugin for just input files and not output files, then stdout and stderr are copied back as requested in the submit description file.

This seems like a bug to me.  Since my plugin isn't transferring these files, that means HTCondor is doing it so HTCondor should honor what is set in the submit description file wether I am using a transfer plugin or not.

Perhaps have rsync transfer these instead?  Or use another custom classad instead of output_destination?  Or what if I put _condor_stdout in +nrao_output_files?

Actually, it doesn't seem to be triggered by just output_destination.  If I set output_destination = $ENV(PWD) and don't use a plugin for output files, I get stdout.jobid.log like I requested.

From coatsworth@cs.wisc.edu Mon Nov 29 12:26:12 2021

I've looked into this in the file transfer code. On the execution

side, we always write stdout and stderr to the _condor_stdout and

_condor_stderr files, then we remap them back to user-provided names

after a job completes. When you have output_destination set, our File

Transfer mechanism does not send files back to the submit machine by

default. However since your plugin is explicitly rysnc-ing files back

there, they get moved without going through the remapping.

I think your File Transfer Mechanism does send files back to the submit machine by default.  My transfer plugin is not transferring _condor_stderr nor _condor_stdout.

Apr. 22, 2022 krowe: Actually I think that since I set *output_destination = nraorsync://...* in the submit description file, the FTM *is* using the plugin.  It has to because output_destination requires that everything use the plugin.  So the FTM calls the plugin with _condor_stdout and _condor_stderr which activates the upload_file() function in the plugin and the files are copied.  This is why they aren't remapped.  I configured the plugin to also copy these files and rename them.  Perhaps instead I could watch for them in upload_file() and remap them there?

ANSWER: Greg is going to tell Mark to put all this plugin work on the back burner or maybe stop altogether.  Our plugin works with its work-arounds so this stuff is not critical.

nraorsync_plugin.py

Since HTCondor walks the directoies in transfer_output_files and submits the files one at a time the plugin, which doesn't work with rsync, we decided to work around the problem.

# Trick HTCondor into launching the plugin to handle output files

transfer_output_files = .job.ad

# custom job ad of files/dirs using nraorsync_plugin.py

+nrao_output_files = "software data"

Call our upload_rsync() function before calling upload_file() in main()

with open(args['outfile'], 'w') as outfile:

    # krowe Oct 28 2021:
    if args['upload']:
        if upload_rsync() != 0:
            raise err

    for ad in infile_ads:

ANSWER: I explained this to CHTC.  They think it is at least an elegant hack. :^)

Transfer Plugin Upload

Working with Mark Coatsworth on this.

I have added my nraorsync_plugin.py to /usr/libexec/condor on the execution host and added the following configuration to the execution host:

FILETRANSFER_PLUGINS = $(LIBEXEC)/nraorsync_plugin.py, $(FILETRANSFER_PLUGINS)

I have the following job:

#!/bin/sh

mkdir newdir

date > newdir/date

/bin/sleep ${1}

and the following submit file:

executable = smaller.sh
arguments = "27"
output = stdout.$(ClusterId).log
error = stderr.$(ClusterId).log
log = condor.$(ClusterId).log

should_transfer_files = YES
transfer_input_files = /users/krowe/.ssh/condor_transfer
transfer_output_files = newdir
output_destination = nraorsync://$ENV(PWD)
+WantIOProxy = True

queue

The resulting input file that is fed to my plugin when the plugin is called with the -upload argument (.nraorsync_plugin.in) contains this:

[ LocalFileName = "/lustre/aoc/admin/tmp/condor/testpost003/execute/dir_29453/_condor_stderr"; Url = "nraorsync:///lustre/aoc/sciops/krowe/plugin/_condor_stderr" ][ LocalFileName = "/lustre/aoc/admin/tmp/condor/testpost003/execute/dir_29453/_condor_stdout"; Url = "nraorsync:///lustre/aoc/sciops/krowe/plugin/_condor_stdout" ][ LocalFileName = "/lustre/aoc/admin/tmp/condor/testpost003/execute/dir_29453/newdir/date"; Url = "nraorsync:///lustre/aoc/sciops/krowe/plugin/newdir/date" ]

I am surprised to see that it sets LocalFileName and Url to the file inside newdir instead of newdir itself.  Needless to say, this makes rsync unhappy as newdir doesn't exist on the destination yet.

If I create 'newdir' in the destination directory before submitting the job, the plugin will correctly copy the 'date' file back to the 'newdir' directory but the condor log file shows the following:

022 (4149.000.000) 08/05 09:22:04 Job disconnected, attempting to reconnect
Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to slot1_4@testpost003.aoc.nrao.edu <10.64.1.173:9618?addrs=10.64.1.173-9618&alias=testpost003.aoc.nrao.edu&noUDP&sock=startd_28565_cae3>
...
023 (4149.000.000) 08/05 09:22:04 Job reconnected to slot1_4@testpost003.aoc.nrao.edu
startd address: <10.64.1.173:9618?addrs=10.64.1.173-9618&alias=testpost003.aoc.nrao.edu&noUDP&sock=startd_28565_cae3>
starter address: <10.64.1.173:9618?addrs=10.64.1.173-9618&alias=testpost003.aoc.nrao.edu&noUDP&sock=slot1_4_28601_bde4_612>
...

condor re-runs the upload portion of the plugin four more times before finally giving up with this error

007 (4149.000.000) 08/05 09:22:31 Shadow exception!
Error from slot1_4@testpost003.aoc.nrao.edu: Repeated attempts to transfer output failed for unknown reasons
0 - Run Bytes Sent By Job
1007 - Run Bytes Received By Job

If I create a file like 'outputfile' instead of 'newdir' and transfer that, everything works fine.

I have an example in /home/nu_kscott/htcondor/plugin_small

ANSWER: Greg will look into this.  K. Scott is working with Mark Coatsworth on this.


condor_off vs condor_drain

I would like to be able to issue a command to an execute host telling it to stop accepting new jobs and let the current jobs finish.  I would also like that host to stay in the condor_status output with a message indicating what I have done (i.e. draining, offline, etc)  I think I want something that does some of condor_off and some of condor_drain.  Is there such a beast?

For example, a -peaceful option to condor_drain might be perfect.

condor_off

condor_drain

ANSWER: condor_status -master and use condor_off

ANSWER: Greg thinks condor_drain should have a -peaceful option.  (bug)

2022-10-05 krowe: HTCondor 9.12.0 "Added -drain option to condor_off and condor_restart".  I think this might be the solution I wanted they just went in a different direction.  Instead of 'condor_drain -peaceful' there is now a 'condor_off -drain'.  The feature isn't in the LTS release yet.  Perhaps it will be in 9.0.18.  Then I will test it.

2023-07-20 krowe: condor_off now (as of some version before 10.0.2) has a -reason option.  You  can see the -reason and -drain options with --help but they aren't in the man  page.

2023-07-20 krowe: I tried condor_off with HTCondor-10.0.2.  It now has the -drain option which seems to kill jobs.  Not what I was expecting.  Also, the -reason option can only be used with the -drain option.  When I use condor_on to put the node back into the cluster, it restarted my job.  Again, I think I really want a -peaceful argument to condor_drain.  That would be the best solution for me.


Show offline nodes

Say I set a few nodes to offline with a command like condor_off -startd -peaceful -name nmpost120  How can I later check to see which nodes are offline?

ANSWER: 2022-06-27

condor_status -const 'Activity == "Retiring"'

offline ads, which is a way for HTCondor to update the status of a node after the startd has exited.

condor_drain -peaceful # CHTC is working on this.  I think this might be the best solution.



Glidein

The only documentation I can find on glinein (https://htcondor.readthedocs.io/en/latest/grid-computing/introduction-grid-computing.html?highlight=glidein#introduction) seems to imply that glidein only works with Globus "HTCondor permits the temporary addition of a Globus-controlled resource to a local pool. This is called glidein."  Is this correct?  Is there better documentation?  Is glidein even a technology or software package or is it just a generic term?

ANSWER: Greg will look at re-writring this.


request_virtualmemory

If I set request_virtualmemory = 2G, condor_submit accepts it as a valid knob but the job stays idle and never runs.

request_memory = 1G
request_virtualmemory = 2G

If I set request_virtualmemory = 2000000, which should be the same as 2G, the job runs but doesn't set memory.memsw.limit_in_bytes in the cgroup.

Oct. 11, 2021 krowe: Checked with HTCondor-9.0.6.  Problem still exists unchanged.

ANSWER: krowe sent mail to Greg about it




Answered Questions

























10/20/20 08:54:36 From submit: ERROR: on Line 9 of submit file:
10/20/20 08:54:36 From submit: Submit:-1:Error "", Line 0, Include Depth 1: can't open file
10/20/20 08:54:36 From submit:
10/20/20 08:54:36 From submit: ERROR: Failed to parse command file (line 9).
10/20/20 08:54:36 failed while reading from pipe.
10/20/20 08:54:36 Read so far: Submitting job(s)ERROR: on Line 9 of submit file: Submit:-1:Error "", Line 0, Include Depth 1: can't open fileERROR: Failed to parse command file (line 9).
10/20/20 08:54:36 ERROR: submit attempt failed
10/20/20 11:58:58 From submit: Submitting job(s)ERROR on Line 13 of submit file: $CHOICE() macro: myindex is invalid index!
10/20/20 11:58:58 failed while reading from pipe.
10/20/20 11:58:58 Read so far: Submitting job(s)ERROR on Line 13 of submit file: $CHOICE() macro: myindex is invalid index!
10/20/20 11:58:58 ERROR: submit attempt failed








Nodesfree

How can one see nodes that are entirely unclaimed?

SOLUTION: condor_status -const 'PartitionableSlot && Cpus == TotalCpus'


HERA queue

I want a proper subset of machines to be for the HERA project. These machines will only run HERA jobs and HERA jobs will only run on these machines.  This seems to work but is there a better way?

machine configsubmit file

HERA = True

STARTD_ATTRS = $(STARTD_ATTRS) HERA

START = ($(START)) && (TARGET.partition =?= "HERA")

requirements = (HERA == True)

+partition = "HERA"

SOLUTION: yes, this is good.  Submit Transforms could also be set on herapost-master (Submit Host)

https://htcondor.readthedocs.io/en/latest/misc-concepts/transforms.html?highlight=submit%20transform


Reservations

What if you know certain nodes will be unavailable for a window of time say the second week of next month.  Is there a way to schedule that in advance in HTCondor?  For example in Slurm

scontrol create reservation starttime=2021-02-8T08:00:00 duration=7-0:0:0 nodes=nmpost[020-030] user=root reservationname=siw2022

ANSWER: HTCondor doesn't have a feature like this.


Bug: All on one core

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND P
66713 krowe 20 0 4364 348 280 S 0.0 0.0 0:00.01 sleep 22
66714 krowe 20 0 4364 352 280 S 0.0 0.0 0:00.02 sleep 24
66715 krowe 20 0 4364 348 280 S 0.0 0.0 0:00.01 sleep 24
66719 krowe 20 0 4364 348 280 S 0.0 0.0 0:00.02 sleep 2
66722 krowe 20 0 4364 352 280 S 0.0 0.0 0:00.02 sleep 22

From jrobnett@nrao.edu Tue Nov 10 16:38:18 2020

As (bad) luck would have it I had some jobs running where I forgot to set the #cores to do so they triggered the behavior.

Sshing into the node I see three processes sharing the same core and the following for the 3 python processes:

bash-4.2$ cat /proc/113531/status | grep Cpus
Cpus_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
Cpus_allowed_list:      0

If I look at another node with 3 processes where they aren't sharing the same core I see:

bash-4.2$ cat /proc/248668/status | grep Cpu
Cpus_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00555555
Cpus_allowed_list:      0,2,4,6,8,10,12,14,16,18,20,22

Dec. 8, 2020 krowe: I launched five sqrt(rand()) jobs and each one landed on its own CPU. 

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND P
48833 krowe 20 0 12532 1052 884 R 100.0 0.0 9:20.95 a.out 4
49014 krowe 20 0 12532 1052 884 R 100.0 0.0 8:34.91 a.out 5
48960 krowe 20 0 12532 1052 884 R 99.6 0.0 8:54.40 a.out 3
49011 krowe 20 0 12532 1052 884 R 99.6 0.0 8:35.00 a.out 1
49013 krowe 20 0 12532 1048 884 R 99.6 0.0 8:34.84 a.out 0

and the masks aren't restricting them to specific cpus.  So I am yet unable to reproduce James's problem.

st077.aoc.nrao.edu]# grep -i cpus /proc/48960/status
Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
Cpus_allowed_list: 0-447

We can reproduce this without HTCondor.  So this is either being caused by our mpicasa program or the openmpi libraries it uses.  Even better, I can reproduce this with a simple shell script executed from two shells at the same time on the same host.  Another MPI implementation (mvapich2) didn't show this problem.

#!/bin/sh
export PATH=/usr/lib64/openmpi/bin:{$PATH}
export LD_LIBRARY_PATH=/usr/lib64/openmpi/lib:${LD_LIBRARY_PATH}
mpirun -np 2 /users/krowe/work/doc/all/openmpi/busy/busy


Array Jobs

Does HTCondor support array jobs like Slurm? For example in Slurm #SBATCH --array=0-3%2 or is one supposed to use queue options and DAGMan throttling?

ANSWER: HTCondor does reduce the priority of a user the more jobs they run so there may be less need of a maxjob or modulus option.  But here are some other things to look into.

https://htcondor.readthedocs.io/en/latest/users-manual/dagman-workflows.html#throttling-nodes-by-category

queue from seq 10 5 30 |

queue item in 1, 2, 3


combined cluster (Slurm and HTCondor)

Slurm starts and stops condor.  CHTC does this because their HTCondor can preempt jobs.  So when Slurm starts a job it kills the condor startd and any HTCondor jobs will get preempted and probably restarted somewhere else.


Node Priority

Is there a way to set an order to which nodes are picked first or a weight system?  We want certain nodes to be chosen first because they are faster, or have less memory or other such criteria.

NEGOTIATOR_PRE_JOB_RANK on the negotiator


HPC Cluster

Could I have access to the HPC cluster?  To learn Slurm.

ANSWER: https://chtc.cs.wisc.edu/hpc-overview  I need to login to submit2 first but that's fine.

How does CHTC keep shared directories (/tmp, /var/tmp, /dev/shm) clean with Slurm?

ANSWER: CHTC doesn't do any cleaning of shared directories.  But the suggested looking at https://derekweitzel.com/2016/03/22/fedora-copr-slurm-per-job-tmp/  I don't know if this plugin will clean files created by an interactive ssh, but i suspect it won't because it is a slurm plugin and ssh'ing to the host is outside of the control of Slurm except for the pam_slurm_adopt that adds you to the cgroup.  So I may still need a reaper script to keep these directories clean.


vmem exceeded in Torque

We have seen a problem in Torque recently that reminds us of the memory fix you recently implemented in HTCondor.  What that fix related to any recent changes in the Linux kernel or was it a pure HTCondor bug?  What was it that you guys did to fix it?

ANSWER: There are two problems here.  The first is the short read, which we are still trying to understand the root cause.  We've worked around the problem in the short term by re-polling when the number of processes we see drops by 10% or more. The other problem is when condor uses cgroups to measure the amount of memory that all processes in a job use, it goes through the various field in /sys/fs/cgroup/memory/cgroup_name/memory.stat.  Memory is categorized into a number of different types in this file, and we were omitting some types of memory when summing up the total.

cpuset issues

ANSWER: git bisect could be useful.  Maybe we could ask Ville.

Distant execute nodes

Are there any problems having compute nodes at a distant site?

ANSWER: no intrinsic issues.  Be sure to set requirements.


Memory bug fix?

What version of condor has this fix?

ANSWER: 8.9.9

When is it planned for 8.8 or 9.x inclusion?

ANSWER: 9.0 in Apr. 2021

Globus

You mentioned that the globus RPMs are going away.  Yes?

ANSWER: They expect to drop globus support in 9.1 around May 2021.

VNC

Do you have any experience using VNC with HTCondor?

ANSWER: no they don't have experience like this.  But mount_under_scratch= will use the real /tmp


Which hosts do the flocking?

Lustre is going to be a problem.  Our new virtual CMs can't see lustre.  Can just a submit host see lustre and not the CM in order to flock?

ANSWER: Only submit machines need to be configured to flock.  It goes from a local submit host to a remote CM.  So we could keep gibson as a flocking submit host.  This means the new CMs don't need the firewall rules.


Transfer Mechanism Plugin


Containers


Remote

condor_submit -remote  what does it do?  The manpage makes me think it submits your job using a different submit host but when I run it I get lots of authentication errors. Can it not use host-based authentication (e.g. ALLOW_WRITE = *.aoc.nrao.edu)?

Here is an example of me running condor_submit on one of our Submit Hosts (testpost-master) trying to remote to our Central Manager (testpost-cm) which is also a submit host.

condor_submit -remote testpost-cm tiny.htc
Submitting job(s)
ERROR: Failed to connect to queue manager testpost-cm-vml.aoc.nrao.edu
AUTHENTICATE:1003:Failed to authenticate with any method
AUTHENTICATE:1004:Failed to authenticate using GSI
GSI:5003:Failed to authenticate. Globus is reporting error (851968:50). There
is probably a problem with your credentials. (Did you run grid-proxy-init?)
AUTHENTICATE:1004:Failed to authenticate using KERBEROS
AUTHENTICATE:1004:Failed to authenticate using FS

ANSWER:

condor_submit -remote does indeed, tell the condor_submit tool to submit to a remote schedd. (it also implies -spool)

Because the schedd can run the job as the submitting owner, and runs the shadow as the submitting owner, the remote schedd needs to not just authorize the remote user to submit jobs, but must authenticate the remote user as some allowed user.

Condor's IP host-based authentication is just authentication, it can say "everyone coming from this IP address is allowed to do X, but I don't know who that entity is".

So, for remote submit to work, we need some kind of authentication method as well, like IDTOKENS, munge.


Authentication


HTcondor+Slurm


Transfer Plugin Order

HTCondor guarantees that the condor file transfer happens before the plugin transfer, but only when using the "multi-file" plugin style,
like we have in our curl plugin.  If you used the curl plugin as the model for rsync, you should be good.


AMQP

The AMQP gateway that we had developed was called Qpid, and worked by tailing the user job log and turning it into qpid events.  I suspect
there's also ways to have condor plugins directly send amqp events as well.


CPU Shares

Torque uses cpusets which is pretty straight forward, but HTCondor uses cpu.shares which confuses me a bit.  For example, a job with request_cpus = 8 executing on a 24-core machine gets cpu.shares = 800  If there are no other jobs on node, does this job essentially get more CPU time than 1024/800?

ANSWER: yes it is oppertunistic.  If there are no other jobs running on a node you essentially get all the node.


Nodescheduler

We found a way to implement our nodescheduler script in Slurm using the --exclude option.  Is there a way to exclude certain hosts from a job?  Or perhaps a constraint that prevents a job from running on a node that is already running a job of that user?  Is there a better way than this?

requirements = Machine != "nmpost097.aoc.nrao.edu" && Machine != "nmpost119.aoc.nrao.edu"

badmachines=one+two+three

requirements not in $(badmachines)

I didn't get the actual syntax from Greg and I am apparently not able to look it up.  The long syntax I suggested should work I just dont know what Greg's more efficient syntax is.


condor_ssh_to_job

Is there a way to use condor_ssh_to_job to connect to a job submitted from a different submit host (schedd) or do you have to run it from the submit host used to submit the job?  I have tried using the -name option to condor_ssh_to_job but I always get Failed to send GET_JOB_CONNECT_INFO to schedd

ANSWER: idtokens.  Host-based and poolpassword are not sufficient to identify users and allow for this (and probably condor_submit -remote).


HTCondor Workshop vs Condor Week

ANSWER: Essentially it is "Condor Week Europe".  Mostly the same talks but different customer presentations.  Could be interesting for the different customer presentations.


Shutdown

STARTD.DAEMON_SHUTDOWN = State == "Unclaimed" && Activity == "Idle" && (MyCurrentTime - EnteredCurrentActivity) > 600

MASTER.DAEMON_SHUTDOWN = STARTD_StartTime == 0

But I was running a job when it shut down.

07/19/21 11:45:01 The DaemonShutdown expression "State == "Unclaimed" && Activity == "Idle" && (MyCurrentTime - EnteredCurrentActivity) > 600" evaluated to TRUE: starting graceful shutdown

Could this be because we use dynamic slots?

testpost-cm-vml krowe >condor_status
Name OpSys Arch State Activity LoadAv Mem

slot1@testpost001.aoc.nrao.edu LINUX X86_64 Unclaimed Idle 0.000 193
slot1_1@testpost001.aoc.nrao.edu LINUX X86_64 Claimed Busy 0.000
slot1@testpost002.aoc.nrao.edu LINUX X86_64 Unclaimed Idle 0.000 144
slot1_1@testpost002.aoc.nrao.edu LINUX X86_64 Claimed Busy 0.810 49
slot1@testpost003.aoc.nrao.edu LINUX X86_64 Unclaimed Idle 0.000 193

I see that with dynamic slots, the parent slot (slot1) seems always unclaimed and idle and the child slots (slot1_1) are Claimed and Busy.  So I tried checking the ChildState attribute which looks to be a list but doesn't behaive like one.  For example, none of these show any slots

condor_status -const 'ChildState == { "Claimed" }'

condor_status -const 'sum(ChildState) == 0'

Even though this produces true

classad_eval 'a = { }' 'sum(a) == 0'

ANSWER: Try this

condor_status -const 'size(ChildState) == 0

HTCondor and Slurm

NRAO has effectively two use cases:  1) Operations triggered jobs.  These are well formulated pipeline jobs, they're still fairly monolithic and long running (many hours to few days).   2) User triggered jobs, these are of course not well formulated.  We will be moving the operations jobs to htcondor.   We plan to move the user triggered jobs to SLURM form Torque.   There's enough noise in the two job loads that we don't want to have strict host carve outs for type 1 and type 2 jobs.  What we anticipate doing is having a set of nodes known only to htcondor for the bulk of operations and a set of hosts controlled by SLURM for the user facing jobs.   Periodically when they have a large set of operations jobs we'd like for them to burst into the SLURM controlled nodes.  We neither anticipate nor want the slurm jobs to burst into the htcondor set of nodes.

Say we have two clusters (HTCondor and Slurm) and both can be submitted to from the same host.  We want the HTCondor jobs to use the Slurm cluster resources when the HTCondor cluster resources are full, but we probably don't want to support preemption.  How could we have HTCondor submit jobs to a Slurm cluster?  (HTCondor-C, flocking, overlapping, batch-grid-type, HTCondor-CE, etc)

ANSWER: write our own 'factory' that watched HTCondor and when it is full submit Pilot jobs to Slurm that launch startd daemons thus allowing the Payload jobs waiting in HTCondor to run.  Will want to set the startd to exit after being idle for a little while, run the Pilot job as root, and figure out how to do cgroups properly.


Shadow jobs and Lustre

We had some jobs get restarted because they lost contact with their shadow jobs.  I assume this is because the shadow jobs keep the condor.log file open and if that file is on Lustre and Lustre goes down then the shadow job fails to communicate with the job and the job gets killed.   Does that seem accurate to you?

nmpost-master root >ps auxww|grep shadow|grep krowe

krowe 1631810 0.0 0.0 38708 3676 ? S 09:29 0:00 condor_shadow -f 486.0 --schedd=<10.64.10.100:9618?addrs=10.64.10.100-9618&noUDP&sock=5837_96cc_3> --xfer-queue=limit=upload,download;addr=<10.64.10.100:14115> <10.64.10.100:14115> -

nmpost-master root >ls -la /proc/1631810/fd

total 0

dr-x------ 2 root root 0 Jul 27 09:29 ./

dr-xr-xr-x 8 krowe nmstaff 0 Jul 27 09:29 ../

lr-x------ 1 root root 64 Jul 27 09:29 0 -> pipe:[16358528]

lr-x------ 1 root root 64 Jul 27 09:29 1 -> pipe:[16358540]

lrwx------ 1 root root 64 Jul 27 09:29 18 -> socket:[16358529]

l-wx------ 1 root root 64 Jul 27 09:29 2 -> pipe:[16358540]

l-wx------ 1 root root 64 Jul 27 09:29 3 -> /lustre/aoc/sciops/krowe/condor.486.log

lrwx------ 1 root root 64 Jul 27 09:29 4 -> socket:[16358542]

Here are some logs of a filed job

07/26/21 14:38:38 (479.0) (1188418): Job 479.0 is being evicted from slot1_1@nmpost114.aoc.nrao.edu
07/26/21 14:38:38 (479.0) (1188418): logEvictEvent with unknown reason (108), not logging.
07/26/21 14:38:38 (479.0) (1188418): **** condor_shadow (condor_SHADOW) pid 1188418 EXITING WITH STATUS 108

Exit Code 108 = can not connect to the condor_startd or request refused

2021-07-26 14:16:39 (pid:91673) Lost connection to shadow, waiting 2400 secs for reconnect


ANSWER: Greg thinks this is an accurate description of the problem.  Greg thinks this 2400 second timeout may be adjustable but do we want to?  How long is long enough?  Two choices: 1 decide we don't care 2 write log files to something other than Lustre.


Rebooting Submit Host

What happens to running jobs if the submit host reboots?  Shadow processes?  What if the submithost is replaced with a new server?  I think we have shown there is a 2400 second (40 minute) timeout.

ANSWER: state files are in $(condor_config_val SPOOL) and you only have 40 minutes by default and that timeout is set at job submission time.


Chirp in upload_file

While I seem to be able to use chirp in the download_file() function of a plugin, I cannot seem to use it in the upload_file() porttion.  Something like the following will produce a line in the condor log file but not when executed from the upload_file() function.  This I have tested at CHTC.

message = 'in upload_file()'

subprocess.call(['/usr/libexec/condor/condor_chirp', 'ulog', message])

I have an example in /home/nu_kscott/htcondor/plugin_small

The plugin hangs during the output and the processes running on the exectuion host look like this

krowe 36107 0.0 0.0 58420 8128 ? Ss 13:44 0:00 condor_starter -f -local-name slot_type_1 -a slot1_1 testpost-master.aoc.nrao.edu
krowe 36571 0.8 0.0 182728 15004 ? S 13:46 0:00 /usr/bin/python3 /usr/libexec/condor/nraorsync_plugin.py -infile /lustre/aoc/admin/tmp/condor/testpost002/execute/dir_36107/.nraorsync_plugin.py.in -outfile /lustre/aoc/admin/tmp/condor/testpost002/execute/dir_36107/.nraorsync_plugin.py.out -upload
krowe 36572 0.0 0.0 17084 1288 ? S 13:46 0:00 /usr/libexec/condor/condor_chirp ulog in upload_file()

If I kill the condor_chirp process (36572), the plugin moves on to the next file to upload at which point it runs condor_chirp again and hangs again. If I keep killing the condor_chirp processes eventually the job finished properly.


ANSWER: Greg looked into this and said there is no good workaround.  "This is simply a deadlock between chirp and the file transfer plugin.  When transfering the output sandbox back to the submit machine, the HTCondor starter runs the file transfer code synchronously wrt the starter (it forks to do this while transfering the input sandbox...), and the starter also handles chirp calls."


Timeout

What is the timeout setting called and can we increase it?  Is it JobLeaseDuration?  Can it be altered on a running job?

ANSWER: yes it is JobLeaseDuration and it can be changed in the execution host



condor_gpu_discovery

I can't find the condor_gpu_discovery on my cluster (HTCondor-9.0.4) or CHTC (9.1.4) even on a GPU host.

ANSWER: /usr/libexec/condor/condor_gpu_discovery


idtokens with RPMs

It seems that installing HTCondor-9.0.4 via RPMs doesn't automatically create signing key in /etc/condor/passwords.d/POOL like the documentation reads https://htcondor.readthedocs.io/en/latest/admin-manual/security.html?highlight=idtokens#quick-configuration-of-security

Also with the RPM install, ALLOW_WRITE = * which seems insecure.  Does this even matter when use security:recommended_v9_0

ANSWER: this can probably just be ignored.  Greg didn't think fresh installs actually created signing keys so this may be an error in documentation.


idtokens

We are using HTCondor-9.0.4 and switched from using host_based security to idtoken security with the following procedure.

On just the Central Manager named testpost-cm (which is the collector and schedd)

openssl rand -base64 32 | condor_store_cred add -c -f /etc/condor/passwords.d/POOL
condor_token_create -identity condor@testpost-cm.aoc.nrao.edu > /etc/condor/tokens.d/condor@testpost-cm.aoc.nrao.edu
echo 'SEC_TOKEN_POOL_SIGNING_KEY_FILE = /etc/condor/passwords.d/POOL' >> /etc/condor/config.d/99-nrao

 then switch to use security:recommended_v9_0 in 00-htcondor-9.0.config

On the worker nodes (startd's)

scp testpost-cm:/etc/condor/passwords.d/POOL /etc/condor/passwords.d
scp testpost-cm:/etc/condor/tokens.d/condor\@testpost-cm.aoc.nrao.edu /etc/condor/tokens.d
echo 'SEC_TOKEN_POOL_SIGNING_KEY_FILE = /etc/condor/passwords.d/POOL' >> /etc/condor/config.d/99-nrao

  then switch to use security:recommended_v9_0 in 00-htcondor-9.0.config

But then things like condor_off don't work

testpost-cm-vml root >condor_off -name testpost002
ERROR
AUTHENTICATE:1003:Failed to authenticate with any method
AUTHENTICATE:1004:Failed to authenticate using SSL
AUTHENTICATE:1004:Failed to authenticate using SCITOKENS
AUTHENTICATE:1004:Failed to authenticate using GSI
GSI:5003:Failed to authenticate. Globus is reporting error (851968:50). There is probably a problem with your credentials. (Did you run grid-proxy-init?)
AUTHENTICATE:1004:Failed to authenticate using KERBEROS
AUTHENTICATE:1004:Failed to authenticate using IDTOKENS
AUTHENTICATE:1004:Failed to authenticate using FS
Can't send Kill-All-Daemons command to master testpost002.aoc.nrao.edu

ANSWER: The CONDOR_HOST on the startd was not fully qualified. Also, both the startd and the collector/schedd were using the cname (testpost-cm) instead of the hostname (testpost-cm-vml). I changed them both to the following and now I can use both condor_off and condor_drain without error.

CONDOR_HOST = testpost-cm-vml.aoc.nrao.edu


Docs wrong for evaluating ClassAds?

This web page https://htcondor.readthedocs.io/en/latest/man-pages/classads.html?highlight=evaluate#testing-classad-expressions suggests that the following will produce false but for me it produces error

condor_status -limit 1 -af 'regexp( "*tr*", "string" )'

ANSWER: The first asterisk shouldn't be there.  This is a regex not globbing.  Greg will look into updating this document.

Oct. 11, 2021 krowe: The documentation looks to have been corrected.


Memory usage report

The memory usage report at the end of the condor log seems incorrect.  I can watch the memory.max_usage_in_bytes in the cgroup get over 8,400MB yet the report in the condor log reads 6,464MB.  Does the log only report the memory usage of the parent process and not include all the children?  Is it an average memory usage over time?

ANSWER: It is a report of a sum of certain fields in memory.stat in the cgroup.  Get Greg an example.  Try it on two machines in case this is a problem of re-using the same cgroup.  Or reboot and try again.

Oct. 11, 2021 krowe: With HTCondor-9.0.6, it looks like my tests are now reporting consistant values between memory.max_usage_in_bytes in the cgroup and Memory in the condor log.  Except the memory.max_usage_in_bytes is in base-10 while the condor log is in base-2.


Tracking jobs through various log files

What is the preferred method of tracking jobs through various log files like condor.log, StarterLog.slot1_2, etc?

The condor.log uses a jobid but the StarterLogs use pid

ANSWER: condor.log to StartLog on execute host to StarterLog.slot* on exeucte host search for "Job <jobid>"

ANSWER: condor_history <jobid> -af LastRemoteHost will give the slot id


Flocking with idtokens

Does the following seem correct?

I setup rastan-vml as a standalone Central Manager, Schedd, and Startd (I'm starting to talk like an HTCondor admin now).  This is what I had in the config on rastan-vml 

UID_DOMAIN = aoc.nrao.edu
JOB_DEFAULT_NOTIFICATION = ERRORCONDOR_ADMIN = krowe@nrao.edu
CONDOR_HOST = rastan-vml.aoc.nrao.edu
PoolName = "rastan"
FLOCK_TO = testpost-cm-vml.aoc.nrao.edu

I then created a token for me in ~/.condor/tokens.d but this did not allow jobs to flock from rastan-vml to testpost-cm.

I then copied the token from testpost-cm:/etc/condor/tokens.d to rastan-vml:/etc/condor/tokens.d and that was enough to get the job flocking.

ANSWER: Yes.


gdrive example

I tried to use the gdrive plugin but couldn't find any documentation and failed to figure it out on my own.

ANSWER: ask coatsworth

I swear this wasn't in the docs last week.

https://htcondor.readthedocs.io/en/latest/users-manual/file-transfer.html?highlight=Transferring%20files%20to%20and%20from%20Google%20Cloud%20Storage#file-transfer-using-a-url

But CHTC doesn't have a Google credential, so I can't use the gdrive plugin at CHTC.

Submitting job(s)

OAuth error: Failed to securely read client secret for service gdrive; Tell your admin that gdrive_CLIENT_SECRET_FILE is not configured


condor_watch_q

nmpost-master krowe >condor_watch_q

ERROR: Unhandled error: [Errno 2] No such file or directory: '/proc/sys/user/max_inotify_instances'. Re-run with -debug for a full stack trace.

ANSWER: it is in beta.  send emai to htcondor-admin about it

https://htcondor.readthedocs.io/en/latest/overview/support-downloads-bug-reports.html

Lauren suggest I ask for a ticket account.

Nevermind.  It is slated to be fixed in 9.0.7.


Launch numbers

Are there knobs to control how many jobs get launched at the same time and/or delay between launches?  We are wondering because we hit our MaxStartups limit of 10:30:60 in sshd.

ANSWER: JOB_START_COUNT looks like the right thing.


plugin_small

Can one of you please try the instructions in

/home/nu_kscott/plugin_small/small.htc

ANSWER: CHTC admins are required to use two-factor authentication via PAM.  This means they can't use a passwdless ssh key in a job.


Transferring back .ad files

I can add the following to my job and it will not cause an error but it also won't transfer the file

transfer_output_files = .job.ad, .machine.ad, .chirp.config

This isn't really important, I just thought it could be useful to diagnose jobs if I had a copy of the .job.ad and thought this would be a convenient way to get it.  I am surprised that it neither causes and error, which it would if the file didn't actually exist, nor copies it.  So I am guessing either the file is removed after it is checked for existance or HTCondor knows about its internal files and refuses to copy them.

ANSWER: Greg is not surprised by this.


FILESYSTEM_DOMAIN as requirement

I want to submit jobs that require a different filesystem but none of the following seem to work

requirements = (FILESYSTEM_DOMAIN == "aoc.nrao.edu")

FILESYSTEM_DOMAIN = "aoc.nrao.edu"

+FILESYSTEM_DOMAIN = "aoc.nrao.edu"

Looks like the answer is

requirements = (FileSystemDomain == "aoc.nrao.edu")

<sarcasm>because that's perfectly obvious</sarcasm>

But let's say we have two clusters (aoc.nrao.edu and cv.nrao.edu) with different filesystems.  I want jobs submitted in aoc.nrao.edu with a requirement of cv.nrao.edu to glidein to the cv.nrao.edu cluster.  How can a factory script at cv.nrao.edu look for such jobs?  I can't seem to use condor_q -constraint to look for such jobs.  The following doesn't work.

condor_q -pool nmpost-cm-vml.aoc.nrao.edu -global -allusers -constraint 'Requirements = ((FileSystemDomain == "cv.nrao.edu"))'

ANSWER: I think the answer is not to use FileSystemDomain but to create our own custom classad like we do with the VLASS partition.  Greg says it is possible to query for this requirement but the syntax is pretty gnarly.  I think making a partition is a better solution.


Removing tokens

Let's say I have a schedd that authenticates with an idtoken in /etc/condor/tokens.d.  If I remove that token, I am still able to submit jobs from that host until condor is restarted.  It has to be a restart as condor_reconfig seems insufficient.  This indicates to me that HTCondor is caching the token. Although it is strange that condor_token_list returns nothing immediatly after remocing the token yet HTCondor still can submit jobs. This is not really a problem but I was surprised by it and wanted to point it out in case it was unexpected.  There doesn't seem to be a timeout either.

ANSWER: Greg knows about this.  HTCondor establishes a relationship once authenticated and continues to use that relationship.  It may timeout after 24 hours, not sure.


Signing key

Given two separate clusters (testpost and nmpost), what should the signing keys and tokens look like?

Now that we use idtokens, I thought that to get a VM to be able to submit jobs I only needed to add our cluster's token to /etc/condor/tokens.d.  But apparently I also need to add our cluster's signing key to /etc/condor/passwords.d.  I since learned that this is probably because I created the signing key and token on our testpost cluster and then copied them to our nmpost cluster.

ANSWER: yes.  create signing keys for each cluster.


Jobs with a little swap

Say we had jobs that need 40GB of memory but occationally, very briefly, spike to 60GB.  With Torque this is not a problem because it will just let the job swap.  It is not a big performance hit because the amount of time that memory is needed is very short compared to the runtime of the job.  How could we do this in HTCondor?  We really don't want to set a memory requirement of 60GB beucase we want to run multiple jobs on a node and doing so will significantly reduce the number of jobs we could put on a node.

Does the new DISABLE_SWAP_FOR_JOB=false knob, introduced in 8.9.9, mean that HTCondor now swaps if needed by default?

ANSWER: try setting memory.swappiness for the condor cgroup.

ANSWER:  The VLASS nodes don't have a swap partition.  Make a swapfile on the vlass node (nmpost110) and see if that works.


Allocated in the log file

If I submit a job at CHTC with request_disk = 1 G the log output looks like

Partitionable Resources : Usage Request Allocated

Cpus : 1 1

Disk (KB) : 49 1048576 1485064

IoHeavy : 0

Memory (MB) : 1 1024 1024

But if I submit a job at CHTC with a request_disk = 2 G the log output looks like

Partitionable Resources : Usage Request Allocated

Cpus : 1 1

Disk (KB) : 49 2097152 7258993

IoHeavy : 0

Memory (MB) : 0 1024 1024

What does the "Allocated" disk space mean in these examples?

ANSWER: with partitionable slots HTCondor allocates more disk space than you ask because then that slot might be used by a follow up job.  This is because destroying and creating partitionable slots takes a full negotiation cycle which is measured in minutes.

MODIFY_REQUEST_EXPR_REQUEST_DISK=RequestDisk can alter this behavior.  check docs. on the startd (execute host)


Rebooting Execute Hosts

When an Execute Host unexpectadly reboots, what happens to the job?  What are the options?  Currently it looks like the job just "hangs".  Condor_q indicates that it is still running but it isn't.  Looks like it eventually times out after the magic 40 minutes.

ANSWER: Correct


condor_off -reason

You added a -reason to condor_drain, could the same be added to condor_off?

ANSWER: Greg likes this idea and will look into it.  Only recently did they implement offline ads that would allow this sort of thing.



Security email

On Mar. 3, 2022 James Robnett received the Security Release email.  Is there an email list for these? It looks like he was just BCC'd.  Could we change it from James's address to a non-human address?

ANSWER: Greg updated their security list with nrao-scg@nrao.edu



condor_rm -addr

No matter what IP I use, condor_rm -addr (E.g. condor_rm -addr 10.64.1.178:9618 361) always respnds with something like this

condor_rm: "10.64.1.178:9618" is not a valid address

Should be of the form <ip.address.here:port>

For example: <123.456.789.123:6789>

Yet this works condor_rm -pool 146.88.10.46:9618 361

ANSWER: condor_rm -addr "<10.64.1.178:9618>" 361

It actually needs the angle brackets.  Weird.



condor_startd blocking on plugin

I modified our nraorsync plugin on the testpost cluster to sleep for 3600 seconds before calling upload_rsync() and then started my small, test job that uses the plugin. Here is what I see in the condor.log

000 (401.000.000) 2022-03-21 11:18:01 Job submitted from host: <10.64.1.178:9618?addrs=10.64.1.178-9618&alias=testpost-master.aoc.nrao.edu&noUDP&sock=schedd_2991_27db>
...
040 (401.000.000) 2022-03-21 11:18:01 Started transferring input files
Transferring to host: <10.64.1.173:9618?addrs=10.64.1.173-9618&alias=testpost003.aoc.nrao.edu&noUDP&sock=slot1_3_5133_69aa_48>
...
040 (401.000.000) 2022-03-21 11:18:04 Finished transferring input files
...
001 (401.000.000) 2022-03-21 11:18:04 Job executing on host: <10.64.1.173:9618?addrs=10.64.1.173-9618&alias=testpost003.aoc.nrao.edu&noUDP&sock=startd_5045_0762>
...
006 (401.000.000) 2022-03-21 11:18:12 Image size of job updated: 58356
1 - MemoryUsage of job (MB)
312 - ResidentSetSize of job (KB)
...
040 (401.000.000) 2022-03-21 11:18:32 Started transferring output files
...
022 (401.000.000) 2022-03-21 12:17:09 Job disconnected, attempting to reconnect
Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to slot1_3@testpost003.aoc.nrao.edu <10.64.1.173:9618?addrs=10.64.1.173-9618&alias=testpost003.aoc.nrao.edu&noUDP&sock=startd_5045_0762>

...
024 (401.000.000) 2022-03-21 12:17:09 Job reconnection failed
Job disconnected too long: JobLeaseDuration (2400 seconds) expired
Can not reconnect to slot1_3@testpost003.aoc.nrao.edu, rescheduling job
...
040 (401.000.000) 2022-03-21 12:19:04 Started transferring input files
Transferring to host: <10.64.1.173:9618?addrs=10.64.1.173-9618&alias=testpost003.aoc.nrao.edu&noUDP&sock=slot1_1_5133_69aa_49>


Here is the StartLog on testpost003 for the time the job was disconnected

03/21/22 11:29:43 Unable to calculate keyboard/mouse idle time due to them both being USB or not present, assuming infinite idle time for these devices.
03/21/22 12:17:09 ERROR: Child pid 39127 appears hung! Killing it hard.
03/21/22 12:17:09 Starter pid 39127 died on signal 9 (signal 9 (Killed))
03/21/22 12:17:09 slot1_3: State change: starter exited
03/21/22 12:17:09 slot1_3: Changing activity: Busy -> Idle
03/21/22 12:17:09 slot1_3: State change: idle claim shutting down due to CLAIM_WORKLIFE
03/21/22 12:17:09 slot1_3: Changing state and activity: Claimed/Idle -> Preempting/Vacating
03/21/22 12:17:09 slot1_3: State change: No preempting claim, returning to owner
03/21/22 12:17:09 slot1_3: Changing state and activity: Preempting/Vacating -> Owner/Idle
03/21/22 12:17:09 slot1_3: State change: IS_OWNER is false
03/21/22 12:17:09 slot1_3: Changing state: Owner -> Unclaimed
03/21/22 12:17:09 slot1_3: Changing state: Unclaimed -> Delete
03/21/22 12:17:09 slot1_3: Resource no longer needed, deleting
03/21/22 12:17:09 Error: can't find resource with ClaimId (<10.64.1.173:9618?addrs=10.64.1.173-9618&alias=testpost003.aoc.nrao.edu&noUDP&sock=startd_5045_0762>#1645569601#178#...) for 443 (RELEASE_CLAIM); perhaps this claim was removed already.
03/21/22 12:17:09 condor_write(): Socket closed when trying to write 45 bytes to <10.64.1.178:18477>, fd is 11
03/21/22 12:17:09 Buf::write(): condor_write() failed
03/21/22 12:19:04 slot1_1: New machine resource of type -1 allocated

Here is the ShadowLog on testpost-master (the submit host) for the time the job was disconnected

03/21/22 11:18:04 (401.0) (3260086): File transfer completed successfully.

03/21/22 12:17:09 (401.0) (3260086): Can no longer talk to condor_starter <10.64.1.173:9618>

03/21/22 12:17:09 (401.0) (3260086): Trying to reconnect to disconnected job

03/21/22 12:17:09 (401.0) (3260086): LastJobLeaseRenewal: 1647883111 Mon Mar 21 11:18:31 2022

03/21/22 12:17:09 (401.0) (3260086): JobLeaseDuration: 2400 seconds

03/21/22 12:17:09 (401.0) (3260086): JobLeaseDuration remaining: EXPIRED!

03/21/22 12:17:09 (401.0) (3260086): Reconnect FAILED: Job disconnected too long: JobLeaseDuration (2400 seconds) expired

03/21/22 12:17:09 (401.0) (3260086): Exiting with JOB_SHOULD_REQUEUE

03/21/22 12:17:09 (401.0) (3260086): **** condor_shadow (condor_SHADOW) pid 3260086 EXITING WITH STATUS 107

03/21/22 12:19:04 ******************************************************


Does the condor_starter block waiting for the plugin to finish and therefore not respond to queries from the condor_starter?  Will setting JobLeaseDuration to something longer than 2400 seconds help with this?

ANSWER: Yes, the condor_starter blocks on the output transfer (but not on the input transfer).  Greg thinks adding JobLeaseDuration to the submit description file should fix the problem.

Greg agrees that it shouldn't not blocking on upload would be preffereable.

ANSWER: turns out it was a differnt knob than JobLeaseDuration.  I set the following on our execution hosts to solve the problem.

NOT_RESPONDING_TIMEOUT = 86400



Removing a job from a container schedd

Let's say I submit a condor job from a schedd runnning inside a container (hamilton).  How can I remove that job from outside the container (nmpost-master)?

I can see the job using condor_q

nmpost-master krowe >condor_q -name hamilton 4.0


-- Schedd: hamilton.aoc.nrao.edu : <146.88.1.44:9618?... @ 03/04/22 11:33:38
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL JOB_IDS
condor tiny1 3/4 11:31 _ _ _ 1 1 4.0

but when I try to remove it I get an auth error

nmpost-master krowe >condor_rm -name hamilton 4.0

AUTHENTICATE:1003:Failed to authenticate with any method

AUTHENTICATE:1004:Failed to authenticate using SCITOKENS

AUTHENTICATE:1004:Failed to authenticate using GSI

GSI:5003:Failed to authenticate. Globus is reporting error (851968:101). There is probably a problem with your credentials. (Did you run grid-proxy-init?)

AUTHENTICATE:1004:Failed to authenticate using KERBEROS

AUTHENTICATE:1004:Failed to authenticate using IDTOKENS

AUTHENTICATE:1004:Failed to authenticate using FS

No result found for job 4.0

Could this be because the container doesn't have a Condor Signing Key but only has a Condor Token?

But I get the same problem when trying to kill a job on nmpost-master submitted from nmpost-cm and they both have the same passwords and tokens.  Do I need a token signing key and a token in ~/.condor/tokens.d?

ANSWER: yes, you need both the token signing key in /etc/condor/passwords.d and the token in ~/.condor/tokens.d



JobLeaseDuration

So setting JobLeaseDuration works if I don't choose a 'bad' value.  So far values of 7200 and 14400 cause the job to be disconnected after about 60 minutes while values of 4000, 4800, and 8000 let the job finish nomally.  Why?

Why does it seemt to take 60 minutes for the job to disconnect instead of 40 minutes (2400 seconds)?

Is there a way I can set JobLeaseDuration at a system level instead of in the submit description file?

Why is it that if I set JOB_DEFAULT_LEASE_DURATION = 4000 in the submit host config, the job.ad gets JobLeaseDuration = 4000 and yet the job still disconnects?

I compared the jobads (condor_q -l) of a job where *JobLeaseDuration = 4000* is set in the submit description file and *JOB_DEFAULT_LEASE_DURATION = 4000* set in the submithost config. The only differences I see in the jobads are times, jobids, logfiles and diskprovisioned. So I don't understand why altering the submit host config doesn't work.

ANSWER: daemons have a keep alive message. If startd expects keep alives from the starter if not received it gets killed. This is outside JobLeaseDuration. This is from the old days of Condor when it was scavaging cycles and didn't want to get in the user's way. Look into NOT_RESPONDING_TIMEOUT in the config file on the worker node. Default is 3600 seconds. Try setting it to something LARGE.


Comments

This has probably already been mentioned but would it be possible to put comments after a condor command like so

batch_name = "test script" # dont show this

without the batch_name being set to test script # dont show this

ANSWER: Not likely to be changed as doing so may break other things.


Removing jobs with tokens

You can use tokens to remove jobs as other users but strangly not on the same host. For example: krowe and krowe2 have the same token (~/.condor/tokens.d/testpost). if I submit a job as krowe on testpost-master I cannot remove that job as krowe2 on testpost-master.

testpost-master$ condor_q -g -all -af clusterid owner jobstatus globaljobid
452 krowe 2 testpost-master.aoc.nrao.edu#452.0#1648820298

testpost-master$ condor_rm 452

Couldn't find/remove all jobs in cluster 452

testpost-master$ condor_rm -name testpost-master 452

Couldn't find/remove all jobs in cluster 452

However, if I submit a job as krowe on testpost-cm I *can* remove that job from testpost-master (condor_rm -name testpost-cm 123).  Is this a bug?  Is it because when you are on the same host, HTCondor is trying UID authentication instead of token authentication?  If so, is there a way to force to force token authentication?

ANSWER: Greg thinks this is because they choose the authentication type first and then stick with that type.

WORKAROUND: I *think*
_condor_SEC_DEFAULT_AUTHENTICATION_METHODS=IDTOKENS condor_rm will use idtokens but Greg thinks this may not work so be warned.


Condor Week



RADIAL CHTC support


Flocking and networking

Say we have a pool named cvpost at some remote site and we want to flock jobs to it from our pool named nmpost.  What kind of networking is necessary?  Do the execute hosts need a routable IP (NAT or real) for download and/or upload?  What about the submit host and central manager?

ANSWER: These paths need to be open


/tmp

executable = /bin/bash

arguments = "-c '/bin/date > /tmp/date'"

should_transfer_files = yes

transfer_output_files = /tmp/date

#transfer_output_files = tmp/date

queue

If I write to /tmp/date and set transfer_output_files = /tmp/date I get errors like

Error from slot1_4@nmpost040.aoc.nrao.edu: STARTER at 10.64.10.140 failed to send
file(s) to <10.64.10.100:9618>: error reading from /tmp/date: (errno 2) No such
file or directory; SHADOW failed to receive file(s) from <10.64.10.140:35386>

It works if I set transfer_output_files = tmp/date


/dev/shm

executable = /bin/bash

arguments = "-c '/bin/date > /tmp/date'"

should_transfer_files = yes

transfer_output_files = /dev/shm/date

#transfer_output_files = dev/shm/date

queue

If I write to /dev/shm/date I get errors setting transfer_output_files = /dev/shm/date

Error from slot1_4@nmpost040.aoc.nrao.edu: STARTER at 10.64.10.140 failed to send
file(s) to <10.64.10.100:9618>: error reading from /dev/shm/date: (errno 2) No
such file or directory; SHADOW failed to receive file(s) from
<10.64.10.140:41516>

If I write to /dev/shm/date I get errors setting transfer_output_files = dev/shm/date

Error from slot1_4@nmpost040.aoc.nrao.edu: STARTER at 10.64.10.140 failed to send
file(s) to <10.64.10.100:9618>: error reading from
/lustre/aoc/admin/tmp/condor/nmpost040/execute/dir_30401/dev/shm/date: (errno 2)
No such file or directory; SHADOW failed to receive file(s) from
<10.64.10.140:40380>

ANSWER: these are known issues and not surprising.  It's debatable weather they are bugs or not.  The issue is the job is "done" by the time transfer_output_files is used and since the job is done the bindmounts for /tmp and /dev/shm(which is a little different) are gone.


pro-active glideins

Need to investigate gliding in based on lack of free slots rather than idle jobs.  Can one query HTCondor for a CARTA-shaped slot (core, mem, disk)?

ANSWER: Greg thinks this is a good idea and might be useful as a condor-week talk.

condor_off vs condor_drain

a -peaceful option to condor_drain might be perfect.  Low priority for NRAO.

ANSWER: Yes condor_drain is being worked on and this is one of the things.

Transfer Plugins

Don't have condor block on the transfer plugin uploading.  Low priority for NRAO.

ANSWER: This requires some serious work.  Greb will ask Todd about it.



More plugin woes

So let's say you have a plugin to transfer output files and this plugin fails because a destination directory, like nosuchdir, doesn't exist.  All the plugin can do is indicate success or failure so it indicates failure.  But that seems to cause HTCondor to disconnect/reconnect four times, the fail, then set the job to idle so it can try again later, which then disconnects/reconnects four times and ...  Is there anything else the plugin can do to tell HTCondor to hold the job instead of restart?


executable = /bin/sleep
arguments = "27"
output = nosuchdir/condor_out.log
error = nosuchdir/condor_err.log
log = condor.log
should_transfer_files = YES
transfer_output_files = _condor_stdout
# output_destination = nraorsync://$ENV(PWD)
+WantIOProxy = True
queue


If you set either output or error to a directory that doesn't exist like output = nosuchdir/condor_out.log, then when the job ends, HTCondor will put the job on hold with a message like the following in the condor.log 

040 (5062.000.000) 2022-09-30 08:28:58 Finished transferring output files
...
007 (5062.000.000) 2022-09-30 08:28:58 Shadow exception!
        Error from slot1_2@nmpost040.aoc.nrao.edu: STARTER at 10.64.10.140 failed to send file(s) to <10.64.10.100:9618>; SHADOW at 10.64.10.100 failed to write to file /users/krowe/htcondor/nraorsync/dir/stdout.5062.log: (errno 2) No such file or directory
        13  -  Run Bytes Sent By Job
        354  -  Run Bytes Received By Job
...
012 (5062.000.000) 2022-09-30 08:28:58 Job was held.
        Error from slot1_2@nmpost040.aoc.nrao.edu: STARTER at 10.64.10.140 failed to send file(s) to <10.64.10.100:9618>; SHADOW at 10.64.10.100 failed to write to file /users/krowe/htcondor/nraorsync/dir/stdout.5062.log: (errno 2) No such file or directory
        Code 12 Subcode 2


But if you have set output_destination to use the nraorsync plugin like so output_destination = nraorsync://$ENV(PWD) then you get four disconnect/reconnect events followed by a shadow exception (see below). Then HTCondor sets the job to idle so it can try again instead of putting it on hold. I assume this is because it doesn't know why the job failed because there isn't really a mechanism for the plugin to tell it why.

040 (5061.000.000) 2022-09-30 08:23:20 Finished transferring output files
...
022 (5061.000.000) 2022-09-30 08:23:20 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot1_3@nmpost040.aoc.nrao.edu <10.64.10.140:9618?addrs=10.64.10.140-9618&alias=nmpost040.aoc.nrao.edu&noUDP&sock=startd_5631_f9e2>
...
023 (5061.000.000) 2022-09-30 08:23:20 Job reconnected to slot1_3@nmpost040.aoc.nrao.edu
    startd address: <10.64.10.140:9618?addrs=10.64.10.140-9618&alias=nmpost040.aoc.nrao.edu&noUDP&sock=startd_5631_f9e2>
    starter address: <10.64.10.140:9618?addrs=10.64.10.140-9618&alias=nmpost040.aoc.nrao.edu&noUDP&sock=slot1_3_5795_da05_282>
...
040 (5061.000.000) 2022-09-30 08:23:20 Started transferring output files
...
040 (5061.000.000) 2022-09-30 08:23:20 Finished transferring output files
...
022 (5061.000.000) 2022-09-30 08:23:20 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot1_3@nmpost040.aoc.nrao.edu <10.64.10.140:9618?addrs=10.64.10.140-9618&alias=nmpost040.aoc.nrao.edu&noUDP&sock=startd_5631_f9e2>
...
023 (5061.000.000) 2022-09-30 08:23:20 Job reconnected to slot1_3@nmpost040.aoc.nrao.edu
    startd address: <10.64.10.140:9618?addrs=10.64.10.140-9618&alias=nmpost040.aoc.nrao.edu&noUDP&sock=startd_5631_f9e2>
    starter address: <10.64.10.140:9618?addrs=10.64.10.140-9618&alias=nmpost040.aoc.nrao.edu&noUDP&sock=slot1_3_5795_da05_282>
...
040 (5061.000.000) 2022-09-30 08:23:20 Started transferring output files
...
040 (5061.000.000) 2022-09-30 08:23:21 Finished transferring output files
...
022 (5061.000.000) 2022-09-30 08:23:21 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot1_3@nmpost040.aoc.nrao.edu <10.64.10.140:9618?addrs=10.64.10.140-9618&alias=nmpost040.aoc.nrao.edu&noUDP&sock=startd_5631_f9e2>
...
023 (5061.000.000) 2022-09-30 08:23:21 Job reconnected to slot1_3@nmpost040.aoc.nrao.edu
    startd address: <10.64.10.140:9618?addrs=10.64.10.140-9618&alias=nmpost040.aoc.nrao.edu&noUDP&sock=startd_5631_f9e2>
    starter address: <10.64.10.140:9618?addrs=10.64.10.140-9618&alias=nmpost040.aoc.nrao.edu&noUDP&sock=slot1_3_5795_da05_282>
...
040 (5061.000.000) 2022-09-30 08:23:21 Started transferring output files
...
040 (5061.000.000) 2022-09-30 08:23:21 Finished transferring output files
...
022 (5061.000.000) 2022-09-30 08:23:21 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot1_3@nmpost040.aoc.nrao.edu <10.64.10.140:9618?addrs=10.64.10.140-9618&alias=nmpost040.aoc.nrao.edu&noUDP&sock=startd_5631_f9e2>
...
023 (5061.000.000) 2022-09-30 08:23:21 Job reconnected to slot1_3@nmpost040.aoc.nrao.edu
    startd address: <10.64.10.140:9618?addrs=10.64.10.140-9618&alias=nmpost040.aoc.nrao.edu&noUDP&sock=startd_5631_f9e2>
    starter address: <10.64.10.140:9618?addrs=10.64.10.140-9618&alias=nmpost040.aoc.nrao.edu&noUDP&sock=slot1_3_5795_da05_282>
...
040 (5061.000.000) 2022-09-30 08:23:21 Started transferring output files
...
040 (5061.000.000) 2022-09-30 08:23:22 Finished transferring output files
...
007 (5061.000.000) 2022-09-30 08:23:22 Shadow exception!
        Error from slot1_3@nmpost040.aoc.nrao.edu: Repeated attempts to transfer output failed for unknown reasons
        0  -  Run Bytes Sent By Job
        354  -  Run Bytes Received By Job

ANSWER: This is a bug.  CHTC would like to implement better error handling here.

Workaround could be to set the following in the config file on the submit host.  But this may be problematic on SSA's container submit host.  It should cause condor to fail the job if the nosuchdir doesn't exist.  I think its just best to note this as a bug and wait for CHTC to implement better error handling that allows the plugin to tell HTCondor how to fail.  Or something like that.


DAG hosts

Is there a way to guarentee all the nodes of a DAG run on the same hostname without specifying the specific hostname?

An example would be the first node copies some data to local host storage, then all the other nodes read that data.

ANSWER: have the DAG post script figure out what hostname the node ran on and then modify or create the submit file for the next node.

glidein memory requirements

Twice now I have doubled the memory for the pilot job. On Jul. 14, 2022 from 1GB to 2GB and just now (Aug. 25, 2022) from 2GB to 4GB. This is because the condor daemon like condor_starter exceeeded the memory and was OOM killed.  This second time was during nraorsync uploading files. Is there a suggested amount of memory for Slurm for glidein jobs?

ANSWER: The startd assumes it has control of all physical memory and don't check if they are in a cgroup or not.  If I run into this again, try and track down what is actually happening.  Greg would like to know.  He is surprised because the HTCondor daemons should only need MBs not GBs.


no match found

When a job stays idle for a long time and its LastRejMatchReason = "no match found " what are some good places to look to see why it isn't finding a match?  For example, if you make a type-o and set the following (note the misspelling of nraorsync)

transfer_input_files = $ENV(HOME)/.ssh/condor_transfer, nraorysnc://$ENV(PWD)/testdir

ANSWER: condor_q -better

It doesn't know about quotas or fairshare or per-machine resource limits like +IOHeavy or other such adds.

Accounts

Can Felipe get an account?  Also, you might want to ask James if he still needs his account now that he no longer works for the NRAO.

ANSWER: done.


Start glidein node if there isn't one free

Previously, our factory.sh script would start a glidein job in Slurm if there was a job idle.  But now that we want jobs to start as quickly as possible, our factory.sh script now starts a glidein job if there isn't enough free resources available.  To do this we had to define what "free resources" ment so we went with MIN_SLOTS=8 and MIN_MEMORY=16384 since we use dynamic slots.  We also had to set a "default" machine add on all the nodes that we wanted available to this glidein job.  This is so that the factory.sh script doesn't check nodes that are in the VLASS or CVPOST or other such groups.  I could have explicitly excluded those groups but that wouldn't scale well if we ever created more groups.

ANSWER: Greg thinks this is a perfectly crumulent way to do this.


OS

What RHEL8-like OS is CHTC going to use or is using?  CentOS8/stream, Alma, Rocky, etc?  Looks like CentOS8 Stream.  Any thoughts?

ANSWER: yes they are using CentOS8/stream.  So far so good.



PATh getting data to execute hosts

What are the prefered methods?  http? nraorsync? other?

ANSWER: http, s3 or other plugins.  Or OSDF.  

https://osg-htc.org/services/osdf.html

OSDF (Open Science Data Federation)  There are "data origins" which are basicly webservers with cache.  Long term we might be able to have our own data origin that authenticates to NRAO and shares data from our Lustre filesystem to their Ceph system via some way.  This is like an object store so you can't really update files but you can make new ones.

/mnt/stash/ospool/PROTECTED/ Copy data from NRAO to this path and then you can access it in the job via

transfer_input_files = stash:///ospool/PROTECTED/user/file

transfer_output_files = stash:///ospool/PROTECTED/user/file

https://portal.osg-htc.org/documentation/htc_workloads/managing_data/stashcache/

I think this is cooler than using the nraorsync plugin.


PATh GPUs

We only see four GPUs on PATh right now.   What is the timeline to get more?   Does PATh flock to other sites with GPUs?

ANSWER: hosts may be dynamic and only come on-line as needed with some k8s magic.  Christina is checking.  Greg is pretty sure there should be way more than just 4 GPUs in PATh.  PATh is made up of six different sites https://path-cc.io/facility/index.html each of these sites provides hardware.


Disk Space

Since neither HTCondor nor cgroups control scratch space, how can we keep jobs from using up all the scratch space on a node and causing other jobs to fail as well?

ANSWER: specify a periodic hold in the startd.  every 6 seconds startd can check and put the job on hold.  Greg can look up the syntax.  Someday, condor starter will make an efemeral filesystem (loopback) on the scratch area with the requested disk space.  This is comming soon.

https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToLimitDiskUsageOfJobs

Squid on PATh?

Does PATh use a squid server so that we can wget something once and have it chaced for a while?

ANSWER: Greg thinks PATh does this as well.


RADIAL workload balance

If there are only two users submitting jobs to a cluster, will HTCondor try to balance the workload between the two users?  For example will it prioritize user2 jobs if user1 jobs are using the majority of resources?  I think I read this about HTCondor's fair-share algorithm but I am not sure.

ANSWER: yes condor does this.  There are knobs to adjust user priorities (user1 is twice the priority of user2, etc).  You can also specify the length of the half life.  There are many other ways to do something like this.


Cluster domain names

This is not an HTCondor question but perhaps Greg has some insight.  Let's say I am setting up a turn-key cluster.  We deliver a rack of compute nodes with one head node.  That one head node will need Internet access for SSH, DNS, etc.  But the compute nodes don't need any Internet access.  You can only get to them via ssh by first sshing to the head node and the only host the use as a name server is the head node.  So, my question is what TLD should all these compute nodes be in?  I would like to use a local, non-routable TLD analogus to non-routeable IP ranges like 10.0.0.0/8 or 192.168.0.0/16.  But there doesn't seem to be such a thing defined by ICANN.

Have you heard of any clusters using any "private" TLDs?  Can we just use IPs and not use names?

ANSWER: Greg looked at their nodes and saw both .local and .internal in use


2FA

I don't seem to be prompted for two-factor authentication when I login to CHTC.  Should I?

ANSWER: Greg will ask.  I'm not going to worry about it unless I can no longer login.


Using containers at PATh

Anything special to do?  Singularity/Apptainer?

ANSWER: Greg doesn't think so but Singularity would be the first to test.


Docker at PATh

Doesn't seem to work.  My docker universe job just stayed idle for three days.

ANSWER: since PATh uses docker containers in kubernetes, they probably can't support docker containers in a docker container.  So Singularity is the answer.


PATh libraries

It seems that not all the PATh execution hosts have the same set of libraries.  For example my program, https://github.com/axboe/fio.git requires libmvec.so.1.  It fails on all Expanse, SYRA GPU, UNL GPU nodes.  It works on all WISC, FIU, SYRA non-GPU nodes.

Work on

Fails on

ANSWER: use singularity


Flocking to PATh

Is flocking to PATh expected to be a thing either now or in the future?

Apparently there may be a way to setup PATh as an annex https://htcondor.org/experimental/ospool/byoc/path-facility if we knew our PROJECT_ID

ANSWER: annex https://htcondor.org/experimental/ospool/byoc/path-facility


gpu_burn

When I run https://github.com/wilicc/gpu-burn on a PATh node, the nvidia-smi output only shows a couple of extra watts, about 400MiB of memory used and no GPU-Utilization.  Normally, this program pegs GPUs at 100% utilization.

ANSWER: Greg agrees with krowe this might be because the GPUs are A100s


Transfer Mechanism granularity?

Say you are transferring two files A and B.  Is there a way to tell how long HTCondor took to transfer A and how long it took to transfer B?  Are they even transferred serially?  This may help Felipe tell how long it takes to transfer an MS vs the CFCache.

ANSWER: the files are transferred serially.  And there is no way to tell file A vs file B.


Excluding nodes

How can we tell HTCondor to not run on nodes like FIU* or UNL*?

requirements = GLIDEIN_ResourceName != "FIU-PATH-EP"

ANSWER: this is probably the best solution.  CHTC does have a taxonomy for naming things but it may not always get used.


Job Shape Plots

Felipe has some plots showing what each subjob is doing in a large job (queue 32).  Does HTCondor have or suggest any tools for plotting this sort of thing?

ANSWER: not really.  Greg really likes Felipe's "gant charts".  Matplotlib is in common use. Admins use Graphana.  There are python modules for reading user logs.


Nvidia MIG support?

Dividing a GPU into multiple (Max 8) GPU slices.  This started with the Nvidia Ampere architecture.  Does HTCondor support this?

ANSWER: Condor supports MIGs but no dynamically.  The admin has to split the GPUs staticaly and then condor can use it.


Upgrade to 10.x

What are the current recomendations for upgrading to 10.x?  Order of upgrades?  Mixed mode?

ANSWER: mixed mode should be fine.



condor_ssh_to_job

If I am vlapipe@nmpost-master and wand to connect to a job submitted from vlapipe@hamilton (actually a container running on hamilton) it doesn't work.  Should it?  I am guessing it doesn't because in this case hamilton is a container and not the actual host hamilton.

 vlapipe@nmpost-master$ condor_ssh_to_job -name hamilton 3691
Failed to send GET_JOB_CONNECT_INFO to schedd

ANSWER: This is less of an issue as Amy can use the wf_inspector tool from SSA to connect to jobs.



Preference


We want some jobs to run on a set of nodes (e.g. NMT VLASS), but if those aren't available, then run on the default set of nodes (e.g. DSOC VLASS).

I should be able to use the rank expression to do this, right?  E.g.

Rank = (machine == "nmpost039.aoc.nrao.edu")

But when I run a job with this, it runs on nmpost038 instead of nmpost039.  There is nothing wrong with nmpost039 and both nodes have the same config file.  I can require the job to run on nmpost039 with the following

requirements = (machine == "nmpost039.aoc.nrao.edu")

HTCondor should select the most restrictive first, right?  Is that NMT VLASS or DSOC VLASS.

Rank seems work at PATh but not NRAO nor CHTC.

ANSWER: the negotiator finds all slots that match.  It sorts them by NEGOTIATOR_PRE_JOB_RANK, then by JOB_RANK, then by NEGOTIATOR_POST_JOB_RANK.  It works at PATh because the available nodes at PATh are all empty so NEGOTIATOR_PRE_JOB_RANK is the same value for all of them so then it goes to the JOB_RANK.  We could set NEGOTIATOR_POST_JOB_RANK == JOB_RANK in the config file on the Central Manager, but that would not pack jobs the way we like.

By default NEGOTIATOR_PRE_JOB_RANK = (10000000 * My.Rank) + (1000000 * (RemoteOwner =?= UNDEFINED)) - (100000 * Cpus) - Memory

Since we disable preemtion, I think My.Rank, which is the rank in the machine classad, is always 0.0.  What if we replaced it with Target.Rank which would be the rank in the job classad? By default rank in the job classad is also 0.0 unless the user sets rank in the submit description file.

So change it to NEGOTIATOR_PRE_JOB_RANK = (10000000 * Target.Rank) + (1000000 * (RemoteOwner =?= UNDEFINED)) - (100000 * Cpus) - Memory in the config file on the Central Manager.


condor_q output totals

The Total lines at the end of condor_q are the same format at NRAO and PATh. Why and how?

Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended 

Total for krowe: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended 

Total for all users: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended

But at CHTC it is only one line.

0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended

ANSWER: Don't know.  But this is pretty minor and not really a problem just a curiosity.


ypbind.service in condor.service

Pretty minor but I noticed that the systemd unit file for condor indicates it needs to start after ypbind.  I can see this on CHTC.  Even NRAO is getting rid of ypbind.  Y'all can probably remove this.

submit2 nu_kscott >systemctl cat condor | grep After
After=network-online.target nslcd.service ypbind.service time-sync.target nfs-client.target autofs.service

K. Scott will look into a proper solution.  perhaps systemd has a directory service definition to use.

ANSWER: ypbind has largely been replaced by sssd which can use various directory services (LDAP, Active Directory, IdM/IPA, etc)


Time Zones

We have multiple submit hosts and some of them are in different time zones.  Is there a condor command that could tell me the timezone of a particular submit host?  Something like condor_status -l -schedd hamilton | grep -i zone  Logging into some of these submit hosts is hard because they are containers.  I need to know the timezone so that I can cross reference the logs correctly.

ANSWER: No.  you can advertise whatever value you want in the classad.  We could create a timezone variable.  Greg also thinks there may be a _hacky_ way to do this.

Time zone logging?

Is there a way to know what time zone a set of logs are in?  Or can you change the time zone that your condor job uses so that they are always in UTC?

ANSWER: Not at the user level.  You can set DEFAULT_USERLOG_FORMAT_OPTIONS = UTC in the condor config file (not submit file).


PATh credits

Is there a way to check your credit status at PATh?

ANSWER: Greg doesn't know.  Ask Christina.


ARM Mac support (dlyons)

A problem we’re about to have is, how do we submit jobs from an ARM Mac container?

My sense is that the only way to do that is to run the container image as if it were a submit host. And that there are going to be two problems with this:

1. There is no ARM build of any of the Condor docker images
2. There may be architecture mismatch issues

I am willing to try to build Condor images for them but I have found their documentation for building containers somewhat obtuse. If they have a cheat sheet I can steal from that would be useful.

Anyway, I don’t really know if there is another approach, but if there is, that might be better.

If you want me at the meeting to ask about this, I can come.

ANSWER: 

https://www-auth.cs.wisc.edu/lists/htcondor-users/2023-March/msg00130.shtml

CHTC uses QEMU in docker to build their non-x86_64 binaries.  It's slow but surprisingly reliable.


Flocking to radial

I can flock from testpost cluster to nmpost cluster by just setting FLOCK_TO = nmpost-cm-vml.aoc.nrao.edu and copying the idtoken from nmpost.  But when I try flocking from testpost cluster to radial cluster instead of nmpost using the same procedure, it doesn't work.  I see the jobs in condor_q on radialhead but the jobs just stay idle.  I don't see any errors in any logs or any indication why it isn't working.

Testpost and nmpost are both HTCondor-9 with the same user space, FILESYSTEM_DOMAIN, and UID_DOMAIN.  Also, the radial cluster is on an isolated network (192.168.0.0/24) and the central manager is multi-homed.

What is the data path when a job flocks?  Does the flocked_to startd have to contact the original schedd or the flocked_to schedd?

ANSWER: The schedd advertises the demand to the remote pool.  Greg will need to think about this.

2023-03-29 krowe: I upgraded testpost to HTCondor-10 and tried flocking to radial again.  Now I see the following in the SchedLog on testpost-master.  So it looks like the original schedd needs to contact the eventual startd.  That's going to be a problem because the startd is on an isolated network.

03/29/23 08:30:28 (pid:119154) attempt to connect to <192.168.0.32:9618> failed: timed out after 45 seconds.
03/29/23 08:30:28 (pid:119154) Failed to send REQUEST_CLAIM to startd slot1@radial001.nrao.radial.local <192.168.0.32:9618?addrs=192.168.0.32-9618&alias=radial001.nrao.radial.local&noUDP&sock=startd_1460_045c> for krowe: SECMAN:2003:TCP connection to startd slot1@radial001.nrao.radial.local <192.168.0.32:9618?addrs=192.168.0.32-9618&alias=radial001.nrao.radial.local&noUDP&sock=startd_1460_045c> for krowe failed.
03/29/23 08:30:28 (pid:119154) Match record (slot1@radial001.nrao.radial.local <192.168.0.32:9618?addrs=192.168.0.32-9618&alias=radial001.nrao.radial.local&noUDP&sock=startd_1460_045c> for krowe, 773.0) deleted

To test this I made radialhead (which is dual-homed) a startd and I can flock from testpost-master to radialhead.  So how do I run a job remotely on a startd on an isolated network?

ANSWER: HTCondor really wants the execute host to at least have outbound networking.

ANSWER: HTCondor-c is source level routing from one schedd to another schedd.  schedd-a is in NM and scedd-b is radial.  With condor-c say submit this job but don't run it local run it on schedd-b.  If you have multiple sites you have to select which site to use when you set the job.


Object Store

What do people use to put files in stash?  Tar? HDF5? zip? other?

ANSWER: HTCondor sets OMP_NUM_THREADS=1 which may affect the speed of uncompressing.  Should test.

ANSWER: gzip can handle directories?  But Greg thinks it may not matter.  Tar is probably fine

QUESTION: what is system vs user time?  I/O time?

KROWE: I unset OMP_NUM_THREADS and did manual tests outside of HTCondor using the NVMe on testpost001.

Uncompressed tar file

/usr/bin/time -f "real %e\nuser %U\nkernel %S\nwaits %w" tar xf hudf_n1.ms.tar 
real 6.53
user 0.52
kernel 5.98
waits 85

Compressed tar file with gzip

testpost001 krowe >/usr/bin/time -f "real %e\nuser %U\nkernel %S\nwaits %w" tar xf gzip-6.tgz 

real 36.97

user 35.58

kernel 6.65

waits 172421

Compressed tar file with bzip2

testpost001 krowe >/usr/bin/time -f "real %e\nuser %U\nkernel %S\nwaits %w" tar xf bzip2.tgz 

real 311.63

user 310.06

kernel 16.43

waits 1128218


ANSWER: Greg is as surprised as we are.


Federation

Does this look correct?

https://staff.nrao.edu/wiki/bin/view/NM/HTCondor-federations

ANSWER: yes


Reservations

Reservations from the Double Tree were for Sunday Jul.9 through Thursday Jul. 13 (4 nights).  But I need at least until Friday Jul. 14 right?

ANSWER: Greg will look into it.


Scatter/Gather problem

At some point we will have a scatter/gather problem. For example we will launch 10^5 jobs, each of which will produce some output that will need to be summed with the output of all the other jobs.  Lanching 10^5 jobs is not hard.  Dong the sumation is not hard.  Moving all the output around is the hard part.

One idea is to have a dedicated process running to which each job uploads its output.  This process could sum output as it arrives; it doesn't need to wait until all the output is done.  It would be nice if this process also ran in the same HTCondor environment (PATh, CHTC, etc) because that would keep all the data "close" and presumably keep transfer times short.

ANSWER: DAGs and sub-DAGs of course.  Provisioner node was created to submit jobs into the cloud.  It exists as long as the DAG is working.

Nodes talking to each other becomes difficult in federated clusters.

https://ccl.cse.nd.edu/software/taskvine/

makeflow

Astra suggests something like Apache Beam for ngVLA data which is more of a data approach than a compute approach.


What is annex?

Yet another way to federate condor clusters.  Annexes are useful when you have an allocation on a system (AWS) you have the ability to start jobs on a system.  You give annex your credentials and how many workers you want. Annex will launch the startds and create a local central manager.  It then configures flocking from your local to the remote pool.  So in a sense annex is an effemeral flocking relationship for just the one person setting up the annex.


Condor and Kubernetes (k8s)

Condor supports Docker, Singularity, and Apptainer.  In OSG the Central Managers are in K8s and most of the Schedds are also in k8s.  In the OSPool some worker nodes are in k8s and they are allowed to run unprivilaged Apptainer but not Docker (because privilages).  PATh has worker nodes in k8s.  They are backfilled on demand.


NSF Archive Storage

Are you aware of any archive storage funded by NSF we could use?  We are looking for off-site backup of our archive (NGAS).

ANSWER: Greg doesn't know of one.


Hung Jobs and viewing stdout

We have some jobs that seem to hang possibly because of a race condition or whatnot.  I'm pretty sure it is our fault.  But, the only way I know to tell is to login to the node and look at _condor_stdout in the scratch area.  That gets pretty tedious when I want to check hundreds of jobs to see which ones are hung.  Does condor have a way to check the _condor_stdout of a job from the submit host so I can do this programatically?

I thought condor_tail would be the solution but it doesn't display anything.

ANSWER: condor_ssh_to_job might be able to be used non-interactivly. I will try that.

ANSWER: use the FULL jobid with condor_tail.  E.g. condor_tail 12345.0 Greg has submitted a patch so you don't have to specify the ProcId (.0).


Bug: condor_off -peaceful

testpost-cm-vml root >condor_off -peaceful -name testpost002
Sent "Set-Peaceful-Shutdown" command to startd testpost002.aoc.nrao.edu
Can't find address for schedd testpost002.aoc.nrao.edu
Can't find address for testpost002.aoc.nrao.edu
Perhaps you need to query another pool.

Yet it works without the -peaceful option

testpost-cm-vml root >condor_off -name testpost002
Sent "Kill-All-Daemons" command to master testpost002.aoc.nrao.edu

ANSWER: Add the -startd option.  E.g. condor_off -peaceful -startd -name <hostname>  Greg thinks it might be a regression (another bug).  This still happens even after I set all the CONDOR_HOST knobs to testpost-cm-vml.aoc.nrao.edu.  So it is still a bug and not because of some silly config I had at NRAO.


File Transfer Plugins and HTCondor-C

Is there a way I can use our nraorsync plugin on radial001?  Or something similar?

SOLUTION: ssh tunnels


Condor Week (aka Throughput Week)

July 10-14, 2023.  Being co-run with the OSG all hands meeting.  At the moment, it is not hybrid but entirely in-person.  https://path-cc.io/htc23


PROVISIONER node

When I define a PROVISIONER node, that is the only node that runs.  The others never run.  Also, the PROVISIONER job always returns 1 "exited normally with status 1" even though it is just running /bin/sleep.

JOB node01 node01.htc
JOB node02 node02.htc
JOB node03 node03.htc

PARENT node01 CHILD node02 node03
PROVISIONER prov01 provisioner.htc

ANSWER: my prov01 job needs to indicate when it is ready with something like the following but that means the provisioner job has to run in either the local or scheduler universes because our execute nodes cant run condor_qedit.

condor_qedit myJobId ProvisionerState 2

But execute hosts can't run condor_qedit so this really only works if you set universe to local or scheduler.



Does CHTC have resources available for VLASS?

Our Single Epoch jobs

Brian was not scared by this and gave us a form to fill out

https://chtc.cs.wisc.edu/uw-research-computing/form.html

ANSWER: Yes.  We and Mark Lacy have started the process with CHTC for VLASS.


Annex to PATh

https://htcondor.org/experimental/ospool/byoc/path-facility

ANSWER: Greg doesn't know but he can connect me with someone who does.

Tod Miller is the person to ask about this.


Hold jobs that exceed disk request

ANSWER: https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToLimitDiskUsageOfJobs


condor_userprio

We want a user (vlapipe) to always have higher priority than other users.  I see we can set this with condor_userprio but is that change permenent?

ANSWER: There is no config file for this.  Set the priority_factor of vlapipe to 1.  That is saved on disk and should persist through reboots and upgrades.



Submitting jobs as other users

At some point in the future we will probably want the ability for a web process to launch condor jobs as different users.  The web process will probably not be running as root.  Does condor have a method for this or should we make our own setuid root thingy?  Tell dlyons the asnswer.

ANSWER: HTCondor doesn't have anything for this.  So it is up to us to do some suid-fu.



SSH keys with Duo

I tried following the link below to setup ssh such that I don't have to enter both my password and Duo every time I login to CHTC.  It doesn't create anything in ~/.ssh/connections after I login.  Thoughts?

https://chtc.cs.wisc.edu/uw-research-computing/configure-ssh

ANSWER: Greg doesn't know what to do here.  We should ask Christina.



HTCondor-C and requirements

submitting jobs from the container on shipman as vlapipe to the NRAO RADIAL prototype cluster seems to ignore requirements like the following.  Is this expected?

requirements = (machine == "radial001.nrao.radial.local")

and

requirements = (VLASS == True)
+partition = "VLASS"

It also seems to ignore things like

request_memory = 100G

ANSWER: 

https://htcondor.readthedocs.io/en/latest/grid-computing/grid-universe.html?highlight=remote_requirements#htcondor-c-job-submission

But I am still having problems.


This forces job to run on radial001

+remote_Requirements = ((machine == "radial001.nrao.radial.local"))

This runs on radialhead even though it only has 64G

+remote_RequestMemory = 102400

This runs on radialhead even though it doesn't have a GPU

request_gpus = 1

+remote_RequestGPUs = 1

ANSWER: This works 

+remote_Requirements = ((machine == "radial001.nrao.radial.local") && memory > 102400)

as does this

+remote_Requirements = (memory > 102400)

but Greg will look into why +remote_RequestMemory doesn't work.  It should.



Select files to transfer dynamically according to job-slot match

We currently have separate builds of our GPU software for CUDA Capability 7.0 and 8.0, and our jobs specify that both builds should be transferred to the EP, so that the job executable selects the appropriate build to run based on the CUDA Capability of the assigned GPU. Is there a way to do this selection when the job is matched to a slot, so that only the necessary build is transferred according to the slot's CUDA Capability?

ANSWER: $() means expand this locally from the jobad.  $$() means expand at job start time.

executable = my_binary.$$(GPUs_capability)

executable = my_binary.$$([int(GPUs_capabilitty)]) # Felipe said this actually works

executable = my_binary.$$([ classad_express(GPUS_capability) ]) # Hopefully you don't need this



CPU/GPU Balancing

We have 30 nodes in a rack at NMT with a power limit of 17 kW and we are able to hit that limit when all 720 cores (24 cores * 30 nodes) are busy.  We want to add two GPUs to each node but that would almost certainly put us way over the power limit if each node had 22 cores and 2 GPUs busy.  So is there a way to tell HTCondor to reserve X cores for each GPU?  That way we could balance the power load.

JOB TRANSFORMS work per schedd so that wouldn't work on the startd side which is what we want.

IDEA: NUM_CPUS = 4 or some other small number greater then the number of GPUs but limiting enough to keep the power draw low.

ANSWER: There isn't a knob for this in HTCondor but Greg is interested in this and will look into this. 

WORKAROUND: MODIFY_REQUEST_EXPR_REQUESTCPUS may help by setting each job gets 8cores or something like.

MODIFY_REQUEST_EXPR_REQUESTCPUS = quantize(RequestCpus, isUndefined(RequestGpus) ? {1} : {8, 16, 24, 32, 40})

That is, when a job comes into the startd, if it doesn't request any GPUs, allocate exactly as many cpu cores as it requests. Otherwise, allocate 8 times as many cpus as it requests.

This seems to work. If I ask for 0 GPUs and 4 CPUs, I am given 0 GPUs and 4 CPUs.  If I ask for 1 GPU and don't ask for CPUs, I am given 1 GPU and 8 CPUs.

But if I ask for 2 GPUs and don't ask for CPUs, I still am only given 8 CPUs.  I was expecting to be given 16 CPUs.  This is probably fine as we are not planning on more than 1 GPU per job.

But if I ask for 1 GPU and 4 CPUs, i am given 1 GPU and 8 CPUs.  That is probably acceptable.

2024-01-24 krowe: Assuming a node can draw up to 550 Watts when all 24 cores are busy and that node only draws 150 Watts when idle, and that we have 17,300 Watts available to us in an NMT rack,

Upgrading

CHTC just upgrades to the latest version when it becomes available, right?  Do you ever run into problems because of this?  We are still using version 9 because I can't seem to schedule a time with our VLASS group to test version 10.  Much less version 23.

ANSWER: yes.  The idea is that CHTC's users are helping them test the latest versions.



Flocking to CHTC?

We may want to run VLASS jobs at CHTC.  What is the best way to submit locally and run globally?

ANSWER: Greg thinks flocking is the best idea.

This will require 9618 open to nmpost-master and probably a static NAT and external DNS name.



External users vs staff

We are thinking about making a DMZ ( I don't like that term ) for observers.  Does CHTC staff use the same cluster resources that CHTC observers (customers) use?

ANSWER: There is no airgap at CHTC everyone uses the same cluster.  Sometime users use a different AP but more for load balancing than security.  Everyone does go through 2FA.



Does PATh Cache thingy(tm) (a.k.a. Stash) work outside of PATh?

I see HTCondor-10.x comes with a stash plugin.  Does this mean we could read/write to stash from NRAO using HTCondor-10.x?

ANSWER: Greg thinks you can use stash remotely, like at our installation of HTCondor.



Curl_plugin doesn't do FTP

None of the following work.  They either hang or produce errors.  They work on the shell command line, except at CHTC where the squid server doesn't seem to grok FTP.

transfer_input_files = ftp://demo:password@test.rebex.net:/readme.txt
transfer_input_files = ftp://ftp:@ftp.gnu.org:/welcome.msg
transfer_input_files = ftp://ftp.gnu.org:/welcome.msg
transfer_input_files = ftp://ftp:@ftp.slackware.com:/welcome.msg
transfer_input_files = ftp://ftp.slackware.com:/welcome.msg

2024-02-05: Greg thinks this should work and will look into it.

ANSWER: 2024-02-06 Greg wrote "Just found the problem with ftp file transfer plugin.  I'm afraid there's no easy workaround, but I've pushed a fix that will go into the next stable release. "


File Transfer Plugins and HTCondor-C

I see that when a job starts, the execution point (radial001) uses our nraorsync plugin to download the files.  This is fine and good.  When the job is finished, the execution point (radial001) uses our nraorsync plugin to upload the files, also fine and good.  But then the RADIAL schedd (radialhead) also runs our nraorsync plugin to upload files.  This causes problems because radialhead doesn't have the _CONDOR_JOB_AD environment variable and the plugin dies.  Why is the remote schedd running the plugin and is there a way to prevent it from doing so?

Greg understands this and will ask the HTCondor-c folks about it.

Greg thinks it is a bug and will talk to our HTCondor-C peopole.

2023-08-07: Greg said the HTCondor-C people agree this is a bug and will work on it.

2023-09-25 krowe: send Greg my exact procedure to reproduce this.

2023-10-02 krowe: Sent Greg an example that fails.  Turns out it is intermittent.

2024-01-22 krowe: will send email to the condor list

ANSWER: It was K. Scott all along.  I now have HTCondor-C workiing from nmpost and testpost clusters to the radial cluster using my nraorsync plugin to trasfer both input and output files.  The reason the remote AP (radialhead) was running the nraorsync plugin was because I defined it in the condor config like so.

FILETRANSFER_PLUGINS = $(FILETRANSFER_PLUGINS), /usr/libexec/condor/nraorsync_plugin.py

I probably did this early in my HTCondor-C testing not knowing what I was doing.  I commented this out, restarted condor, and now everything seems to be working properly.



Quotes in DAG VARS

I was helping SSA with a syntax problem between HTCondor-9 and HTCondor-10 and I was wondering if you had any thoughts on it.  They have a dag with lines like this

  JOB SqDeg2/J232156-603000 split.condor
  VARS SqDeg2/J232156-603000 jobname="$(JOB)" split_dir="SqDeg2/J232156+603000" 

Then they set that split_dir VAR to a variable in the submit description file like this

  SPLIT_DIR = "$(split_dir)"

The problem seems to be the quotes around $(split_dir).  It works fine in HTCondor-9 but with HTCondor-10 they get an error like this in their pims_split.dag.dagman.out file

  02/28/24 16:26:02 submit error: Submit:-1:Unexpected characters following doublequote.  Did you forget to escape the double-quote by repeating it?  Here is the quote and trailing characters: "SqDeg2/J232156+603000""

Looking at the documentation https://htcondor.readthedocs.io/en/latest/version-history/lts-versions-10-0.html#version-10-0-0 its clear they shouldn't be putting quotes around $(split_dir). So clearly something changed with version 10.  Either a change to the syntax or, my guess, just a stricter parser.

Any thoughts on this?

ANSWER: Greg doesn't know why this changed but thinks we are now doing the right thing.


OSDF Cache

Is there a way to prefer a job to run on a machine where the data is cached?

ANSWER: There is no knob in HTCondor for this but CHTC would like to add one for this.  they would like to glide in OSDF caches like they glide in nodes.  But this is all long-term ideas.


GPU names

HTCondor seems to have short names for GPUs which are the first part of the UUID.  Is there a way to use/get the full UUID?  This would make it consistant with nvidia-smi.

ANSWER: Greg thinks you can use the full UUID with HTCondor.

But cuda_visible_devices only provides the short UUID name.  Is there a way to get the long UUID name from cuda_visisble_devices?

ANSWER: You can't use id 0 because 0 will always be the first GPU that HTCondor chose for you.  Some new release of HTCondor supports NVIDIA_VISIBLE_DEVICES which should be the full UUID.


Big Data

Are we alone in needing to copy in and out many GBs per job?  Do other institutions have this problem as well?  Does CHTC have any suggestions to help?  Sanja will ask this of Bockleman as well.

ANSWER: Greg thinks our transfer times are not uncommon but our processing time is shorter than many.  Other jobs have similar data sizes.  Some other jobs have similar transfer times but process for many hours.  Maybe we can constrain our jobs to only run on sites that seem to transfer quickly.  Greg is also interested in why some sites seem slower than others.  Is that actually site specific or is it time specific or...

Felipe does have a long list of excluded sites in his run just for this reason.  Greg would like a more declaritive solution like "please run on fast transfer hosts" especially if this is dynamic.



GPUs_Capability

We have a host (testpost001) with both a Tesla T4 (Capability=7.5) and a Tesla L4 (Capability=8.9) and when I run condor_gpu_discovery -prop I see something like the following

DetectedGPUs="GPU-ddc998f9, GPU-40331b00"

Common=[ DriverVersion=12.20; ECCEnabled=true; MaxSupportedVersion=12020; ]

GPU_40331b00=[ id="GPU-40331b00"; Capability=7.5; DeviceName="Tesla T4"; DevicePciBusId="0000:3B:00.0"; DeviceUuid="40331b00-c3b6-fa9a-b8fd-33bec2fcd29c"; GlobalMemoryMb=14931; ]

GPU_ddc998f9=[ id="GPU-ddc998f9"; Capability=8.9; DeviceName="NVIDIA L4"; DevicePciBusId="0000:5E:00.0"; DeviceUuid="ddc998f9-99e2-d9c1-04e3-7cc023a2aa5f"; GlobalMemoryMb=22491; ]

The problem is `condor_status -compact -constraint 'GPUs_Capability >= 7.0'` doesn't show testpost001.  It does show testpost001 when I physically remove the T4.

Requesting a specific GPU with `RequireGPUs = (Capability >= 8.0)` or `RequireGPUs = (Capability <= 8.0)` does work however so maybe this is just a condor_status issue.

We then replaced the L4 with a second T4 and then GPUs_Capability functioned as expected.

Can condor handle two different capabilities on the same node?

ANSWER: Greg will look into it.  They only recently added support for different GPUs on the same node.  So this is going to take some time to get support in condor_status.  Yes this is just a condor_status issue.


Priority for Glidein Nodes

We have a factory.sh script that glides in Slurm nodes to HTCondor as needed.  The problem is that HTCondor then seems to prefer these nodes to the regular HTCondor nodes such that after a while there are several free regular HTCondor nodes, and three glide-in nodes.  Is there a way to set a lower priority on glide-in nodes so that HTCondor only chooses them if the regular HTCondor nodes are all busy?  I am going to offline the glide-in nodes to see if that works but that is a manual solution not and automated one.

I would think NEGOTIATOR_PRE_JOB_RANK would be the trick but we already set that on the CMs to the following so that RANK expressions in submit description files are honored and negotiation will prefer NMT nodes over DSOC nodes if possible.

NEGOTIATOR_PRE_JOB_RANK = (10000000 * Target.Rank) + (1000000 * (RemoteOwner =?= UNDEFINED)) - (100000 * Cpus) - Memory

ANSWER: NEGOTIATOR_PRE_JOB_RANK = (10000000  Target.Rank) + (1000000 (RemoteOwner =?= UNDEFINED)) - (100000 * Cpus) - Memory + 100000  * (site == "not-slurm")

I don't like setting not-slurm in the dedicated HTCondor nodes.  I would rather set something like "glidein=true" or "glidein=1000" in the default 99-nrao config file and then remove it for the 99-nrao config in snapshots for dedicated HTCondor nodes.  But that assumes that the base 99-nrao is for NM.  Since we are sharing an image with CV we can't assume that.  Therefore every node, weather dedicated HTCondor or not, will need a 99-nrao in its snapshot area.

SOLUTION

This seems to work.  If I set NRAOGLIDEIN = True on a node, then that node will be chosen last.  You may ask why not just add 10000000 ASTERISK (NRAOGLIDEIN == True).  If I did that I would have to also set it to false on all the other nodes otherwise the negotiator would fail to parse NEGOTIATOR_PRE_JOB_RANK into a float.  So I check if it isn't undefined then check if it is true.  This way you could set NRAOGLIDEIN to False if you wanted.

NEGOTIATOR_PRE_JOB_RANK = (10000000 * Target.Rank) + (1000000 * (RemoteOwner =?= UNDEFINED)) - (100000 * Cpus) - Memory - 10000000 * ((NRAOGLIDEIN =!= UNDEFINED) && (NRAOGLIDEIN == True))

I configured our pilot.sh script to add the NRAOGLIDEIN = True key/value pair to a node when it glides in to HTCondor.  That is the simplest and best place to set this I think.



K8s kubernetes

2024-04-15 krowe: There is a lot of talk around NRAO about k8s these days. Can you explain if/how HTCondor works with k8s?  I'm not suggesting we run HTCondor on top of k8s but I would like to know the options.

Condor and k8s have different goals.  Condor an infinite number of jobs for finite time each job.  k8s runs a finite number of services for infinite time.

There is some support in k8s to run batch jobs but it ins't well formed yet.  Running the condor services like the CM in k8s can make some sense.

The new hotness is using EBPF to change routing tables.



RedHat8 Only

Say we have a few RedHat8 nodes and we only want jobs to run on those nodes that request RedHat8 with

requirements = (OpSysAndVer == "RedHat8")

I know I could set up a partition like we have done with VLASS but since HTCondor already has an OS knob, can I use that?

Setting RedHat8 in the job requirements guarantees the job will run on a RedHat8 node, but how do I make that node not run jobs that don't specify the OS they want?

The following didn't do what I wanted.

START = ($(START)) && (TARGET.OpSysAndVer =?= "RedHat8")

Then I thought I needed to specify jobs where OpSysAndVer is not Undefined but that didn't work either.  Either of the following do prevent jobs that don't specify an OS from running on the node but they also prevent jobs that DO specify an OS via either OpSysAndVer or OpSysMajorVer respectively.

START = ($(START)) && (TARGET.OpSysAndVer isnt UNDEFINED)

START = ($(START)) && (TARGET.OpSysMajorVer isnt UNDEFINED)


A better long-term solution is probably for our jobs (VLASS, VLA calibration, ingestion, etc) to ask for the OS that they want if they care.  Then they can test new OSes when they want and we can upgrade OSes at our schedule (to a certain point).  I think asking them to start requesting the OS they want now is not going to happen but maybe by the time RedHat9 is an option they and we will be ready for this.

ANSWER: unparse takes a classad expression and turns into a string then use a regex on it looking for opsysandver.

Is this the right syntax?  Probably not as it doesn't work

START = ($(START)) && (regexp(".*RedHat8.*", unparse(TARGET.Requirements)))

Greg thinks this should work.  We will poke at it.

The following DOES WORK in the sense that it matches anything.

START = ($(START)) && (regexp(".", unparse(TARGET.Requirements)))

None of these work

START = ($(START)) && (regexp(".*RedHat8.*", unparse(Requirements)))
START = ($(START)) && (regexp(".*a.*", unparse(Requirements)))
START = ($(START)) && (regexp("((OpSysAndVer.*", unparse(Requirements)))
START = ($(START)) && (regexp("((OpSysAndVer.*", unparse(TARGET.Requirements)))
START = ($(START)) && (regexp("\(\(OpSysAndVer.*", unparse(Requirements)))
START = ($(START)) && (regexp("(.*)RedHat8(.*)", unparse(Requirements)))
START = ($(START)) && (regexp("RedHat8", unparse(Requirements), "i"))
START = ($(START)) && (regexp("^.*RedHat8.*$", unparse(Requirements), "i"))
START = ($(START)) && (regexp("^.*RedHat8.*$", unparse(Requirements), "m"))
START = ($(START)) && (regexp("OpSysAdnVer\\s*==\\s*\"RedHat8\"", unparse(Requirements)))
START = $(START) && regexp("OpSysAdnVer\\s*==\\s*\"RedHat8\"", unparse(Requirements))

#START = $(START) && debug(regexp(".*RedHat8.*", unparse(TARGET.Requirements)))


This should also work

in the config file

START = $(START) && target.WantTORunOnRedHat8only

Submit file 

My.WantToRunonRedHat8Only = true

But I would rather not have to add yet more attributes to the EPs.  I would like to use the existing OS attribute that HTCondor provides.


Wasn't there a change to PCRE to PCRE2 or something like that?  Could that be causing the problem?  2023-11-13 Greg doesn't think so.


2024-01-03 krowe: Can we use a container like trhis?  How does PATh do this?

+SingularityImage = "/cvmfs/singularity.opensciencegrid.org/opensciencegrid/osgvo-el7:latest"



See retired nodes

2024-04-15 krowe: Say I set a few nodes to offline with a command like condor_off -startd -peaceful -name nmpost120  How can I later check to see which nodes are offline?

ANSWER: 2022-06-27

condor_status -const 'Activity == "Retiring"'

offline ads, which is a way for HTCondor to update the status of a node after the startd has exited.

condor_drain -peaceful # CHTC is working on this.  I think this might be the best solution.

Try this: condor_status -constraint 'PartitionableSlot && Cpus && DetectedCpus && State == "Retiring"'

or this: condor_status -const 'PartitionableSlot && State == "Retiring"' -af Name DetectedCpus Cpus

or: condor_status -const 'PartitionableSlot && Activity == "Retiring"' -af Name Cpus DetectedCpus 

or: condor_status -const 'partitionableSlot && Activity == "Retiring" && cpus == DetectedCpus'

None of which actually show nodes that have drained.  I.e. were in state Retiring and are now done running jobs.

ANSWER: This seems to work fairly well.  Not sure if it is perfect or not condor_status -master -constraint 'STARTD_StartTime == 0' 


Condor_reboot?

Is there such a thing?  Slurm has a nice one `scontrol reboot HOSTNAME`.  I know it might not be the condor way, but thought I would ask.

ANSWER: https://htcondor.readthedocs.io/en/latest/admin-manual/configuration-macros.html#MASTER_SHUTDOWN_%3CName%3E and https://htcondor.readthedocs.io/en/latest/man-pages/condor_set_shutdown.html  maybe do the latter and then the former and possibly combined with condor_off -peaceful.  I'll need to play with it when I feel better.


Felipe's code

Felipe to share his job visualization software with Greg and maybe present at Throughput 2024.

https://github.com/ARDG-NRAO/LibRA/tree/main/frameworks/htclean/read_htclean_logs


Versions and falling behind

We are still using HTCondor-10.0.2.  How far can/should we fall behind before catching up again? 

ANSWER: Version 24 is coming out around condor week in 2024.  It is suggested to drift no more than one major version, e.g. don't be older than 23 once 24 is available.



Sams question

A DAG of three nodes: fetch -> envoy -> deliver. Submit host and cluster are far apart, and we need to propagate large quantities of data from one node to the next. How do we make this transfer quickly (i.e. without going through the submit host) without knowing the data's location at submit time?

krowe: Why do this as a dag?  Why not make it one job instead of a dag?  Collapsing the DAG into just one job has the advantage that it can use the local condor scratch area and can easily restart if the job fails without need for cleaning up anything.  And of course making it one job means all the steps know where the data is.

Greg: condor_chirp condor_chirp_set_job_attr attributeName 'Value' You could do somethig like 

condor_chirp set_job_attr DataLocation '"/path/to/something"'

or

condor_chirp put_file local remote

Each DAG has a prescript that runs before the dag nodes.

Another idea is to define the directory before submitting the job (e.g. /lustre/naasc/.../jobid)


Condor history for crashed node

We have nodes crashing sometimes.  1. should HTCondor recover from a crashed node?  Will the jobs be restarted somewhere else?  2. How can I see what jobs were running on a node when it crahsed?

How about this

condor_history -name mcilroy -const "stringListMember(\"alias=nmpost091.aoc.nrao.edu\", StarterIpAddr, \"&\") == true"

ANSWER: There is a global event log but it has to be enabled and isn't in our case EVENT_LOG = $(LOG)/EventLog

ANSWER: show jobs that have restarted condor_q -name mcilroy -allusers -const 'NumShadowStarts > 1'


STARTD_ATTRS in glidein nodes

We add the following line to /etc/condor/condor_config on all our Slurm nodes so that if they get called as a glidein node, they can set some special glidein settings.

LOCAL_CONFIG_FILE = /var/run/condor/condor_config.local

Our /etc/condor/config.d/99-nrao file effectivly sets sets the following

STARTD_ATTRS =  PoolName NRAO_TRANSFER_HOST HASLUSTRE BATCH

Our /var/run/condor/condor_config.local, which is run by glidein nodes, sets the following

STARTD_ATTRS = $(STARTD_ATTRS) NRAOGLIDEIN

The problem is glidein nodes don't get all the STARD_ATTRS set by 99-nrao.  They just get NRAOGLIDEIN.  It is like condor-master reads 99-nrao to set its STARTD_ATTRS.  Then it read condor_config.local to set its STARTD_ATTRS again but without accessing $(STARTD_ATTRS).

ANSWER:  The last line in /var/run/condor/condor_config.local is re-writing STARTD_ATTRS.  It should have $(STARTD_ATTRS) appended

STARTD_ATTRS =  NRAOGLIDEIN



Output to two places

Some of our pipeline jobs don't set shoud_transfer_files=YES because they need to transfer some output to an area for Analysts to look at and a some other output (may be a subset) to a different area for the User to look at.  Is there a condor way to do this?  transfer_output_remaps?

ANSWER: Greg doesn't think there is a Condor way to do this.  Could make a copy of the subset and use transfer_output_rempas on the copy but that is a bit of a hack.


Pelican?

Felipe is playing with it and we will probably want it at NRAO.

ANSWER: Greg will ask around.


RHEL8 Crashing

We have had many NMT VLASS nodes crash since we upgraded to RHEL8.  I think the nodes were busy when they crashed.  So I changed our SLOT_TYPE_1 from 100% to 95%.  Is this a good idea?

ANSWER: try using RESERVED_MEMORY=4096 (units are in Megabytes) instead of SLOT_TYPE_1=95% and put SLOT_TYPE_1=100% again.



getnenv

Did it change since 10.0?  Can we still use getenv in DAGs or regular jobs?

#krowe Nov  5 2024: getenv no longer includes your entire environment as of version 10.7 or so.  But instead it only includes the environment variables you list with the "ENV GET" syntax in the .dag file.


https://git.ligo.org/groups/computing/-/epics/30

ANSWER: Yes this is true.  CHTC would like users to stop using getenv=true.  There may be a knob to restore the old behavior.

DONE: check out docs and remove getenv=true


condor_userlog

condor_userlog /users/krowe/htcondor/condor_userlog/tmprn04xnqo/condor.log shows over 100% CPU Utilization.  How does that happen?  Hyperthreading is disabled.

nmpost-master krowe >condor_userlog condor.log 

Job      Host            Start Time  Evict Time  Wall Time Good Time CPU Usage
7315.0   10.7.7.168       2/11 19:42  2/11 23:35   0+03:52   0+03:52   0+08:31
7316.0   10.7.7.168       2/11 23:35  2/12 05:03   0+05:27   0+05:27   0+05:01
7317.0   10.7.7.168       2/12 05:03  2/12 06:13   0+01:09   0+01:09   0+00:33

Host/Job        Wall Time Good Time CPU Usage Avg Alloc  Avg Lost Goodput  Util.

10.7.7.168        0+10:29   0+10:29   0+14:06   0+03:29   0+00:00  100.0% 134.4%

7315.0            0+03:52   0+03:52   0+08:31   0+03:52   0+00:00  100.0% 219.9%
7316.0            0+05:27   0+05:27   0+05:01   0+05:27   0+00:00  100.0%  92.1%
7317.0            0+01:09   0+01:09   0+00:33   0+01:09   0+00:00  100.0%  47.7%

Total             0+10:29   0+10:29   0+14:06   0+03:29   0+00:00  100.0% 134.4%

ANSWER: Greg is not aware of any such bugs or reasons this would happen.



Seeing hostnames in condor_q output

What is the condor way to see the hostnames in condor_q output.  Say a user wants to see what jobs are running on host nmpost037.

The reason I want to know is so when I am helping some user with our HTCondor install I can show them how to see the hostnames without telling them to use my script.

ANSWER: condor_q -run -all -g


Using Pelican to replace nraorsync?

nraorsync does three things: uses rsync to only write back what has changed; use the faster network (IB, 10g, etc), our AP, nmpost-master, doesn't have an external IP; use our "data move" host gibson.

Greg thinks Pelican can write back only what has changed.

ANSWER: I found a ?recursive option to the URL but I don't know if that does any deduplication like rsync does or not.



Writing to NRAO Origin

If we want to write to our Origin do we need to enable authentication?

What is involved with doing that?

Greg doesn't know.  I will look at the docs.


transfer_output_files change in version 23

My silly nraorsync transfer plugin relies on the user setting transfer_output_files = .job.ad in the submit description file to trigger the transfer of files.  Then my nraorsync plugin takes over and looks at +nrao_output_files for the files to copy.  But with version 23, this no longer works.  I am guessing someone decided that internal files like .job.ad, .machine.ad, _condor_stdout, and _condor_stderr will no longer be tranferrable via trasnfer_output_files.  Is that right?  If so, I think I can work around it.  Just wanted to know.

ANSWER: the starter has an exclude list and .job.ad is probably in it and maybe it is being access sooner or later than before.  Greg will see if there is a better, first-class way to trigger transfers.

DONE: We will use condor_transfer since it needs to be there anyway.


Installing version 23

I am looking at upgrading from version 10 to 23 LTS.  I noticed that y'all have a repo RPM to install condor but it installs the Feature Release only.  It doens't provide repos to install the LTS.

https://htcondor.readthedocs.io/en/main/getting-htcondor/from-our-repositories.html

ANSWER: Greg will find it and get back to me.

DONE: https://research.cs.wisc.edu/htcondor/repo/23.0/el8/x86_64/release/



campuschampions

Have you heard of this email list https://campuschampions.cyberinfrastructure.org/


pools

This is pretty amazing.  I can check the status and queue of OSG and PATH from nmpost-master

condor_status -pool cm-1.ospool.osg-htc.org

or

condor_status -pool htcondor-cm-path.osg.chtc.io

As long as I have a token for them in ~/.condor/tokens.d