Current Questions
PATh libraries
It seems that not all the PATh execution hosts have the same set of libraries. For example my program requires libmvec.so.1 which seems to only exist on some hosts like FIU-*, WISC-*, and sometimes UNL-* and SYRA-* hosts. But never on Expanse-* and often not on SYRA-* hosts.
Docker at PATh
Doesn't seem to work. My docker universe job just stayed idle for three days.
Singularity at PATh
My singularity jobs run but get the following error output
INFO Discarding path '/hadoop'. File does not exist
INFO Discarding path '/ceph'. File does not exist
INFO Discarding path '/hdfs'. File does not exist
INFO Discarding path '/lizard'. File does not exist
INFO Discarding path '/mnt/hadoop'. File does not exist
INFO Discarding path '/mnt/hdfs'. File does not exist
WARNING: Environment variable HAS_SINGULARITY already has value [True], will not forward new value [1] from parent process environment
WARNING: Environment variable REQUIRED_OS already has value [default], will not forward new value [] from parent process environment
/srv/.gwms-user-job-wrapper.sh: line 882: /usr/bin/singularity: No such file or directory
Nvidia GPUDirect Storage?
https://developer.nvidia.com/blog/gpudirect-storage/
Using containers at PATh
Anything special to do? Singularity/Apptainer?
ANSWER: Greg doesn't think so but Singularity would be the first to test.
2FA
I don't seem to be prompted for two-factor authentication when I login to CHTC. Should I?
ANSWER: Greg will ask.
Cluster domain names
This is not an HTCondor question but perhaps Greg has some insight. Let's say I am setting up a turn-key cluster. We deliver a rack of compute nodes with one head node. That one head node will need Internet access for SSH, DNS, etc. But the compute nodes don't need any Internet access. You can only get to them via ssh by first sshing to the head node and the only host the use as a name server is the head node. So, my question is what TLD should all these compute nodes be in? I would like to use a local, non-routable TLD analogus to non-routeable IP ranges like 10.0.0.0/8 or 192.168.0.0/16. But there doesn't seem to be such a thing defined by ICANN.
- .example is only for documentation.
- .test is only for testing and our product is in production.
- .invalid is ment for initial scripts that must be changed.
- .home.arpa is for home networking (thank you IETF). Not ideal but better than the above.
- .local is all tied up with mDNS thanks to Apple. But Kubernetes uses .local without mDNS.
- .internal is used by both Google and Amazon which is pretty compelling.
- There have been many Internet-Draft documents since 2017 attempting to address this but none have beome RFCs.
Have you heard of any clusters using any "private" TLDs? Can we just use IPs and not use names?
ANSWER: Greg looked at their nodes and saw both .local and .internal in use
In progress
Blocking on upload
Don't have condor block on the transfer plugin uploading. It doesn't block on download. When it blocks on upload and the upload is large, the job may get killed if NOT_RESPONDING_TIMEOUT isn't set to something larger than the 3600 second default.
stdout and stderr with plugins
When using a transfer plugin to transfer output files, stdout and stderr are copied back as _condor_stdout and _condor_stderr. It doesn't rename them to what output and error are set to in the submit description file. If I use a transfer plugin for just input files and not output files, then stdout and stderr are copied back as requested in the submit description file.
This seems like a bug to me. Since my plugin isn't transferring these files, that means HTCondor is doing it so HTCondor should honor what is set in the submit description file wether I am using a transfer plugin or not.
Perhaps have rsync transfer these instead? Or use another custom classad instead of output_destination? Or what if I put _condor_stdout in +nrao_output_files?
Actually, it doesn't seem to be triggered by just output_destination. If I set output_destination = $ENV(PWD) and don't use a plugin for output files, I get stdout.jobid.log like I requested.
From coatsworth@cs.wisc.edu Mon Nov 29 12:26:12 2021
I've looked into this in the file transfer code. On the execution
side, we always write stdout and stderr to the _condor_stdout and
_condor_stderr files, then we remap them back to user-provided names
after a job completes. When you have output_destination set, our File
Transfer mechanism does not send files back to the submit machine by
default. However since your plugin is explicitly rysnc-ing files back
there, they get moved without going through the remapping.
I think your File Transfer Mechanism does send files back to the submit machine by default. My transfer plugin is not transferring _condor_stderr nor _condor_stdout.
Apr. 22, 2022 krowe: Actually I think that since I set *output_destination = nraorsync://...* in the submit description file, the FTM *is* using the plugin. It has to because output_destination requires that everything use the plugin. So the FTM calls the plugin with _condor_stdout and _condor_stderr which activates the upload_file() function in the plugin and the files are copied. This is why they aren't remapped. I configured the plugin to also copy these files and rename them. Perhaps instead I could watch for them in upload_file() and remap them there?
ANSWER: Greg is going to tell Mark to put all this plugin work on the back burner or maybe stop altogether. Our plugin works with its work-arounds so this stuff is not critical.
nraorsync_plugin.py
Since HTCondor walks the directoies in transfer_output_files and submits the files one at a time the plugin, which doesn't work with rsync, we decided to work around the problem.
# Trick HTCondor into launching the plugin to handle output files
transfer_output_files = .job.ad
# custom job ad of files/dirs using nraorsync_plugin.py
+nrao_output_files = "software data"
Call our upload_rsync() function before calling upload_file() in main()
with open(args['outfile'], 'w') as outfile:
# krowe Oct 28 2021:
if args['upload']:
if upload_rsync() != 0:
raise err
for ad in infile_ads:
ANSWER: I explained this to CHTC. They think it is at least an elegant hack. :^)
Transfer Plugin Upload
Working with Mark Coatsworth on this.
I have added my nraorsync_plugin.py to /usr/libexec/condor on the execution host and added the following configuration to the execution host:
FILETRANSFER_PLUGINS = $(LIBEXEC)/nraorsync_plugin.py, $(FILETRANSFER_PLUGINS)
I have the following job:
#!/bin/sh
mkdir newdir
date > newdir/date
/bin/sleep ${1}
and the following submit file:
executable = smaller.sh
arguments = "27"
output = stdout.$(ClusterId).log
error = stderr.$(ClusterId).log
log = condor.$(ClusterId).logshould_transfer_files = YES
transfer_input_files = /users/krowe/.ssh/condor_transfer
transfer_output_files = newdir
output_destination = nraorsync://$ENV(PWD)
+WantIOProxy = Truequeue
The resulting input file that is fed to my plugin when the plugin is called with the -upload argument (.nraorsync_plugin.in) contains this:
[ LocalFileName = "/lustre/aoc/admin/tmp/condor/testpost003/execute/dir_29453/_condor_stderr"; Url = "nraorsync:///lustre/aoc/sciops/krowe/plugin/_condor_stderr" ][ LocalFileName = "/lustre/aoc/admin/tmp/condor/testpost003/execute/dir_29453/_condor_stdout"; Url = "nraorsync:///lustre/aoc/sciops/krowe/plugin/_condor_stdout" ][ LocalFileName = "/lustre/aoc/admin/tmp/condor/testpost003/execute/dir_29453/newdir/date"; Url = "nraorsync:///lustre/aoc/sciops/krowe/plugin/newdir/date" ]
I am surprised to see that it sets LocalFileName and Url to the file inside newdir instead of newdir itself. Needless to say, this makes rsync unhappy as newdir doesn't exist on the destination yet.
If I create 'newdir' in the destination directory before submitting the job, the plugin will correctly copy the 'date' file back to the 'newdir' directory but the condor log file shows the following:
022 (4149.000.000) 08/05 09:22:04 Job disconnected, attempting to reconnect
Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to slot1_4@testpost003.aoc.nrao.edu <10.64.1.173:9618?addrs=10.64.1.173-9618&alias=testpost003.aoc.nrao.edu&noUDP&sock=startd_28565_cae3>
...
023 (4149.000.000) 08/05 09:22:04 Job reconnected to slot1_4@testpost003.aoc.nrao.edu
startd address: <10.64.1.173:9618?addrs=10.64.1.173-9618&alias=testpost003.aoc.nrao.edu&noUDP&sock=startd_28565_cae3>
starter address: <10.64.1.173:9618?addrs=10.64.1.173-9618&alias=testpost003.aoc.nrao.edu&noUDP&sock=slot1_4_28601_bde4_612>
...
condor re-runs the upload portion of the plugin four more times before finally giving up with this error
007 (4149.000.000) 08/05 09:22:31 Shadow exception!
Error from slot1_4@testpost003.aoc.nrao.edu: Repeated attempts to transfer output failed for unknown reasons
0 - Run Bytes Sent By Job
1007 - Run Bytes Received By Job
If I create a file like 'outputfile' instead of 'newdir' and transfer that, everything works fine.
I have an example in /home/nu_kscott/htcondor/plugin_small
ANSWER: Greg will look into this. K. Scott is working with Mark Coatsworth on this.
condor_off vs condor_drain
I would like to be able to issue a command to an execute host telling it to stop accepting new jobs and let the current jobs finish. I would also like that host to stay in the condor_status output with a message indicating what I have done (i.e. draining, offline, etc) I think I want something that does some of condor_off and some of condor_drain. Is there such a beast?
For example, a -peaceful option to condor_drain might be perfect.
condor_off
- Pro: Immediatly prevents new jobs from starting on the node with the -startd option
- Pro: Is supposed to let existing jobs finish running with the -peaceful option but
- Con: Stopped working with -peaceful for me sometime after upgrading to 9.0.4
- Con: Doesn't have a -reason option
- Con: removes host from condor_status
condor_drain
- Pro: Leaves host in condor_status
- Pro: Has a -reason option
- Con: Immediatly evicts all running jobs on node
ANSWER: condor_status -master and use condor_off
ANSWER: Greg thinks condor_drain should have a -peaceful option. (bug)
2022-10-05 krowe: HTCondor 9.12.0 "Added -drain option to condor_off and condor_restart". I think this might be the solution I wanted they just went in a different direction. Instead of 'condor_drain -peaceful' there is now a 'condor_off -drain'. The feature isn't in the LTS release yet. Perhaps it will be in 9.0.18. Then I will test it.
Bug: condor_off -peaceful
testpost-cm-vml root >condor_off -peaceful -name testpost002
Sent "Set-Peaceful-Shutdown" command to startd testpost002.aoc.nrao.edu
Can't find address for schedd testpost002.aoc.nrao.edu
Can't find address for testpost002.aoc.nrao.edu
Perhaps you need to query another pool.
Yet it works without the -peaceful option
testpost-cm-vml root >condor_off -name testpost002
Sent "Kill-All-Daemons" command to master testpost002.aoc.nrao.edu
ANSWER: Add the -startd option. E.g. condor_off -peaceful -startd -name <hostname> Greg thinks it might be a regression (another bug). This still happens even after I set all the CONDOR_HOST knobs to testpost-cm-vml.aoc.nrao.edu. So it is still a bug and not because of some silly config I had at NRAO.
Show offline nodes
Say I set a few nodes to offline with a command like condor_off -startd -peaceful -name nmpost120 How can I later check to see which nodes are offline?
- condor_status -offline returns nothing
- condor_status -long nmpost120 returns nothing about being offline
- The following shows nodes where startd has actually stopped but it doesn't show nodes that are set offline but still running jobs (e.g. Retiring)
- condor_status -master -constraint 'STARTD_StartTime == 0'
- This shows nodes that are set offline but still running jobs (a.k.a. Retiring)
- condor_status |grep Retiring
ANSWER: 2022-06-27
condor_status -const 'Activity == "Retiring"'
offline ads, which is a way for HTCondor to update the status of a node after the startd has exited.
condor_drain -peaceful # CHTC is working on this. I think this might be the best solution.
Glidein
The only documentation I can find on glinein (https://htcondor.readthedocs.io/en/latest/grid-computing/introduction-grid-computing.html?highlight=glidein#introduction) seems to imply that glidein only works with Globus "HTCondor permits the temporary addition of a Globus-controlled resource to a local pool. This is called glidein." Is this correct? Is there better documentation? Is glidein even a technology or software package or is it just a generic term?
ANSWER: Greg will look at re-writring this.
request_virtualmemory
If I set request_virtualmemory = 2G, condor_submit accepts it as a valid knob but the job stays idle and never runs.
request_memory = 1G
request_virtualmemory = 2G
If I set request_virtualmemory = 2000000, which should be the same as 2G, the job runs but doesn't set memory.memsw.limit_in_bytes in the cgroup.
Oct. 11, 2021 krowe: Checked with HTCondor-9.0.6. Problem still exists unchanged.
ANSWER: krowe sent mail to Greg about it
Answered Questions
- JOB ID question from Daniel
When I submit a job, I get a job ID back. My plan is to hold onto that job ID permanently for tracking. We have had issues in the past with Torque/Maui because the job IDs got recycled later and our internal bookkeeping got mixed up. So my questions are:
- Are job IDs guaranteed to be unique in HTCondor?
- How unique are they—are they _globally_ unique or just unique within a particular namespace (such as our cluster or the submit node)?- A Job ID (ClusterID.ProcID)
- DNS name of the schedd and ctime of the job_queued.log file.
- It is unique to a schedd.
- We should talk with Daniel about this. They should craft their own ID. It could be seeded with a JobID but should not depend on just it.
- UpgradingHTCondor without killing jobs?
- schedd can be upgraded and restarted without loosing state assuming the restart is less than the timeout.
- currently restarting execute services will kill jobs. CHTC is working on improving this.
- negotiator and collector can be restarted without killing jobs.
- CHTC works hard to ensure 8.8.x is compatible with 8.8.y or 8.9.x is compatible with 8.9.y.
- Leaving data on execution host between jobs (data reuse)
- Todd is working on this now.
- Ask about installation of CASA locally and ancillary data (cfcache)
- CHTC has a Ceph filesystem that is available to many of their execution hosts (notibly the larger ones)
- There is another software filesystem where CASA could live that is more used for admin usage but might be available to us.
- We could download the tarball each time over HTTP. CHTC uses a proxy server so it would often be cached.
- Environment: Is there a way to have condor "login" when a job starts thus sourcing /etc/proflie and the user's rc files? Currently, not even $HOME is set.
- A good analogy is Torque does a su - _username_ while HTCondor just does a su _username_
- WORKAROUND: setting getenv = True which is like the -V option to qsub, may help. It doesn't source rc files but does inherit your current environment. This may be a problem if your current environment is not what you want on the cluster node. Perhaps the cluster node is a different OS or architecture.
- ANSWER: condor doesn't execute things with a shell. You could set your executable as /bin/bash and then have the arguments be the executable you used to have. I just changed our stuff to staticly set $HOME and I think that is good enough.
- Flocking: Suppose I have two hosts in the same pool. testpost-master is a submit-host and testpost-serv-1 is both a submit-host and the central-manager. testpost-serv-1 is configured to flock to CHTC but testpost-master is not. Is it possible to submit a job on testpost-master that will flock to CHTC by somehow leveraging testpost-serv-1? In other words, do I have to setup flocking and an external IP on every submit host?
- ANSWER: there isn't a good way to do this. So eventually we will need to make testpost-master flock to CHTC and possibly remove the ability of testpost-serv-1 to flock.
- It seems the transfer mechanism won't transfer symlinks to directories (e.g. data/vlass.ms → /lustre/aoc/...) Is there a way around this?
- ANSWER: there is no flag to chase symlinks at the moment. The top level dir (e.g. data) could be a symlink if transfer_input_files=data/ but it will then transfer the contents of data instead of data itself.
- If symlink → data and transfer_input_files=symlink I get the error Transfer of symlinks to directories is not supported.
- if symlink → ../data and transfer_input_files=symlink/ it transfers the contents not the directory. In other words I don't have a data directory in scratch I have a VLASS... directory.
- If data/VLASS → /some/path/VLASS and transfer_input_files=data/VLASS/
- If data/VLASS → /some/path/VLASS and transfer_input_files=data/ I get the error Transfer of symlinks to directories is not supported.
- DAG log time stamps, is there a way to differentiate data import/export time and process run time.
- Look in the job log file not the dag log file
- 040 (150.000.000) 2020-06-15 13:05:45 Started transferring input files
Transferring to host: <10.64.10.172:9618?addrs=10.64.10.172-9618&alias=nmpost072.aoc.nrao.edu&noUDP&sock=slot1_1_72656_7984_60>
...
040 (150.000.000) 2020-06-15 13:06:04 Finished transferring input files
- Rank and Premption: Can we use Rank to set "preferences" without requiring job preemption?
- ANSWER: There are 2 kinds of rank (job rank, machine rank). job rank (RANK=... in a submit file) is purely a preference. That does not preempt. Machine rank (in startd.config) will preempt. Negotiator pre-job rank is a third type of rank that works at a pool level and is often used to pack jobs efficiently.
- Update on software store for CASA either on shared Ceph storage or admin software storage
- Staging area for datasets 100MB - TBs. This is where we could try keeping the cfcache assuming doing so doesn't overwhelm the filesystem.
- /staging/nu_jrobnett
Requirements = (Target.HasCHTCStaging == true)
- Quota: 100GB, 100K files
- Squid area for 100MB - 1GB input or shared software. This is where we could keep casa.tgz and then have the execution host retrieve it via HTTP.
- /squid/nu_jrobnett
- only accessable via this path on the submit hosts. Execution hosts will need to access it via HTTP.
transfer_input_files = http://proxy.chtc.wisc.edu/SQUID/nu_jrobnett/casa.tgz
Software area We can use this in run-time applications. Think of it like /usr/local.
/software/nu_jrobnett/casa/casa-pipeline-release-5.6.1-8.el7
- export PATH=/opt/local/bin:/software/nu_jrobnett/casa/casa-pipeline-release-5.6.1-8.el7/bin:${PATH}
- Quota: 5GB, 100K files
- Staging area for datasets 100MB - TBs. This is where we could try keeping the cfcache assuming doing so doesn't overwhelm the filesystem.
- Public_input_files: How is this different than transfer_input files and when would one want to use it instead of files or URLs with transfer_input_files?
- https://htcondor.readthedocs.io/en/latest/users-manual/file-transfer.html#public-input-files
- This is still a work in progress. It may allow for caching on a squid server, fetchable by others someday.
- Flocking: When we flock to CHTC what is the data path for transfer_input_files? Is it our submit host and CHTC's execution host, is CHTCs submit host involved ?
- Dataflow is from our schedd (submit host) to their execute host but CCB will reverse the connection. Their execution hosts are publicly addressable but that may not be necessary.
- Dataflow is from our schedd (submit host) to their execute host but CCB will reverse the connection. Their execution hosts are publicly addressable but that may not be necessary.
- How can we choose the data path for transfer_input_files to our clients given multiple networks. Currently we assume it will use the 1Gb link but we have IB links. Is there a way for condor to use the IB link just for transferring files, is that hostname based ? Other ideas?
- CHTC doesn't have a good solution for this.
- We could upgrade from 1Gb to 10Gb
- We could use the IB names for everything (problematic for submit hosts that don't have IB)
- We could not use transfer mechanism and instead use something else like scp
- We could use a custom transfer plugin
- Are there known issues with distributed scratch via NFS or Lustre w.r.t tmpdir or other, e.g. OpenMPI complains about tmpdir being on network FS?
- Some problems with log files on the submit host but rare.
- Any general best practices to support MPI in terms of class ads or other.
- Use the shared memory transport for security
- Use the shared memory transport for security
- Is there a way DAGMan can be told to ignore errors, in some cases we want a DAG to mindlessly continue vs retry.
- The job is considered successful based on the return of the post script. If there isn't a post script, the success is based on the return of the job.
- The job is considered successful based on the return of the post script. If there isn't a post script, the success is based on the return of the job.
- Transfer mechanism: Documentation implies that only files with an mtime newer than when the transfer_input_files finished will be transferred back to the submit host. While running a dag, the files in my working directory (which is in both transfer_input_files and transfer_output_files) seem to always have an mtime around the most recent step in the DAG suggesting that the entire working directory is copied from the execution host to the submit host at the end of each DAG step. Perhaps this means the transfer mechanism only looks at the mtime of the files/dirs specified in transfer_output_files and doesn't descend into the directories.
- Subdirectories are treated differently
- SOLUTION: I think casa just touches every file and therefore condor is forced to copy everything in the working directory. I have been unable to reproduce the problem outside of casa.
- SOLUTION: If you specify a directory, HTCondor will transfer the entire directory not just files with new mtime.
- Does the trasnfer mechanism accept any sort of regular expression? E.g. transfer_input_files=*.txt
- No
- No
- Can the transfer mechanism accept manifest files? E.g. a file that is a list of files?
- Use include : <some file> in the submit script where <some file> contains the full transfer_input_files line
- use queue FILES from manifest Which defines the submit variable $(FILES) which could be used in a transfer_input_files like: transfer_input_files = $(FILES)
- Perhaps a plugin
- What other options are there than holding a job? I find myself not noticing, sometimes for hours, that a job is on hold. Is there a way to make jobs fail instead of getting held? I assume others will make this mistake like me.
- I see I can set periodic_remove = (JobStatus == 5) but HTCondor doesn't seem to think that is an error so if I have notification = Error I don't get any email.
- Greg will look into adding a Hold option to notification
- The HTCondor idea of held jobs is that you submitted a large DAG of jobs, one step is missing a file and you would like to put that file in place and continue the job instead of the whole DAG failing and having to be resubmitted. This makes sense but it would be nice to be notified when a job gets held.
Greg wrote "notification = error in the submit file is supposed to send email when the job is held by the system, but there's a bug now where it doesn't. I'll fix this."
- What limits are there to transfer_input_files? I would sometimes get Failed to transfer files when the number of files was around 10,000
- ANSWER: There is a memory limit because of ClassAds but in general there isn't a defined limit.
- Is there a way to generate the dag.dot file without having to submit the job?
- The -no_submit option doesn't create the .dot file
- Is adding NOOP to all the JOB commands the right thing to do? The DAG still gets submitted but then quickly ends.
- ANSWER: You need to submit the DAG. NOOP is the current solution
- Is there a way to start a dag at a given point? E.g. if there are 5 steps in the dag, can you start the job at step 3?
- Is the answer again to add NOOP to the JOB commands you don't want to run?
- ANSWER: Splices may work here but NOOP is certainly a tool to use.
- I see at CHTC jobs now require a request_disk setting. How does one launch interactive jobs?
- ANSWER: This is a bug.
- For our initial tests, we want to flock jobs to CHTC that transfer about 60GB input and output. Eventually we will reduce this significantly but for now what can we do?
- Globus or rsync to move data? If Globus, how to do so in an automated way (E.g. no password)?
- ANSWER: Using the Transfer Mechanism from our submit server to CHTC's execution host is ok with CHTC because it doesn't interfere with their submit host. Outbound from their execute hosts is also allowed (scp, rsync, etc).
- Use +Longjob = true attribute for 72+ hour jobs. This is a CHTC-specific knob.
- How can I know if my job swapped?
- ANSWER: CHTC nodes have no or minimal swap space.
- Condor Annex processing in AWS. Is there support for spot market
- ANSWER: Condor Annex does indeed support the spot market. It is a bit more work to set up because you don't say "give my X of Y", but "I'll pay d1 dollars for machines like X1 and d2 for machines like X2, etc.".
- What network mask should we use to allow ssh from CHTC into NRAO? Is it a class B or several class Cs?
- ANSWER: The ip (v4 !) ranges for CHTC execute nodes are
128.104.100.0/22
128.104.55.0/24
128.104.58.0/23
128.105.244.0/23
- ANSWER: The ip (v4 !) ranges for CHTC execute nodes are
- Is there a ganglia server or some other monitor service at CHTC we can view?
ANSWER: We have a bunch of ganglia and grafana graphs for the system, but I think they are restricted to campus folks and tend to show system-wide utliization and problem
I have a machine with an externaly-accessable, non-NATed address (146.88.10.48) and an internal, non-routable address (10.64.1.226). I want to install condor_annex on this machine such that I can submit jobs to AWS from it. I don't necessarily need to submit jobs to local execute hosts from this machine. Should I make this machine a central manager, a submit host, both, or does it matter?
- ANSWER: I think instances in AWS will need to contact both the schedd (submit host) and collectord (central manager) from the internet using port 9618. So either both submit host and central manager need external connections and IPs with port 9618 open or combine them into one host with an external IP and port 9618 open to the Internet.
Last time I configured condor_annex I was using an older version of condor (8.8.3 I think) and used a pool password for security. Now I am using 8.9.7. Is there a newer/better security method I should use?
- ANSWER: The annex still primarily uses pool password, so let's stick with that for now.
- How can I find out what hosts are available for given requirements (LongJob, memory, staging)
- condor_status -compact -constraint "HasChtcStaging==true" -constraint 'DetectedMemory>500000' -constraint "CanRunLongJobs isnt Undefined"
- Answer: yes this is correct but it doesn't show what other jobs are waiting on the same resources. Which is fine.
- It looks to me like most hosts at CHTC are setup to run LongJobs. The following shows a small list of about 20 hosts so I assume all others can run long jobs. Is the correct?
- condor_status -compact -constraint "CanRunLongJobs is Undefined"
- JongJobs is for something like 72 hours. So it might be best to not set it unless we really need it like step23.
- Answer: yes this is correct but it doesn't show what other jobs are waiting on the same resources. Which is fine.
- Is port 9618 needed for flocking or just for condor_annex?
- ANSWER: Greg thinks yes 9618 is needed for both flocking and condor_annex.
- Are there bugs in the condor.log output of a DAG node? For example, I have a condor.log file that clearly shows the job taking about three hours to run yet at the bottom lists user time of 13 hours and system time of 1 hour. https://open-confluence.nrao.edu/download/attachments/40541486/step07.py.condor.log?api=v2
And as for the cpu usage report, there could very well be a bug, but first, is your job multi-threaded or multi-process? If so, the cpu usage will be the aggregate across all cpu cores.
- Yes they are all parallel jobs to some extent so I accept your answer for that job. But I have another job that took 21 hours of wallclock time and yet the condor.log shows 55 minutes of user and 5:34 hours of system time. https://open-confluence.nrao.edu/download/attachments/40541486/step05.py.condor.log?api=v2
- ANSWER: if you look, the user time is actually 6 days and 55 minutes. I missed the 6 in there.
- Given a DAG where some steps are to always run at AOC and some are to always run in CHTC how can we dictate this. Right now local jobs flock to CHTC if local resources are full. Can we make local jobs idle instead of flock?
- ANSWER: Use PoolNames. I need to make a testpost PoolName.
- PoolName = "TESTPOST"
STARTD_ATTRS = $(STARTD_ATTRS) PoolName
- It seems that when using DAGs the recommended method is to define variables in the DAG script instead of submit scripts. This makes sense as it allows for only one file, the DAG script, that needs to be edited to make changes. But, is there a way to get variables into the DAG script from the command line or environment or an include_file or something?
- ANSWER: There is an INCLUDE syntax but there is no command-line or environment variable way to get vars into a DAG.
- We are starting to run 10s of jobs in CHTC requiring 40GB as part of a local DAG. Are there any options we can set to improve their execution chance. What memory footprint (32, 20, 16, 8GB) would significantly improve their chances.
- ANSWER: only use +LongJob if the job needs more than 72 hour, which is the default "walltime".
- How can we set AWS Tags with condor_annex? We'd like this to track jobs and set billing tags. Looks like there isn't really a way.
- SOLUTION: Greg wrote (Oct. 5, 2020) "we've just now added code to the condor_annex to support setting multiple aws tags on the annex command line". K. Scott expects it will take a while before that code makes it to released software.
- Launch Templates didn't work. I don't think condor_annex supports Launch Templates.
- Use aws-user-data options to condor_annex?
- I have tried all sorts of user-data and default-user-data-file options. On-demand apparently no longer works and I was never able to get something working with spot-fleet. I think all things user-data are non-functional.
- I tried setting a tag in the role defined in config.json (aws-ec2-spot-fleet-tagging-role) but that tag didn't translate to the instance.
- I tried adding a tag to the AMI when creating a new AMI (EC2 → Instances → Actions → Image → Create Image). Didn't work.
- What about selftagging? The instance figures out its instance id and runs aws.
- wget -qO- http://instance-data/latest/meta-data/instance-id
- returns nothing when logged in as nobody (condor_ssh_to_job)
- returns nothing when logged in as centos (ssh -i ~/.ssh/...)
- returns instanceid when logged in as root (ssh as centos then sudo su)
- Aha! There is a firewall (iptables) rule blocking exactly this. But I can't figure out what file sets this iptables rule on boot.
- wget -qO- http://instance-data/latest/meta-data/instance-id
- I tried adding tags to the json file using both ResourceType set to instance and spot-fleet-request. Neither created an instance with my tag.
"TagSpecifications": [
{
"ResourceType": "instance",
"Tags": [
{
"Key": "budget",
"Value": "VLASS"
}
]
{
],- What are the clever solutions to submitting N different DAG jobs with each having different parmeters?
T10t34
J220200-003000
bin, working, data
J220600-003000
bin, working, data
...
T10t35
J170743-393000
bin, working, data
J171241-383000
bin, working, data
...
ANSWERS:
INCLUDE syntax for DAGs
include syntax for submit files
make a template of files
use a PRE script that populates things
usedagdir
- I had a job killed because it exceeded 72 hours even though I set +LongJobs = true in the submit file
- 2385.0 krowe 9/22 20:43 Error from slot1_1@e2008.chtc.wisc.edu: Job failed to complete in 72 hrs
- ANSWER: the knob is sinuglar +LongJob = true
- What are the options to setting up HTCondor to both flock to CHTC and annex to AWS? Multiple submit hosts? Multiple CMs? etc.
- ANSWER: Philosophy is for everyone to submit in one place and let condor sort out where it goes.
- CHTC flocks annex jobs to a different CM that actually starts the annex.
- Submit annex job on SM. It then flocks to a different CM that can create the annex
- As we feared, referencing the cache of convolution functions (cfcache) directly from staging performed poorly. This is due to a fstat() pathology that fares poorly on distributed filesystems. Jobs ran 3 to 4 times faster when we copied cfcache from /staging to local disk. I ran a small data set test with full parameters at CHTC that copied cfcache from /staging to local disk and step05 took only 16.7 hours instead of the 56.8 hours it had taken using cfcache on /staging.
- Condor_annex bug: Edit /usr/libexec/condor/condor-annex-ec2 and comment out the line chkconfig condor || exit 1 because this line is a hold-over from older versions that put condor in init.d. Now that it is in systemd, this line causes condor to exit.
- SOLUTION: Greg submitted ticket on this.
- Debugging held jobs. I had thought that setting when_to_transfer_output = ON_EXIT_OR_EVICT would copy the scratch area back to the submit machine so files there could be inspected. But that doesn't seem to happen for me.
- https://htcondor.readthedocs.io/en/latest/users-manual/file-transfer.html#specifying-if-and-when-to-transfer-files
- Evict means your job was killed because of policy like priority or time but not memory or disk.
- Hold is often because of an error like missing file transfer or out of memory
- SOLUTION: there isn't a good solution to hold.
- Memory issue: Greg did find a bug deep in the code that may cause jobs to be killed because of memory issues. HTCondor occationally gets a short read when looking at the process table via /proc, and then something like 2/3 of the processes are missing.
- SOLUTION: CHTC will work towards a solution.
- How can we have the .dag.* files written to a different directory? -usedagdir doesn't help.
- ANSWER: There isn't a way to tell condor_submit_dag where to put the logs
- Is there a for-loop structure available to DAG scripts or a range mechanic?
- No
- If 8.9.9 requires Globus from EPEL then it may have trouble being installed on a Globus endpoint because the EPEL version of Globus conflicts with the Globus.org version.
- I told them about it. I have not tried installing HTCondor-8.9.9 so I am only guessing it will be a problem.
- Is there a recommended way to start annexes from a DAG? We have been using PRE scripts but sometimes it seems to fail.
- SOLUTIONS:
- CHTC is working on a BEGIN syntax (provision) that will block a DAG node from starting until the annex is ready.
- We could have the script not return until the annex is ready.
- We could also have the job require a specific name that the create_annex creates.
- SOLUTIONS:
- How can I set a variable in a DAG file that I can then use in the submit file in a conditional? None of the following seem to work
- DAG:
VARS step01 CHTC=""
- VARS step05 CHTC="True"
- Submit:
- if defined $(CHTC)
- requirements = PoolName == "CHTC"
- endif
- if defined $(CHTC)
- or
- DAG:
- #VARS step01 CHTC="True"
- VARS step05 CHTC="True"
- Submit:
- if defined $(CHTC)
- requirements = PoolName == "CHTC"
- endif
- if defined $(CHTC)
- or
- DAG:
- VARS step01 CHTC="False"
- VARS step05 CHTC="True"
- Submit:
- chtc_var = $(CHTC)
- if $(chtc_var)
- requirements = PoolName == "CHTC"
- endif
- even though when I pass $(chtc_var) as arguments to the shell script, the shell script sees it as True.
- or
- DAG:
VARS node1 file="chtc.htc"
- VARS node2 file="aws.htc"
- Submit:
- include : $(file)
- DAG:
10/20/20 08:54:36 From submit: ERROR: on Line 9 of submit file:
10/20/20 08:54:36 From submit: Submit:-1:Error "", Line 0, Include Depth 1: can't open file
10/20/20 08:54:36 From submit:
10/20/20 08:54:36 From submit: ERROR: Failed to parse command file (line 9).
10/20/20 08:54:36 failed while reading from pipe.
10/20/20 08:54:36 Read so far: Submitting job(s)ERROR: on Line 9 of submit file: Submit:-1:Error "", Line 0, Include Depth 1: can't open fileERROR: Failed to parse command file (line 9).
10/20/20 08:54:36 ERROR: submit attempt failed
- Yet I can use a variable defined in a DAG for things like arguments and request_memory.
- I can also use file = $CHOICE(myindex, chtc.htc, aws.htc) where myindex is defined in a DAG it will set $(file) to the file I want to include but again if I use include : $(file) I get an error
10/20/20 11:58:58 From submit: Submitting job(s)ERROR on Line 13 of submit file: $CHOICE() macro: myindex is invalid index!
10/20/20 11:58:58 failed while reading from pipe.
10/20/20 11:58:58 Read so far: Submitting job(s)ERROR on Line 13 of submit file: $CHOICE() macro: myindex is invalid index!
10/20/20 11:58:58 ERROR: submit attempt failed
- Perhaps use requirements. Greg will send an example
- SOLUTION:
- DAG:
- JOB step05 step05.htc
- #VARS step05 SITE="chtc"
- #VARS step05 SITE="aws"
- Submit:
- +NRAOAttr = "$(SITE)"
- Requirements = My.NRAOAttr == "chtc" ? PoolName == "CHTC" : PoolName =!= "CHTC"
Requirements = My.NRAOAttr == "chtc" ? (Target.HasCHTCStaging == true) : (Target.HasCHTCStaging =!= true)
- myannex = "krowe-annex"
- +MayUseAWS = True
Requirements = My.NRAOAttr == "aws" ? AnnexName == $(myannex) : AnnexName =!= $(myannex)
- I would set myannex in the DAG but when I do that it tries to find an AnnexName of "krowe - annex" (note spaces)
- ANSWER: My conclusion is that there are limitations on what one can do with variables in the submit file that were defined in the DAG file.
- Is there a config option that will cause condor to not start? We have diskless nodes and it is easier to modify the config file then change systemd.
- SOLUTION: Either set START_MASTER = False or START_DAEMONS = False depending on desired outcome.
- Torque has this command called pbsnodes that can not only offline/drain a node but keeps a note about it that all can see in one place. I know I can use condor_off to drain a node but is there a central place keep notes so I can remember a month later why I set a certain node to drain?
- ANSWER: there is no place to keep such notes but Greg likes the idea and may look into it.
- May want to use condor_drain instead of condor_off. condor_off will kill the startd when all jobs finish and it no longer shows up in condor_status. condor_drain will leave the node in condor_status.
- condor_drain doesn't work for me because it immediatly sets jobs idle instead of letting them run to completion. This is why I use condor_off -startd -peaceful instead.
- How can you tell which job is associated with an email given the email message doesn't include a working dir or the assigned batch_name?
- CHTC will look into adding such information to the email condor sends.
- Bug in condor_annex: Underscores in the AnnexName prevent the annex from moving into the pool.
- Also when I try to terminate an annex with underscores (e.g. krowe_annex_casa5) with the command condor_off -annex krowe_annex_casa5 I get the following error
- Found no ClassAds when querying pool (local)
- Can't find addresses for master's for constraint 'AnnexName =?= "krowe_annex_casa5"'
Perhaps you need to query another pool.
- Greg has noted this bug
- Also when I try to terminate an annex with underscores (e.g. krowe_annex_casa5) with the command condor_off -annex krowe_annex_casa5 I get the following error
- Bug in condor_annex: The following will wait for an annex named krowe - annex - casa5 (note the spaces). If I pass $(myannex) as an argument to a shell script, the spaces are not there.
- include.htc
- myannex = krowe-annex-casa5
- submit.htc
- include : include.htc
- executable = /bin/sleep
- arguments = 127
- +MayUseAWS = True
- requirements = AnnexName == $(myannex)
- queue
- Actually, I think this isn't a bug but a limitation on using macros. The AnnexName needs to be quoted but how can I quote a macro? Note, I have the same problems with AnnexNames that don't have hyphens (E.g. krowetest).
- No: requirements = AnnexName == "$(myannex)"
- No: myannex = "krowe-annex-casa5"
- No: myannex = \"krowe-annex-casa5\"
- No: myannex = "\"krowe-annex-casa5\""
- Idea: +annex = "krowe-annex-casa5"
- requirements = AnnexName == my.annex
- Greg has noted this bug
- include.htc
Nodesfree
How can one see nodes that are entirely unclaimed?
SOLUTION: condor_status -const 'PartitionableSlot && Cpus == TotalCpus'
HERA queue
I want a proper subset of machines to be for the HERA project. These machines will only run HERA jobs and HERA jobs will only run on these machines. This seems to work but is there a better way?
machine config | submit file |
---|---|
HERA = True STARTD_ATTRS = $(STARTD_ATTRS) HERA START = ($(START)) && (TARGET.partition =?= "HERA") | requirements = (HERA == True) +partition = "HERA" |
SOLUTION: yes, this is good. Submit Transforms could also be set on herapost-master (Submit Host)
https://htcondor.readthedocs.io/en/latest/misc-concepts/transforms.html?highlight=submit%20transform
Reservations
What if you know certain nodes will be unavailable for a window of time say the second week of next month. Is there a way to schedule that in advance in HTCondor? For example in Slurm
scontrol create reservation starttime=2021-02-8T08:00:00 duration=7-0:0:0 nodes=nmpost[020-030] user=root reservationname=siw2022
ANSWER: HTCondor doesn't have a feature like this.
Bug: All on one core
- Bug where James's jobs are all put on the same core. Here is top -u krowe showing the Last Used Cpu (SMP) after I submitted five sleep jobs to the same host.
- Is this just a side effect of condor using cpuacct instead of cpuset in cgroup?
- Is this a failure of the Linux kernel to schedule things on separate cores?
- Is this because cpu.shares is set to 100 instead of 1024?
- Check if CPU affinity is set in /proc/self/status
- Is sleep cpu-intensive enough to properly test this? Perhaps submit a while 1 loop instead?
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND P
66713 krowe 20 0 4364 348 280 S 0.0 0.0 0:00.01 sleep 22
66714 krowe 20 0 4364 352 280 S 0.0 0.0 0:00.02 sleep 24
66715 krowe 20 0 4364 348 280 S 0.0 0.0 0:00.01 sleep 24
66719 krowe 20 0 4364 348 280 S 0.0 0.0 0:00.02 sleep 2
66722 krowe 20 0 4364 352 280 S 0.0 0.0 0:00.02 sleep 22
From jrobnett@nrao.edu Tue Nov 10 16:38:18 2020
As (bad) luck would have it I had some jobs running where I forgot to set the #cores to do so they triggered the behavior.
Sshing into the node I see three processes sharing the same core and the following for the 3 python processes:
bash-4.2$ cat /proc/113531/status | grep Cpus
Cpus_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
Cpus_allowed_list: 0
If I look at another node with 3 processes where they aren't sharing the same core I see:
bash-4.2$ cat /proc/248668/status | grep Cpu
Cpus_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00555555
Cpus_allowed_list: 0,2,4,6,8,10,12,14,16,18,20,22
Dec. 8, 2020 krowe: I launched five sqrt(rand()) jobs and each one landed on its own CPU.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND P
48833 krowe 20 0 12532 1052 884 R 100.0 0.0 9:20.95 a.out 4
49014 krowe 20 0 12532 1052 884 R 100.0 0.0 8:34.91 a.out 5
48960 krowe 20 0 12532 1052 884 R 99.6 0.0 8:54.40 a.out 3
49011 krowe 20 0 12532 1052 884 R 99.6 0.0 8:35.00 a.out 1
49013 krowe 20 0 12532 1048 884 R 99.6 0.0 8:34.84 a.out 0
and the masks aren't restricting them to specific cpus. So I am yet unable to reproduce James's problem.
st077.aoc.nrao.edu]# grep -i cpus /proc/48960/status
Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
Cpus_allowed_list: 0-447
We can reproduce this without HTCondor. So this is either being caused by our mpicasa program or the openmpi libraries it uses. Even better, I can reproduce this with a simple shell script executed from two shells at the same time on the same host. Another MPI implementation (mvapich2) didn't show this problem.
#!/bin/sh
export PATH=/usr/lib64/openmpi/bin:{$PATH}
export LD_LIBRARY_PATH=/usr/lib64/openmpi/lib:${LD_LIBRARY_PATH}
mpirun -np 2 /users/krowe/work/doc/all/openmpi/busy/busy
Array Jobs
Does HTCondor support array jobs like Slurm? For example in Slurm #SBATCH --array=0-3%2 or is one supposed to use queue options and DAGMan throttling?
ANSWER: HTCondor does reduce the priority of a user the more jobs they run so there may be less need of a maxjob or modulus option. But here are some other things to look into.
queue from seq 10 5 30 |
queue item in 1, 2, 3
combined cluster (Slurm and HTCondor)
Slurm starts and stops condor. CHTC does this because their HTCondor can preempt jobs. So when Slurm starts a job it kills the condor startd and any HTCondor jobs will get preempted and probably restarted somewhere else.
Node Priority
Is there a way to set an order to which nodes are picked first or a weight system? We want certain nodes to be chosen first because they are faster, or have less memory or other such criteria.
NEGOTIATOR_PRE_JOB_RANK on the negotiator
HPC Cluster
Could I have access to the HPC cluster? To learn Slurm.
ANSWER: https://chtc.cs.wisc.edu/hpc-overview I need to login to submit2 first but that's fine.
How does CHTC keep shared directories (/tmp, /var/tmp, /dev/shm) clean with Slurm?
ANSWER: CHTC doesn't do any cleaning of shared directories. But the suggested looking at https://derekweitzel.com/2016/03/22/fedora-copr-slurm-per-job-tmp/ I don't know if this plugin will clean files created by an interactive ssh, but i suspect it won't because it is a slurm plugin and ssh'ing to the host is outside of the control of Slurm except for the pam_slurm_adopt that adds you to the cgroup. So I may still need a reaper script to keep these directories clean.
vmem exceeded in Torque
We have seen a problem in Torque recently that reminds us of the memory fix you recently implemented in HTCondor. What that fix related to any recent changes in the Linux kernel or was it a pure HTCondor bug? What was it that you guys did to fix it?
ANSWER: There are two problems here. The first is the short read, which we are still trying to understand the root cause. We've worked around the problem in the short term by re-polling when the number of processes we see drops by 10% or more. The other problem is when condor uses cgroups to measure the amount of memory that all processes in a job use, it goes through the various field in /sys/fs/cgroup/memory/cgroup_name/memory.stat. Memory is categorized into a number of different types in this file, and we were omitting some types of memory when summing up the total.
cpuset issues
ANSWER: git bisect could be useful. Maybe we could ask Ville.
Distant execute nodes
Are there any problems having compute nodes at a distant site?
ANSWER: no intrinsic issues. Be sure to set requirements.
Memory bug fix?
What version of condor has this fix?
ANSWER: 8.9.9
When is it planned for 8.8 or 9.x inclusion?
ANSWER: 9.0 in Apr. 2021
Globus
You mentioned that the globus RPMs are going away. Yes?
ANSWER: They expect to drop globus support in 9.1 around May 2021.
VNC
Do you have any experience using VNC with HTCondor?
ANSWER: no they don't have experience like this. But mount_under_scratch= will use the real /tmp
Which hosts do the flocking?
Lustre is going to be a problem. Our new virtual CMs can't see lustre. Can just a submit host see lustre and not the CM in order to flock?
ANSWER: Only submit machines need to be configured to flock. It goes from a local submit host to a remote CM. So we could keep gibson as a flocking submit host. This means the new CMs don't need the firewall rules.
Transfer Mechanism Plugin
- Our environment has a complex network topology. We have a prototype rsync plugin but may want to specify a specific network interface for a host as a function of where the execute host resides.
- Do file transfer plugins have access to the JobAd, either internally or via an external command condor_q -l? For example can they tell what PoolName a job requested?
- Can we make use of logic during the match making where 'if execute host is in set of X, then set some variable to Y' and then the plugin inspects some variable to determine where it is toplogically and therefore which interface to use.
- ANSWER: look at .job.ad or .machine.ad in the scratch area. Could set some attributes in the config file for the nodes.
Containers
- Is HTC basically committed to distributing container implementations with each new release
- ANSWER: CHTC is planning to release containers with each HTCondor release.
- Is this migrating toward a recommended implementation method for things like the submit hosts and possibly even execute hosts where the transactions could be light weight.
- ANSWER: The jobs are tied to a submit host. If that submit host goes away the job may be orphaned.
Remote
condor_submit -remote what does it do? The manpage makes me think it submits your job using a different submit host but when I run it I get lots of authentication errors. Can it not use host-based authentication (e.g. ALLOW_WRITE = *.aoc.nrao.edu)?
Here is an example of me running condor_submit on one of our Submit Hosts (testpost-master) trying to remote to our Central Manager (testpost-cm) which is also a submit host.
condor_submit -remote testpost-cm tiny.htc
Submitting job(s)
ERROR: Failed to connect to queue manager testpost-cm-vml.aoc.nrao.edu
AUTHENTICATE:1003:Failed to authenticate with any method
AUTHENTICATE:1004:Failed to authenticate using GSI
GSI:5003:Failed to authenticate. Globus is reporting error (851968:50). There
is probably a problem with your credentials. (Did you run grid-proxy-init?)
AUTHENTICATE:1004:Failed to authenticate using KERBEROS
AUTHENTICATE:1004:Failed to authenticate using FS
ANSWER:
condor_submit -remote does indeed, tell the condor_submit tool to submit to a remote schedd. (it also implies -spool)
Because the schedd can run the job as the submitting owner, and runs the shadow as the submitting owner, the remote schedd needs to not just authorize the remote user to submit jobs, but must authenticate the remote user as some allowed user.
Condor's IP host-based authentication is just authentication, it can say "everyone coming from this IP address is allowed to do X, but I don't know who that entity is".
So, for remote submit to work, we need some kind of authentication method as well, like IDTOKENS, munge.
Authentication
- We're currently using host based authentication. Is there a 'future proof' recommended authentication system for HTCondor-9.x for a site planning to use both on-premesis cluster and CHTC flocking and or glide-ins to other facitlities? host_based? password? Tokens? SSL? Munge? Munge might be my preferred method as Slurm already requires it.
- If we're using containers for submit hosts is there a preferred authentication scheme (host based doesn't scale well).
- ANSWER: idtokens
HTcondor+Slurm
- Do people do HTCondor glide-ins to slurm where the HTCondor jobs are not prempted, as a way to share resources with both schedulers?
- ANSWER: You can glide in to Slurm.
- You can have Slurm preempt HTCondor jobs in favor of its own jobs (HTCondor jobs presumably will be resubmitted)
- You can have HTCondor preempt Slurm jobs in the same sort of way.
Transfer Plugin Order
HTCondor guarantees that the condor file transfer happens before the plugin transfer, but only when using the "multi-file" plugin style,
like we have in our curl plugin. If you used the curl plugin as the model for rsync, you should be good.
AMQP
The AMQP gateway that we had developed was called Qpid, and worked by tailing the user job log and turning it into qpid events. I suspect
there's also ways to have condor plugins directly send amqp events as well.
CPU Shares
Torque uses cpusets which is pretty straight forward, but HTCondor uses cpu.shares which confuses me a bit. For example, a job with request_cpus = 8 executing on a 24-core machine gets cpu.shares = 800 If there are no other jobs on node, does this job essentially get more CPU time than 1024/800?
ANSWER: yes it is oppertunistic. If there are no other jobs running on a node you essentially get all the node.
Nodescheduler
We found a way to implement our nodescheduler script in Slurm using the --exclude option. Is there a way to exclude certain hosts from a job? Or perhaps a constraint that prevents a job from running on a node that is already running a job of that user? Is there a better way than this?
requirements = Machine != "nmpost097.aoc.nrao.edu" && Machine != "nmpost119.aoc.nrao.edu"
badmachines=one+two+three
requirements not in $(badmachines)
I didn't get the actual syntax from Greg and I am apparently not able to look it up. The long syntax I suggested should work I just dont know what Greg's more efficient syntax is.
condor_ssh_to_job
Is there a way to use condor_ssh_to_job to connect to a job submitted from a different submit host (schedd) or do you have to run it from the submit host used to submit the job? I have tried using the -name option to condor_ssh_to_job but I always get Failed to send GET_JOB_CONNECT_INFO to schedd
ANSWER: idtokens. Host-based and poolpassword are not sufficient to identify users and allow for this (and probably condor_submit -remote).
HTCondor Workshop vs Condor Week
ANSWER: Essentially it is "Condor Week Europe". Mostly the same talks but different customer presentations. Could be interesting for the different customer presentations.
Shutdown
STARTD.DAEMON_SHUTDOWN = State == "Unclaimed" && Activity == "Idle" && (MyCurrentTime - EnteredCurrentActivity) > 600
MASTER.DAEMON_SHUTDOWN = STARTD_StartTime == 0
But I was running a job when it shut down.
07/19/21 11:45:01 The DaemonShutdown expression "State == "Unclaimed" && Activity == "Idle" && (MyCurrentTime - EnteredCurrentActivity) > 600" evaluated to TRUE: starting graceful shutdown
Could this be because we use dynamic slots?
testpost-cm-vml krowe >condor_status
Name OpSys Arch State Activity LoadAv Memslot1@testpost001.aoc.nrao.edu LINUX X86_64 Unclaimed Idle 0.000 193
slot1_1@testpost001.aoc.nrao.edu LINUX X86_64 Claimed Busy 0.000
slot1@testpost002.aoc.nrao.edu LINUX X86_64 Unclaimed Idle 0.000 144
slot1_1@testpost002.aoc.nrao.edu LINUX X86_64 Claimed Busy 0.810 49
slot1@testpost003.aoc.nrao.edu LINUX X86_64 Unclaimed Idle 0.000 193
I see that with dynamic slots, the parent slot (slot1) seems always unclaimed and idle and the child slots (slot1_1) are Claimed and Busy. So I tried checking the ChildState attribute which looks to be a list but doesn't behaive like one. For example, none of these show any slots
condor_status -const 'ChildState == { "Claimed" }'
condor_status -const 'sum(ChildState) == 0'
Even though this produces true
classad_eval 'a = { }' 'sum(a) == 0'
ANSWER: Try this
condor_status -const 'size(ChildState) == 0
HTCondor and Slurm
NRAO has effectively two use cases: 1) Operations triggered jobs. These are well formulated pipeline jobs, they're still fairly monolithic and long running (many hours to few days). 2) User triggered jobs, these are of course not well formulated. We will be moving the operations jobs to htcondor. We plan to move the user triggered jobs to SLURM form Torque. There's enough noise in the two job loads that we don't want to have strict host carve outs for type 1 and type 2 jobs. What we anticipate doing is having a set of nodes known only to htcondor for the bulk of operations and a set of hosts controlled by SLURM for the user facing jobs. Periodically when they have a large set of operations jobs we'd like for them to burst into the SLURM controlled nodes. We neither anticipate nor want the slurm jobs to burst into the htcondor set of nodes.
Say we have two clusters (HTCondor and Slurm) and both can be submitted to from the same host. We want the HTCondor jobs to use the Slurm cluster resources when the HTCondor cluster resources are full, but we probably don't want to support preemption. How could we have HTCondor submit jobs to a Slurm cluster? (HTCondor-C, flocking, overlapping, batch-grid-type, HTCondor-CE, etc)
ANSWER: write our own 'factory' that watched HTCondor and when it is full submit Pilot jobs to Slurm that launch startd daemons thus allowing the Payload jobs waiting in HTCondor to run. Will want to set the startd to exit after being idle for a little while, run the Pilot job as root, and figure out how to do cgroups properly.
Shadow jobs and Lustre
We had some jobs get restarted because they lost contact with their shadow jobs. I assume this is because the shadow jobs keep the condor.log file open and if that file is on Lustre and Lustre goes down then the shadow job fails to communicate with the job and the job gets killed. Does that seem accurate to you?
nmpost-master root >ps auxww|grep shadow|grep krowe
krowe 1631810 0.0 0.0 38708 3676 ? S 09:29 0:00 condor_shadow -f 486.0 --schedd=<10.64.10.100:9618?addrs=10.64.10.100-9618&noUDP&sock=5837_96cc_3> --xfer-queue=limit=upload,download;addr=<10.64.10.100:14115> <10.64.10.100:14115> -
nmpost-master root >ls -la /proc/1631810/fd
total 0
dr-x------ 2 root root 0 Jul 27 09:29 ./
dr-xr-xr-x 8 krowe nmstaff 0 Jul 27 09:29 ../
lr-x------ 1 root root 64 Jul 27 09:29 0 -> pipe:[16358528]
lr-x------ 1 root root 64 Jul 27 09:29 1 -> pipe:[16358540]
lrwx------ 1 root root 64 Jul 27 09:29 18 -> socket:[16358529]
l-wx------ 1 root root 64 Jul 27 09:29 2 -> pipe:[16358540]
l-wx------ 1 root root 64 Jul 27 09:29 3 -> /lustre/aoc/sciops/krowe/condor.486.log
lrwx------ 1 root root 64 Jul 27 09:29 4 -> socket:[16358542]
Here are some logs of a filed job
07/26/21 14:38:38 (479.0) (1188418): Job 479.0 is being evicted from slot1_1@nmpost114.aoc.nrao.edu
07/26/21 14:38:38 (479.0) (1188418): logEvictEvent with unknown reason (108), not logging.
07/26/21 14:38:38 (479.0) (1188418): **** condor_shadow (condor_SHADOW) pid 1188418 EXITING WITH STATUS 108
Exit Code 108 = can not connect to the condor_startd or request refused
2021-07-26 14:16:39 (pid:91673) Lost connection to shadow, waiting 2400 secs for reconnect
ANSWER: Greg thinks this is an accurate description of the problem. Greg thinks this 2400 second timeout may be adjustable but do we want to? How long is long enough? Two choices: 1 decide we don't care 2 write log files to something other than Lustre.
Rebooting Submit Host
What happens to running jobs if the submit host reboots? Shadow processes? What if the submithost is replaced with a new server? I think we have shown there is a 2400 second (40 minute) timeout.
ANSWER: state files are in $(condor_config_val SPOOL) and you only have 40 minutes by default and that timeout is set at job submission time.
Chirp in upload_file
While I seem to be able to use chirp in the download_file() function of a plugin, I cannot seem to use it in the upload_file() porttion. Something like the following will produce a line in the condor log file but not when executed from the upload_file() function. This I have tested at CHTC.
message = 'in upload_file()'
subprocess.call(['/usr/libexec/condor/condor_chirp', 'ulog', message])
I have an example in /home/nu_kscott/htcondor/plugin_small
The plugin hangs during the output and the processes running on the exectuion host look like this
krowe 36107 0.0 0.0 58420 8128 ? Ss 13:44 0:00 condor_starter -f -local-name slot_type_1 -a slot1_1 testpost-master.aoc.nrao.edu
krowe 36571 0.8 0.0 182728 15004 ? S 13:46 0:00 /usr/bin/python3 /usr/libexec/condor/nraorsync_plugin.py -infile /lustre/aoc/admin/tmp/condor/testpost002/execute/dir_36107/.nraorsync_plugin.py.in -outfile /lustre/aoc/admin/tmp/condor/testpost002/execute/dir_36107/.nraorsync_plugin.py.out -upload
krowe 36572 0.0 0.0 17084 1288 ? S 13:46 0:00 /usr/libexec/condor/condor_chirp ulog in upload_file()
If I kill the condor_chirp process (36572), the plugin moves on to the next file to upload at which point it runs condor_chirp again and hangs again. If I keep killing the condor_chirp processes eventually the job finished properly.
ANSWER: Greg looked into this and said there is no good workaround. "This is simply a deadlock between chirp and the file transfer plugin. When transfering the output sandbox back to the submit machine, the HTCondor starter runs the file transfer code synchronously wrt the starter (it forks to do this while transfering the input sandbox...), and the starter also handles chirp calls."
Timeout
What is the timeout setting called and can we increase it? Is it JobLeaseDuration? Can it be altered on a running job?
ANSWER: yes it is JobLeaseDuration and it can be changed in the execution host
condor_gpu_discovery
I can't find the condor_gpu_discovery on my cluster (HTCondor-9.0.4) or CHTC (9.1.4) even on a GPU host.
ANSWER: /usr/libexec/condor/condor_gpu_discovery
idtokens with RPMs
It seems that installing HTCondor-9.0.4 via RPMs doesn't automatically create signing key in /etc/condor/passwords.d/POOL
like the documentation reads https://htcondor.readthedocs.io/en/latest/admin-manual/security.html?highlight=idtokens#quick-configuration-of-security
Also with the RPM install, ALLOW_WRITE = * which seems insecure. Does this even matter when use security:recommended_v9_0
ANSWER: this can probably just be ignored. Greg didn't think fresh installs actually created signing keys so this may be an error in documentation.
idtokens
We are using HTCondor-9.0.4 and switched from using host_based security to idtoken security with the following procedure.
On just the Central Manager named testpost-cm (which is the collector and schedd)
openssl rand -base64 32 | condor_store_cred add -c -f /etc/condor/passwords.d/POOL
condor_token_create -identity condor@testpost-cm.aoc.nrao.edu > /etc/condor/tokens.d/condor@testpost-cm.aoc.nrao.edu
echo 'SEC_TOKEN_POOL_SIGNING_KEY_FILE = /etc/condor/passwords.d/POOL' >> /etc/condor/config.d/99-nrao
then switch to use security:recommended_v9_0 in 00-htcondor-9.0.config
On the worker nodes (startd's)
scp testpost-cm:/etc/condor/passwords.d/POOL /etc/condor/passwords.d
scp testpost-cm:/etc/condor/tokens.d/condor\@testpost-cm.aoc.nrao.edu /etc/condor/tokens.d
echo 'SEC_TOKEN_POOL_SIGNING_KEY_FILE = /etc/condor/passwords.d/POOL' >> /etc/condor/config.d/99-nrao
then switch to use security:recommended_v9_0 in 00-htcondor-9.0.config
But then things like condor_off don't work
testpost-cm-vml root >condor_off -name testpost002
ERROR
AUTHENTICATE:1003:Failed to authenticate with any method
AUTHENTICATE:1004:Failed to authenticate using SSL
AUTHENTICATE:1004:Failed to authenticate using SCITOKENS
AUTHENTICATE:1004:Failed to authenticate using GSI
GSI:5003:Failed to authenticate. Globus is reporting error (851968:50). There is probably a problem with your credentials. (Did you run grid-proxy-init?)
AUTHENTICATE:1004:Failed to authenticate using KERBEROS
AUTHENTICATE:1004:Failed to authenticate using IDTOKENS
AUTHENTICATE:1004:Failed to authenticate using FS
Can't send Kill-All-Daemons command to master testpost002.aoc.nrao.edu
ANSWER: The CONDOR_HOST on the startd was not fully qualified. Also, both the startd and the collector/schedd were using the cname (testpost-cm) instead of the hostname (testpost-cm-vml). I changed them both to the following and now I can use both condor_off and condor_drain without error.
CONDOR_HOST = testpost-cm-vml.aoc.nrao.edu
Docs wrong for evaluating ClassAds?
This web page https://htcondor.readthedocs.io/en/latest/man-pages/classads.html?highlight=evaluate#testing-classad-expressions suggests that the following will produce false but for me it produces error
condor_status -limit 1 -af 'regexp( "*tr*", "string" )'
ANSWER: The first asterisk shouldn't be there. This is a regex not globbing. Greg will look into updating this document.
Oct. 11, 2021 krowe: The documentation looks to have been corrected.
Memory usage report
The memory usage report at the end of the condor log seems incorrect. I can watch the memory.max_usage_in_bytes in the cgroup get over 8,400MB yet the report in the condor log reads 6,464MB. Does the log only report the memory usage of the parent process and not include all the children? Is it an average memory usage over time?
ANSWER: It is a report of a sum of certain fields in memory.stat in the cgroup. Get Greg an example. Try it on two machines in case this is a problem of re-using the same cgroup. Or reboot and try again.
Oct. 11, 2021 krowe: With HTCondor-9.0.6, it looks like my tests are now reporting consistant values between memory.max_usage_in_bytes in the cgroup and Memory in the condor log. Except the memory.max_usage_in_bytes is in base-10 while the condor log is in base-2.
Tracking jobs through various log files
What is the preferred method of tracking jobs through various log files like condor.log, StarterLog.slot1_2, etc?
The condor.log uses a jobid but the StarterLogs use pid
ANSWER: condor.log to StartLog on execute host to StarterLog.slot* on exeucte host search for "Job <jobid>"
ANSWER: condor_history <jobid> -af LastRemoteHost will give the slot id
Flocking with idtokens
Does the following seem correct?
I setup rastan-vml as a standalone Central Manager, Schedd, and Startd (I'm starting to talk like an HTCondor admin now). This is what I had in the config on rastan-vml
UID_DOMAIN = aoc.nrao.edu JOB_DEFAULT_NOTIFICATION = ERRORCONDOR_ADMIN = krowe@nrao.edu CONDOR_HOST = rastan-vml.aoc.nrao.edu PoolName = "rastan" FLOCK_TO = testpost-cm-vml.aoc.nrao.edu
I then created a token for me in ~/.condor/tokens.d but this did not allow jobs to flock from rastan-vml to testpost-cm.
I then copied the token from testpost-cm:/etc/condor/tokens.d to rastan-vml:/etc/condor/tokens.d and that was enough to get the job flocking.
ANSWER: Yes.
gdrive example
I tried to use the gdrive plugin but couldn't find any documentation and failed to figure it out on my own.
ANSWER: ask coatsworth
I swear this wasn't in the docs last week.
But CHTC doesn't have a Google credential, so I can't use the gdrive plugin at CHTC.
Submitting job(s)
OAuth error: Failed to securely read client secret for service gdrive; Tell your admin that gdrive_CLIENT_SECRET_FILE is not configured
condor_watch_q
nmpost-master krowe >condor_watch_q
ERROR: Unhandled error: [Errno 2] No such file or directory: '/proc/sys/user/max_inotify_instances'. Re-run with -debug for a full stack trace.
ANSWER: it is in beta. send emai to htcondor-admin about it
https://htcondor.readthedocs.io/en/latest/overview/support-downloads-bug-reports.html
Lauren suggest I ask for a ticket account.
Nevermind. It is slated to be fixed in 9.0.7.
Launch numbers
Are there knobs to control how many jobs get launched at the same time and/or delay between launches? We are wondering because we hit our MaxStartups limit of 10:30:60 in sshd.
- FILE_TRANSFER_DISK_LOAD_THROTTLE ?
- JOB_START_COUNT /JOB_START_DELAY ?
ANSWER: JOB_START_COUNT looks like the right thing.
plugin_small
Can one of you please try the instructions in
/home/nu_kscott/plugin_small/small.htc
ANSWER: CHTC admins are required to use two-factor authentication via PAM. This means they can't use a passwdless ssh key in a job.
Transferring back .ad files
I can add the following to my job and it will not cause an error but it also won't transfer the file
transfer_output_files = .job.ad, .machine.ad, .chirp.config
This isn't really important, I just thought it could be useful to diagnose jobs if I had a copy of the .job.ad and thought this would be a convenient way to get it. I am surprised that it neither causes and error, which it would if the file didn't actually exist, nor copies it. So I am guessing either the file is removed after it is checked for existance or HTCondor knows about its internal files and refuses to copy them.
ANSWER: Greg is not surprised by this.
FILESYSTEM_DOMAIN as requirement
I want to submit jobs that require a different filesystem but none of the following seem to work
requirements = (FILESYSTEM_DOMAIN == "aoc.nrao.edu")
FILESYSTEM_DOMAIN = "aoc.nrao.edu"
+FILESYSTEM_DOMAIN = "aoc.nrao.edu"
Looks like the answer is
requirements = (FileSystemDomain == "aoc.nrao.edu")
<sarcasm>because that's perfectly obvious</sarcasm>
But let's say we have two clusters (aoc.nrao.edu and cv.nrao.edu) with different filesystems. I want jobs submitted in aoc.nrao.edu with a requirement of cv.nrao.edu to glidein to the cv.nrao.edu cluster. How can a factory script at cv.nrao.edu look for such jobs? I can't seem to use condor_q -constraint to look for such jobs. The following doesn't work.
condor_q -pool nmpost-cm-vml.aoc.nrao.edu -global -allusers -constraint 'Requirements = ((FileSystemDomain == "cv.nrao.edu"))'
ANSWER: I think the answer is not to use FileSystemDomain but to create our own custom classad like we do with the VLASS partition. Greg says it is possible to query for this requirement but the syntax is pretty gnarly. I think making a partition is a better solution.
Removing tokens
Let's say I have a schedd that authenticates with an idtoken in /etc/condor/tokens.d. If I remove that token, I am still able to submit jobs from that host until condor is restarted. It has to be a restart as condor_reconfig seems insufficient. This indicates to me that HTCondor is caching the token. Although it is strange that condor_token_list returns nothing immediatly after remocing the token yet HTCondor still can submit jobs. This is not really a problem but I was surprised by it and wanted to point it out in case it was unexpected. There doesn't seem to be a timeout either.
ANSWER: Greg knows about this. HTCondor establishes a relationship once authenticated and continues to use that relationship. It may timeout after 24 hours, not sure.
Signing key
Given two separate clusters (testpost and nmpost), what should the signing keys and tokens look like?
Now that we use idtokens, I thought that to get a VM to be able to submit jobs I only needed to add our cluster's token to /etc/condor/tokens.d. But apparently I also need to add our cluster's signing key to /etc/condor/passwords.d. I since learned that this is probably because I created the signing key and token on our testpost cluster and then copied them to our nmpost cluster.
ANSWER: yes. create signing keys for each cluster.
Jobs with a little swap
Say we had jobs that need 40GB of memory but occationally, very briefly, spike to 60GB. With Torque this is not a problem because it will just let the job swap. It is not a big performance hit because the amount of time that memory is needed is very short compared to the runtime of the job. How could we do this in HTCondor? We really don't want to set a memory requirement of 60GB beucase we want to run multiple jobs on a node and doing so will significantly reduce the number of jobs we could put on a node.
Does the new DISABLE_SWAP_FOR_JOB=false knob, introduced in 8.9.9, mean that HTCondor now swaps if needed by default?
ANSWER: try setting memory.swappiness for the condor cgroup.
ANSWER: The VLASS nodes don't have a swap partition. Make a swapfile on the vlass node (nmpost110) and see if that works.
Allocated in the log file
If I submit a job at CHTC with request_disk = 1 G the log output looks like
Partitionable Resources : Usage Request Allocated
Cpus : 1 1
Disk (KB) : 49 1048576 1485064
IoHeavy : 0
Memory (MB) : 1 1024 1024
But if I submit a job at CHTC with a request_disk = 2 G the log output looks like
Partitionable Resources : Usage Request Allocated
Cpus : 1 1
Disk (KB) : 49 2097152 7258993
IoHeavy : 0
Memory (MB) : 0 1024 1024
What does the "Allocated" disk space mean in these examples?
ANSWER: with partitionable slots HTCondor allocates more disk space than you ask because then that slot might be used by a follow up job. This is because destroying and creating partitionable slots takes a full negotiation cycle which is measured in minutes.
MODIFY_REQUEST_EXPR_REQUEST_DISK=RequestDisk can alter this behavior. check docs. on the startd (execute host)
Rebooting Execute Hosts
When an Execute Host unexpectadly reboots, what happens to the job? What are the options? Currently it looks like the job just "hangs". Condor_q indicates that it is still running but it isn't. Looks like it eventually times out after the magic 40 minutes.
ANSWER: Correct
condor_off -reason
You added a -reason to condor_drain, could the same be added to condor_off?
ANSWER: Greg likes this idea and will look into it. Only recently did they implement offline ads that would allow this sort of thing.
Security email
On Mar. 3, 2022 James Robnett received the Security Release email. Is there an email list for these? It looks like he was just BCC'd. Could we change it from James's address to a non-human address?
ANSWER: Greg updated their security list with nrao-scg@nrao.edu
condor_rm -addr
No matter what IP I use, condor_rm -addr (E.g. condor_rm -addr 10.64.1.178:9618 361) always respnds with something like this
condor_rm: "10.64.1.178:9618" is not a valid address
Should be of the form <ip.address.here:port>
For example: <123.456.789.123:6789>
Yet this works condor_rm -pool 146.88.10.46:9618 361
ANSWER: condor_rm -addr "<10.64.1.178:9618>" 361
It actually needs the angle brackets. Weird.
condor_startd blocking on plugin
I modified our nraorsync plugin on the testpost cluster to sleep for 3600 seconds before calling upload_rsync() and then started my small, test job that uses the plugin. Here is what I see in the condor.log
000 (401.000.000) 2022-03-21 11:18:01 Job submitted from host: <10.64.1.178:9618?addrs=10.64.1.178-9618&alias=testpost-master.aoc.nrao.edu&noUDP&sock=schedd_2991_27db>
...
040 (401.000.000) 2022-03-21 11:18:01 Started transferring input files
Transferring to host: <10.64.1.173:9618?addrs=10.64.1.173-9618&alias=testpost003.aoc.nrao.edu&noUDP&sock=slot1_3_5133_69aa_48>
...
040 (401.000.000) 2022-03-21 11:18:04 Finished transferring input files
...
001 (401.000.000) 2022-03-21 11:18:04 Job executing on host: <10.64.1.173:9618?addrs=10.64.1.173-9618&alias=testpost003.aoc.nrao.edu&noUDP&sock=startd_5045_0762>
...
006 (401.000.000) 2022-03-21 11:18:12 Image size of job updated: 58356
1 - MemoryUsage of job (MB)
312 - ResidentSetSize of job (KB)
...
040 (401.000.000) 2022-03-21 11:18:32 Started transferring output files
...
022 (401.000.000) 2022-03-21 12:17:09 Job disconnected, attempting to reconnect
Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to slot1_3@testpost003.aoc.nrao.edu <10.64.1.173:9618?addrs=10.64.1.173-9618&alias=testpost003.aoc.nrao.edu&noUDP&sock=startd_5045_0762>...
024 (401.000.000) 2022-03-21 12:17:09 Job reconnection failed
Job disconnected too long: JobLeaseDuration (2400 seconds) expired
Can not reconnect to slot1_3@testpost003.aoc.nrao.edu, rescheduling job
...
040 (401.000.000) 2022-03-21 12:19:04 Started transferring input files
Transferring to host: <10.64.1.173:9618?addrs=10.64.1.173-9618&alias=testpost003.aoc.nrao.edu&noUDP&sock=slot1_1_5133_69aa_49>
Here is the StartLog on testpost003 for the time the job was disconnected
03/21/22 11:29:43 Unable to calculate keyboard/mouse idle time due to them both being USB or not present, assuming infinite idle time for these devices.
03/21/22 12:17:09 ERROR: Child pid 39127 appears hung! Killing it hard.
03/21/22 12:17:09 Starter pid 39127 died on signal 9 (signal 9 (Killed))
03/21/22 12:17:09 slot1_3: State change: starter exited
03/21/22 12:17:09 slot1_3: Changing activity: Busy -> Idle
03/21/22 12:17:09 slot1_3: State change: idle claim shutting down due to CLAIM_WORKLIFE
03/21/22 12:17:09 slot1_3: Changing state and activity: Claimed/Idle -> Preempting/Vacating
03/21/22 12:17:09 slot1_3: State change: No preempting claim, returning to owner
03/21/22 12:17:09 slot1_3: Changing state and activity: Preempting/Vacating -> Owner/Idle
03/21/22 12:17:09 slot1_3: State change: IS_OWNER is false
03/21/22 12:17:09 slot1_3: Changing state: Owner -> Unclaimed
03/21/22 12:17:09 slot1_3: Changing state: Unclaimed -> Delete
03/21/22 12:17:09 slot1_3: Resource no longer needed, deleting
03/21/22 12:17:09 Error: can't find resource with ClaimId (<10.64.1.173:9618?addrs=10.64.1.173-9618&alias=testpost003.aoc.nrao.edu&noUDP&sock=startd_5045_0762>#1645569601#178#...) for 443 (RELEASE_CLAIM); perhaps this claim was removed already.
03/21/22 12:17:09 condor_write(): Socket closed when trying to write 45 bytes to <10.64.1.178:18477>, fd is 11
03/21/22 12:17:09 Buf::write(): condor_write() failed
03/21/22 12:19:04 slot1_1: New machine resource of type -1 allocated
Here is the ShadowLog on testpost-master (the submit host) for the time the job was disconnected
03/21/22 11:18:04 (401.0) (3260086): File transfer completed successfully.
03/21/22 12:17:09 (401.0) (3260086): Can no longer talk to condor_starter <10.64.1.173:9618>
03/21/22 12:17:09 (401.0) (3260086): Trying to reconnect to disconnected job
03/21/22 12:17:09 (401.0) (3260086): LastJobLeaseRenewal: 1647883111 Mon Mar 21 11:18:31 2022
03/21/22 12:17:09 (401.0) (3260086): JobLeaseDuration: 2400 seconds
03/21/22 12:17:09 (401.0) (3260086): JobLeaseDuration remaining: EXPIRED!
03/21/22 12:17:09 (401.0) (3260086): Reconnect FAILED: Job disconnected too long: JobLeaseDuration (2400 seconds) expired
03/21/22 12:17:09 (401.0) (3260086): Exiting with JOB_SHOULD_REQUEUE
03/21/22 12:17:09 (401.0) (3260086): **** condor_shadow (condor_SHADOW) pid 3260086 EXITING WITH STATUS 107
03/21/22 12:19:04 ******************************************************
Does the condor_starter block waiting for the plugin to finish and therefore not respond to queries from the condor_starter? Will setting JobLeaseDuration to something longer than 2400 seconds help with this?
ANSWER: Yes, the condor_starter blocks on the output transfer (but not on the input transfer). Greg thinks adding JobLeaseDuration to the submit description file should fix the problem.
Greg agrees that it shouldn't not blocking on upload would be preffereable.
ANSWER: turns out it was a differnt knob than JobLeaseDuration. I set the following on our execution hosts to solve the problem.
NOT_RESPONDING_TIMEOUT = 86400
Removing a job from a container schedd
Let's say I submit a condor job from a schedd runnning inside a container (hamilton). How can I remove that job from outside the container (nmpost-master)?
I can see the job using condor_q
nmpost-master krowe >condor_q -name hamilton 4.0
-- Schedd: hamilton.aoc.nrao.edu : <146.88.1.44:9618?... @ 03/04/22 11:33:38
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL JOB_IDS
condor tiny1 3/4 11:31 _ _ _ 1 1 4.0
but when I try to remove it I get an auth error
nmpost-master krowe >condor_rm -name hamilton 4.0
AUTHENTICATE:1003:Failed to authenticate with any method
AUTHENTICATE:1004:Failed to authenticate using SCITOKENS
AUTHENTICATE:1004:Failed to authenticate using GSI
GSI:5003:Failed to authenticate. Globus is reporting error (851968:101). There is probably a problem with your credentials. (Did you run grid-proxy-init?)
AUTHENTICATE:1004:Failed to authenticate using KERBEROS
AUTHENTICATE:1004:Failed to authenticate using IDTOKENS
AUTHENTICATE:1004:Failed to authenticate using FS
No result found for job 4.0
Could this be because the container doesn't have a Condor Signing Key but only has a Condor Token?
But I get the same problem when trying to kill a job on nmpost-master submitted from nmpost-cm and they both have the same passwords and tokens. Do I need a token signing key and a token in ~/.condor/tokens.d?
ANSWER: yes, you need both the token signing key in /etc/condor/passwords.d and the token in ~/.condor/tokens.d
JobLeaseDuration
So setting JobLeaseDuration works if I don't choose a 'bad' value. So far values of 7200 and 14400 cause the job to be disconnected after about 60 minutes while values of 4000, 4800, and 8000 let the job finish nomally. Why?
Why does it seemt to take 60 minutes for the job to disconnect instead of 40 minutes (2400 seconds)?
Is there a way I can set JobLeaseDuration at a system level instead of in the submit description file?
Why is it that if I set JOB_DEFAULT_LEASE_DURATION = 4000 in the submit host config, the job.ad gets JobLeaseDuration = 4000 and yet the job still disconnects?
I compared the jobads (condor_q -l) of a job where *JobLeaseDuration = 4000* is set in the submit description file and *JOB_DEFAULT_LEASE_DURATION = 4000* set in the submithost config. The only differences I see in the jobads are times, jobids, logfiles and diskprovisioned. So I don't understand why altering the submit host config doesn't work.
ANSWER: daemons have a keep alive message. If startd expects keep alives from the starter if not received it gets killed. This is outside JobLeaseDuration. This is from the old days of Condor when it was scavaging cycles and didn't want to get in the user's way. Look into NOT_RESPONDING_TIMEOUT in the config file on the worker node. Default is 3600 seconds. Try setting it to something LARGE.
Comments
This has probably already been mentioned but would it be possible to put comments after a condor command like so
batch_name = "test script" # dont show this
without the batch_name being set to test script # dont show this
ANSWER: Not likely to be changed as doing so may break other things.
Removing jobs with tokens
You can use tokens to remove jobs as other users but strangly not on the same host. For example: krowe and krowe2 have the same token (~/.condor/tokens.d/testpost). if I submit a job as krowe on testpost-master I cannot remove that job as krowe2 on testpost-master.
testpost-master$ condor_q -g -all -af clusterid owner jobstatus globaljobid
452 krowe 2 testpost-master.aoc.nrao.edu#452.0#1648820298testpost-master$ condor_rm 452
Couldn't find/remove all jobs in cluster 452
testpost-master$ condor_rm -name testpost-master 452
Couldn't find/remove all jobs in cluster 452
However, if I submit a job as krowe on testpost-cm I *can* remove that job from testpost-master (condor_rm -name testpost-cm 123). Is this a bug? Is it because when you are on the same host, HTCondor is trying UID authentication instead of token authentication? If so, is there a way to force to force token authentication?
ANSWER: Greg thinks this is because they choose the authentication type first and then stick with that type.
WORKAROUND: I *think*
_condor_SEC_DEFAULT_AUTHENTICATION_METHODS=IDTOKENS condor_rm will use idtokens but Greg thinks this may not work so be warned.
Condor Week
- transportation (airports, busses, cars, etc) Greg recommends a cab from the airport to campus.
- The Getting to the University of Wisconsin-Madison campus (by car, bus or plane) link under the Local Arrangements link is broken.
- The Ground Transportation from the Madison airport into download link seems to work but I bet you ment downtown.
RADIAL CHTC support
- What role does CHTC have for this if any?
- ANSWER: They PXE boot but use a local OS installation with puppet to keep them in sync.
- Singularity or Apptainer?
- ANSWER: They are using Singularity now but may switch to Apptainer.
Flocking and networking
Say we have a pool named cvpost at some remote site and we want to flock jobs to it from our pool named nmpost. What kind of networking is necessary? Do the execute hosts need a routable IP (NAT or real) for download and/or upload? What about the submit host and central manager?
- Job itself: submit host sends job to remote central manager?
- transfer input files: The submit host sends files to the execute host?
- transfer output files: The execute host sends files to the submit host?
ANSWER: These paths need to be open
- From local schedd to remote collectord on condor port 9618
- From remote negotiator and execute hosts to local schedd. Here the execute hosts can be NATed.
- From local shadow to remote starterd. Use CCB. It allows execute hosts to live behind firewall and be NATed.
/tmp
executable = /bin/bash
arguments = "-c '/bin/date > /tmp/date'"
should_transfer_files = yes
transfer_output_files = /tmp/date
#transfer_output_files = tmp/date
queue
If I write to /tmp/date and set transfer_output_files = /tmp/date I get errors like
Error from slot1_4@nmpost040.aoc.nrao.edu: STARTER at 10.64.10.140 failed to send
file(s) to <10.64.10.100:9618>: error reading from /tmp/date: (errno 2) No such
file or directory; SHADOW failed to receive file(s) from <10.64.10.140:35386>
It works if I set transfer_output_files = tmp/date
/dev/shm
executable = /bin/bash
arguments = "-c '/bin/date > /tmp/date'"
should_transfer_files = yes
transfer_output_files = /dev/shm/date
#transfer_output_files = dev/shm/date
queue
If I write to /dev/shm/date I get errors setting transfer_output_files = /dev/shm/date
Error from slot1_4@nmpost040.aoc.nrao.edu: STARTER at 10.64.10.140 failed to send
file(s) to <10.64.10.100:9618>: error reading from /dev/shm/date: (errno 2) No
such file or directory; SHADOW failed to receive file(s) from
<10.64.10.140:41516>
If I write to /dev/shm/date I get errors setting transfer_output_files = dev/shm/date
Error from slot1_4@nmpost040.aoc.nrao.edu: STARTER at 10.64.10.140 failed to send
file(s) to <10.64.10.100:9618>: error reading from
/lustre/aoc/admin/tmp/condor/nmpost040/execute/dir_30401/dev/shm/date: (errno 2)
No such file or directory; SHADOW failed to receive file(s) from
<10.64.10.140:40380>
ANSWER: these are known issues and not surprising. It's debatable weather they are bugs or not. The issue is the job is "done" by the time transfer_output_files is used and since the job is done the bindmounts for /tmp and /dev/shm(which is a little different) are gone.
pro-active glideins
Need to investigate gliding in based on lack of free slots rather than idle jobs. Can one query HTCondor for a CARTA-shaped slot (core, mem, disk)?
ANSWER: Greg thinks this is a good idea and might be useful as a condor-week talk.
condor_off vs condor_drain
a -peaceful option to condor_drain might be perfect. Low priority for NRAO.
ANSWER: Yes condor_drain is being worked on and this is one of the things.
Transfer Plugins
Don't have condor block on the transfer plugin uploading. Low priority for NRAO.
ANSWER: This requires some serious work. Greb will ask Todd about it.
More plugin woes
So let's say you have a plugin to transfer output files and this plugin fails because a destination directory, like nosuchdir, doesn't exist. All the plugin can do is indicate success or failure so it indicates failure. But that seems to cause HTCondor to disconnect/reconnect four times, the fail, then set the job to idle so it can try again later, which then disconnects/reconnects four times and ... Is there anything else the plugin can do to tell HTCondor to hold the job instead of restart?
executable = /bin/sleep
arguments = "27"
output = nosuchdir/condor_out.log
error = nosuchdir/condor_err.log
log = condor.log
should_transfer_files = YES
transfer_output_files = _condor_stdout
# output_destination = nraorsync://$ENV(PWD)
+WantIOProxy = True
queue
If you set either output or error to a directory that doesn't exist like output = nosuchdir/condor_out.log, then when the job ends, HTCondor will put the job on hold with a message like the following in the condor.log
040 (5062.000.000) 2022-09-30 08:28:58 Finished transferring output files ... 007 (5062.000.000) 2022-09-30 08:28:58 Shadow exception! Error from slot1_2@nmpost040.aoc.nrao.edu: STARTER at 10.64.10.140 failed to send file(s) to <10.64.10.100:9618>; SHADOW at 10.64.10.100 failed to write to file /users/krowe/htcondor/nraorsync/dir/stdout.5062.log: (errno 2) No such file or directory 13 - Run Bytes Sent By Job 354 - Run Bytes Received By Job ... 012 (5062.000.000) 2022-09-30 08:28:58 Job was held. Error from slot1_2@nmpost040.aoc.nrao.edu: STARTER at 10.64.10.140 failed to send file(s) to <10.64.10.100:9618>; SHADOW at 10.64.10.100 failed to write to file /users/krowe/htcondor/nraorsync/dir/stdout.5062.log: (errno 2) No such file or directory Code 12 Subcode 2
But if you have set output_destination to use the nraorsync plugin like so output_destination = nraorsync://$ENV(PWD) then you get four disconnect/reconnect events followed by a shadow exception (see below). Then HTCondor sets the job to idle so it can try again instead of putting it on hold. I assume this is because it doesn't know why the job failed because there isn't really a mechanism for the plugin to tell it why.
040 (5061.000.000) 2022-09-30 08:23:20 Finished transferring output files ... 022 (5061.000.000) 2022-09-30 08:23:20 Job disconnected, attempting to reconnect Socket between submit and execute hosts closed unexpectedly Trying to reconnect to slot1_3@nmpost040.aoc.nrao.edu <10.64.10.140:9618?addrs=10.64.10.140-9618&alias=nmpost040.aoc.nrao.edu&noUDP&sock=startd_5631_f9e2> ... 023 (5061.000.000) 2022-09-30 08:23:20 Job reconnected to slot1_3@nmpost040.aoc.nrao.edu startd address: <10.64.10.140:9618?addrs=10.64.10.140-9618&alias=nmpost040.aoc.nrao.edu&noUDP&sock=startd_5631_f9e2> starter address: <10.64.10.140:9618?addrs=10.64.10.140-9618&alias=nmpost040.aoc.nrao.edu&noUDP&sock=slot1_3_5795_da05_282> ... 040 (5061.000.000) 2022-09-30 08:23:20 Started transferring output files ... 040 (5061.000.000) 2022-09-30 08:23:20 Finished transferring output files ... 022 (5061.000.000) 2022-09-30 08:23:20 Job disconnected, attempting to reconnect Socket between submit and execute hosts closed unexpectedly Trying to reconnect to slot1_3@nmpost040.aoc.nrao.edu <10.64.10.140:9618?addrs=10.64.10.140-9618&alias=nmpost040.aoc.nrao.edu&noUDP&sock=startd_5631_f9e2> ... 023 (5061.000.000) 2022-09-30 08:23:20 Job reconnected to slot1_3@nmpost040.aoc.nrao.edu startd address: <10.64.10.140:9618?addrs=10.64.10.140-9618&alias=nmpost040.aoc.nrao.edu&noUDP&sock=startd_5631_f9e2> starter address: <10.64.10.140:9618?addrs=10.64.10.140-9618&alias=nmpost040.aoc.nrao.edu&noUDP&sock=slot1_3_5795_da05_282> ... 040 (5061.000.000) 2022-09-30 08:23:20 Started transferring output files ... 040 (5061.000.000) 2022-09-30 08:23:21 Finished transferring output files ... 022 (5061.000.000) 2022-09-30 08:23:21 Job disconnected, attempting to reconnect Socket between submit and execute hosts closed unexpectedly Trying to reconnect to slot1_3@nmpost040.aoc.nrao.edu <10.64.10.140:9618?addrs=10.64.10.140-9618&alias=nmpost040.aoc.nrao.edu&noUDP&sock=startd_5631_f9e2> ... 023 (5061.000.000) 2022-09-30 08:23:21 Job reconnected to slot1_3@nmpost040.aoc.nrao.edu startd address: <10.64.10.140:9618?addrs=10.64.10.140-9618&alias=nmpost040.aoc.nrao.edu&noUDP&sock=startd_5631_f9e2> starter address: <10.64.10.140:9618?addrs=10.64.10.140-9618&alias=nmpost040.aoc.nrao.edu&noUDP&sock=slot1_3_5795_da05_282> ... 040 (5061.000.000) 2022-09-30 08:23:21 Started transferring output files ... 040 (5061.000.000) 2022-09-30 08:23:21 Finished transferring output files ... 022 (5061.000.000) 2022-09-30 08:23:21 Job disconnected, attempting to reconnect Socket between submit and execute hosts closed unexpectedly Trying to reconnect to slot1_3@nmpost040.aoc.nrao.edu <10.64.10.140:9618?addrs=10.64.10.140-9618&alias=nmpost040.aoc.nrao.edu&noUDP&sock=startd_5631_f9e2> ... 023 (5061.000.000) 2022-09-30 08:23:21 Job reconnected to slot1_3@nmpost040.aoc.nrao.edu startd address: <10.64.10.140:9618?addrs=10.64.10.140-9618&alias=nmpost040.aoc.nrao.edu&noUDP&sock=startd_5631_f9e2> starter address: <10.64.10.140:9618?addrs=10.64.10.140-9618&alias=nmpost040.aoc.nrao.edu&noUDP&sock=slot1_3_5795_da05_282> ... 040 (5061.000.000) 2022-09-30 08:23:21 Started transferring output files ... 040 (5061.000.000) 2022-09-30 08:23:22 Finished transferring output files ... 007 (5061.000.000) 2022-09-30 08:23:22 Shadow exception! Error from slot1_3@nmpost040.aoc.nrao.edu: Repeated attempts to transfer output failed for unknown reasons 0 - Run Bytes Sent By Job 354 - Run Bytes Received By Job
ANSWER: This is a bug. CHTC would like to implement better error handling here.
Workaround could be to set the following in the config file on the submit host. But this may be problematic on SSA's container submit host. It should cause condor to fail the job if the nosuchdir doesn't exist. I think its just best to note this as a bug and wait for CHTC to implement better error handling that allows the plugin to tell HTCondor how to fail. Or something like that.
SUBMIT_SKIP_FILECHECKS=false
DAG hosts
Is there a way to guarentee all the nodes of a DAG run on the same hostname without specifying the specific hostname?
An example would be the first node copies some data to local host storage, then all the other nodes read that data.
ANSWER: have the DAG post script figure out what hostname the node ran on and then modify or create the submit file for the next node.
glidein memory requirements
Twice now I have doubled the memory for the pilot job. On Jul. 14, 2022 from 1GB to 2GB and just now (Aug. 25, 2022) from 2GB to 4GB. This is because the condor daemon like condor_starter exceeeded the memory and was OOM killed. This second time was during nraorsync uploading files. Is there a suggested amount of memory for Slurm for glidein jobs?
ANSWER: The startd assumes it has control of all physical memory and don't check if they are in a cgroup or not. If I run into this again, try and track down what is actually happening. Greg would like to know. He is surprised because the HTCondor daemons should only need MBs not GBs.
no match found
When a job stays idle for a long time and its LastRejMatchReason = "no match found " what are some good places to look to see why it isn't finding a match? For example, if you make a type-o and set the following (note the misspelling of nraorsync)
transfer_input_files = $ENV(HOME)/.ssh/condor_transfer, nraorysnc://$ENV(PWD)/testdir
ANSWER: condor_q -better
It doesn't know about quotas or fairshare or per-machine resource limits like +IOHeavy or other such adds.
Accounts
Can Felipe get an account? Also, you might want to ask James if he still needs his account now that he no longer works for the NRAO.
ANSWER: done.
Start glidein node if there isn't one free
Previously, our factory.sh script would start a glidein job in Slurm if there was a job idle. But now that we want jobs to start as quickly as possible, our factory.sh script now starts a glidein job if there isn't enough free resources available. To do this we had to define what "free resources" ment so we went with MIN_SLOTS=8 and MIN_MEMORY=16384 since we use dynamic slots. We also had to set a "default" machine add on all the nodes that we wanted available to this glidein job. This is so that the factory.sh script doesn't check nodes that are in the VLASS or CVPOST or other such groups. I could have explicitly excluded those groups but that wouldn't scale well if we ever created more groups.
ANSWER: Greg thinks this is a perfectly crumulent way to do this.
OS
What RHEL8-like OS is CHTC going to use or is using? CentOS8/stream, Alma, Rocky, etc? Looks like CentOS8 Stream. Any thoughts?
ANSWER: yes they are using CentOS8/stream. So far so good.
PATh getting data to execute hosts
What are the prefered methods? http? nraorsync? other?
ANSWER: http, s3 or other plugins. Or OSDF.
https://osg-htc.org/services/osdf.html
OSDF (Open Science Data Federation) There are "data origins" which are basicly webservers with cache. Long term we might be able to have our own data origin that authenticates to NRAO and shares data from our Lustre filesystem to their Ceph system via some way. This is like an object store so you can't really update files but you can make new ones.
/mnt/stash/ospool/PROTECTED/ Copy data from NRAO to this path and then you can access it in the job via
transfer_input_files = stash:///ospool/PROTECTED/user/file
transfer_output_files = stash:///ospool/PROTECTED/user/file
https://portal.osg-htc.org/documentation/htc_workloads/managing_data/stashcache/
I think this is cooler than using the nraorsync plugin.
PATh GPUs
We only see four GPUs on PATh right now. What is the timeline to get more? Does PATh flock to other sites with GPUs?
ANSWER: hosts may be dynamic and only come on-line as needed with some k8s magic. Christina is checking. Greg is pretty sure there should be way more than just 4 GPUs in PATh. PATh is made up of six different sites https://path-cc.io/facility/index.html each of these sites provides hardware.
Disk Space
Since neither HTCondor nor cgroups control scratch space, how can we keep jobs from using up all the scratch space on a node and causing other jobs to fail as well?
ANSWER: specify a periodic hold in the startd. every 6 seconds startd can check and put the job on hold. Greg can look up the syntax. Someday, condor starter will make an efemeral filesystem (loopback) on the scratch area with the requested disk space. This is comming soon.
https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToLimitDiskUsageOfJobs
Squid on PATh?
Does PATh use a squid server so that we can wget something once and have it chaced for a while?
ANSWER: Greg thinks PATh does this as well.
RADIAL workload balance
If there are only two users submitting jobs to a cluster, will HTCondor try to balance the workload between the two users? For example will it prioritize user2 jobs if user1 jobs are using the majority of resources? I think I read this about HTCondor's fair-share algorithm but I am not sure.
ANSWER: yes condor does this. There are knobs to adjust user priorities (user1 is twice the priority of user2, etc). You can also specify the length of the half life. There are many other ways to do something like this.