condor_submit -i job.htc
To run a job interactively for debugging. I've used this to get interactive resources but never to debug jobs.
Access Points and Execute Points
They are using "Access Point" where I use "Submit Host"
Also "Execute Point" where I use "Execute Host"
I will update my slides.
transfer_output_remaps
transfer_output_remaps = "count.Dracula.txt=output/count.Dracula.txt"
condor_tail
See stdout and stderr files of a job. Only works if should_transfer_files = YES
condor_tail JobID
condor_tail -stderr JobID
GPUs
9.0 has trouble handling different models of GPUs. Later versions (9.8+) are needed for a cluster with more than one type of GPU. See John Knoeller's talk
condor_submit -spool
Can SSA use this with their containers? Have I already looked at this? No I have not tried this because it doesn't do what I first thought. It spools all input files on the schedd in the spool directory (/var/lib/condor/spool). For SSA, the schedd is mcilroy so this option would be disasterous for them.
checkpointing
This is a crazy idea but what about using checkpointing with SSA's workflow. Right now they have a three-step process: download, process, upload. all of which use lustre. But what if we ran checkpointing after each step? Would this allow the data to be downloaded directlyi to local storage instead of lustre, then processeed, then uploaded. Now that I write it out, I don't see how this is much better than the current process of copying from archive to lustre to local to lustre to local to lustre. Have to think about it more.
This checkpointing is kind of a trick to get multiple jobs, actually checkpoints of one job, to run on the same host (something we wanted a while ago)
Let me see if I can explain what I think the process is for SSAs std_calibration which is a DAG
- fetch - Copies data from someplace (perhaps the archive) to local storage on nmpost node.
- Then DAG node ends and data is returned to lustre.
- envoy - Copies data from lustre to local storage and runs calibration.
- Then DAG node ends and data is returned to lustre.
- convey - Copies data from lustre to local storage and then delivers is someplace.
Though probably the best solution is to keep SSA from doing their unnecessary three-step process.
rrsync
from Rafi Rubin
For security, rsync has a script in the src tree "rrsync" you use that in authorized_keys to restrict what rsync can do over ssh. I usually recommend single purpose keypairs for that. You can also just use a standing rsyncd.
GlideinWMS
Is this better than my cheesy factory/pilot scripts?
Job Sets
Inroduced in HTCondor-9.4
Extended submit commands
promote +commands to first class commands
Newer versions of HTCondor allow an admin to make custom commands (say NRAO_TRANSFER_FILES) into standard commands that no longer require the plus sign to use.
condor_new
Todd Miller would like folks to test it
condor_adstash
output classad history for things like elasticsearch DB
grid universe
The grid universe looks like a way for htcondor to submit jobs directly to Slurm without glidein. I may be wrong about this.