...
Using my Cluster Translation Table at https://staff.nrao.edu/wiki/bin/view/NM/ClusterCommands here is what I suggest for Slurm. Notible things missing are Slurm differences: Srm doesn't provide user-level prologue/epilogue scripts, Slurm can't set a umask of a job, Slurm exports all environment variables to the job by default
[/usr/bin/sudo, -u, almapipe, /usr/bin/sbatch, -p, batch,-N, 1, -n, 1, --mem=18G, -t 12-00:00:00, --export=ALL,CAPSULE_CACHE_DIR=~/.capsule-vatest,CAPO_PROFILE=vatest, --export=ALL, -D, /lustre/naasc/web/almapipe/pipeline/vatest/tmp/ArchiveWorkflowStartupTask_runAlmaBasicRestoreWorkflow_4276995994868118298/, --mail-type=FAIL, --mail-user=jgoldste,dlyons,jsheckar, -J, PrepareWorkingDirectoryJob.vatest.86b484f2-dfda-4f51-ad71-c808066441de, -o, /lustre/naasc/web/almapipe/pipeline/vatest/tmp/ArchiveWorkflowStartupTask_runAlmaBasicRestoreWorkflow_4276995994868118298/PrepareWorkingDirectoryJob.out.txt, -e, /lustre/naasc/web/almapipe/pipeline/vatest/tmp/ArchiveWorkflowStartupTask_runAlmaBasicRestoreWorkflow_4276995994868118298/PrepareWorkingDirectoryJob.err.txt, /lustre/naasc/web/almapipe/workflows/vatest/bin/job-runner.sh, 18 -c edu.nrao.archive.workflow.jobs.PrepareWorkingDirectoryJob -p vatest -w /lustre/naasc/web/almapipe/pipeline/vatest/tmp/ArchiveWorkflowStartupTask_runAlmaBasicRestoreWorkflow_4276995994868118298]
...
Replacement options for Torque/Moab (Pros and Cons)
Torque | OpenPBS | Slurm | HTCondor | |
---|---|---|---|---|
Working directory | Yes both -d and -w | No -d nor -w to set working directory | Yes -D | |
Passed args | Yes -F | No. At lease what the man page reads doesn't work for me. | Yes | |
Prolog/Epilog | Yes | No user-level prolog/epilog scripts. | No user-level prolog/epilog scripts. | |
Array jobs | Yes | Yes | Yes | Uses DAGs instead of array jobs |
Complex queues | Can handle vlass/vlasstest queues | Can handle vlass/vlasstest queues | Can handle vlass/vlasstest queues but they are partitions not queues. Should be fine. | Uses requirements instead of queues but should be sufficient |
Reservations | Yes | Reservations work differently but may still be useful. Version 2021.1 may do this better. | Yes | No way to reserve nodes for maintenance or special occasions. |
Authorization | Yes. PAM module | No PAM module. The MoM can kill processes not running a job and not owned by up to 10 special users. | Has a PAM module similar to Torque | |
Remote Jobs | Maybe with Nodus but I was unimpressed | Presumably with Altair Control | Yes to CHTC, OSG, AWS | |
cgroups | Yes with cpuset | Yes both cpuset and cpuacct | Yes with cpuset | Yes with cpuacct |
Multiple Submit Hosts | Yes | Yes | Yes | Yes |
Pack jobs | Yes | Yes | Yes | Yes |
Multi-node MPI | Yes | Yes | Yes | Yes but needs the Parallel Universe |
Preemption | Yes but can be disabled | Yes but can be disabled | Yes but can be disabled | |
nodescheduler | Yes because of cgreg and uniqueuser | No | No | No |
nodevnc | Yes | Yes | Yes but is buggy | |
Cleans Up files and processes | No. Will require a reaper script | No. Will require a reaper script | No. Will require a reaper script | Yes |
Node order | Yes. The nodefile defines order | Not really a way to set the order in which the scheduler will give out nodes | Not really a way to set the order in which the scheduler will give out nodes |
...