...
https://staff.nrao.edu/wiki/bin/view/NM/Slurm
Queues: We want to keep the queue functionality of Torque/Moab where, for example, hera jobs go to hera nodes, vlass jobs go to vlass nodes. We would also like to be able to have vlasstest jobs go to the vlass nodes with a higher priority without preempting running jobs.
Slurm
- Queues are called partitions. At some level they are called partitions in Torque as well.
- Job preemtion is disabled by default
- Allows for simple priority settings in partitions with the default PriorityType=priority/basic plugin.
- E.g. PartitionName=vlass Nodes=testpost[002-004] MaxTime=144000 State=UP Priority=1000
- HTCondor
- HTCondor doesn't have queues or partitions like Torque/Moab or Slurm but there are still ways to do what we need.
- Constraints and/or seperate pools are good options.
- I don't know how to simulate the vlass/vlasstest queues. Perhaps by the time we move to HTCondor we won't need vlasstest anymore.
Interactive: The ability to assign all or part of a node to a user with shell level access (nodescheduler, qsub -I, etc), minimal granularity is per NUMA node, finer would be useful.
- What is it that we like about nodescheduler over something like qsub -I?
- It's not tied to any tty so a user can login multiple times from multiple places to their reserved node without requiring screen or tmux or VNC.
- Its creation is asynchronous. If the cluster is full you don't wait around for your reservation to start, you get an email message when it is ready.
- It's time limited (e.g. two weeks). We might be able to do the same with a queue/partition setting but could we then extend that reservation?
- We get to define the shape of a reservation (whole node, NUMA node, etc). If we just let people use qsub -I they could reserve all sorts of sizes which may be less efficient. Then again it may be more efficient. But either way it is simpler for our users.
- It's not tied to any tty so a user can login multiple times from multiple places to their reserved node without requiring screen or tmux or VNC.
- What is it that we like about nodescheduler over something like qsub -I?
Access: Would like to prevent users from being able to login to nodes unless they have a proper reservation.
- Slurm
- Has a pam_slurm.so module similar to pam_pbssimpleauth.so.
- HTCondor
- Since I don't think we will be using nodescheduler with HTCondor, this isn't needed.
- Since I don't think we will be using nodescheduler with HTCondor, this isn't needed.
- Slurm
Reservations: The ability to reserve nodes far in the future for things like CASA classes and SIW would be very helpful. It would need to prevent HTCondor from starting jobs on these nodes as reservation time approaches.
- Slurm
- scontrol create reservation starttime=now duration=5 nodes=testpost001 user=root
- scontrol create reservation starttime=2022-05-3T08:00:00 duration=21-0:0:0 nodes=nmpost[020-030] user=root reservationname=siw2022
- scontrol show res The output of this kinda sucks. Hopefully there is a better way to see all the reservations.
- HTcondor
- This isn't really something HTCondor is designed to do. We will use Slurm for this.
- This isn't really something HTCondor is designed to do. We will use Slurm for this.
- Slurm
Ability to run jobs remotely (AWS, CHTC, OSG, etc)
- Slurm
- I don't think we will need this ability with Slurm
- HTCondor
- I have tested both condor_annex to AWS and flocking to CHTC.
- I have tested both condor_annex to AWS and flocking to CHTC.
- Slurm
Array jobs: Do we want to keep the Torque array job functionality?
- Slurm
- #SBATCH --array=0-3%2 This syntax is very similar to Torque.
- HTCondor
- To some extent, this isn't how HTCondor is ment to be used. In other extents, DAGMan and the queue command can simulate this.
- To some extent, this isn't how HTCondor is ment to be used. In other extents, DAGMan and the queue command can simulate this.
- Slurm
MPI: We have some users that use MPI across multiple nodes. It would be nice to keep that as an option.
- Slurm
- mpich2
- PATH=${PATH}:/usr/lib64/mpich/bin salloc --ntasks=8 mpiexec mpiexec.sh
- PATH=${PATH}:/usr/lib64/mpich/bin salloc --nodes=2 mpiexec mpiexec.sh
- OpenMPI
- Use #SBATCH to request a number of tasks (cores) and then run mpiexec or mpicasa as normal.
- mpich2
- HTCondor
- While there is a parallel universe for HTCondor, I think we will use Slurm for MPI jobs.
- While there is a parallel universe for HTCondor, I think we will use Slurm for MPI jobs.
- Slurm
Cgroups: We will need protection like what cgroups provide so that jobs can’t impact other jobs on the same node.
- Slurm
- /etc/slurm/cgroup.conf
- HTCondor
- Set CGROUP_MEMORY_LIMIT_POLICY = hard in /etc/condor/config.d/99-nrao on the execute nodes.
- Set CGROUP_MEMORY_LIMIT_POLICY = hard in /etc/condor/config.d/99-nrao on the execute nodes.
- Slurm
Submit hosts: we may have several hosts that will need to be able to submit and delete jobs. (wirth, mcilroy, hamilton, etc)
- Slurm
- Slurm-20 requires systemd so hosts must be RHEL7 or later.
- HTCondor
- Slurm
Pack Jobs: Put jobs on nodes efficiently such that as many nodes as possible are left idle and available for users with large memory and/or large core-count requirements.
- Slurm
- Add SchedulerType=sched/backfill to /etc/slurm/slurm.conf on the Management Node
- HTCondor
- Add NEGOTIATOR_DEPTH_FIRST = True to /etc/condor/config.d/99-nrao on the Central Manager
- Add NEGOTIATOR_DEPTH_FIRST = True to /etc/condor/config.d/99-nrao on the Central Manager
- Slurm
Reaper: Clean nodes of unwanted files, dirs and procs. Condor seems to handle /tmp and /var/tmp properly because it uses fake versions of these dirs for each job. But /dev/shm is still an issue. What about errant processes?
Slurm
There is the pam_slurm_adopt.so that supposedly tracks and kills errant processes but it conflicts with systemd and therefore requires some special tweaking.
- HTCondor
- Seems to handle /tmp and /var/tmp properly because it uses fake versions of these dirs for each job.
- but /dev/shm is still an issue.
- What about errant processes?
Reaper: Cancel jobs when accounts are closed.
Node priority: With Torque/Moab we can control the order in which the scheduler pick nodes. This allows us to run jobs on the faster nodes by default. Can HTCondor do this?
- Slurm
- The order of the nodes in PartitionName is not important. But you can set a Weight to a NodeName. Nodes with the lowest weight will be chosen first.
- The order of the nodes in PartitionName is not important. But you can set a Weight to a NodeName. Nodes with the lowest weight will be chosen first.
- Slurm
While preemption can be useful in some circumstances I expect we will want it disabled for the foreseeable future.
Slurm
The default is PreemptType=preempt/none which means Slurm will not preempt jobs.
- Run both Slurm and HTCondor on the same nodes
...
Now that we have a list of requirements I think the next step is to create a step-by-step procedure document listing everything that needs to be done to migrate from Torque/Moab to HTCondor and perhaps also Slurm.
To Do
Node priority: With Torque/Moab we can control the order in which the scheduler picks nodes by altering the oder of the nodes file. This allows us to run jobs on the faster nodes by default.
- Slurm
- I don't know how to set priorities in Slurm like we do in Torque where batch jobs get the faster nodes and interactive jobs get the slower nodes. There is a Weight feature to a NodeName where the lowest weight will be chosen first but that will affect batch and interactive partitions equally. I need another axis. Actually, this might work at least for hera and hera-jupyter.
NodeName=herapost[001-007] Sockets=2 CoresPerSocket=8 RealMemory=193370 Weight=10
NodeName=herapost011 Sockets=2 CoresPerSocket=10 RealMemory=515790 Weight=1
PartitionName=batch Nodes=herapost[001-007] Default=YES MaxTime=144000 State=UP
PartitionName=hera-jupyter Nodes=ALL MaxTime=144000 State=UP
- The order in which the nodes are defined in slurm.conf has no baring on which node the scheduler will choose. Even though the man page for slurm.conf reads "the order the nodes appear in the configuration file".
- Perhaps I can use some sbatch option in nodescheduler to choose slower nodes first.
- Perhaps use Gres to set a resource like KLARNS for various nodes (Gold 6135, E-5 2400, etc). The slower the node, the more KLARNS we will assign it. Then if Slurm assigns jobs to nodes with the most KLARNS then we can use that to select the slowest nodes first. Hinky? You betcha.
- I don't know how to set priorities in Slurm like we do in Torque where batch jobs get the faster nodes and interactive jobs get the slower nodes. There is a Weight feature to a NodeName where the lowest weight will be chosen first but that will affect batch and interactive partitions equally. I need another axis. Actually, this might work at least for hera and hera-jupyter.
- HTCondor
- There isn't a simple list like pbsnodes in Torque but there is NEGOTIATOR_PRE_JOB_RANK which can be used to weight nodes by cpu, memory, etc.
- OpenPBS
- Doesn't have a nodes file so I don't know what drives the order of the nodes chosen for jobs.
- Slurm
- Node packing: Doesn't seem to pack jobs on to one node and then move to the next. The documentation mentions a "best fit algorythm" but never explains what that is. This problem is probably related to the Node priority issue.
- SchedulerParameters=pack_serial_at_end This puts serial jobs (jobs with only one core) at the end of the node list. E.g. sbatch --cpus-per-task=2 tiny.sh will get put on testpost001 while sbatch --cpus-per-task=1 tiny.sh will get put on testpost004. So that isn't a good solution.
- SchedulerParameters=pack_serial_at_end This puts serial jobs (jobs with only one core) at the end of the node list. E.g. sbatch --cpus-per-task=2 tiny.sh will get put on testpost001 while sbatch --cpus-per-task=1 tiny.sh will get put on testpost004. So that isn't a good solution.
Reaper: Cancel jobs when accounts are closed.
This could be a cron job on the Central Manager that looks at all the owners of jobs and kills jobs of any user that is not active.
- Flocking: When I initially installed Torque/Maui I wanted a "secret server" that users would not log in to that ran the important services and therefore would not suffer any user shinanigans like running out of memory or CPU. This is what nmpost-serv-1 is. Then I wanted a submit host that all users would use to submit jobs and if it suffered shinanigans, jobs would not be lost or otherwise interrupted. This is what nmpost-master is. This system has performed pretty well. For HTCondor, this setup seemed to map well to their ideas of execution host, submit host, and central manager. But if we want to do flocking or condor_annex, we will need both the submit host and central manager to have external IPs. This makes me question keeping the central manager and submit hosts as separate hosts.
- The central manager will need an external IP. Will it need access to Lustre?
- The submit host that can flock will need an external IP. It will need access to Lustre.
- The submit host that runs condor_annex will need an external IP. It will need access to Lustre.
Done
- DONE: Run both Slurm and HTCondor on the same nodes
- Slurm starts and stops condor. CHTC does this because their HTCondor can preempt jobs. So when Slurm starts a job it kills the condor startd and any HTCondor jobs will get preempted and probably restarted somewhere else.
- https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToScavengeCycles
- I think we should just try to keep the clusters separate until there is a need to combine them.
- Glidein to Slurm
- https://staff.nrao.edu/wiki/bin/view/NM/HTCondor-glidein
- 2021-08-02 krowe: I have a working OS image that can run Torque/Moab, Slurm, and HTCondor depending on files in /etc/sysconfig. It allows for HTCondor jobs to glidein to a Slurm cluster.
DONE: Reaper: Clean nodes of unwanted files, dirs and procs. I don't think HTCondor will need this.
Slurm - Needs a reaper script to delete files/dirs and kill processes.
- If I run vncserver via Torque, my reaper script has to kill a bunch of processes when the job is done. But when I run vncserver via Slurm, those processes remain. So we will need some sort of reaper-type script for Slurm.
- There is https://slurm.schedmd.com/pam_slurm_adopt.html that tracks and kills errant processes but it conflicts with systemd and therefore requires some special installation instructions.
- Aha. I may have it working. You have to add PrologFlags=contain to both the client and server slurm.conf files.
- But it doesn't delete files or directories from /tmp, /var/tmp, or /dev/shm when the job ends.
- I will have to write a reaper script for files/dirs to use in Slurm system epilogs.
- Reading up on how pam_slurm_adopt works, it will probably never cooperate with systemd and therefore it is a hack and not future-proof. https://github.com/systemd/systemd/issues/13535 I am unsure how wise it is to start using pam_slurm_adopt in the first place.
- If we aren't going to use pam_slurm_adopt.so then reaper will need to kill procs and delete files/dirs just like it does with Torque/Moab.
- Done: slurm_reaper.py seems to work.
- HTCondor - Doesn't seem to need file/dirs nor proc reaped.
- Seems to handle /tmp, /var/tmp, and /dev/shm properly because it uses fake versions of these dirs for each job.
- It seems to handle errant processes as well.
- There is also condor_preen that cleans condor directories like /var/lib/condor/spool/...
- Seems to handle /tmp, /var/tmp, and /dev/shm properly because it uses fake versions of these dirs for each job.
- OpenPBS
- Probably needs a reaper script just like Torque does.
DONE: Queues: We want to keep the multiple queue functionality of Torque/Moab where, for example, HERA jobs go to hera nodes and VLASS jobs go to vlass nodes. We would also like to be able to have vlasstest jobs go to the vlass nodes with a higher priority without preempting running jobs.
Slurm
- Queues are called partitions. At some level they are called partitions in Torque as well.
- Job preemtion is disabled by default
- Allows for simple priority settings in partitions with the default PriorityType=priority/multifactor plugin.
HERA
PartitionName=hera Nodes=herapost[001-010] Default=YES MaxTime=144000 State=UP
- User: #SBATCH -p hera
- VLASS/VLASSTEST
- Server: PartitionName=vlass Nodes=nmpost[061-090] MaxTime=144000 State=UP Priority=1000
- Server: PartitionName=vlasstest Nodes=nmpost[061-070] MaxTime=144000 State=UP
- User: #SBATCH -p vlass
- There may not be a point to having both a vlass and vlasstest partition in Slurm. All the automated jobs (workflows) will be run in HTCondor. Slurm VLASS nodes will be for users to submit manual jobs. There will only be a few Slurm VLASS nodes and they may be used to glidein HTCondor jobs as needed. So I don't think we will need a vlasstest partition.
- HTCondor
- HTCondor doesn't have queues or partitions like Torque/Moab or Slurm but there are still ways to do what we need.
- Constraints, Custom ClassAds, and Ranks are all options. For example, HERA nodes could set the following in their configs
- HERA = True
- STARTD_ATTRS = $(STARTD_ATTRS) HERA
START = ($(START)) && (TARGET.partition =?= "HERA")
- and users could set the following in their submit files
- Requirements = (HERA =?= True) or Requirements = (HERA == True) The differences may not be important.
+partition = "HERA"
- We could do the same for VLASS/VLASSTEST but I don't know if HTCondor can prioritize VLASS over VLASSTEST the way we do with Moab. We could also do something like this for interactive nodes and nodescheduler if we end up using that.
- VLASS = True
- VLASSTEST = True
- STARTD_ATTRS = $(STARTD_ATTRS) VLASS VLASSTEST
START = ($(START)) && (TARGET.partition =?= "VLASS")
- and users could set the following in their submit files
- requirements = (VLASS =?= True) or requirements = (VLASSTEST =?= True)
+partition = "VLASS" or +partition = "VLASSTEST" depending on which they want
Rank = (VLASS =?= True) + (VLASSTEST =!= True) Force onto VLASS nodes first, then VLASSTEST nodes if necessary.
- Using separate pools for things like HERA and VLASS is an option, but may be overkill as it would require separate Central Managers.
- HTCondor does support accounting groups that may work like queues.
- Because of the design of HTCondor there isn't a central place to define the order and "queue" of nodes like there is in Torque.
- HTCondor doesn't have queues or partitions like Torque/Moab or Slurm but there are still ways to do what we need.
- OpenPBS
- VLASS/VLASSTEST
- qmgr -c 'create resource qlist type=string_array, flag=h'
- Set resources: "ncpus, mem, arch, host, vnode, qlist” in sched_priv/sched_config and restart pbs_sched
- qmgr -c 'set queue vlass default_chunk.qlist = vlass'
- qmgr -c 'set queue vlasstest default_chunk.qlist = vlasstest'
- qmgr -c 'set node acedia resources_available.qlist = vlass'
- qmgr -c 'set node rastan resources_available.qlist = "vlass,vlasstest"'
- VLASS/VLASSTEST
DONE: Access: Would like to prevent users from being able to login to nodes unless they have a proper reservation. Right now we restrict access via /etc/security/access.conf and use Torque's pam_pbssimpleauth.so to allow access for any user running a job.
- Slurm
- Has a pam_slurm.so module which does seem to work like the pam_pbssimpleauth.so module.
- Has a pam_slurm.so module which does seem to work like the pam_pbssimpleauth.so module.
- HTCondor
- How do we restrict access to condor nodes to only those users with valid jobs running?
- With the restrictions in access.conf, HTCondor can still run jobs as users like krowe2. I think this is because HTCondor doesn't use the login mechanism but just starts shells as the user.
- How do we restrict access to condor nodes to only those users with valid jobs running?
- OpenPBS
- Doesn't come with a PAM module and the Torque PAM module doesn't work with OpenPBS.
- restrict_user and restrict_user_exceptions work in the mom_priv/config file but there is a max of 10 user exceptions. With a PAM module we could make as many exceptions as we like and can use groups and netgroups.
- Slurm
DONE: Ability to run jobs remotely (AWS, CHTC, OSG, etc)
- Slurm
- I don't think we will need this ability with Slurm
- HTCondor
- We have successfully tested both condor_annex to AWS, and flocking to CHTC.
- OpenPBS
- I don't think we will need this ability with OpenPBS
- I don't think we will need this ability with OpenPBS
- Slurm
DONE: Cgroups: We will need protection like what cgroups provide so that jobs can’t impact other jobs on the same node.
- Slurm
- /etc/slurm/cgroup.conf
- HTCondor
- Set CGROUP_MEMORY_LIMIT_POLICY = hard in /etc/condor/config.d/99-nrao on the execute nodes.
- Set CGROUP_MEMORY_LIMIT_POLICY = hard in /etc/condor/config.d/99-nrao on the execute nodes.
- OpenPBS
- qmgr -c "set hook pbs_cgroups enabled = true"
- Slurm
DONE: Submit hosts: we may have several hosts that will need to be able to submit and delete jobs. (wirth, mcilroy, hamilton, etc)
- Slurm
- Slurm-20 requires systemd so hosts must be RHEL7 or later.
- HTCondor
- OpenPBS
- rpm -Uvh pbspro-client-19.1.3-0.x86_64.rpm and I am liking using Munge instead of other options like acl_hosts or hosts.equiv.
- Slurm
DONE: Pack Jobs: Put jobs on nodes efficiently such that as many nodes as possible are left idle and available for users with large memory and/or large core-count requirements.
- Slurm
- Add SchedulerType=sched/backfill to /etc/slurm/slurm.conf on the Management Node
- HTCondor
- Add NEGOTIATOR_DEPTH_FIRST = True to /etc/condor/config.d/99-nrao on the Central Manager
- OpenPBS
- Defaults to packing jobs. Set smp_cluster_dist: pack in /var/spool/pbs/sched_priv on the central server.
- qmgr -c 'set server backfill_depth = 10'
- Slurm
DONE: Reservations: The ability to reserve nodes far in the future for things like CASA classes and SIW would be very helpful. It would need to prevent HTCondor from starting jobs on these nodes as reservation time approaches.
- Slurm
- scontrol create reservation starttime=now duration=5 nodes=testpost001 user=root
- scontrol create reservation starttime=2022-05-3T08:00:00 duration=21-0:0:0 nodes=nmpost[020-030] user=root reservationname=siw2022
- scontrol show res The output of this kinda sucks. Hopefully there is a better way to see all the reservations.
- HTcondor
- There isn't a reservation feature in HTCondor. Since CHTC makes use of preemption, their nodes can be removed at almost any time without adversely affecting running jobs. Sadly NRAO cannot really use preemption.
- OpenPBS
- pbs_rsub makes reservations but they are very different than torque. The following will make a reservation for the host nmpost001 as the user the runs the command. Root cannot make reservations.
- pbs_rsub -N siw2022 -R 202205030800 -E 202205240800 -l select=host=nmpost001 -l place=exclhost
- Then when this reservation time starts you will need to do something like kill pbs_mom and edit access.conf. Not as sexy as Torque/Moab.
- Version 2021.1, which required RHEL8, has support for maintenance reservations pbs_rsub --hosts <hostname which might be more useful.
- Slurm
- DONE: Array jobs: Do we want to keep the Torque array job functionality (e.g. #PBS -t 0-9%2)?
- Slurm
- #SBATCH --array=0-9%2 This syntax is very similar to Torque.
- HTCondor
- To some extent, this isn't how HTCondor is ment to be used. In other extents, DAGMan and the queue command can simulate this.
- queue 100 starts 100 copies of the job
- queue from seq 10 5 30 | will launch five jobs with $(item) set to 10, 15, 20, 25, 30
- queue item in 0, 1, 2, 3 Is another example. I don't think you can do the modulus feature with queue (i.e. %)
- You can throttle DAGMan jobs https://htcondor.readthedocs.io/en/latest/users-manual/dagman-workflows.html#throttling-nodes-by-category
- To some extent, this isn't how HTCondor is ment to be used. In other extents, DAGMan and the queue command can simulate this.
- OpenPBS
- #PBS -J 0-9:2
- Slurm
DONE: MPI: We have some users that use MPI across multiple nodes. It would be nice to keep that as an option.
- Slurm
- mpich2
- PATH=${PATH}:/usr/lib64/mpich/bin salloc --ntasks=8 mpiexec mpiexec.sh
- PATH=${PATH}:/usr/lib64/mpich/bin salloc --nodes=2 mpiexec mpiexec.sh
- OpenMPI
- Use #SBATCH to request a number of tasks (cores) and then run mpiexec or mpicasa as normal.
- mpich2
- HTCondor
- Single-node MPI jobs do work in the Vanilla universe.
- Multi-node MPI jobs require the creation of a Parallel universe. But it might be best to tell users that want multi-node MPI to use Slurm and not HTCondor.
- OpenPBS
- Provides a PBS_NODES_FILE just like Torque so should be pretty similar.
- Slurm
- DONE: While preemption can be useful in some circumstances I expect we will want it disabled for the foreseeable future.
Slurm
The default is PreemptType=preempt/none which means Slurm will not preempt jobs.
- HTCondor
- Setting a Machine Rank will cause jobs to be preempted https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigPrioritiesForUsers
- OpenPBS
- Defaults to preemtion. Disable it by setting preemptive_sched: false ALL in /var/spool/pbs/sched_priv on the central server.
DONE: Interactive: The ability to assign all or part of a node to a user with shell level access (nodescheduler, qsub -I, etc). Current minimal granularity is per NUMA node, but finer could be useful. Slurm and HTCondor lack the uniqueuser feature of Moab so implementing nodescheduler will at least be different if not difficult and at most be impossible. One thought is to ditch nodescheduler and just use the interactive commands that come with Slurm and HTCondor, but I am having some success implementing nodescheduler in Slurm with the --exclude syntax.
- nodescheduler: Was written before I understood what qsub -I did. Had I known, I may have argued to use qsub -I instead of nodescheduler as it is much simpler, is consistent with other installations of Torque, and may have forced some users to use batch processing which is much more efficient.
- nodescheduler features we like
- It's not tied to any tty so a user can login multiple times from multiple places to their reserved node without requiring something like screen, tmux, or vnc. It also means that users aren't all going through nmpost-master.
- Its creation is asynchronous. If the cluster is full you don't wait around for your reservation to start, you get an email message when it is ready.
- It's time limited (e.g. two weeks). We might be able to do the same with a queue/partition setting but could we then extend that reservation?
- We get to define the shape of a reservation (whole node, NUMA node, etc). If we just let people use qsub -I they could reserve all sorts of sizes which may be less efficient. Then again it may be more efficient. But either way I think nodescheduler it is simpler for our users.
- It's not tied to any tty so a user can login multiple times from multiple places to their reserved node without requiring something like screen, tmux, or vnc. It also means that users aren't all going through nmpost-master.
- nodescheduler features we dislikes
- With Toruqe/Moab asking for a NUMA node doesn't work as I would like. Because of bugs and limitations, I still have to ask for a specific amount of memory. The whole point of asking for a NUMA node was that I didn't need to know the resources of a node ahead of time but could just ask for half of a node. Sadly, that doesn't work with Torque/Moab.
- Because of the way I maintain the cgroup for the user, with /etc/cgrules.conf, I cannot let a user have more than one nodescheduler job on the same node or it will be impossible to know which cgroup an ssh connection should use. The interactive commands (qsub -I, etc) don't have this problem.
- Done: Slurm
- cgroup cleanup. Looking back at my epilogue script for Torque, I see that I did add a cgdelete line because the pbs_mom was getting put in the cgroup and therefore the cgroup wasn't getting deleted easily. Something similar may be happening with Slurm. I think the slurmstepd and perhaps other processes are getting put in the cgrop which causes difficulty deleting the cgroup. I think the slurmstepd is launched by root but changes its UID to krowe to do some stuff at which point it gets captured in the cgroup. So, perhpas the sleep SIGUSER2 combined with a cgdelete just in case may be a good solution.
- srun -p interactive --pty bash This logs the user into an interactive shell on a node with defaults (1 core, 1 GB memory) in the interactive partition.
- NUMA I don't see how Slurm can reserve NUMA nodes so we may have to just reserve X tasks with Y memory.
- naccesspolicy=uniqueuser I don't know how to keep Slurm from giving a user multiple portions of the same host. With Moab I used naccesspolicy=uniqueuser which prevents the ambiguity of which ssh connection goes to which cgroup. I could have nodescheduler check the nodes and assign one that the user isn't currently using but this is starting to turn nodescheduler into a scheduler of its own and I think may be more complication than we want to maintain.
- one job only What about enforcing one interactive job per user? nodescheduler exiting with an error if the user already has an interactive job running.
- routing queue What if I create a routing queue (Slurm can do those, yes?) and then walk that queue and assign them to nodes. Yes this would be starting to implement my own scheduler.
- exclude There is a -x , --exclude=<node name list> argument to sbatch.
But -x will only work if nodescheduler can find a free node at the moment. If it has to wait, then that excluded node may no longer be running a job by the user. Worse yet, the node that nodescheduler is about to give the user may have a new job by this user.
- What about combining -x with a test-and-resbubmit function in the prolog script? Before setting up cgreg, if there is already an interactive job running on this node as the user, add this node to the exclude list and resubmit the interactive job.
- What if nodescheduler excludes nodes that the user is running interactive jobs on instead of letting it go to the prolog?
- SOLUTION: Using the ExcNodeList option combined with requeuehold and some other jiggery pokery in prolog and epilog scripts seems to have nodescheduler working.
- Better SOLUTION: Just have nodescheduler build a list of interactive_j jobs running by the user and the add that list with the --exclude sbatch command. This was James's idea. I hate it when he has good ideas.
- cgrules Slurm has system-level prolog/epilog functionality that should allow nodesceduler to set /etc/cgrules.conf which we will need because pam_slurm_adopt.so, which could do what /etc/cgrules.conf does, isn't an option.
- PAM The pam_slurm.so module can be used without modifying systemd and will block users that don't have a job running from logging in. The pam_slurm_adopt.so module required removing some pam_systemd modeules and does what pam_slurm.so does plus will put the user's login shell in the same cgroup as the slurm job expected to run the longest, which could replace my /etc/cgrules.conf hack. This still doesn't solve the problem of multiple interactive jobs by the same user on the same node. Removing the pam_systemd.so module prevents the creation of things like /run/user/<UID> and the XDG_RUNTIME_DIR and XDG_SESSION_ID which breaks VNC. So we may want to use just pam_slurm.so and not pam_slurm_adopt.so.
- But slurm at CHTC has neither pam_slurm.so nor pam_slurm_adopt.so configured and their nodes don't create /run/user/<UID> either. So it might just be slurm itself and not the pam modules causing the problem.
- Also, in order to install pam_slurm_adopt.so you have to not only disable systemd-logind but you must mask it as well. This prevents /run/user/<UID> from being created even if you login with ssh (e.g. no Slurm, Torque, or HTCondor involved).
- nodeextendjob Can Slurm extend the MaxTime of an interactive job? Yes scontrol update timelimit=+7-0:0:0 jobid=489 This sets the MaxTime to seven minutes.
- HTCondor
- condor_submit -i This logs the user into an interactive shell on a node with defaults (1 core equivelent, 0.5 GB memory)
- NUMA I don't see how HTCondor can reserve NUMA nodes so we may have to just reserve X tasks with Y memory.
- naccesspolicy=uniqueuser I don't think I need to worry about giving a user multiple portions of the same host if we are using condor_ssh_to_job. But if aren't using condor_ssh_to_job then we could exclude hosts with requirements = Machine != hostname
- cgrules I don't know if HTCondor has the prologue/epilogue functionality to implement my /etc/cgrules.conf hack.
- PAM How can we allow a user to login to a node they have an interactive job running on via nodescheduler? With Torque or Slurm there are PAM modules but there isn't one for HTCondor.
- Could run a sleep job just like we do with Torque and use condor_ssh_to_job which seems to do X11 properly. We would probably want to make gygax part of the nmpost pool.
- cgreg I don't know if HTCondor has system-level prolog and epilog scripts to edit /etc/cgrules.conf
- OpenPBS
- Does not have a uniqueuser option so cannot do nodescheduler like Torque/Moab.
- nodescheduler features we like
- Nodevnc Given the limitation of Slurm and HTCondor and that we already recommend users use VNC on their interactive nodes, why don't we just provide a nodevnc script that reserves a node (via torque, slurm or HTCondor), start a vnc server and then tells the user it is ready and how to connect to it? If someone still needs/wants just simple terminal access, then qsub -I or srun --pty bash or condor_submit -i might suffice.
- DONE: Torque
- I can actually successfully launch a VNC session using my nodevnc-pbs script even though there is no /run/user/<UID> on the node. I have not changed this nodevnc-pbs script in six months. This is because even though Torque doesn't create /run/user/<UID> just like Slurm doesn't, Torque doesn't set the XDG_RUNTIME_DIR variable like Slurm does. This is good news because since Torque neither creates /run/user/<UID> nor sets XDG_RUNTIME_DIR and we have been using RHEL7 since late 2020 without issue, then unsetting XDG_RUNTIME_DIR in Slurm is not likely to cause us problems.
- DONE: Slurm
- /run/user/<UID> Slurm doesn't actually run /bin/login so things like /run/user/<UID> are not created yet XDG_RUNTIME_DIR is still set for some reason which causes vncserver to produce errors like Call to lnusertemp failed upon connection with vncviewer.
- If I unset XDG_RUNTIME_DIR in the slurm script, I can successfully connect to VNC. Why is Slurm setting this when it isn't making the directory? I think this may be a bug in Slurm. Perhaps Slurm is setting this variable instead of letting pam_systemd.so and/or systemd-logind set it. There is a bug report https://bugs.schedmd.com/show_bug.cgi?id=5920 where the developers think this is being caused because of their pam_slurm_adopt.so module but I don't think that is the case.
- Would it be best to just unset XDG_RUNTIME_DIR in a system prolog? I don't think the prolog can unset an environment variable.
- Since the default behavior of Slurm is to export all environment variables to the job, I think this is why XDG_RUNTIME_DIR is getting set on the execute host.
- YES: loginctl enable-linger krowe run vnc loginctl disable-linger krowe. I could maybe put this is a prolog/epilog.
- I can successfully run Xvfb without /run/user/<UID>.
- I have successfully ran small CASA tests with xvfb-run.
- A work-around could be something like the following, but there might be other things broken because of the missing /run/user/<UID> and ${XDG_RUNTIME_DIR}/gvfs is actually a fuse mount and reaper does not know how to umount things.
- mkdir /tmp/${USER}
- export XDG_RUNTIME_DIR=/tmp/${USER}
- Reading up on how pam_slurm_adopt works, it will probably never cooperate with systemd and therefore it is a hack and not future-proof. https://github.com/systemd/systemd/issues/13535 I am unsure how wise it is to start using pam_slurm_adopt in the first place.
- So if I don't install pam_slurm_adopt.so, which I only installed because it seemed better than my /etc/cgrules.conf hack, which I only created for nodescheduler after we started using cgroups, then I think I can get nodevnc working as a pseudo-replacement for nodescheduler. If we do use nodevnc and don't use nodescheduler (which we mostly can't) then we may not want to use the pam_slurm.so module either so that users can't login to nodes they have reserved and possibly use resources they aren't scheduled for. If they really need to login to a node where they are running a job, Slurm has something similar to HTCondor's condor_ssh_to_job which is srun --jobid jobid --pty bash -l But you need to set PrologFlags=x11 in slurm.conf, only one terminal can connect with srun in this way at a time and the DISPLAY seems to only work under certain situations. Basicly, this is not a useful mechanism for users. X11 forwarding works a little better if I use salloc instead of sbatch sleep.sh but it still only allows one terminal at a time and doesn't work with the --no-shell option.
- /run/user/<UID> Slurm doesn't actually run /bin/login so things like /run/user/<UID> are not created yet XDG_RUNTIME_DIR is still set for some reason which causes vncserver to produce errors like Call to lnusertemp failed upon connection with vncviewer.
- HTCondor
- HTCondor doesn't seem to create /run/user/<UID> either here (8.9.7) nor at CHTC (8.9.11). I can get vncserver to run at CHTC by setting HOME=$TMPDIR and transferring ~/.vnc but I am unable to connect to it via vncviewer. The connection times out. This makes me think that even if I can get vncserver working, which I may have done at CHTC, it still will give me the lnusertemp error because of the missing /run/user/<UID>.
- Xserver Since we run an X server on our nmpost nodes, ironically to allow VNC and remote X from thin clients, starting a vncserver from HTCondor fails. This is because vncserver doesn't see the /tmp/.X11-unix socket of the running X server because HTCondor has bind mounted a fresh /tmp for us so vncserver tries to start an X server which fails because the port is already in use.
- Mar. 30, 2021 krowe: I upgraded all the execute hosts to 8.9.11 for the fix to James's memory problem (actually fixed in 8.9.9) and now my nodevnc-htcondor script works. Perhaps something in the new version of condor fixed things? It still isn't creating a /run/user/UID but maybe that isn't really necessary.
- Apr. 12, 2021 krowe: nodevnc-htcondor did not start when HTCondor selected a node that was running a job for James on nmpost106. Yet it seems to let me run two nodevnc jobs on the same testpost node. Is it because of James, the nmpost node or something else? After James's job finished I was able to run nodevnc on nmpost106, so it was James's job. The problem is xvfb-run is preventing nodevn from establishing a listening socket.
- HTCondor doesn't seem to create /run/user/<UID> either here (8.9.7) nor at CHTC (8.9.11). I can get vncserver to run at CHTC by setting HOME=$TMPDIR and transferring ~/.vnc but I am unable to connect to it via vncviewer. The connection times out. This makes me think that even if I can get vncserver working, which I may have done at CHTC, it still will give me the lnusertemp error because of the missing /run/user/<UID>.
- DONE: Torque
- screen?
- tmux?
- nodescheduler: Was written before I understood what qsub -I did. Had I known, I may have argued to use qsub -I instead of nodescheduler as it is much simpler, is consistent with other installations of Torque, and may have forced some users to use batch processing which is much more efficient.
Define how SSA uses Torque/Moab and what they should do for Slurm
Here is an example of a 'java-heavy' SSA call using Torque/Moab
[/usr/bin/sudo, -u, almapipe, /opt/services/torque/bin/qsub, -q, batch, -l, nodes=1:ppn=1,mem=18gb,vmem=19gb,epilogue=/lustre/naasc/web/almapipe/workflows/vatest/bin/epilogue,walltime=12:00:00:00, -v, CAPSULE_CACHE_DIR=~/.capsule-vatest, -v, CAPO_PROFILE=vatest, -V, -d, /lustre/naasc/web/almapipe/pipeline/vatest/tmp/ArchiveWorkflowStartupTask_runAlmaBasicRestoreWorkflow_4276995994868118298/, -m, a, -M, jgoldste,dlyons,jsheckar, -N, PrepareWorkingDirectoryJob.vatest.86b484f2-dfda-4f51-ad71-c808066441de, -o, /lustre/naasc/web/almapipe/pipeline/vatest/tmp/ArchiveWorkflowStartupTask_runAlmaBasicRestoreWorkflow_4276995994868118298/PrepareWorkingDirectoryJob.out.txt, -e, /lustre/naasc/web/almapipe/pipeline/vatest/tmp/ArchiveWorkflowStartupTask_runAlmaBasicRestoreWorkflow_4276995994868118298/PrepareWorkingDirectoryJob.err.txt, -W, umask=0117, -F, 18 -c edu.nrao.archive.workflow.jobs.PrepareWorkingDirectoryJob -p vatest -w /lustre/naasc/web/almapipe/pipeline/vatest/tmp/ArchiveWorkflowStartupTask_runAlmaBasicRestoreWorkflow_4276995994868118298, /lustre/naasc/web/almapipe/workflows/vatest/bin/job-runner.sh]
Using my Cluster Translation Table at https://staff.nrao.edu/wiki/bin/view/NM/ClusterCommands here is what I suggest for Slurm. Notible differences:
- Slurm doesn't provide user-level prologue/epilogue scripts with sbatch
- Slurm can't set the umask of a job
- Slurm exports all environment variables to the job by default
- arguments to the script are added to the end of the sbatch command
[/usr/bin/sudo, -u, almapipe, /usr/bin/sbatch, -p, batch,-N, 1, -n, 1, --mem=18G, -t 12-00:00:00, --export=ALL,CAPSULE_CACHE_DIR=~/.capsule-vatest,CAPO_PROFILE=vatest, -D, /lustre/naasc/web/almapipe/pipeline/vatest/tmp/ArchiveWorkflowStartupTask_runAlmaBasicRestoreWorkflow_4276995994868118298/, --mail-type=FAIL, --mail-user=jgoldste,dlyons,jsheckar, -J, PrepareWorkingDirectoryJob.vatest.86b484f2-dfda-4f51-ad71-c808066441de, -o, /lustre/naasc/web/almapipe/pipeline/vatest/tmp/ArchiveWorkflowStartupTask_runAlmaBasicRestoreWorkflow_4276995994868118298/PrepareWorkingDirectoryJob.out.txt, -e, /lustre/naasc/web/almapipe/pipeline/vatest/tmp/ArchiveWorkflowStartupTask_runAlmaBasicRestoreWorkflow_4276995994868118298/PrepareWorkingDirectoryJob.err.txt, /lustre/naasc/web/almapipe/workflows/vatest/bin/job-runner.sh, 18 -c edu.nrao.archive.workflow.jobs.PrepareWorkingDirectoryJob -p vatest -w /lustre/naasc/web/almapipe/pipeline/vatest/tmp/ArchiveWorkflowStartupTask_runAlmaBasicRestoreWorkflow_4276995994868118298]
Why can't we implement nodescheduler in Slurm?
In a word, uniqueuser.
Moab has an option to qsub (-l naccesspolicy=uniqueuser) that prevents a user's job from running on a node where that same user is already
runing a job. This allows my /etc/cgrules.conf hack to add a user's login shell to the cgroup of the interactive job running on that node. Without uniqueuser, a user could have two interactive jobs running on the same node and when they login into that node, my /etc/cgrules.conf hack would have no way of knowing to which cgroup it should add the login shell.
I don't see a similar function in Slurm so there is no way for my /etc/cgrules.conf hack to put shells into the right cgroup.
We could have nodescheduler check the nodes and assign one that the user isn't currently using but this is starting to turn nodescheduler into a scheduler of its own and I think may be more complication than we want to maintain. It would also have to find nodes with all the other requirements (free cores, free mem, etc). Also, it will introduce a race condition where nodescheduler may reserve a node that Moab just gave to some other job.
pam_slurm_adopt.so seems like it might help because it moves shells into the cgroup of the user's job that is expected to run the longest. That's nice, but now if a user has two interactive jobs on the same node, all shells will be put in the cgroup of just one of the jobs, thus never utilizing the resources reserved for the other job. Also, in order to install pam_slurm_adopt.so you have to not only disable systemd-logind but you must mask it as well. This prevents /run/user/<UID> from being created, even if you login with ssh (e.g. no Slurm, Torque, or HTCondor involved). As I understand it, pam_slurm_adopt.so is not planned to ever work properly with systemd so I think it may be a non-starter.
srun --jobid seems like it might help because it logs a user into a running reservation based on the jobid. But, it doesn't tunnel X11 reliably and only allows for one connection to a job at a time.
Well, it seems I can if I see what nodes the user has jobs on already and use the --exclude argument to sbatch.
Replacement options for Torque/Moab (Pros and Cons)
Torque | OpenPBS | Slurm | HTCondor | |
---|---|---|---|---|
Working directory | Yes both -d and -w | No -d nor -w to set working directory | Yes -D | |
Passed args | Yes -F | No. At least what the man page reads doesn't work for me. | Yes | |
Prolog/Epilog | Yes | No user-level prolog/epilog scripts. | No user-level prolog/epilog scripts. | |
Array jobs | Yes | Yes | Yes | Uses DAGs instead of array jobs |
Complex queues | Can handle vlass/vlasstest queues | Can handle vlass/vlasstest queues | Can handle vlass/vlasstest queues but they are partitions not queues. Should be fine. | Uses requirements instead of queues but should be sufficient |
Reservations | Yes | Reservations work differently but may still be useful. Version 2021.1 may do this better. | Yes | No way to reserve nodes for maintenance or special occasions. |
Authorization | Yes. PAM module | No PAM module. The MoM can kill processes not running a job and not owned by up to 10 special users. | Has a PAM module similar to Torque | |
Remote Jobs | Maybe with Nodus but I was unimpressed | Presumably with Altair Control | Yes to CHTC, OSG, AWS | |
cgroups | Yes with cpuset | Yes both cpuset and cpuacct | Yes with cpuset | Yes with cpuacct |
Multiple Submit Hosts | Yes | Yes | Yes | Yes |
Pack jobs | Yes | Yes | Yes | Yes |
Multi-node MPI | Yes | Yes | Yes | Yes but needs the Parallel Universe |
Preemption | Yes but can be disabled | Yes but can be disabled | Yes but can be disabled | |
nodescheduler | Yes because of cgreg and uniqueuser | No | Yes with --exclude | No |
nodevnc | Yes | Yes | Yes but is buggy | |
Cleans Up files and processes | No. Will require a reaper script | No. Will require a reaper script | No. Will require a reaper script. Doesn't clean up cgroups well either. | Yes |
Node order | Yes. The nodefile defines order | Not really a way to set the order in which the scheduler will give out nodes | Not really a way to set the order in which the scheduler will give out nodes |