This document proposes a broad change to NRAO’s computational approach w.r.t. CASA to improve overall efficiency. Within industry there are strategies viewed as either High Performance or High Throughput Computing, commonly referred to as either HPC or HTC. Neither approach is optimal for NRAO so what is proposed here is a High Efficiency Computing approach that considers all aspects of the problem including hardware configuration and performance, PI time, and costs for hardware and software development. Because the goal is broad efficiency across multiple disjoint axis there is no ultimate final goal that is being aimed at, rather there is a finite series of stages which achieve some delta improvement in efficiency with the next stage going through a review and prototype process before commencing.
https://staff.nrao.edu/wiki/bin/view/NM/HTCondor#Conversion
https://staff.nrao.edu/wiki/bin/view/NM/Slurm
Queues: We want to keep the multiple queue functionality of Torque/Moab where, for example, HERA jobs go to hera nodes and VLASS jobs go to vlass nodes. We would also like to be able to have vlasstest jobs go to the vlass nodes with a higher priority without preempting running jobs.
Slurm
- Queues are called partitions. At some level they are called partitions in Torque as well.
- Job preemtion is disabled by default
- Allows for simple priority settings in partitions with the default PriorityType=priority/basic plugin.
- E.g. PartitionName=vlass Nodes=testpost[002-004] MaxTime=144000 State=UP Priority=1000
- HTCondor
- HTCondor doesn't have queues or partitions like Torque/Moab or Slurm but there are still ways to do what we need.
- Using separate pools for things like HERA and VLASS is an option, but may be overkill as it would require separate Central Managers.
- Requirements or Constraints is an option. For example, HERA nodes could set the following in their configs
- HERA = True
- STARTD_ATTRS = HERA, $(STARTD_ATTRS)
- and users could set the following in their submit files
- Requirements = (HERA =?= True)
- We could do the same for VLASS/VLASSTEST but I don't know if HTCondor can prioritize VLASS over VLASSTEST the way we do with Moab. We could also do something like this for interactive nodes and nodescheduler if we end up using that.
- VLASS = True
- VLASSTEST = True
- STARTD_ATTRS = VLASS, VLASSTEST, $(STARTD_ATTRS)
- then users would set either requirements = (VLASS =?= True) or requirements = (VLASSTEST =?= True)
- Or if you wanted to keep the priority where VLASS jobs only run on VLASSTEST nodes when they aren't busy, the user could set Rank = (VLASS == 1) and requirements = (VLASS =?= True) in order to run on a VLASS node and only run on a VLASSTEST node when there are no VLASS nodes available.
- HTCondor does support accounting groups that may work like queues.
- Because of the design of HTCondor there isn't a central place to define the order and "queue" of nodes like there is in Torque.
Interactive: The ability to assign all or part of a node to a user with shell level access (nodescheduler, qsub -I, etc), minimal granularity is per NUMA node, finer would be useful.
- nodescheduler: What is it that we like about nodescheduler over something like qsub -I or srun --pty bash or condor_submit -i
- It's not tied to any tty so a user can login multiple times from multiple places to their reserved node without requiring screen or tmux or vnc. It also means that users aren't all going through nmpost-master.
- Its creation is asynchronous. If the cluster is full you don't wait around for your reservation to start, you get an email message when it is ready.
- It's time limited (e.g. two weeks). We might be able to do the same with a queue/partition setting but could we then extend that reservation?
- We get to define the shape of a reservation (whole node, NUMA node, etc). If we just let people use qsub -I they could reserve all sorts of sizes which may be less efficient. Then again it may be more efficient. But either way it is simpler for our users.
- It's not tied to any tty so a user can login multiple times from multiple places to their reserved node without requiring screen or tmux or vnc. It also means that users aren't all going through nmpost-master.
- Slurm
- srun -pty bash
- I don't see how Slurm can reserve NUMA nodes so we will have to just reserve X tasks with Y memory.
- I don't know how to keep Slurm from giving a user multiple portions of the same host. With Moab I used naccesspolicy=uniqueuser This prevents the ambiguity of which ssh connection goes to which cgroup.
- HTCondor
- Can HTCondor even do this?
- condor_submit -interactive can be shortened to just condor_submit -i
- Could run a sleep job just like we do with Torque and use condor_ssh_to_job which seems to do X11 properly. We would probably want to make gygax part of the nmpost pool.
- nodevnc
- Given the limitation of Slurm and HTCondor and that we already recommend users use VNC on their interactive nodes, why don't we just provide a nodevnc script that reserves a node (via torque, slurm or HTCondor), start a vnc server and then tells the user it is ready and how to connect to it? If someone still needs/wants just simple terminal access, then qsub -I or srun --pty bash or condor_submit -i might suffice.
- Given the limitation of Slurm and HTCondor and that we already recommend users use VNC on their interactive nodes, why don't we just provide a nodevnc script that reserves a node (via torque, slurm or HTCondor), start a vnc server and then tells the user it is ready and how to connect to it? If someone still needs/wants just simple terminal access, then qsub -I or srun --pty bash or condor_submit -i might suffice.
- nodescheduler: What is it that we like about nodescheduler over something like qsub -I or srun --pty bash or condor_submit -i
Access: Would like to prevent users from being able to login to nodes unless they have a proper reservation.
- Slurm
- Has a pam_slurm.so module similar to pam_pbssimpleauth.so.
- HTCondor
- Since I don't think we will be using nodescheduler with HTCondor, this isn't needed.
- Since I don't think we will be using nodescheduler with HTCondor, this isn't needed.
- Slurm
Reservations: The ability to reserve nodes far in the future for things like CASA classes and SIW would be very helpful. It would need to prevent HTCondor from starting jobs on these nodes as reservation time approaches.
- Slurm
- scontrol create reservation starttime=now duration=5 nodes=testpost001 user=root
- scontrol create reservation starttime=2022-05-3T08:00:00 duration=21-0:0:0 nodes=nmpost[020-030] user=root reservationname=siw2022
- scontrol show res The output of this kinda sucks. Hopefully there is a better way to see all the reservations.
- HTcondor
- This isn't really something HTCondor is designed to do. We will use Slurm for this.
- This isn't really something HTCondor is designed to do. We will use Slurm for this.
- Slurm
Ability to run jobs remotely (AWS, CHTC, OSG, etc)
- Slurm
- I don't think we will need this ability with Slurm
- HTCondor
- I have tested both condor_annex to AWS and flocking to CHTC.
- I have tested both condor_annex to AWS and flocking to CHTC.
- Slurm
Array jobs: Do we want to keep the Torque array job functionality?
- Slurm
- #SBATCH --array=0-3%2 This syntax is very similar to Torque.
- HTCondor
- To some extent, this isn't how HTCondor is ment to be used. In other extents, DAGMan and the queue command can simulate this.
- To some extent, this isn't how HTCondor is ment to be used. In other extents, DAGMan and the queue command can simulate this.
- Slurm
MPI: We have some users that use MPI across multiple nodes. It would be nice to keep that as an option.
- Slurm
- mpich2
- PATH=${PATH}:/usr/lib64/mpich/bin salloc --ntasks=8 mpiexec mpiexec.sh
- PATH=${PATH}:/usr/lib64/mpich/bin salloc --nodes=2 mpiexec mpiexec.sh
- OpenMPI
- Use #SBATCH to request a number of tasks (cores) and then run mpiexec or mpicasa as normal.
- mpich2
- HTCondor
- Single-node MPI jobs should work in the Vanilla universe.
- Multi-node MPI jobs require the creation of a Parallel universe or using Slurm instead.
- Slurm
Cgroups: We will need protection like what cgroups provide so that jobs can’t impact other jobs on the same node.
- Slurm
- /etc/slurm/cgroup.conf
- HTCondor
- Set CGROUP_MEMORY_LIMIT_POLICY = hard in /etc/condor/config.d/99-nrao on the execute nodes.
- Set CGROUP_MEMORY_LIMIT_POLICY = hard in /etc/condor/config.d/99-nrao on the execute nodes.
- Slurm
Submit hosts: we may have several hosts that will need to be able to submit and delete jobs. (wirth, mcilroy, hamilton, etc)
- Slurm
- Slurm-20 requires systemd so hosts must be RHEL7 or later.
- HTCondor
- Slurm
Pack Jobs: Put jobs on nodes efficiently such that as many nodes as possible are left idle and available for users with large memory and/or large core-count requirements.
- Slurm
- Add SchedulerType=sched/backfill to /etc/slurm/slurm.conf on the Management Node
- HTCondor
- Add NEGOTIATOR_DEPTH_FIRST = True to /etc/condor/config.d/99-nrao on the Central Manager
- Add NEGOTIATOR_DEPTH_FIRST = True to /etc/condor/config.d/99-nrao on the Central Manager
- Slurm
Reaper: Clean nodes of unwanted files, dirs and procs. Condor seems to handle /tmp and /var/tmp properly because it uses fake versions of these dirs for each job. But /dev/shm is still an issue. What about errant processes?
Slurm
There is the pam_slurm_adopt.so that supposedly tracks and kills errant processes but it conflicts with systemd and therefore requires some special tweaking.
- HTCondor
- Seems to handle /tmp and /var/tmp properly because it uses fake versions of these dirs for each job.
- /dev/shm is now handled as of HTCondor-8.9.7
- What about errant processes?
Reaper: Cancel jobs when accounts are closed.
Node priority: With Torque/Moab we can control the order in which the scheduler picks nodes. This allows us to run jobs on the faster nodes by default.
- Slurm
- The order of the nodes in PartitionName is not important. But you can set a Weight to a NodeName. Nodes with the lowest weight will be chosen first.
- HTCondor
- Can HTCondor do this?
- Can HTCondor do this?
- Slurm
While preemption can be useful in some circumstances I expect we will want it disabled for the foreseeable future.
Slurm
The default is PreemptType=preempt/none which means Slurm will not preempt jobs.
- Run both Slurm and HTCondor on the same nodes
- Server Layout: When I initially installed Torque/Maui I wanted a "secret server" that users would not log in to that ran the important services and therefore would not suffer any user shinanigans like running out of memory or CPU. This is what nmpost-serv-1 is. Then I wanted a submit host that all users would use to submit jobs and if it suffered shinanigans, jobs would not be lost or otherwise interrupted. This is what nmpost-master is. This system has performed pretty well. For HTCondor, this setup seemed to map well to their ideas of execution host, submit host, and central manager. But if we want to do flocking or condor_annex, we will need both the submit host and central manager to have external IPs. This makes me question keeping the central manager and submit hosts as separate hosts.
- The central manager will need an external IP.
- The submit host that can flock will need an external IP.
- The submit host that runs condor_annex will need an external IP.
https://open-confluence.nrao.edu/download/attachments/40537022/nmpost-slurm.conf?api=v2 is a proposed slurm.conf for our nmpost cluster.
Now that we have a list of requirements I think the next step is to create a step-by-step procedure document listing everything that needs to be done to migrate from Torque/Moab to HTCondor and perhaps also Slurm. Procedure to replace Torque/Moab with HTCondor/Slurm