This document proposes a broad change to NRAO’s computational approach w.r.t. CASA to improve overall efficiency. Within industry there are strategies viewed as either High Performance or High Throughput Computing, commonly referred to as either HPC or HTC. Neither approach is optimal for NRAO so what is proposed here is a High Efficiency Computing approach that considers all aspects of the problem including hardware configuration and performance, PI time, and costs for hardware and software development. Because the goal is broad efficiency across multiple disjoint axis there is no ultimate final goal that is being aimed at, rather there is a finite series of stages which achieve some delta improvement in efficiency with the next stage going through a review and prototype process before commencing.
https://staff.nrao.edu/wiki/bin/view/NM/HTCondor#Conversion
https://staff.nrao.edu/wiki/bin/view/NM/Slurm
Queues: We want to keep the queue functionality of Torque/Moab where, for example, hera jobs go to hera nodes, vlass jobs go to vlass nodes. We would also like to be able to have vlasstest jobs go to the vlass nodes with a higher priority without preempting running jobs.
- In Slurm, queues are called partitions. At some level they are called partitions in Torque as well.
- In Slurm, job preemtion is disabled by default
- Slurm allows for simple priority settings in partitions with the default PriorityType=priority/basic plugin.
- E.g. PartitionName=vlass Nodes=testpost[002-004] MaxTime=144000 State=UP Priority=1000
Interactive: The ability to assign all or part of a node to a user with shell level access (nodescheduler, qsub -I, etc), minimal granularity is per NUMA node, finer would be useful.
- What is it that we like about nodescheduler over something like qsub -I?
- It's not tied to any tty so a user can login multiple times from multiple places to their reserved node without requiring screen or tmux or VNC.
- Its creation is asynchronous. If the cluster is full you don't wait around for your reservation to start, you get an email message when it is ready.
- It's time limited (e.g. two weeks). We might be able to do the same with a queue/partition setting but could we then extend that reservation?
- We get to define the shape of a reservation (whole node, NUMA node, etc). If we just let people use qsub -I they could reserve all sorts of sizes which may be less efficient. Then again it may be more efficient. But either way it is simpler for our users.
- It's not tied to any tty so a user can login multiple times from multiple places to their reserved node without requiring screen or tmux or VNC.
- What is it that we like about nodescheduler over something like qsub -I?
Access: Would like to prevent users from being able to login to nodes unless they have a proper reservation.
- Slurm has a pam_slurm.so module similar to pam_pbssimpleauth.so.
- Slurm has a pam_slurm.so module similar to pam_pbssimpleauth.so.
Reservations: The ability to reserve nodes far in the future for things like CASA classes and SIW would be very helpful. It would need to prevent HTCondor from starting jobs on these nodes as reservation time approaches.
- In Slurm
- scontrol create reservation starttime=now duration=5 nodes=testpost001 user=root
- scontrol create reservation starttime=2022-05-3T08:00:00 duration=21-0:0:0 nodes=nmpost[020-030] user=root reservationname=siw2022
- scontrol show res The output of this kinda sucks. Hopefully there is a better way to see all the reservations.
- In Slurm
Ability to run jobs remotely (AWS, CHTC, OSG, etc)
Array jobs: Do we want to keep the Torque array job functionality?
- Slurm
- #SBATCH --array=0-3%2 This syntax is very similar to Torque.
- #SBATCH --array=0-3%2 This syntax is very similar to Torque.
- Slurm
MPI: We have some users that use MPI across multiple nodes. It would be nice to keep that as an option.
Cgroups: We will need protection like what cgroups provide so that jobs can’t impact other jobs on the same node.
Submit hosts: we may have several hosts that will need to be able to submit and delete jobs. (wirth, mcilroy, hamilton, etc)
Pack Jobs: Put jobs on nodes efficiently such that as many nodes as possible are left idle and available for users with large memory and/or large core-count requirements.
- Slurm has a sched/backfill plugin that backfills jobs similar to Torque/Moab.
- Slurm has a sched/backfill plugin that backfills jobs similar to Torque/Moab.
Reaper: Clean nodes of unwanted files, dirs and procs. Condor seems to handle /tmp and /var/tmp properly because it uses fake versions of these dirs for each job. But /dev/shm is still an issue. What about errant processes?
Reaper: Cancel jobs when accounts are closed.
Node priority: With Torque/Moab we can control the order in which the scheduler pick nodes. This allows us to run jobs on the faster nodes by default. Can HTCondor do this?
While preemption can be useful in some circumstances I expect we will want it disabled for the foreseeable future.