Procedure to replace Torque/Moab with HTCondor/Slurm

Currently, the nmpost cluster is a mix of Torque/Moab nmpost{001..090} and HTCondor nmpost{091..120} devhost{001..002}. Eventually we would like to replace Torque/Moab with Slurm as we think it can do most, if not all, of what Torque/Moab does but is free and seems more commonly used these days than Torque/Moab.

We upgraded to Torque-6/Moab-9 and thus started having to pay for Torque/Moab in 2018. This was done because Torque-6 understood cgroups and NUMA nodes (although it doesn't handle NUMA nodes how I would like), and Torque-6 was no longer compatible with the free scheduler Maui, forcing us to purchase the Moab scheduler. Since then we have leveraged a couple of things Moab can do that Maui never could like increasing the number of jobs the scheduler looks ahead to schedule which allows Moab to start reserving space for pending vlass jobs on vlasstest nodes but is not a critical requirement. Largely, the win was with cgroups for resource separation, and NUMA nodes to double the number of interactive nodes. Both of which only required the new version of Torque which in turn required Moab which in turn we had to pay for. See what they did there? You can read more about it at https://staff.nrao.edu/wiki/bin/view/DMS/SCGTorque6Moab9Presentation

I did look at openpbs which seems to be the free version of PBS Pro maintained by Altair Engineering. I have found it lacking in a few important things: it doesn't support a working directory like Torque does with -d or -w, and has no PAM module allowing users to login if they have an active job which would make nodescheduler very hard to implement. So I don't think openpbs is a suitable replacement for Torque/Moab.

Once nmpost is transitioned we can look at doing cvpost with all the lessons learned in nmpost. Before cvpost is transitioned we should tell CV users about the coming transition and possibly them use nmpost for testing.

To Do

Prep

Done: upgrade testpost-master to RHEL7 so it can run Slurm 122408
Done: upgrade nmpost-master to RHEL7 so it can run Slurm 122408
Done: Look at upgrading to the latest version of Slurm

Work

DONE: Port nodeextendjob to Slurm scontrol update jobid=974 timelimit=+7-0:0:0
DONE: Port nodesfree to Slurm
DONE: Port nodereboot to Slurm scontrol ASAP reboot reason=testing testpost001
DONE: Create a subset of testpost cluster that only runs Slurm for admins to test.
- Done: Install Slurmctld on testpost-serv-1, testpost-master, and OS image
- Done: install Slurm reaper on OS image (RHEL-7.8.1.3)
- Done: Make the new testpost-master a Slurm submit host
Create a small subset of nmpost cluster that only runs Slurm for users to test.
- Done: Install Slurmctld on nmpost-serv-1, nmpost-master, herapost-master, and OS image
- Done: install Slurm reaper on OS image (RHEL-7.8.1.3)
- Done: Make the new nmpost-master a Slurm submit host
- Done: Make the new, disked herapost-master a Slurm submit host.
- Need at least 3 nodes: batch/interactive, vlass/vlasstest, hera/hera-i
Identify stake-holders (E.g. operations, VLASS, DAs, sci-staff, SSA, HERA, observers, ALMA, CV) and give them the chance to test Slurm and provide opinions
implement useful opinions
- for MPI jobs we should either create hardware ssh keys so users can launch MPI worker processes like they currently do in Torque (with mpiexec or mpirun) or, compile Slurm with PMIx to work with OpenMPI3 or compile OpenMPI with the libpmi that Slurm creates. I expect changing mpicasa to use OpenMPI3/PMIx instead of its current OpenMPI version will be difficult so it might be easier to just add hardware ssh keys. This makes me sad because that was one of the things I was hoping to stop doing with Slurm. sigh.
Set a date to transition remaining cluster to Slurm. Possibly before we have to pay for Torque again around Jun. 2022.
Do another pass on the documentation https://info.nrao.edu/computing/guide/cluster-processing

Launch

Switch remaining nmpost nodes from Torque/Moab to Slurm.
Switch Torque nodescheduler, nodeextendjob, nodesfree with Slurm versions
Publish new documentation https://info.nrao.edu/computing/guide/cluster-processing

Clean

Remove nodefindfphantoms
Remove nodereboot and associated cron job on servers

Done

DONE: Set a PoolName for the testpost and nmpost clusters. E.g. NRAO-NM-PROD and NRAO-NM-TEST. They don't have to be allcaps.
DONE: Change slurm so that nodes come up properly after a reboot instead of "unexpectedly rebooted" ReturnToService=2
DONE: Document how to use HTCondor and Slurm with emphasis on transitioning from Torque/Moab
- https://staff.nrao.edu/wiki/bin/view/NM/HTCondor#Simple_Documentation
- https://staff.nrao.edu/wiki/bin/view/NM/SlurmExampleSubmit
- I will convert these into pages in https://info.nrao.edu/computing/guide/cluster-processing/

DONE: Sort out the various memory settings (ConstrainRAMSpace, ConstrainSwapSpace, AllowedSwapSpace, etc)

References

Local cluster processing use cases and requirements

Space shortcuts

Page tree