Currently, the nmpost cluster is a mix of Torque/Moab nmpost{001..090} and HTCondor nmpost{091..120} devhost{001..002}. Eventually we would like to replace Torque/Moab with Slurm as we think it can do most, if not all, of what Torque/Moab does but is free and seems more commonly used these days than Torque/Moab.
We upgraded to Torque-6/Moab-9 and thus started having to pay for Torque/Moab in 2018. This was done because Torque-6 understood cgroups and NUMA nodes (although it doesn't handle NUMA nodes how I would like), and Torque-6 was no longer compatible with the free scheduler Maui, forcing us to purchase the Moab scheduler. Since then we have leveraged a couple of things Moab can do that Maui never could like increasing the number of jobs the scheduler looks ahead to schedule which allows Moab to start reserving space for pending vlass jobs on vlasstest nodes but is not a critical requirement. Largely, the win was with cgroups for resource separation, and NUMA nodes to double the number of interactive nodes. Both of which only required the new version of Torque which in turn required Moab which in turn we had to pay for. See what they did there? You can read more about it at https://staff.nrao.edu/wiki/bin/view/DMS/SCGTorque6Moab9Presentation
I did look at openpbs which seems to be the free version of PBS Pro maintained by Altair Engineering. I have found it lacking in a few important things: it doesn't support a working directory like Torque does with -d or -w, and has no PAM module allowing users to login if they have an active job which would make nodescheduler very hard to implement. So I don't think openpbs is a suitable replacement for Torque/Moab.
Once nmpost is transitioned we can look at doing cvpost with all the lessons learned in nmpost. Before cvpost is transitioned we should tell CV users about the coming transition and possibly them use nmpost for testing.
To Do
Prep
- Done: upgrade testpost-master to RHEL7 so it can run Slurm 122408
- Done: upgrade nmpost-master to RHEL7 so it can run Slurm 122408
- Done: Look at upgrading to the latest version of Slurm
Work
- DONE: Port nodeextendjob to Slurm scontrol update jobid=974 timelimit=+7-0:0:0
- DONE: Port nodesfree to Slurm
- DONE: Port nodereboot to Slurm scontrol ASAP reboot reason=testing testpost001
- DONE: Create a subset of testpost cluster that only runs Slurm for admins to test.
- Done: Install Slurmctld on testpost-serv-1, testpost-master, and OS image
- Done: install Slurm reaper on OS image (RHEL-7.8.1.3)
- Done: Make the new testpost-master a Slurm submit host
- Done: Create a small subset of nmpost cluster that only runs Slurm for users to test.
- Done: Install Slurmctld on nmpost-serv-1, nmpost-master, herapost-master, and OS image
- Done: install Slurm reaper on OS image (RHEL-7.8.1.3)
- Done: Make the new nmpost-master a Slurm submit host
- Done: Make the new, disked herapost-master a Slurm submit host.
- Done: Need at least 2 nmpost nodes for testing: batch/interactive, vlass/vlasstest
- done: test nodescheduler
- done: test mpicasa single-node, multi-node. Both without -n nor -machinefile
- Identify stake-holders (E.g. operations, VLASS, DAs, sci-staff, SSA, HERA, observers, ALMA, CV) and give them the chance to test Slurm and provide opinions
- implement useful opinions
- Done: for MPI jobs we should either create hardware ssh keys so users can launch MPI worker processes like they currently do in Torque (with mpiexec or mpirun) or, compile Slurm with PMIx to work with OpenMPI3 or compile OpenMPI with the libpmi that Slurm creates. I expect changing mpicasa to use OpenMPI3/PMIx instead of its current OpenMPI version will be difficult so it might be easier to just add hardware ssh keys. This makes me sad because that was one of the things I was hoping to stop doing with Slurm. sigh. Actually, this may not be needed. mpicasa figures things out from the Slurm environment and doesn't need a -n or a machinefile. I will test all this without hardware ssh keys.
- Done: Figure out why cgroups for nodescheduler jobs aren't being removed.
- Done: Make heranodescheduler and and heranodesfree and herascancel.
- Done: Document that the submit program will not be available with Slurm.
- Done: How can a user get a report of what was requested and used for the job like '#PBS -m e' does in Torque? SOLUTION: accounting.
- Done: Update the cluster-nmpost stow package to be Slurm-aware (txt 136692)
- done: Alter /etc/slurm/epilog to use nodescheduler out of /opt/local/ instead of /users/krowe on nmpost-serv-1
- cd /opt/services/diskless_boot/RHEL-7.8.1.5/nmpost
- edit root/etc/slurm/epilog
- done: Alter /etc/slurm/epilog to use nodescheduler out of /opt/local/ instead of /users/krowe on nmpost-serv-1
- Done: vncserver doesn't work with Slurm from an interactive session using srun --mem=8G --pty bash -l. I get a popup window reading Call to lnusertemp failed (temporary directories full?) I also see this in the log in ~/.vnc Error: cannot create directory "/run/user/5213/ksocket-krowe": No such file or directory
- Unsetting XDG_RUNTIME_DIR then starting vncserver allows me to connect via vncviewer successfully. I am not aware of anyone that will actually be affected by this because the nm-#### users use nodescheduler and ssh so for them /run/user/<UID> will exist, and I doubt we have any users that currently use qsub -I other than me. So probably the best thing for now is just to document this and move on.
- Done: jobs are being restarted after a node reboot. I thouhgt I had that disabled but apparently not. SOLUTION: JobRequeue=0
- Done: There is a newer version (21.x) It might be simple to upgreade. I will test on the testpost cluster.
- Done: MailProg=/usr/bin/smail This produces more information in the end email message, Not as much as Torque though.
- cgropus are not being removed when jobs end. E.g. nmpost035 has uid_1429, uid_25654, and uid_5572 in /sys/fs/cgroup/memory/slurm and all of those jobs have ended.
- This is not a critical problem, but more of an annoyance. It does dometimes set the node state to drain because of Reason=Kill task failed
- This has been a recurring problem since they introduced cgroup support back around version 17. Sadly, upgrading to version 21 didn't fix the problem.
- done: A common suggestion is to increase UnkillableStepTimeout=120 or more. I have set this and will see if it helps.
- As of Mar. 1, 2022, after some reboots, there are no extraneous cgroups on nmpost{035..039}. Let's see if it stays that way.
- Document and perhaps script ways to see if your job is swapping. Slurm doesn't seem to track this which is really unfortunate.
- The sstat command doesn't produce any information unless you use the task name instead of the job name. E.g. sstat -j 1904.batch
- While the use of PrologFlag=contain which is set because we use PrologFlag=X11 does create the <JOBID>.extern, it does not prevent sstat -j <JOBID> from working. There must be some other reason. Why did I set X11? Sure it allows for things like srun --x11 xclock but why do we need it? It isn't needed for nodescheduler. PrologFlag=contain doesn't seem to be needed by reaper.
- If you add the -a option it works. E.g. sstat -a -j 1234 but this shouldn't be necessary. I don't need the -a at CHTC.
- Change to TaskPlugin=affinity,cgroup in accordance with recommendation in https://slurm.schedmd.com/slurm.conf.html
- Try job_container/tmpfs Does it remove the need for the reaper script? Does it break things?
- Seems like it could be a good idea. Like how HTCondor works. But when I try it I get this in the slurmd.log
[2022-03-01T15:00:09.965] error: container_p_join: open failed for /var/tmp/slurm/1975/.ns: No such file or directory
- Try commenting out PrologFlag=x11 thus setting PrologFlag=none. It isn't needed for nodescheduler. PrologFlag=contain doesn't seem to be needed by reaper.
- Done: Do another pass on the documentation https://info.nrao.edu/computing/guide/cluster-processing
- Done: Publish new documentation https://info.nrao.edu/computing/guide/cluster-processing
- Set a date to transition remaining cluster to Slurm. Preferably before we have to pay for Torque again around Jun. 2022 or before the license expires on Sep. 30, 2022.
- Could this coincide with the server room PDU upgrade?
- Send email to all parties, especially the nm observer accounts, about the change.
- Note that the submit script (which Lorant still uses) will be going away.
- Make a draft message now with references to documentation and leave variables for dates.
- Think about configuring RHEL-7.8.1.5 now to default to Slurm. As of Jan. 7, 2022 all non-test nmpost nodes under nmpost091 are using RHEL-7.8.1.1 so it would make Launch day easier if all we needed to do was change DHCP and reboot the nodes.
Launch
- Configure DHCP on zia to boot all nmpost nodes to RHEL-7.8.1.5
- Switch remaining nmpost nodes from Torque/Moab to Slurm on nmpost-serv-1
- Change to the snapshot directory
- cd /opt/services/diskless_boot/RHEL-7.8.1.5/nmpost/snapshot
- Enable slurmd to start on boot
- for x in nmpost{001..090}* ; do echo 'SLURMD_OPTIONS="--conf-server nmpost-serv-1"' > ${x}/etc/sysconfig/slurmd ; done
- for x in nmpost{001..090}* ; do echo '/etc/sysconfig/slurmd' >> ${x}/files ; done
- Disalbe Torque from starting on boot
- rm -f nmpost*/etc/sysconfig/pbs_mom
- for x in nmpost* ; do sed -i '/^\/etc\/ssh\/pbs_mom/d' ${x}/files ; done
- for x in nmpost* ; do sed -i '/^\/etc\/ssh\/shosts.equiv/d' ${x}/files ; done
- for x in nmpost* ; do sed -i '/^\/etc\/ssh\/ssh_known_hosts/d' ${x}/files ; done
- Reboot each node
- Change to the snapshot directory
- Switch Torque nodescheduler, nodeextendjob, nodesfree with Slurm versions on zia
- cd /home/local/Linux/rhel7/x86_64/stow
- stow -D cluster
- (cd cluster/bin ; rm -f nodescheduler ; ln -s nodescheduler-slurm nodescheduler)
- (cd cluster/bin ; rm -f nodescheduler-test ; ln -s nodescheduler-test-slurm nodescheduler-test)
- (cd cluster/bin ; rm -f nodeextendjob ; ln -s nodeextendjob-slurm nodeextendjob)
- (cd cluster/bin ; rm -f nodesfree ; ln -s nodesfree-slurm nodesfree)
- stow cluster
- Uncomment nmpost lines in nmpot-serv-1:/etc/slurm/slurm.conf
- On nmpost-serv-1 restart with systemctl restart slurmctld
- On nmpost-master restart with systemctl restart slurmd
- Remove the bold note about Slurm in the docs on info.nrao.edu
- Remove pam_pbssimpleauth.so from files in /etc/pam.d in the OS image
- Remove /usr/lib64/security/pam_pbssimpleauth.* from the OS image
Clean
- Remove nodefindfphantoms
- Remove cancelmanyjobs
- Remove nodereboot and associated cron job on servers
- Remove Torque reaper
- Uninstall Torque from OS image.
- Uninstall Torque from nmpost and testpost servers
- Remove snapshot/*/etc/ssh/shosts.equiv
- Remove snapshot/*/etc/ssh/ssh_known_hosts
Done
- DONE: Set a PoolName for the testpost and nmpost clusters. E.g. NRAO-NM-PROD and NRAO-NM-TEST. They don't have to be allcaps.
- DONE: Change slurm so that nodes come up properly after a reboot instead of "unexpectedly rebooted" ReturnToService=2
- DONE: Document how to use HTCondor and Slurm with emphasis on transitioning from Torque/Moab
- DONE: Sort out the various memory settings (ConstrainRAMSpace, ConstrainSwapSpace, AllowedSwapSpace, etc)