...
- On zia, configure DHCP to boot each nmpost node to RHEL-7.8.1.5
- Switch each nmpost node from Torque/Moab to Slurm on nmpost-serv-1
- Change to the snapshot directory
- cd /opt/services/diskless_boot/RHEL-7.8.1.5/nmpost/snapshot
- Enable slurmd to start on boot
- for x in nmpost{001..060}* ; do echo 'SLURMD_OPTIONS="--conf-server nmpost-serv-1"' > ${x}/etc/sysconfig/slurmd ; done
- for x in nmpost{001..060}* ; do echo '/etc/sysconfig/slurmd' >> ${x}/files ; done
- Disable Torque from starting on boot (The default for RHEL-7.8.1.5 is not to start pbs_mom)
- for x in nmpost{001..060}* ; do (cd ${x} ; sed -i -e 's|/etc/sysconfig/pbs_mom||' files) ; done
- for x in nmpost{001..060}* ; do \rm -f ${x}/etc/sysconfig/pbs_mom ; done
- Reboot each node you modified
- Change to the snapshot directory
- On zia, switch nodescheduler, nodeextendjob, nodesfree from Torque to Slurm
- Change to the stow directory
- cd /home/local/Linux/rhel7/x86_64/stow
- Alter the email that is sent. The default version of nodescheduler is now slurm instead of torque.
- emacs -nw cluster/share/cluster/cluster_job_*.sh and change 'nodescheduler-slurm' to 'nodescheduler' in the slurm email functions
- emacs -nw cluster/share/cluster/cluster_job_*.sh and change 'nodescheduler' to 'nodescheduler-torque' in the torque email functions
- Change the symlinks from pointing to the Torque versions to the Slurm versions. Don't unstow the package as that could kill jobs.
- (cd cluster/bin ; rm -f nodescheduler ; ln -s nodescheduler-slurm nodescheduler)
- (cd cluster/bin ; rm -f nodescheduler-test ; ln -s nodescheduler-test-slurm nodescheduler-test)
- (cd cluster/bin ; rm -f nodeextendjob ; ln -s nodeextendjob-slurm nodeextendjob)
- (cd cluster/bin ; rm -f nodesfree ; ln -s nodesfree-slurm nodesfree)
- Change to the stow directory
- Add nodes to Slurm scheduler
- On nmpost-serv-1, edit /etc/slurm/slurm.conf and uncomment and modify NodeName lines. Each node can only be in one NodeName line.
- On nmpost-serv-1, edit /etc/slurm/slurm.conf and modify PartitionName lines. There cannot be more than one PartitionName line per partition.
- On nmpost-serv-1 restart slurmctld with systemctl restart slurmctld
- On nmpost-master restart slurmd with systemctl restart slurmd
- On info.nrao.edu, remove the bold notes about Slurm in the docs
Test
- Use nodescheduler to reserve a node. Try with various options.
- Submit a Slurm job
- srun sleep 27
- Submit a Torque/Moab jobecho "sleep 27" | qsubhttps://staff.nrao.edu/wiki/bin/view/NM/TorqueDecomission2022#Test
Later Clean
Much later, when we are sure we don't want certain nodes to ever run Torque/Moab again, we can do the following. This is a rough idea of what needs to be done. I would suggesting making this part of a new OS image like RHEL-7.8.1.6 or later.
...