...
RHEL-7.8.1.5 defaults to not starting pbs_mom because of PBS_ARGS="-h" in RHEL-7.8.1.5/nmpost/root/etc/sysconfig/pbs_mom but then is overridden by PBS_ARGS="" in etc/sysconfig/pbs_mom in snapshots. It is looking like we won't be able to do a clean transfer. We will have to keep some Torque/Moab nodes around for a while (SSA workflows, etc). So we may want to be able to switch nodes between Torque/Moab and Slurm easily. Here is how https://staff.nrao.edu/wiki/bin/view/NM/NmpostRHEL7#Set_schedulers_off_by_default
The instructions below assume all nodes (nmpost{001.090}) but it is likely you will want to change that to just the subset of noides that you are transitioning to Slurm.
- Configure DHCP on zia to boot all nmpost nodes to RHEL-7.8.1.5
- Switch remaining nmpost nodes from Torque/Moab to Slurm on nmpost-serv-1
- Change to the snapshot directory
- cd /opt/services/diskless_boot/RHEL-7.8.1.5/nmpost/snapshot
- Enable slurmd to start on boot
- for x in nmpost{001..090}* ; do echo 'SLURMD_OPTIONS="--conf-server nmpost-serv-1"' > ${x}/etc/sysconfig/slurmd ; done
- for x in nmpost{001..090}* ; do echo '/etc/sysconfig/slurmd' >> ${x}/files ; done
- Disable Torque from starting on boot
- for x in nmpost{001..090}* ; do echo 'PBS_ARGS="-h"' > ${x}/etc/sysconfig/pbs_mom ; done
- Reboot each node
- Change to the snapshot directory
- Switch Torque nodescheduler, nodeextendjob, nodesfree with from Torque to Slurm versions on zia
- cd /home/local/Linux/rhel7/x86_64/stow
- #edit cluster/share/cluster/*.sh and change 'nodescheduler-slurm' to 'nodescheduler' in the slurm email functions
- stow -D cluster
- (cd cluster/bin ; rm -f nodescheduler ; ln -s nodescheduler-slurm nodescheduler)
- (cd cluster/bin ; rm -f nodescheduler-test ; ln -s nodescheduler-test-slurm nodescheduler-test)
- (cd cluster/bin ; rm -f nodeextendjob ; ln -s nodeextendjob-slurm nodeextendjob)
- (cd cluster/bin ; rm -f nodesfree ; ln -s nodesfree-slurm nodesfree)
- stow cluster
- Uncomment nmpost lines in nmpot-serv-1:/etc/slurm/slurm.conf
- On nmpost-serv-1 restart with systemctl restart slurmctld
- On nmpost-master restart with systemctl restart slurmd
- Remove the bold note about Slurm in the docs on info.nrao.edu
- Remove pam_pbssimpleauth.so from files in /etc/pam.d in the OS image
- Remove /usr/lib64/security/pam_pbssimpleauth.* from the OS image
Test
- Use nodescheduler to reserve a node. Try with various options.
- Submit a Slurm job
- srun sleep 27
- Submit a Torque/Moab job
- echo "sleep 27" | qsub
Clean
- Remove nodefindfphantoms
- Remove cancelmanyjobs
- Remove nodereboot and associated cron job on servers
- Remove Torque reaper
- Uninstall Torque from OS image.
- Uninstall Torque from nmpost and testpost servers
- Remove snapshot/*/etc/ssh/shosts.equiv
- Remove snapshot/*/etc/ssh/ssh_known_hosts
...