Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The instructions below assume all nodes (nmpost{001..090}) but it is likely you will want to change that to just the subset of noides that you are transitioning to Slurm.

  • Configure DHCP on zia On zia, configure DHCP to boot all nmpost nodes to RHEL-7.8.1.5
  • Switch remaining nmpost nodes from Torque/Moab to Slurm on nmpost-serv-1
    • Change to the snapshot directory
      • cd /opt/services/diskless_boot/RHEL-7.8.1.5/nmpost/snapshot
    • Enable slurmd to start on boot
      • for x in nmpost{001..090}* ; do echo 'SLURMD_OPTIONS="--conf-server nmpost-serv-1"' > ${x}/etc/sysconfig/slurmd ; done
      • for x in nmpost{001..090}* ; do  echo '/etc/sysconfig/slurmd' >> ${x}/files ; done
    • Disable Torque from starting on boot
      •  for x in nmpost{001..090}* ; do  echo 'PBS_ARGS="-h"' > ${x}/etc/sysconfig/pbs_mom ; done
    • Reboot each node you modified
  • Switch On zia, switch nodescheduler, nodeextendjob, nodesfree from Torque to Slurm on zia
    • Change to the stow directory
      • cd /home/local/Linux/rhel7/x86_64/stow
    • Alter the email that is sent.  The default version of nodescheduler is now slurm instead of torque.
      • #edit cluster/share/cluster/cluster_job_*.sh and change 'nodescheduler-slurm' to 'nodescheduler' in the slurm email functions
      • #edit cluster/share/cluster/cluster_job_*.sh and change 'nodescheduler' to 'nodescheduler-torque' in the torque email functions
    • Change the symlinks from pointing to the Torque versions to the Slurm versions.  Don't unstow the package as that could kill jobs.
      • (cd cluster/bin ; rm -f nodescheduler ; ln -s nodescheduler-slurm nodescheduler)
      • (cd cluster/bin ; rm -f nodescheduler-test ; ln -s nodescheduler-test-slurm nodescheduler-test)
      • (cd cluster/bin ; rm -f nodeextendjob ; ln -s nodeextendjob-slurm nodeextendjob)
      • (cd cluster/bin ; rm -f nodesfree ; ln -s nodesfree-slurm nodesfree)
  • Add nodes to Slurm scheduler
    • On nmpost-serv-1, edit /etc/slurm/slurm.conf and uncomment and modify NodeName lines.  Each node can only be in one NodeName line.
    • On nmpost-serv-1, edit /etc/slurm/slurm.conf and modify PartitionName lines.  There cannot be more than one PartitionName line per partition.
    • On nmpost-serv-1 restart slurmctld with systemctl restart slurmctld
    • On nmpost-master restart slurmd with systemctl restart slurmd
  • Remove On info.nrao.edu, remove the bold notes about Slurm in the docs on info.nrao.eduChange the order of the docs such that Slurm comes before Torque

Test

  • Use nodescheduler to reserve a node.  Try with various options.
  • Submit a Slurm job
    • srun sleep 27 
  • Submit a Torque/Moab job
    • echo "sleep 27" | qsub

...