...
- DONE: Port nodeextendjob to Slurm scontrol update jobid=974 timelimit=+7-0:0:0
- DONE: Port nodesfree to Slurm
- DONE: Port nodereboot to Slurm scontrol ASAP reboot reason=testing testpost001
- DONE: Create a subset of testpost cluster that only runs Slurm for admins to test.
- Done: Install Slurmctld on testpost-serv-1, testpost-master, and OS image
- Done: install Slurm reaper on OS image (RHEL-7.8.1.3)
- Done: Make the new testpost-master a Slurm submit host
- Create a small subset of nmpost cluster that only runs Slurm for users to test.
- Done: Install Slurmctld on nmpost-serv-1, nmpost-master, herapost-master, and OS image
- Done: install Slurm reaper on OS image (RHEL-7.8.1.3)
- Done: Make the new nmpost-master a Slurm submit host
- Done: Make the new, disked herapost-master a Slurm submit host.
- Need at least 3 nodes: batch/interactive, vlass/vlasstest, hera/hera-i
- Identify stake-holders (E.g. operations, VLASS, DAs, sci-staff, SSA, HERA, observers, ALMA, CV) and give them the chance to test Slurm and provide opinions
- implement useful opinions
- Done: for MPI jobs we should either create hardware ssh keys so users can launch MPI worker processes like they currently do in Torque (with mpiexec or mpirun) or, compile Slurm with PMIx to work with OpenMPI3 or compile OpenMPI with the libpmi that Slurm creates. I expect changing mpicasa to use OpenMPI3/PMIx instead of its current OpenMPI version will be difficult so it might be easier to just add hardware ssh keys. This makes me sad because that was one of the things I was hoping to stop doing with Slurm. sigh. Actually, this may not be needed. mpicasa figures things out from the Slurm environment and doesn't need a -n or a machinefile. I will test all this without hardware ssh keys.
- Done: Figure out why cgroups for nodescheduler jobs aren't being removed.
- Done: Make heranodescheduler and and heranodesfree and herascancel.
- Done: Document that the submit program will not be available with Slurm.
- Done: How can a user get a report of what was requested and used for the job like '#PBS -m e' does in Torque? SOLUTION: accounting.
- Update the cluster-nmpost stow package to be Slurm-aware (txt 136692)
- Alter /etc/slurm/epilog to use nodescheduler out of /opt/local/ instead of /users/krowe on nmpost-serv-1
- cd /opt/services/diskless_boot/RHEL-7.8.1.5/nmpost
- edit root/etc/slurm/epilog
- Set a date to transition remaining cluster to Slurm. Possibly before we have to pay for Torque again around Jun. 2022.
- Done: Do another pass on the documentation https://info.nrao.edu/computing/guide/cluster-processing
- Done: Publish new documentation https://info.nrao.edu/computing/guide/cluster-processing
...
- Switch remaining nmpost nodes from Torque/Moab to Slurm on nmpost-serv-1
- cd /opt/services/diskless_boot/RHEL-7.8.1.5/nmpost/snapshot
- echo 'SLURMD_OPTIONS="--conf-server nmpost-serv-1"' > TEMPLATE/etc/sysconfig/slurmd
- echo '/etc/sysconfig/slurmd' >> TEMPLATE/files
- rm -f nmpost*/etc/sysconfig/pbs_mom
- for x in nmpost* ; do \cp -f TEMPLATE/etc/sysconfig/slurmd ${x}/etc/sysconfig ; done
- for x in nmpost* ; do \cp -f TEMPLATE/files ${x} ; done
- reboot each node
- Switch Torque nodescheduler, nodeextendjob, nodesfree with Slurm versions on zia
- cd ~krowe/nodescheduler/slurm
- cp nodescheduler /home/local/Linux/rhel7/x86_64/stow/cluster-nmpost/bin
- cp nodescheduler-test /home/local/Linux/rhel7/x86_64/stow/cluster-nmpost/bin
- edit nodescheduler and nodescheduler-test. Update CLUSTERDIR.
- cp cluster_job* /home/local/Linux/rhel7/x86_64/stow/cluster-nmpost/share/cluster-nmpost
- cp nodeextendjob /home/local/Linux/rhel7/x86_64/stow/cluster-nmpost/bin/nodeextendjob
- cp nodesfree /home/local/Linux/rhel7/x86_64/stow/cluster-nmpost/bin
- cp herasclean /home/local/Linux/rhel7/x86_64/stow/cluster-nmpost/bin
- Edit /etc/sudoers.d/nrao-hera on nmpost-master and herapost-master
- stow cluster-nmpost
Clean
- Remove nodefindfphantoms
- Remove nodereboot and associated cron job on servers
...