...
- DONE: Port nodeextendjob to Slurm scontrol update jobid=974 timelimit=+7-0:0:0
- DONE: Port nodesfree to Slurm
- DONE: Port nodereboot to Slurm scontrol ASAP reboot reason=testing testpost001
- DONE: Create a subset of testpost cluster that only runs Slurm for admins to test.
- Done: Install Slurmctld on testpost-serv-1, testpost-master, and OS image
- Done: install Slurm reaper on OS image (RHEL-7.8.1.3)
- Done: Make the new testpost-master a Slurm submit host
- Done: Create a small subset of nmpost cluster that only runs Slurm for users to test.
- Done: Install Slurmctld on nmpost-serv-1, nmpost-master, herapost-master, and OS image
- Done: install Slurm reaper on OS image (RHEL-7.8.1.3)
- Done: Make the new nmpost-master a Slurm submit host
- Done: Make the new, disked herapost-master a Slurm submit host.
- Done: Need at least 2 nmpost nodes for testing: batch/interactive, vlass/vlasstest
- done: test nodescheduler
- done: test mpicasa single-node, multi-node. Both without -n nor -machinefile
- Identify stake-holders (E.g. operations, VLASS, DAs, sci-staff, SSA, HERA, observers, ALMA, CV) and give them the chance to test Slurm and provide opinions
- nrao-scg, cchandle, emomjian, bsvoboda, jott, lsjouwer, jmarvil, akimball, dmedlin, mmcclear
- Should include for next message: jtobin,
- implement useful opinions
- Done: for MPI jobs we should either create hardware ssh keys so users can launch MPI worker processes like they currently do in Torque (with mpiexec or mpirun) or, compile Slurm with PMIx to work with OpenMPI3 or compile OpenMPI with the libpmi that Slurm creates. I expect changing mpicasa to use OpenMPI3/PMIx instead of its current OpenMPI version will be difficult so it might be easier to just add hardware ssh keys. This makes me sad because that was one of the things I was hoping to stop doing with Slurm. sigh. Actually, this may not be needed. mpicasa figures things out from the Slurm environment and doesn't need a -n or a machinefile. I will test all this without hardware ssh keys.
- Done: Figure out why cgroups for nodescheduler jobs aren't being removed.
- Done: Make heranodescheduler and and heranodesfree and herascancel.
- Done: Document that the submit program will not be available with Slurm.
- Done: How can a user get a report of what was requested and used for the job like '#PBS -m e' does in Torque? SOLUTION: accounting.
- Done: Update the cluster-nmpost stow package to be Slurm-aware (txt 136692)
- done: Alter /etc/slurm/epilog to use nodescheduler out of /opt/local/ instead of /users/krowe on nmpost-serv-1
- cd /opt/services/diskless_boot/RHEL-7.8.1.5/nmpost
- edit root/etc/slurm/epilog
- done: Alter /etc/slurm/epilog to use nodescheduler out of /opt/local/ instead of /users/krowe on nmpost-serv-1
- Done: vncserver doesn't work with Slurm from an interactive session using srun --mem=8G --pty bash -l. I get a popup window reading Call to lnusertemp failed (temporary directories full?) I also see this in the log in ~/.vnc Error: cannot create directory "/run/user/5213/ksocket-krowe": No such file or directory
- Unsetting XDG_RUNTIME_DIR then starting vncserver allows me to connect via vncviewer successfully. I am not aware of anyone that will actually be affected by this because the nm-#### users use nodescheduler and ssh so for them /run/user/<UID> will exist, and I doubt we have any users that currently use qsub -I other than me. So probably the best thing for now is just to document this and move on.
- Done: jobs are being restarted after a node reboot. I thouhgt I had that disabled but apparently not. SOLUTION: JobRequeue=0
- Done: There is a newer version (21.x) It might be simple to upgreade. I will test on the testpost cluster.
- Done: MailProg=/usr/bin/smail This produces more information in the end email message, Not as much as Torque though.
- Done: Change to TaskPlugin=affinity,cgroup in accordance with recommendation in https://slurm.schedmd.com/slurm.conf.html
- Done: Do another pass on the documentation https://info.nrao.edu/computing/guide/cluster-processing
- Done: Publish new documentation https://info.nrao.edu/computing/guide/cluster-processing
- Skip: cgropus are not being removed when jobs end. E.g. nmpost035 has uid_1429, uid_25654, and uid_5572 in /sys/fs/cgroup/memory/slurm and all of those jobs have ended.
- This is not a critical problem, but more of an annoyance. It does sometimes set the node state to drain because of Reason=Kill task failed
- This has been a recurring problem since they introduced cgroup support back around version 17. Sadly, upgrading to version 21 didn't fix the problem.
- done: A common suggestion is to increase UnkillableStepTimeout=120 or more. I have set this and will see if it helps.
- As of Mar. 1, 2022, after some reboots, there are no extraneous cgroups on nmpost{035..039}. Let's see if it stays that way.
- May 3, 2022 krowe: I see some on nmpost036. uid_5684 from Mar. 11 and uid_9326 from Mar.21. Oh well. I have noted this as an issue in our Slurm install docs on the wiki.
- Document and perhaps script ways to see if your job is swapping. Slurm doesn't seem to track this which is really unfortunate.
- We already explain how to check ganglia to see if a node is swapping https://info.nrao.edu/computing/guide/cluster-processing/appendix/troubleshooting#section-2
- Adding instructions on using srun --jobid and vmstat -w 1 10 followed by top -u $USER would be good too.
- Realistlcly, I don't think Linux allows you to truely know which process or processes are swapping. At least I don't know of a guarenteed way to tell.
- The sstat -j <jobid> command produces blanks both at NRAO and CHTC.
- See https://bugs.schedmd.com/show_bug.cgi?id=12405
- I think this is a result of an update to sstat. The man page needs to be updated. The example shows the following.
- sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j 11
- The man page should either add the -a option or use 11.batch
- Try job_container/tmpfs Does it remove the need for the reaper script? Does it break things? Seems like it could be a good idea. Like how HTCondor works. But when I try it I get this in the slurmd.log
[2022-03-01T15:00:09.965] error: container_p_join: open failed for /var/tmp/slurm/1975/.ns: No such file or directory
- Is this because the first thing job_container/tmpfs does is try to umount /dev/shm, which is nuts, and /dev/shm seems to alway have a lldpad.state file? On a non-diskless node, like rastan, I can start lldpad, remove the file with lldpad -d, then stop the service. Still, the idea of umounting /dev/shm is rediculous. Isn't there a bunch of other stuff that will use /dev/shm and prevent the kernel from umounting it?
- Try commenting out PrologFlag=x11 thus setting PrologFlag=none. It isn't needed for nodescheduler. PrologFlag=contain doesn't seem to be needed by reaper.
- If I comment out PrologFlag line, then jobs only get the .batch step not the .batch and .extern steps in sstat.
- If PrologFlag=Alloc then jobs only get the .batch step not the .batch and .extern steps in sstat.
- If PrologFlag=Contain then jobs get both the .batch and .extern steps in sstat.
- If PrologFlag=X11 then jobs get both the .batch and .extern steps in sstat.
- Enumerate all needs that SSA still has for Torque and help them migrate to Slurm or HTCondor and figure out how many nodes may need to remain running Torque/Moab.
casa-imaging-pipeline.sh.vlass.*.* quicklook /lustre/aoc/cluster/pipeline/vlass_prod/spool/quicklook/VLASS2.2*
- casa-pipeline.sh /lustre/aoc/cluster/pipeline/dsoc-prod/spool/
- casa-calibration-pipeline.sh.vlass.*.* /lustre/aoc/cluster/pipeline/vlass_prod/spool/se_calibration/VLASS2.1*
- create-component-list.sh.vlass.*.* /lustre/aoc/cluster/pipeline/vlass_prod/spool/se_continuum_imaging/VLASS2.1*
- DeliveryJob.dsoc-prod.* /lustre/aoc/cluster/pipeline/dsoc-prod/spool/
- data-fetcher.sh.dsoc-prod.* /lustre/aoc/cluster/pipeline/dsoc-prod/spool/
- ingest.dsoc-dev.* /lustre/aoc/cluster/pipeline/dsoc-dev/tmp/ProductLocatorWorkflowStartupTask_runVlbaReingestionWorkflow_*
- ingest.dsoc-test.* /lustre/aoc/cluster/pipeline/dsoc-test/tmp/ProductLocatorWorkflowStartupTask_runVlbaReingestionWorkflow_*
- run-solr-indexer.sh.dsoc-test.* /lustre/aoc/cluster/pipeline/dsoc-test/tmp/ProductLocatorWorkflowStartupTask_runSolrReindexCalForProject_*
- Set a date to transition to Slurm. This will probably be a partial transition where only a few nodes remain for SSA. Preferably this will be before our 2022 Moab license expires on Sep. 30, 2022.
- We usually start the process of renewing Moab in July and receive the new license in August.
- Send email to all parties, especially the nm observer accounts, about the change.
- Note that the submit script (which Lorant still uses) will be going away.
- Make a draft message now with references to documentation and leave variables for dates.
...