Procedure to replace Torque/Moab with HTCondor/Slurm

Currently, the nmpost cluster is a mix of Torque/Moab nmpost{001..090} and HTCondor nmpost{091..120} devhost{001..002}. Eventually we would like to replace Torque/Moab with Slurm as we think it can do most, if not all, of what Torque/Moab does but is free, is being activly developed, and seems more commonly used these days than Torque/Moab.

We upgraded to Torque-6/Moab-9 in 2018 and thus started having to pay for Torque/Moab. This was done because Torque-6 understood cgroups and NUMA nodes (although it doesn't handle NUMA nodes how krowe would like), and Torque-6 was no longer compatible with the free scheduler Maui, forcing us to purchase the Moab scheduler. Since then we have leveraged a couple of things Moab can do that Maui never could like increasing the number of jobs the scheduler looks ahead to schedule, which allows Moab to start reserving space for pending vlass jobs on vlasstest nodes. This is not a critical requirement. Largely the win was with cgroups for resource separation and NUMA nodes to double the number of interactive jobs available. Both of which only required Torque-6 which in turn required Moab-9 which in turn we had to pay for. See what they did there? You can read more about it at https://staff.nrao.edu/wiki/bin/view/DMS/SCGTorque6Moab9Presentation

I did look at openpbs which seems to be the free version of PBS Pro maintained by Altair Engineering. I have found it lacking in a few important things. it doesn't support a working directory like Torque does with -d or -w, and has no PAM module allowing users to login if they have an active job which would make nodescheduler very hard to implement. So I don't think openpbs is a suitable replacement for Torque/Moab.

Once nmpost is transitioned we can look at doing cvpost with all the lessons learned in nmpost. Before cvpost is transitioned we should tell CV users about the coming transition and possibly them use nmpost for testing.

To Do

Prep

Done: upgrade testpost-master to RHEL7 so it can run Slurm 122408
Done: upgrade nmpost-master to RHEL7 so it can run Slurm 122408
Done: Look at upgrading to the latest version of Slurm

Work

DONE: Port nodeextendjob to Slurm scontrol update jobid=974 timelimit=+7-0:0:0
DONE: Port nodesfree to Slurm
DONE: Port nodereboot to Slurm scontrol ASAP reboot reason=testing testpost001
DONE: Create a subset of testpost cluster that only runs Slurm for admins to test.
- Done: Install Slurmctld on testpost-serv-1, testpost-master, and OS image
- Done: install Slurm reaper on OS image (RHEL-7.8.1.3)
- Done: Make the new testpost-master a Slurm submit host
Done: Create a small subset of nmpost cluster that only runs Slurm for users to test.
- Done: Install Slurmctld on nmpost-serv-1, nmpost-master, herapost-master, and OS image
- Done: install Slurm reaper on OS image (RHEL-7.8.1.3)
- Done: Make the new nmpost-master a Slurm submit host
- Done: Make the new, disked herapost-master a Slurm submit host.
- Done: Need at least 2 nmpost nodes for testing: batch/interactive, vlass/vlasstest
  - done: test nodescheduler
  - done: test mpicasa single-node, multi-node. Both without -n nor -machinefile
Identify stake-holders (E.g. operations, VLASS, DAs, sci-staff, SSA, HERA, observers, ALMA, CV) and give them the chance to test Slurm and provide opinions
- nrao-scg, cchandle, emomjian, bsvoboda, jott, lsjouwer, jmarvil, akimball, dmedlin, mmcclear
- Should include for next message: jtobin,
- Common batch users as of Jun. 3, 2022
  - agraham, alawson, ecarlson, ejimenez, fowen, jmarvil, jott, jtobin, lsjouwer, pbeaklin, rxue, nm-####
implement useful opinions
- Done: for MPI jobs we should either create hardware ssh keys so users can launch MPI worker processes like they currently do in Torque (with mpiexec or mpirun) or, compile Slurm with PMIx to work with OpenMPI3 or compile OpenMPI with the libpmi that Slurm creates. I expect changing mpicasa to use OpenMPI3/PMIx instead of its current OpenMPI version will be difficult so it might be easier to just add hardware ssh keys. This makes me sad because that was one of the things I was hoping to stop doing with Slurm. sigh. Actually, this may not be needed. mpicasa figures things out from the Slurm environment and doesn't need a -n or a machinefile. I will test all this without hardware ssh keys.
- Done: Figure out why cgroups for nodescheduler jobs aren't being removed.
- Done: Make heranodescheduler and and heranodesfree and herascancel.
- Done: Document that the submit program will not be available with Slurm.
- Done: How can a user get a report of what was requested and used for the job like '#PBS -m e' does in Torque? SOLUTION: accounting.
- Done: Update the cluster-nmpost stow package to be Slurm-aware (txt 136692)
  - done: Alter /etc/slurm/epilog to use nodescheduler out of /opt/local/ instead of /users/krowe on nmpost-serv-1
    - cd /opt/services/diskless_boot/RHEL-7.8.1.5/nmpost
    - edit root/etc/slurm/epilog
- Done: vncserver doesn't work with Slurm from an interactive session using srun --mem=8G --pty bash -l. I get a popup window reading Call to lnusertemp failed (temporary directories full?) I also see this in the log in ~/.vnc Error: cannot create directory "/run/user/5213/ksocket-krowe": No such file or directory
  - Unsetting XDG_RUNTIME_DIR then starting vncserver allows me to connect via vncviewer successfully. I am not aware of anyone that will actually be affected by this because the nm-#### users use nodescheduler and ssh so for them /run/user/<UID> will exist, and I doubt we have any users that currently use qsub -I other than me. So probably the best thing for now is just to document this and move on.
- Done: jobs are being restarted after a node reboot. I thouhgt I had that disabled but apparently not. SOLUTION: JobRequeue=0
- Done: There is a newer version (21.x) It might be simple to upgreade. I will test on the testpost cluster.
- Done: MailProg=/usr/bin/smail This produces more information in the end email message, Not as much as Torque though.
- Done: Change to TaskPlugin=affinity,cgroup in accordance with recommendation in https://slurm.schedmd.com/slurm.conf.html
- Done: Do another pass on the documentation https://info.nrao.edu/computing/guide/cluster-processing
- Done: Publish new documentation https://info.nrao.edu/computing/guide/cluster-processing
- Done: Try commenting out PrologFlag=x11 thus setting PrologFlag=none. nodescheduler doesn't need any prolog flags set. PrologFlag=contain doesn't seem to be needed by reaper.
  - If I comment out PrologFlag line, then jobs only get the .batch step not the .batch and .extern steps in sstat.
  - If PrologFlag=Alloc then jobs only get the .batch step not the .batch and .extern steps in sstat.
  - If PrologFlag=Contain then jobs get both the .batch and .extern steps in sstat.
  - If PrologFlag=X11 then jobs get both the .batch and .extern steps in sstat.
  - I don't know that we need any PrologFlag set so comment it out until we need it.
- Skip: cgropus are not being removed when jobs end. E.g. nmpost035 has uid_1429, uid_25654, and uid_5572 in /sys/fs/cgroup/memory/slurm and all of those jobs have ended.
  - This is not a critical problem, but more of an annoyance. It does sometimes set the node state to drain because of Reason=Kill task failed
  - This has been a recurring problem since they introduced cgroup support back around version 17. Sadly, upgrading to version 21 didn't fix the problem.
  - done: A common suggestion is to increase UnkillableStepTimeout=120 or more. I have set this and will see if it helps.
  - As of Mar. 1, 2022, after some reboots, there are no extraneous cgroups on nmpost{035..039}. Let's see if it stays that way.
  - May 3, 2022 krowe: I see some on nmpost036. uid_5684 from Mar. 11 and uid_9326 from Mar.21. Oh well. I have noted this as an issue in our Slurm install docs on the wiki.
- Skip: The sstat -j <jobid> command produces blanks both at NRAO and CHTC.
  - See https://bugs.schedmd.com/show_bug.cgi?id=12405
  - I think this is a result of an update to sstat. The man page needs to be updated. The example shows the following.
    - sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j 11
  - The man page should either add the -a option or use 11.batch
- Skip: Try job_container/tmpfs Does it remove the need for the reaper script? Does it break things? Seems like it could be a good idea. Like how HTCondor works. But when I try it I get this in the slurmd.log
  - [2022-03-01T15:00:09.965] error: container_p_join: open failed for /var/tmp/slurm/1975/.ns: No such file or directory
  - Is this because the first thing job_container/tmpfs does is try to umount /dev/shm, which is nuts, and /dev/shm seems to alway have a lldpad.state file? On a non-diskless node, like rastan, I can start lldpad, remove the file with lldpad -d, then stop the service. Still, the idea of umounting /dev/shm is rediculous. Isn't there a bunch of other stuff that will use /dev/shm and prevent the kernel from umounting it?
  - I made testpost-master an execution host and jobs run on it. So perhaps job_container doesn't work with diskless nodes.
  - I filed a ticket https://bugs.schedmd.com/show_bug.cgi?id=14113
  - Turns out it was our fault. The initial ramdisk apparently has the file /dev/shm/lldpad.state because systemd/selinux/whatever. Then the rhel-readonly unit makes this a bindmounted file which then prevents Slurm from using job_contaier/tmpfs. I think it is easier to just unmount this file than try to alter the initial ramdisk.
  - This can also use /tmp on the local disk (i.e. NVMe) instead of a shared ramdisk for /tmp. That should be good. But it also means the nodes will need local storage. All of our nodes do have some sort of local storage. Currently (May 26, 2022) nodes only have a 2GB tmpfs for /tmp so we woulnd't need a lot of disk for a /tmp for each job using job_container/tmpfs and of course /dev/shm doesn't use disk space.
    - nmpost{001..010} /dev/sda1 558GB used for swap
    - nmpost{011..020} /dev/sda1 894GB used for swap
    - nmpost{021..040} : /dev/sda1 558GB used for swap
    - nmpost{041..060}: /dev/sda1 894GB used for swap
    - nmpost{061..090{: /dev/sda1 447GB used for swap /dev/sda2 447GB used for /mnt/scratch
    - nmpost{091..120}: /dev/sda1 447GB used for swap /dev/nvme0n1p1 3.5TB used for /mnt/condor
    - herapost{001..003}: /dev/sda 558GB unused
    - herapost{004..006}: /dev/sda1 558GB used for swap (supposedly temporary)
    - herapost007: /dev/sda 558GB unused
    - herapost008: /dev/sda1 447GB used for swap (supposedly temporary)
    - herapost009: /dev/sda1 238GB used for swap 209GB unused
    - herapost010: /dev/sda1 447GB used for swap (supposedly temporary)
    - herapost{011.14}: /dev/sda1 931GB used for swap (supposedly temporary)
  - I can use something like /var/tmp/slurm as the default then things work pretty much as they do without job_container except that files in /tmp and /dev/shm are cleaned up when the job is over without needing the reaper script. Then if we want to override this and put /tmp on an actual block device we can do that with NodeName. Here is an example.
    - AutoBasePath=true
      BasePath=/var/tmp/slurm
      NodeName=testpost001 BasePath=/mnt/condor/slurm
  - Should test with glideins and other things before putting into production.
    - Worked with glide-ins once I switch from the EXECUTE and SPOOL commands referencing /lustre/aoc to /.lustre/aoc. Apparently the automounter gets in the way of mount namespaces.
  - Test with interactive reservations. If I login via ssh to an interactive reservation do I get the same namespaces for /tmp and /dev/shm?
    - No. I see the real /tmp which means reaper will still be needed.
    - Is there a hack I can do to put a shell in a namespace like what I do with /etc/cgrules.conf?
      - Maybe. There is /etc/security/namespace.conf
        I would think this would work /tmp /var/tmp/slurm/132/.132 level:shared ~krowe but it doesn't. In fact the shared option doesn't seem to work at all. Perhaps a bug or perhaps because I have SELinux disabled.
        add ignore_instance_parent_mode to pam_namespace.so in /etc/pam.d/login
        Even a simpler test like /tmp /tmp-krowe/ level:shared root,adm doesn't work.
        At this point I don't think I can get job_container/tmpfs to work with nodescheduler.
        I suppose we could use job_container/tmpfs and still run reaper only for interactive jobs but I don't see an advantage in doing that and I see a disadvantage in that it is more confusing to the admins.
- Done: Figure out how many nodes may need to remain running Torque/Moab, how many can become Slurm, and how many are in question. The goal is to move as many as possible before Sep. 30, 2022. We should be able to move all but a few between 001 and 060. We will need to work with jkern and akimball about moving 061 through 090.
  - Need to remain Torque until Aug. 2023
    - 120 cores for SSA. I am thinking SSA needs about 30 cores of 48GB each which is at least 3 512GB nodes in both the batch and vlass queues. Daniel and Jim think 5 nodes would be sufficient for SSA or 120cores. One in the vlass queue and the rest in the batch queue.
    - 100 cores for batch. Jun. 3, 2022 just checked the queue and there are about 100 cores in use for non-vlapipe batch jobs spread across 13 nodes.
  - Can become Slurm:
    - 120 cores are already Slurm for testing nmpost{035..039}
    - 360 cores for interactive. We often have about 40 interactive jobs running, so at least 320 cores could be Slurm (default interactive jobs are 8 cores). In fact we could start migrating interactive nodes to Slurm before the big switch by just switching to the slurm version of nodescheduler.
    - 360 cores for CASA Class. The 15 nodes reserved for Data Reduction Class starting Oct. 3, 2022 could become Slurm nodes. Then stay Slurm nodes after the class is over.
  - In question
    - jott vlasstest jobs
    - jtobin vlasstest jobs
    - 720 cores for VLASS nodes nmpost{061..090} are some or all of these moving to HTCondor? Ask Daniel and Amy.
  - Jul. 5, 2022: 10 batch-only nodes between 001..060 and and 5 VLASS between 061..090 (but maybe more). That's 360 cores or 960 cores depeinding on VLASS. I propose keeping nmpost{011..020} for the batch-only nodes as they are currently the batch-only nodes. We still need to talk with VLASS and SSA to see when they want to migrate the VLASS nodes.
- Done: Set a date to transition to Slurm. This will probably be a partial transition where only a few nodes remain for SSA. Preferably this will be before our 2022 Moab license expires on Sep. 30, 2022.
  - We usually start the process of renewing Moab in July and receive the new license in August.
  - SIW 2022 ends May 26, 2022 and uses 14 nodes.
  - Summer Students start May 17, 2022. Don't know when they end. It's at least 4 nodes.
  - Data Reduction Workshop (CASA Class) 2022 starts Oct. 2, 2022 and ends Oct. 20, 2022 and uses 16 nodes.
  - Jul. 5, 2022: We have announced that Jul. 12, 2022 is the date we start migrating from Torque to Slurm.
- Done: Send email to all parties, especially the nm observer accounts, about the change.
  - Note that the submit script (which Lorant still uses) will be going away.
  - Make a draft message now with references to documentation and leave variables for dates.
- Done: Upgraded Slurm to 22.05.2. This has my MaxVMSize fix and allows a user to cancell all of their jobs with one simple command.
- Submitting simple mpicasa jobs produces the following error.
  - mpirun has exited due to process rank 0 with PID 25000 on node testpost002 exiting improperly. There are three reasons this could occur:
  - I commented out reaper and still got the error.
- Think about disabling the parsing of PBS options? The only way I can see doing this is to add SBATCH_IGNORE_PBS=1 to all users environment via somthing like /etc/profile.d/nrao-slurm.sh on the submit hosts.
- Document and perhaps script ways to see if your job is swapping. Slurm doesn't seem to track this which is really unfortunate.
  - We already explain how to check ganglia to see if a node is swapping https://info.nrao.edu/computing/guide/cluster-processing/appendix/troubleshooting#section-2
  - Adding instructions on using srun --jobid and vmstat -w 1 10 followed by top -u $USER would be good too.
  - Realistlcly, I don't think Linux allows you to truely know which process or processes are swapping. At least I don't know of a guarenteed way to tell.

Enumerate all needs that SSA still has for Torque and help them transition to HTCondor.

Job_Name	Directory	Queue	Memory
casa-imaging-pipeline.sh.vlass..	/lustre/aoc/cluster/pipeline/vlass_prod/spool/quicklook/VLASS2.2*	vlass	31gb
casa-calibration-pipeline.sh.vlass..	/lustre/aoc/cluster/pipeline/vlass_prod/spool/se_calibration/VLASS2.1*	vlass	24gb
casa-pipeline.sh	/lustre/aoc/cluster/pipeline/dsoc-prod/spool/	batch	48gb
create-component-list.sh.vlass..	/lustre/aoc/cluster/pipeline/vlass_prod/spool/se_continuum_imaging/VLASS2.1*
DeliveryJob.dsoc-prod.*	/lustre/aoc/cluster/pipeline/dsoc-prod/spool	batch	48gb
data-fetcher.sh.dsoc-prod.*	/lustre/aoc/cluster/pipeline/dsoc-prod/spool	batch	16gb
ingest.dsoc-dev.*	/lustre/aoc/cluster/pipeline/dsoc-dev/tmp/ProductLocatorWorkflowStartupTask_runVlbaReingestionWorkflow_*
ingest.dsoc-test.*	/lustre/aoc/cluster/pipeline/dsoc-test/tmp/ProductLocatorWorkflowStartupTask_runVlbaReingestionWorkflow_*
ingest.dsoc-prod.*		batch	8gb
run-solr-indexer.sh.dsoc-test.*	/lustre/aoc/cluster/pipeline/dsoc-test/tmp/ProductLocatorWorkflowStartupTask_runSolrReindexCalForProject_*
run-solr-indexer.sh.dsoc-prod.*	/lustre/aoc/cluster/pipeline/dsoc-prod/tmp/ProductLocatorWorkflowStartupTask_runSolrReindexProject_*	batch	120gb
check-summer.sh	/lustre/aoc/cluster/pipeline/dsoc-prod/spool/1026791188	batch	48gb
PrepareWorkingDirectoryJob.dsoc-prod.*		batch	18gb
ValidateUserDirectoryJob.dsoc-prod.*		batch	18gb

Done

DONE: Set a PoolName for the testpost and nmpost clusters. E.g. NRAO-NM-PROD and NRAO-NM-TEST. They don't have to be allcaps.
DONE: Change slurm so that nodes come up properly after a reboot instead of "unexpectedly rebooted" ReturnToService=2
DONE: Document how to use HTCondor and Slurm with emphasis on transitioning from Torque/Moab
- https://staff.nrao.edu/wiki/bin/view/NM/HTCondor#Simple_Documentation
- https://staff.nrao.edu/wiki/bin/view/NM/SlurmExampleSubmit
- I will convert these into pages in https://info.nrao.edu/computing/guide/cluster-processing/

DONE: Sort out the various memory settings (ConstrainRAMSpace, ConstrainSwapSpace, AllowedSwapSpace, etc)

References

Local cluster processing use cases and requirements

Space shortcuts

Page tree

To Do

Prep

Work

Launch

Test

Later Clean

Done

References