Local cluster processing use cases and requirements

This document proposes a broad change to NRAO’s computational approach w.r.t. CASA to improve overall efficiency. Within industry there are strategies viewed as either High Performance or High Throughput Computing, commonly referred to as either HPC or HTC. Neither approach is optimal for NRAO so what is proposed here is a High Efficiency Computing approach that considers all aspects of the problem including hardware configuration and performance, PI time, and costs for hardware and software development. Because the goal is broad efficiency across multiple disjoint axis there is no ultimate final goal that is being aimed at, rather there is a finite series of stages which achieve some delta improvement in efficiency with the next stage going through a review and prototype process before commencing.

https://staff.nrao.edu/wiki/bin/view/NM/HTCondor#Conversion

https://staff.nrao.edu/wiki/bin/view/NM/Slurm

DONE: Queues: We want to keep the multiple queue functionality of Torque/Moab where, for example, HERA jobs go to hera nodes and VLASS jobs go to vlass nodes. We would also like to be able to have vlasstest jobs go to the vlass nodes with a higher priority without preempting running jobs.
- Slurm
  - Queues are called partitions. At some level they are called partitions in Torque as well.
  - Job preemtion is disabled by default
  - Allows for simple priority settings in partitions with the default PriorityType=priority/basic plugin.
  - HERA
    - PartitionName=hera Nodes=herapost[001-010] Default=YES MaxTime=144000 State=UP
    - User: #SBATCH -p hera
  - VLASS/VLASSTEST
    - Server: PartitionName=vlass Nodes=nmpost[061-090] MaxTime=144000 State=UP Priority=1000
    - Server: PartitionName=vlasstest Nodes=nmpost[061-070] MaxTime=144000 State=UP
    - User: #SBATCH -p vlass
- HTCondor
  - HTCondor doesn't have queues or partitions like Torque/Moab or Slurm but there are still ways to do what we need.
  - Constraints, Custom ClassAds, and Ranks is an option. For example, HERA nodes could set the following in their configs
    - HERA = True
    - STARTD_ATTRS = $(STARTD_ATTRS) HERA
    - START = ($(START)) && (TARGET.partition =?= "HERA")
    - and users could set the following in their submit files
    - Requirements = (HERA =?= True) or Requirements = (HERA == True) The differences may not be important.
    - +partition = "HERA"
    - We could do the same for VLASS/VLASSTEST but I don't know if HTCondor can prioritize VLASS over VLASSTEST the way we do with Moab. We could also do something like this for interactive nodes and nodescheduler if we end up using that.
    - VLASS = True
    - VLASSTEST = True
    - STARTD_ATTRS = $(STARTD_ATTRS) VLASS VLASSTEST
    - START = ($(START)) && (TARGET.partition =?= "VLASS")
    - and users could set the following in their submit files
    - requirements = (VLASS =?= True) or requirements = (VLASSTEST =?= True)
    - +partition = "VLASS" or +partition = "VLASSTEST" depending on which they want
    - Rank = (VLASS =?= True) + (VLASSTEST =!= True) if they want to run VLASS jobs on unused VLASSTEST nodes.
  - Using separate pools for things like HERA and VLASS is an option, but may be overkill as it would require separate Central Managers.
  - HTCondor does support accounting groups that may work like queues.
  - Because of the design of HTCondor there isn't a central place to define the order and "queue" of nodes like there is in Torque.

DONE: Access: Would like to prevent users from being able to login to nodes unless they have a proper reservation. Right now we restrict access via /etc/security/access.conf and use Torque's pam_pbssimpleauth.so to allow access for any user running a job.
- Slurm
  - Has a pam_slurm.so module which does seem to work like the pam_pbssimpleauth.so module.
- HTCondor
  - How do we restrict access to condor nodes to only those users with valid jobs running?
  - With the restrictions in access.conf, HTCondor can still run jobs as users like krowe2. I think this is because HTCondor doesn't use the login mechanism but just starts shells as the user.

DONE: Ability to run jobs remotely (AWS, CHTC, OSG, etc)
- Slurm
  - I don't think we will need this ability with Slurm
- HTCondor
  - We have successfully tested both condor_annex to AWS, and flocking to CHTC.

DONE: Cgroups: We will need protection like what cgroups provide so that jobs can’t impact other jobs on the same node.
- Slurm
  - /etc/slurm/cgroup.conf
- HTCondor
  - Set CGROUP_MEMORY_LIMIT_POLICY = hard in /etc/condor/config.d/99-nrao on the execute nodes.

DONE: Submit hosts: we may have several hosts that will need to be able to submit and delete jobs. (wirth, mcilroy, hamilton, etc)
- Slurm
  - Slurm-20 requires systemd so hosts must be RHEL7 or later.
- HTCondor
  - https://staff.nrao.edu/wiki/bin/view/NM/HTCondor#Installation
  - https://staff.nrao.edu/wiki/bin/view/NM/HTCondor#Submit_Host

DONE: Pack Jobs: Put jobs on nodes efficiently such that as many nodes as possible are left idle and available for users with large memory and/or large core-count requirements.
- Slurm
  - Add SchedulerType=sched/backfill to /etc/slurm/slurm.conf on the Management Node
- HTCondor
  - Add NEGOTIATOR_DEPTH_FIRST = True to /etc/condor/config.d/99-nrao on the Central Manager

DONE: Reservations: The ability to reserve nodes far in the future for things like CASA classes and SIW would be very helpful. It would need to prevent HTCondor from starting jobs on these nodes as reservation time approaches.
- Slurm
  - scontrol create reservation starttime=now duration=5 nodes=testpost001 user=root
  - scontrol create reservation starttime=2022-05-3T08:00:00 duration=21-0:0:0 nodes=nmpost[020-030] user=root reservationname=siw2022
  - scontrol show res The output of this kinda sucks. Hopefully there is a better way to see all the reservations.
- HTcondor
  - There isn't a reservation feature in HTCondor. Since CHTC makes use of preemption, their nodes can be removed at almost any time without adversely affecting running jobs. Sadly NRAO cannot really use preemption.

DONE: Array jobs: Do we want to keep the Torque array job functionality?
- Slurm
  - #SBATCH --array=0-3%2 This syntax is very similar to Torque.
- HTCondor
  - To some extent, this isn't how HTCondor is ment to be used. In other extents, DAGMan and the queue command can simulate this.
    - queue 100 starts 100 copies of the job
    - queue from seq 10 5 30 | will launch five jobs with $(item) set to 10, 15, 20, 25, 30
    - queue item in 0, 1, 2, 3 Is another example. I don't think you can do the modulus feature with queue (i.e. %)
    - You can throttle DAGMan jobs https://htcondor.readthedocs.io/en/latest/users-manual/dagman-workflows.html#throttling-nodes-by-category

DONE: MPI: We have some users that use MPI across multiple nodes. It would be nice to keep that as an option.
- Slurm
  - mpich2
    - PATH=${PATH}:/usr/lib64/mpich/bin salloc --ntasks=8 mpiexec mpiexec.sh
    - PATH=${PATH}:/usr/lib64/mpich/bin salloc --nodes=2 mpiexec mpiexec.sh
  - OpenMPI
    - Use #SBATCH to request a number of tasks (cores) and then run mpiexec or mpicasa as normal.
- HTCondor
  - Single-node MPI jobs do work in the Vanilla universe.
  - Multi-node MPI jobs require the creation of a Parallel universe. But it might be best to tell users that want multi-node MPI to use Slurm and not HTCondor.

DONE: While preemption can be useful in some circumstances I expect we will want it disabled for the foreseeable future.
- Slurm
  - The default is PreemptType=preempt/none which means Slurm will not preempt jobs.
- HTCondor
  - Setting a Machine Rank will cause jobs to be preempted https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigPrioritiesForUsers

Interactive: The ability to assign all or part of a node to a user with shell level access (nodescheduler, qsub -I, etc), minimal granularity is per NUMA node, finer would be useful. Because Slurm and HTCondor lack the ability to implement nodescheduler, my current thought is to ditch nodescheduler and just use the interactive commands that come with Slurm and HTCondor.
- nodescheduler: Was written before I understood what qsub -I did. Had I known, I may have argued to use qsub -I instead of nodescheduler as it is much simpler, is consistent with other installations of Torque, and may have forced some users to use batch processing which is much more efficient.
- nodescheduler likes
  - It's not tied to any tty so a user can login multiple times from multiple places to their reserved node without requiring something like screen, tmux, or vnc. It also means that users aren't all going through nmpost-master.
  - Its creation is asynchronous. If the cluster is full you don't wait around for your reservation to start, you get an email message when it is ready.
  - It's time limited (e.g. two weeks). We might be able to do the same with a queue/partition setting but could we then extend that reservation?
  - We get to define the shape of a reservation (whole node, NUMA node, etc). If we just let people use qsub -I they could reserve all sorts of sizes which may be less efficient. Then again it may be more efficient. But either way I think nodescheduler it is simpler for our users.
- nodescheduler: dislikes
  - With Toruqe/Moab asking for a NUMA node doesn't work as I would like. Because of bugs and limitations, I still have to ask for a specific amount of memory. The whole point of asking for a NUMA node was that I didn't need to know the resources of a node ahead of time but could just ask for half of a node. Sadly, that doesn't work with Torque/Moab.
  - Because of the way I maintain the cgroup for the user, with /etc/cgrules.conf, I cannot let a user have more than one nodescheduler job on the same node or it will be impossible to know which cgroup an ssh connection should use. The interactive commands (qsub -I, etc) don't have this problem.
- Slurm - I think the biggest blocker here is slurm doesn't have an equivalent to naccesspolicy=uniqueuser.
  - srun -p interactive --pty bash This logs the user into an interactive shell on a node with defaults (1 core, 1 GB memory) in the interactive partition.
  - NUMA I don't see how Slurm can reserve NUMA nodes so we may have to just reserve X tasks with Y memory.
  - naccesspolicy=uniqueuser I don't know how to keep Slurm from giving a user multiple portions of the same host. With Moab I used naccesspolicy=uniqueuser which prevents the ambiguity of which ssh connection goes to which cgroup. I could have nodescheduler check the nodes and assign one that the user isn't currently using but this is starting to turn nodescheduler into a scheduler of its own and I think may be more complication than we want to maintain.
  - cgrules Slurm has system-level prolog/epilog functionality that should allow nodesceduler to set /etc/cgrules.conf but pam_slurm_adopt.so pretty much removes the need for /etc/cgrules.conf.
  - PAM The pam_slurm.so module can be used without modifying systemd and will block users that don't have a job running from logging in. The pam_slurm_adopt.so module required removing some pam_systemd modeules and does what pam_slurm.so does plus will put the user's login shell in the same cgroup as the slurm job expected to run the longest, which would replace my /etc/cgrules.conf hack. This still doesn't solve the problem of multiple interactive jobs by the same user on the same node. Removing the pam_systemd.so module prevents the creation of things like /run/user/<UID> and the XDG_RUNTIME_DIR and XDG_SESSION_ID which breaks VNC. So we may want to use just pam_slurm.so and not pam_slurm_adopt.so.
  - nodeextendjob Can Slurm extend the MaxTime of an interactive job? Yes scontrol update TimeLimit=7 JobId=489 This sets the MaxTime to seven minutes.
- HTCondor
  - condor_submit -i This logs the user into an interactive shell on a node with defaults (1 core equivelent, 0.5 GB memory)
  - NUMA I don't see how HTCondor can reserve NUMA nodes so we may have to just reserve X tasks with Y memory.
  - naccesspolicy=uniqueuser I don't think I need to worry about giving a user multiple portions of the same host if we are using condor_ssh_to_job.
  - cgrules I don't know if HTCondor has the prologue/epilogue functionality to implement my /etc/cgrules.conf hack.
  - PAM How can we allow a user to login to a node they have an interactive job running on via nodescheduler? With Torque or Slurm there are PAM modules but there isn't one for HTCondor.
  - Could run a sleep job just like we do with Torque and use condor_ssh_to_job which seems to do X11 properly. We would probably want to make gygax part of the nmpost pool.
- nodevnc
  - Given the limitation of Slurm and HTCondor and that we already recommend users use VNC on their interactive nodes, why don't we just provide a nodevnc script that reserves a node (via torque, slurm or HTCondor), start a vnc server and then tells the user it is ready and how to connect to it? If someone still needs/wants just simple terminal access, then qsub -I or srun --pty bash or condor_submit -i might suffice.
  - But of course there is another problem. This time it is because Slurm and HTCondor don't actually run /bin/login so proper things like /run/user/<USERID> are not created and vncserver produces a Call to lnusertemp failed. Yet it works for krowe on rastan using but not krowe on testpost001 nor krowe2 on rastan.
    - NO: try /bin/bash -l
    - NO: try --get-user-eval=L
    - NO: try --get-user-eval=S
    - Is this because I had to remove some systemd things to get pam_slurm_adopt working? Was one of those things what created /run/user/<UID>?
    - Could I just creat /run/user/<UID> in a prolog script?
  - Yet it works for krowe on rastan using https://hpc-aub-users-guide.readthedocs.io/en/latest/octopus/interactive_job.html#create-vnc-configuration but not krowe on testpost001 nor krowe2 on rastan. This is probably because there is a /run/user/5213 on rastan that I created a while ago as a test. pam_systemd creates /run/user/<UID> when a user logs in except when I have removed pam_systemd in favor of pam_slurm_adopt.
    - If I create /run/user/5213 on the host, then I can launch vnc via slurm.
    - I need to test if I can run Xvfb without /run/user/<UID>. If I can't then using pam_slurm_adopt is a non-starter for us. A simple CASA test shows that xvfb-run works.
    - While it doesn't break CASA it pretty much does break VNC. Which is also a non-starter for us.
    - A work-around could be something like the following, but there might be other things broken because of the missing /run/user/<UID> and ${XDG_RUNTIME_DIR}/gvfs is actually a fuse mount and reaper does not know how to umount things.
      - mkdir /tmp/${USER}
      - export XDG_RUNTIME_DIR=/tmp/${USER}
      - Or using loginctl enable-linger and disable-linger in prolog/epilog scripts?
  - Reading up on how pam_slurm_adopt works, it will probably never cooperate with systemd and therefore it a hack and not future-proof. https://github.com/systemd/systemd/issues/13535 I am unsure how wise it is to start using pam_slurm_adopt in the first place.
  - So if I don't install pam_slurm_adopt.so, which I only installed because it seemed better than my /etc/cgrules.conf hack, which I only created for nodescheduler after we started using cgroups, then I think I can get nodevnc working as a pseudo-replacement for nodescheduler. If we do use nodevnc and don't use nodescheduler (which we mostly can't) then we may not want to use the pam_slurm.so module either so that users can't login to nodes they have reserved and possibly use resources they aren't scheduled for.
  - I think I should be able to get nodevnc working with HTCondor except it is getting the same lnusertemp errors because /var/run/<UID> isn't being created. I think I will try upgrading htcondor on the testpost cluster to see if that fixes it. Updating htcondor may not solve the problem entirely. I submitted a job at CHTC (htcondor-8.9.11) and while it started Xvnc and vncserver -list showed a working display (:1), I couldn't connect to it with vncviewer so I don't know if it is actually working properly or not. I wouldn't know if I was getting the lnusertemp error until I connected with vncviewer anyway and CHTC doesn't seem to create /run/user/<UID>.
- Other options
  - screen
  - tmux
  - others?

Reaper: Clean nodes of unwanted files, dirs and procs. I don't think HTCondor will need this.
- Slurm - pam_slurm_adopt seems to kill errant procs but will still need a reaper script for files/dirs.
  - If I run vncserver via Torque, my reaper script has to kill a bunch of processes when the job is done. But when I run vncserver via Slurm those processes remain. So we will need some sort of reaper-type script for Slurm.
  - There is https://slurm.schedmd.com/pam_slurm_adopt.html that tracks and kills errant processes but it conflicts with systemd and therefore requires some special installation instructions.
    - Aha. I may have it working. You have to add PrologFlags=contain to both the client and server slurm.conf files.
    - But it doesn't delete files or directories from /tmp, /var/tmp, or /dev/shm when the job ends.
    - I will have to write a reaper script for files/dirs to use in Slurm system epilogs.
- Feb. 22, 2021 krowe: I think I have a working reaper script for slurm. (/users/krowe/reaper/slurm/slurm_reaper.py) It needs more testing.
- HTCondor - Doesn't seem to need file/dirs nor proc reaped.
  - Seems to handle /tmp, /var/tmp, and /dev/shm properly because it uses fake versions of these dirs for each job.
  - It seems to handle errant processes as well.
  - There is also condor_preen that cleans condor directories like /var/lib/condor/spool/...

Reaper: Cancel jobs when accounts are closed.
- This could be a cron job on the Central Manager that looks at all the owners of jobs and kills jobs of any user that is not active.

Node priority: With Torque/Moab we can control the order in which the scheduler picks nodes. This allows us to run jobs on the faster nodes by default.
- Slurm
  - The order of the nodes in PartitionName is not important. But you can set a Weight to a NodeName. Nodes with the lowest weight will be chosen first. Default is 1 and must be an integer.
  - NodeName=testpost001 Weight=100 CPUs=16 Boards=1 SocketsPerBoard=2 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=193370 Weight=2
    NodeName=testpost002 Weight=100 CPUs=16 Boards=1 SocketsPerBoard=2 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=193370 Weight=1
- HTCondor
  - There isn't a simple list like pbsnodes in Torque but there is NEGOTIATOR_PRE_JOB_RANK which can be used to weight nodes by cpu, memory, etc.

Run both Slurm and HTCondor on the same nodes
- Slurm starts and stops condor. CHTC does this because their HTCondor can preempt jobs. So when Slurm starts a job it kills the condor startd and any HTCondor jobs will get preempted and probably restarted somewhere else.
- https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToScavengeCycles
- I think we should just try to keep the clusters separate until there is a need to combine them.

Server Layout: When I initially installed Torque/Maui I wanted a "secret server" that users would not log in to that ran the important services and therefore would not suffer any user shinanigans like running out of memory or CPU. This is what nmpost-serv-1 is. Then I wanted a submit host that all users would use to submit jobs and if it suffered shinanigans, jobs would not be lost or otherwise interrupted. This is what nmpost-master is. This system has performed pretty well. For HTCondor, this setup seemed to map well to their ideas of execution host, submit host, and central manager. But if we want to do flocking or condor_annex, we will need both the submit host and central manager to have external IPs. This makes me question keeping the central manager and submit hosts as separate hosts.
- The central manager will need an external IP. Will it need access to Lustre?
- The submit host that can flock will need an external IP. It will need access to Lustre.
- The submit host that runs condor_annex will need an external IP. It will need access to Lustre.

https://open-confluence.nrao.edu/download/attachments/40537022/nmpost-slurm.conf?api=v2 is a proposed slurm.conf for our nmpost cluster.

Now that we have a list of requirements I think the next step is to create a step-by-step procedure document listing everything that needs to be done to migrate from Torque/Moab to HTCondor and perhaps also Slurm.

Procedure to replace Torque/Moab with HTCondor/Slurm

define how SSA uses Torque/Moab and what they should do for Slurm

Space shortcuts

Page tree