...
MPI: We have some users that use MPI across multiple nodes. It would be nice to keep that as an option.
- Slurm
- mpich2
- PATH=${PATH}:/usr/lib64/mpich/bin salloc --ntasks=8 mpiexec mpiexec.sh
- PATH=${PATH}:/usr/lib64/mpich/bin salloc --nodes=2 mpiexec mpiexec.sh
- OpenMPI
- Use #SBATCH to request a number of tasks (cores) and then run mpiexec or mpicasa as normal.
- mpich2
- HTCondor
- While there is a parallel universe for HTCondor, I think we will use Slurm for MPI jobs.
- While there is a parallel universe for HTCondor, I think we will use Slurm for MPI jobs.
- Slurm
Cgroups: We will need protection like what cgroups provide so that jobs can’t impact other jobs on the same node.
- Slurm
- /etc/slurm/cgroup.conf
- HTCondor
- Set CGROUP_MEMORY_LIMIT_POLICY = hard in /etc/condor/config.d/99-nrao on the execute nodes.
- Set CGROUP_MEMORY_LIMIT_POLICY = hard in /etc/condor/config.d/99-nrao on the execute nodes.
- Slurm
Submit hosts: we may have several hosts that will need to be able to submit and delete jobs. (wirth, mcilroy, hamilton, etc)
- Slurm
- Slurm-20 requires systemd so hosts must be RHEL7 or later.
- HTCondor
- Slurm
Pack Jobs: Put jobs on nodes efficiently such that as many nodes as possible are left idle and available for users with large memory and/or large core-count requirements.
- Slurm has a sched/backfill plugin that backfills jobs similar to Torque/Moab.
- Slurm has a sched/backfill plugin that backfills jobs similar to Torque/Moab.
Reaper: Clean nodes of unwanted files, dirs and procs. Condor seems to handle /tmp and /var/tmp properly because it uses fake versions of these dirs for each job. But /dev/shm is still an issue. What about errant processes?
- HTCondor
- Seems to handle /tmp and /var/tmp properly because it uses fake versions of these dirs for each job.
- but /dev/shm is still an issue.
- What about errant processes?
- Slurm
- There is the pam_slurm_adopt.so that supposedly tracks and kills errant processes but it conflicts with systemd and therefore requires some special tweaking.
Reaper: Cancel jobs when accounts are closed.
Node priority: With Torque/Moab we can control the order in which the scheduler pick nodes. This allows us to run jobs on the faster nodes by default. Can HTCondor do this?
- Slurm
- The order of the nodes in PartitionName is not important. But you can set a Weight to a NodeName. Nodes with the lowest weight will be chosen first.
While preemption can be useful in some circumstances I expect we will want it disabled for the foreseeable future.
Slurm
The default is PreemptType=preempt/none which means Slurm will not preempt jobs.
- Run both Slurm and HTCondor on the same nodes
https://open-confluence.nrao.edu/download/attachments/40537022/nmpost-slurm.conf?api=v2 is a proposed slurm.conf for our nmpost cluster.