...
MPI: We have some users that use MPI across multiple nodes. It would be nice to keep that as an option.
- Slurm
- mpich2
- PATH=${PATH}:/usr/lib64/mpich/bin salloc --ntasks=8 mpiexec mpiexec.sh
- PATH=${PATH}:/usr/lib64/mpich/bin salloc --nodes=2 mpiexec mpiexec.sh
- OpenMPI
- Use #SBATCH to request a number of tasks (cores) and then run mpiexec or mpicasa as normal.
- mpich2
- HTCondor
- Single-node MPI jobs should do work in the Vanilla universe.
- Multi-node MPI jobs require the creation of a Parallel universe or using Slurm instead.
- Slurm
DONE: Cgroups: We will need protection like what cgroups provide so that jobs can’t impact other jobs on the same node.
- Slurm
- /etc/slurm/cgroup.conf
- HTCondor
- Set CGROUP_MEMORY_LIMIT_POLICY = hard in /etc/condor/config.d/99-nrao on the execute nodes.
- Set CGROUP_MEMORY_LIMIT_POLICY = hard in /etc/condor/config.d/99-nrao on the execute nodes.
- Slurm
DONE: Submit hosts: we may have several hosts that will need to be able to submit and delete jobs. (wirth, mcilroy, hamilton, etc)
- Slurm
- Slurm-20 requires systemd so hosts must be RHEL7 or later.
- HTCondor
- Slurm
DONE: Pack Jobs: Put jobs on nodes efficiently such that as many nodes as possible are left idle and available for users with large memory and/or large core-count requirements.
- Slurm
- Add SchedulerType=sched/backfill to /etc/slurm/slurm.conf on the Management Node
- HTCondor
- Add NEGOTIATOR_DEPTH_FIRST = True to /etc/condor/config.d/99-nrao on the Central Manager
- Add NEGOTIATOR_DEPTH_FIRST = True to /etc/condor/config.d/99-nrao on the Central Manager
- Slurm
Reaper: Clean nodes of unwanted files, dirs and procs. Condor seems to handle /tmp and /var/tmp properly because it uses fake versions of these dirs for each job. But /dev/shm is still an issuehandled as well in newer versions of HTCondor. What about errant processes?
Slurm
There is the pam_slurm_adopt.so that supposedly tracks and kills errant processes but it conflicts with systemd and therefore requires some special tweaking.
- HTCondor
- Seems to handle /tmp and /var/tmp properly because it uses fake versions of these dirs for each job.
- /dev/shm is now handled as of HTCondor-8.9.7
- What about errant processes?
...
While preemption can be useful in some circumstances I expect we will want it disabled for the foreseeable future.
Slurm
The default is PreemptType=preempt/none which means Slurm will not preempt jobs.
- HTCondor
- Setting a Machine Rank will cause jobs to be preempted https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigPrioritiesForUsers
- Run both Slurm and HTCondor on the same nodes
- Server Layout: When I initially installed Torque/Maui I wanted a "secret server" that users would not log in to that ran the important services and therefore would not suffer any user shinanigans like running out of memory or CPU. This is what nmpost-serv-1 is. Then I wanted a submit host that all users would use to submit jobs and if it suffered shinanigans, jobs would not be lost or otherwise interrupted. This is what nmpost-master is. This system has performed pretty well. For HTCondor, this setup seemed to map well to their ideas of execution host, submit host, and central manager. But if we want to do flocking or condor_annex, we will need both the submit host and central manager to have external IPs. This makes me question keeping the central manager and submit hosts as separate hosts.
- The central manager will need an external IP. Will it need access to Lustre?
- The submit host that can flock will need an external IP. It will need access to Lustre.
- The submit host that runs condor_annex will need an external IP. It will need access to Lustre.
...