Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • MPI: We have some users that use MPI across multiple nodes.  It would be nice to keep that as an option.

    • Slurm
      • mpich2
        • PATH=${PATH}:/usr/lib64/mpich/bin salloc --ntasks=8 mpiexec mpiexec.sh
        • PATH=${PATH}:/usr/lib64/mpich/bin salloc --nodes=2 mpiexec mpiexec.sh
      • OpenMPI
        • Use #SBATCH to request a number of tasks (cores) and then run mpiexec or mpicasa as normal.
    • HTCondor
      • While there is a parallel universe for HTCondor, I think we will use Slurm for MPI jobs.
  • Cgroups: We will need protection like what cgroups provide so that jobs can’t impact other jobs on the same node.

    • Slurm
      • /etc/slurm/cgroup.conf
    • HTCondor
      • Set CGROUP_MEMORY_LIMIT_POLICY = hard in /etc/condor/config.d/99-nrao on the execute nodes.
    • Pack Jobs: Put jobs on nodes efficiently such that as many nodes as possible are left idle and available for users with large memory and/or large core-count requirements.

      • Slurm has a sched/backfill plugin that backfills jobs similar to Torque/Moab.
    • Reaper: Clean nodes of unwanted files, dirs and procs.  Condor seems to handle /tmp and /var/tmp properly because it uses fake versions of these dirs for each job.  But /dev/shm is still an issue. What about errant processes?

      • HTCondor
      • Seems to handle /tmp and /var/tmp properly because it uses fake versions of these dirs for each job.
      • but /dev/shm is still an issue.
      • What about errant processes?
      • Slurm
      • There is the pam_slurm_adopt.so that supposedly tracks and kills errant processes but it conflicts with systemd and therefore requires some special tweaking.
    • Reaper: Cancel jobs when accounts are closed.

    • Node priority: With Torque/Moab we can control the order in which the scheduler pick nodes.  This allows us to run jobs on the faster nodes by default. Can HTCondor do this?

      • Slurm
      • The order of the nodes in PartitionName is not important.  But you can set a Weight to a NodeName.  Nodes with the lowest weight will be chosen first.


https://open-confluence.nrao.edu/download/attachments/40537022/nmpost-slurm.conf?api=v2 is a proposed slurm.conf for our nmpost cluster.