Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Pack Jobs: Put jobs on nodes efficiently such that as many nodes as possible are left idle and available for users with large memory and/or large core-count requirements.

    • Slurm has a sched/backfill plugin that backfills jobs similar to Torque/Moab.
      • Add SchedulerType=sched/backfill to /etc/slurm/slurm.conf on the Management Node
    • HTCondor
      • Add NEGOTIATOR_DEPTH_FIRST = True to /etc/condor/config.d/99-nrao on the Central Manager
  • Reaper: Clean nodes of unwanted files, dirs and procs.  Condor seems to handle /tmp and /var/tmp properly because it uses fake versions of these dirs for each job.  But /dev/shm is still an issue. What about errant processes?

    • Slurm

      • There is the pam_slurm_adopt.so that supposedly tracks and kills errant processes but it conflicts with systemd and therefore requires some special tweaking.

    • HTCondor
      • Seems to handle /tmp and /var/tmp properly because it uses fake versions of these dirs for each job.
      • but /dev/shm is still an issue.
      • What about errant processes?
    • Slurm
    • There is the pam_slurm_adopt.so that supposedly tracks and kills errant processes but it conflicts with systemd and therefore requires some special tweaking.
  • Reaper: Cancel jobs when accounts are closed.

...