Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...


  • SKIPDONE: Interactive: The ability to assign all or part of a node to a user with shell level access (nodescheduler, qsub -I, etc),  minimal granularity is per NUMA node, finer would be useful.  Because Slurm and HTCondor lack the ability to implement nodescheduler, my current thought is to ditch nodescheduler and just use the interactive commands that come with Slurm and HTCondor.
    MPI: We have some users that use MPI across multiple nodes.  It would be nice to keep that as an option.

    • Slurm
      • mpich2
        • PATH=${PATH}:/usr/lib64/mpich/bin salloc --ntasks=8 mpiexec mpiexec.sh
        • PATH=${PATH}:/usr/lib64/mpich/bin salloc --nodes=2 mpiexec mpiexec.sh
      • OpenMPI
        • Use #SBATCH to request a number of tasks (cores) and then run mpiexec or mpicasa as normal.
    • HTCondor
      • Single-node MPI jobs do work in the Vanilla universe.
      • Multi-node MPI jobs require the creation of a Parallel universe.  But it might be best to tell users that want multi-node MPI to use Slurm and not HTCondor.


  • SKIP: Interactive: The ability to assign all or part of a node to a user with shell level access (nodescheduler, qsub -I, etc),  minimal granularity is per NUMA node, finer would be useful.  Because Slurm and HTCondor lack the ability to implement nodescheduler, my current thought is to ditch nodescheduler and just use the interactive commands that come with Slurm and HTCondor.

    • nodescheduler: Was written before I understood what qsub -I did.  Had I known, I may have argued to use qsub -I instead of nodescheduler as it is much simpler, is consistent with other installations of Torque, and may have forced some users to use batch processing which is much more efficient.
    • nodescheduler likes
      • It's not tied to any tty so a user can login multiple times from multiple places to their reserved node without requiring something like screen, tmux, or vnc.  It also means that users aren't all going through nmpost-master.
      • Its creation is asynchronous.  If the cluster is full you don't wait around for your reservation to start, you get an email message when it is ready.
      • It's time limited (e.g. two weeks).  We might be able to do the same with a queue/partition setting but could we then extend that reservation?
      • We get to define the shape of a reservation (whole node, NUMA node, etc).  If we just let people use qsub -I they could reserve all sorts of sizes which may be less efficient.  Then again it may be more efficient.  But either way I think nodescheduler it is simpler for our users.
    • nodescheduler: dislikes
      • With Toruqe/Moab asking for a NUMA node doesn't work as I would like.  Because of bugs and limitations, I still have to ask for a specific amount of memory.  The whole point of asking for a NUMA node was that I didn't need to know the resources of a node ahead of time but could just ask for half of a node. Sadly, that doesn't work with Torque/Moab.
      • Because of the way I maintain the cgroup for the user, with /etc/cgrules.conf, I cannot let a user have more than one nodescheduler job on the same node or it will be impossible to know which cgroup an ssh connection should use.  The interactive commands (qsub -I, etc) don't have this problem.
    • Slurm
      • srun --pty bash
    • nodescheduler: Was written before I understood what qsub -I did.  Had I known, I may have argued to use qsub -I instead of nodescheduler as it is much simpler, is consistent with other installations of Torque, and may have forced some users to use batch processing which is much more efficient.
    • nodescheduler likes
      • It's not tied to any tty so a user can login multiple times from multiple places to their reserved node without requiring something like screen, tmux, or vnc.  It also means that users aren't all going through nmpost-master.
      • Its creation is asynchronous.  If the cluster is full you don't wait around for your reservation to start, you get an email message when it is ready.
      • It's time limited (e.g. two weeks).  We might be able to do the same with a queue/partition setting but could we then extend that reservation?
      • We get to define the shape of a reservation (whole node, NUMA node, etc).  If we just let people use qsub -I they could reserve all sorts of sizes which may be less efficient.  Then again it may be more efficient.  But either way I think nodescheduler it is simpler for our users.
    • nodescheduler: dislikes
      • With Toruqe/Moab asking for a NUMA node doesn't work as I would like.  Because of bugs and limitations, I still have to ask for a specific amount of memory.  The whole point of asking for a NUMA node was that I didn't need to know the resources of a node ahead of time but could just ask for half of a node. Sadly, that doesn't work with Torque/Moab.
      • Because of the way I maintain the cgroup for the user, with /etc/cgrules.conf, I cannot let a user have more than one nodescheduler job on the same node or it will be impossible to know which cgroup an ssh connection should use.  The interactive commands (qsub -I, etc) don't have this problem.
    • Slurm
      • srun --pty bash This logs the user into an interactive shell on a node with defaults (1 core, 1 GB memory)
      • Slurm has system-level prolog/epilog functionality that should allow nodesceduler to set /etc/cgrules.conf.
      • I don't see how Slurm can reserve NUMA nodes so we may have to just reserve X tasks with Y memory.
      • I don't know how to keep Slurm from giving a user multiple portions of the same host.  With Moab I used naccesspolicy=uniqueuser which prevents the ambiguity of which ssh connection goes to which cgroup.  I could have nodescheduler check the nodes and assign one that the user isn't currently using but this is starting to turn nodescheduler into a scheduler of its own and I think may be more complication than we want to maintain.
        • One method would be to request a node, check if that
    • HTCondor
      • condor_submit -i This logs the user into an interactive shell on a node with defaults (1 core equivelent, 0.5 1 GB memory)
      • Slurm has system-level prolog/epilog functionality that should allow nodesceduler to set /etc/cgrules.conf.
      • I don't see how Slurm can
      • I don't see how HTCondor can
      • reserve NUMA nodes so we may have to just reserve X tasks with Y memory.
      • Could run a sleep job just like we do with Torque and use condor_ssh_to_job which seems to do X11 properly.  We would probably want to make gygax part of the nmpost pool.
      • I don't think I need to worry about giving a user multiple portions of the same host if we are using condor_ssh_to_job.
      • How can we allow a user to login to a node they have an interactive job running on via nodescheduler?  With Torque or Slurm there are PAM modules but there isn't one for HTCondor.
    • nodevnc
      • Given the limitation of Slurm and HTCondor and that we already recommend users use VNC on their interactive nodes, why don't we just provide a nodevnc script that reserves a node (via torque, slurm or HTCondor), start a vnc server and then tells the user it is ready and how to connect to it?  If someone still needs/wants just simple terminal access, then qsub -I or srun --pty bash or condor_submit -i might suffice.

    MPI: We have some users that use MPI across multiple nodes.  It would be nice to keep that as an option.

      • I don't know how to keep Slurm from giving a user multiple portions of the same host.  With Moab I used naccesspolicy=uniqueuser which prevents the ambiguity of which ssh connection goes to which cgroup.  I could have nodescheduler check the nodes and assign one that the user isn't currently using but this is starting to turn nodescheduler into a scheduler of its own and I think may be more complication than we want to maintain.
        • One method would be to request a node, check if that
    • HTCondor
      • condor_submit -i This logs the user into an interactive shell on a node with defaults (1 core equivelent, 0.5 GB memory)
      • I don't see how HTCondor can reserve NUMA nodes so we may have to just reserve X tasks with Y memory.
      • Could run a sleep job just like we do with Torque and use condor_ssh_to_job which seems to do X11 properly.  We would probably want to make gygax part of the nmpost pool.
      • I don't think I need to worry about giving a user multiple portions of the same host if we are using condor_ssh_to_job.
      • How can we allow a user to login to a node they have an interactive job running on via nodescheduler?  With Torque or Slurm there are PAM modules but there isn't one for HTCondor.
    • nodevnc
      • Given the limitation of Slurm and HTCondor and that we already recommend users use VNC on their interactive nodes, why don't we just provide a nodevnc script that reserves a node (via torque, slurm or HTCondor), start a vnc server and then tells the user it is ready and how to connect to it?  If someone still needs/wants just simple terminal access, then qsub -I or srun --pty bash or condor_submit -i might suffice.
    • Slurm
      • mpich2
        • PATH=${PATH}:/usr/lib64/mpich/bin salloc --ntasks=8 mpiexec mpiexec.sh
        • PATH=${PATH}:/usr/lib64/mpich/bin salloc --nodes=2 mpiexec mpiexec.sh
      • OpenMPI
        • Use #SBATCH to request a number of tasks (cores) and then run mpiexec or mpicasa as normal.
    • HTCondor
      • Single-node MPI jobs do work in the Vanilla universe.
      • Multi-node MPI jobs require the creation of a Parallel universe.  How?


  • Reaper: Clean nodes of unwanted files, dirs and procs.  Condor seems to handle /tmp, /var/tmp, and /dev/shm (as of 8.9.7) properly because it uses fake versions of these dirs for each job.  What about errant processes?

    • Slurm

      • There is the pam_slurm_adopt.so that supposedly tracks and kills errant processes but it conflicts with systemd and therefore requires some special tweaking.

    • HTCondor
      • Seems to handle /tmp and /var/tmp properly because it uses fake versions of these dirs for each job.
      • What about errant processes?

...