Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

https://staff.nrao.edu/wiki/bin/view/NM/Slurm


Now that we have a list of requirements I think the next step is to create a step-by-step procedure document listing everything that needs to be done to migrate from Torque/Moab to HTCondor and perhaps also Slurm.

To Do

  • Node priority: With Torque/Moab we can control the order in which the scheduler picks nodes by altering the oder of the nodes file.  This allows us to run jobs on the faster nodes by default.

    • Slurm
      • I don't know how to set priorities in Slurm like we do in Torque where batch jobs get the faster nodes and interactive jobs get the slower nodes.  There is a Weight feature to a NodeName where the lowest weight will be chosen first but that will affect batch and interactive partitions equally.  I need another axis.  Actually, this might work at least for hera and hera-jupyter.
        • NodeName=herapost[001-007] Sockets=2 CoresPerSocket=8 RealMemory=193370 Weight=10

          NodeName=herapost011 Sockets=2 CoresPerSocket=10 RealMemory=515790 Weight=1

          PartitionName=batch Nodes=herapost[001-007] Default=YES MaxTime=144000 State=UP

          PartitionName=hera-jupyter Nodes=ALL MaxTime=144000 State=UP


      • The order in which the nodes are defined in slurm.conf has no baring on which node the scheduler will choose.  Even though the man page for slurm.conf reads "the order the nodes appear in the configuration file".
      • Perhaps I can use some sbatch option in nodescheduler to choose slower nodes first.
      • Perhaps use Gres to set a resource like KLARNS for various nodes (Gold 6135, E-5 2400, etc).  The slower the node, the more KLARNS we will assign it.  Then if Slurm assigns jobs to nodes with the most KLARNS then we can use that to select the slowest nodes first.  Hinky?  You betcha.
    • HTCondor
      • There isn't a simple list like pbsnodes in Torque but there is NEGOTIATOR_PRE_JOB_RANK which can be used to weight nodes by cpu, memory, etc.
    • OpenPBS
      • Doesn't have a nodes file so I don't know what drives the order of the nodes chosen for jobs.

...

https://open-confluence.nrao.edu/download/attachments/40537022/nmpost-slurm.conf?api=v2 is a proposed slurm.conf for our nmpost cluster.Now that we have a list of requirements I think the next step is to create a step-by-step procedure document listing everything that needs to be done to migrate from Torque/Moab to HTCondor and perhaps also Slurm.

...


Done

...