Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Node priority: With Torque/Moab we can control the order in which the scheduler picks nodes by altering the oder of the nodes file.  This allows us to run jobs on the faster nodes by default.

    • Slurm
      • I don't know how to set priorities in Slurm like we do in Torque where batch jobs get the faster nodes and interactive jobs get the slower nodes.  There is a Weight feature to a NodeName where the lowest weight will be chosen first but that will affect batch and interactive partitions equally.  I need another axis.  Actually, this might work at least for hera and hera-jupyter.
        • NodeName=herapost[001-007] Sockets=2 CoresPerSocket=8 RealMemory=193370 Weight=10

          NodeName=herapost011 Sockets=2 CoresPerSocket=10 RealMemory=515790 Weight=1

          PartitionName=batch Nodes=herapost[001-007] Default=YES MaxTime=144000 State=UP

          PartitionName=hera-jupyter Nodes=ALL MaxTime=144000 State=UP


      • The order in which the nodes are defined in slurm.conf has no baring on which node the scheduler will choose.
      • Perhaps I can use some sbatch option in nodescheduler to choose slower nodes first.
      • Perhaps use Gres to set a resource like KLARNS for various nodes (Gold 6135, E-5 2400, etc).  The slower the node, the more KLARNS we will assign it.  Then if Slurm give out assigns jobs to nodes with the most KLARNS then we can use that to select the slowest nodes first.  Hinky?  You betcha.
    • HTCondor
      • There isn't a simple list like pbsnodes in Torque but there is NEGOTIATOR_PRE_JOB_RANK which can be used to weight nodes by cpu, memory, etc.
    • OpenPBS
      • Doesn't have a nodes file so I don't know what drives the order of the nodes chosen for jobs.

...

  • Reaper: Cancel jobs when accounts are closed.

    • This could be a cron job on the Central Manager that looks at all the owners of jobs and kills jobs of any user that is not active.


  • Server LayoutFlocking:  When I initially installed Torque/Maui I wanted a "secret server" that users would not log in to that ran the important services and therefore would not suffer any user shinanigans like running out of memory or CPU.  This is what nmpost-serv-1 is.  Then I wanted a submit host that all users would use to submit jobs and if it suffered shinanigans, jobs would not be lost or otherwise interrupted.  This is what nmpost-master is.  This system has performed pretty well.  For HTCondor, this setup seemed to map well to their ideas of execution host, submit host, and central manager.  But if we want to do flocking or condor_annex, we will need both the submit host and central manager to have external IPs.  This makes me question keeping the central manager and submit hosts as separate hosts.
    • The central manager will need an external IP.  Will it need access to Lustre?
    • The submit host that can flock will need an external IP.  It will need access to Lustre.
    • The submit host that runs condor_annex will need an external IP.  It will need access to Lustre.

...