You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 33 Next »

Currently, the nmpost cluster is a mix of Torque/Moab nmpost{001..090} and HTCondor nmpost{091..120} devhost{001..002}.  Eventually we would like to replace Torque/Moab with Slurm as we think it can do most, if not all, of what Torque/Moab does but is free and seems more commonly used these days than Torque/Moab.

We upgraded to Torque-6/Moab-9 and thus started having to pay for Torque/Moab in 2018.  This was done because Torque-6 understood cgroups and NUMA nodes (although it doesn't handle NUMA nodes the way I would like it to) and Torque-6 was no longer compatible with the free scheduler Maui forcing us to purchase the Moab scheduler.  Since then we have leveraged a couple of things Moab can do that Maui never could like increasing the number of jobs the scheduler looks ahead to schedule.  This allows Moab to start reserving space for pending vlass jobs on vlasstest nodes but is not a critical requirement.  Largely, the win was with cgroups for resource separation and NUMA nodes to double the number of interactive nodes.  Both of which only required the new version of Torque which in turn required Moab which in turn we had to pay for.  See what they did there?  You can read more about it at https://staff.nrao.edu/wiki/bin/view/DMS/SCGTorque6Moab9Presentation

An option to replace Torque/Moab, instead of Slurm, is openpbs which seems to be the free version of PBS Pro maintained by Altair Engineering.  I have tested openpbs and found it lacking in a few important things like it doesn't support a working directory like Torque does with -d or -w, and no PAM module allowing users to login if they have an active job which would make nodescheduler very hard to implement.

To Do

Prep

  • upgrade testpost-master to RHEL7 so it can run Slurm
  • upgrade nmpost-master to RHEL7 so it can run Slurm
  • Look at upgrading to the latest version of Slurm


Work

  • Port nodeextendjob to Slurm scontrol update TimeLimit=+1-0:0:0 jobid=974
  • DONE: Port nodesfree to Slurm
  • DONE: Port nodereboot to Slurm scontrol ASAP reboot reason=testing testpost001
  • Create a subset of testpost cluster that only runs Slurm for admins to test.
    • Install Slurm on testpost-serv-1, testpost-master, and OS image
    • install Slurm reaper on OS image
  • Create a small subset of nmpost cluster that only runs Slurm for users to test.
    • Install Slurm on nmpost-serv-1, nmpost-master, herapost-master, and OS image
    • install Slurm reaper on OS image
    • Need at least 4 nodes: batch, interactive, vlass/vlasstest, hera/hera-i
  • Identify stake-holders (E.g. operations, DAs, sci-staff, SSA, HERA, observers) and give them the chance to test Slurm and provide opinions
  • implement useful opinions
  • Set a date to transition remaining cluster to Slurm.  Possibly before we have to pay for Torque again around Jun. 2022.
  • Do another pass on the documentation https://info.nrao.edu/computing/guide/cluster-processing


Launch


Clean

  • Remove nodefindfphantoms
  • Remove nodereboot and associated cron job on servers

Done


  • DONE: Sort out the various memory settings (ConstrainRAMSpace, ConstrainSwapSpace, AllowedSwapSpace, etc)



References

  • No labels