...
DONE: Queues: We want to keep the multiple queue functionality of Torque/Moab where, for example, HERA jobs go to hera nodes and VLASS jobs go to vlass nodes. We would also like to be able to have vlasstest jobs go to the vlass nodes with a higher priority without preempting running jobs.
Slurm
- Queues are called partitions. At some level they are called partitions in Torque as well.
- Job preemtion is disabled by default
- Allows for simple priority settings in partitions with the default PriorityType=priority/basic plugin.
HERA
PartitionName=hera Nodes=herapost[001-010] Default=YES MaxTime=144000 State=UP
- User: #SBATCH -p hera
- VLASS/VLASSTEST
- Server: PartitionName=vlass Nodes=nmpost[061-090] MaxTime=144000 State=UP Priority=1000
- Server: PartitionName=vlasstest Nodes=nmpost[061-070] MaxTime=144000 State=UP
- User: #SBATCH -p vlass
- HTCondor
- HTCondor doesn't have queues or partitions like Torque/Moab or Slurm but there are still ways to do what we need.
- Constraints, Custom ClassAds, and Ranks is an option. For example, HERA nodes could set the following in their configs
- HERA = True
- STARTD_ATTRS = $(STARTD_ATTRS) HERA
START = ($(START)) && (TARGET.partition =?= "HERA")
- and users could set the following in their submit files
- Requirements = (HERA =?= True) or Requirements = (HERA == True) The differences may not be important.
+partition = "HERA"
- We could do the same for VLASS/VLASSTEST but I don't know if HTCondor can prioritize VLASS over VLASSTEST the way we do with Moab. We could also do something like this for interactive nodes and nodescheduler if we end up using that.
- VLASS = True
- VLASSTEST = True
- STARTD_ATTRS = $(STARTD_ATTRS) VLASS VLASSTEST
START = ($(START)) && (TARGET.partition =?= "VLASS")
- and users could set the following in their submit files
- requirements = (VLASS =?= True) or requirements = (VLASSTEST =?= True)
+partition = "VLASS" or +partition = "VLASSTEST" depending on which they want
Rank = (VLASS =?= True) + (VLASSTEST =!= True) if they want to run VLASS jobs on unused VLASSTEST nodes.
- Using separate pools for things like HERA and VLASS is an option, but may be overkill as it would require separate Central Managers.
- HTCondor does support accounting groups that may work like queues.
- Because of the design of HTCondor there isn't a central place to define the order and "queue" of nodes like there is in Torque.
- HTCondor doesn't have queues or partitions like Torque/Moab or Slurm but there are still ways to do what we need.
- OpenPBS
- VLASS/VLASSTEST
- qmgr -c 'create resource qlist type=string_array, flag=h'
- Set resources: "ncpus, mem, arch, host, vnode, qlist” in sched_priv/sched_config and restart pbs_sched
- qmgr -c 'set queue vlass default_chunk.qlist = vlass'
- qmgr -c 'set queue vlasstest default_chunk.qlist = vlasstest'
- qmgr -c 'set node acedia resources_available.qlist = vlass'
- qmgr -c 'set node rastan resources_available.qlist = "vlass,vlasstest"'
- VLASS/VLASSTEST
DONE: Access: Would like to prevent users from being able to login to nodes unless they have a proper reservation. Right now we restrict access via /etc/security/access.conf and use Torque's pam_pbssimpleauth.so to allow access for any user running a job.
- Slurm
- Has a pam_slurm.so module which does seem to work like the pam_pbssimpleauth.so module.
- Has a pam_slurm.so module which does seem to work like the pam_pbssimpleauth.so module.
- HTCondor
- How do we restrict access to condor nodes to only those users with valid jobs running?
- With the restrictions in access.conf, HTCondor can still run jobs as users like krowe2. I think this is because HTCondor doesn't use the login mechanism but just starts shells as the user.
- How do we restrict access to condor nodes to only those users with valid jobs running?
- OpenPBS
- Doesn't come with a PAM module and the Torque PAM module doesn't work with OpenPBS.
- restrict_user and restrict_user_exceptions work in the mom_priv/config file but there is a max of 10 user exceptions. With a PAM module we could make as many exceptions as we like and can use groups and netgroups.
- Slurm
...