...
https://staff.nrao.edu/wiki/bin/view/NM/Slurm
Now that we have a list of requirements I think the next step is to create a step-by-step procedure document listing everything that needs to be done to migrate from Torque/Moab to HTCondor and perhaps also Slurm.
...
- Flocking: When I initially installed Torque/Maui I wanted a "secret server" that users would not log in to that ran the important services and therefore would not suffer any user shinanigans like running out of memory or CPU. This is what nmpost-serv-1 is. Then I wanted a submit host that all users would use to submit jobs and if it suffered shinanigans, jobs would not be lost or otherwise interrupted. This is what nmpost-master is. This system has performed pretty well. For HTCondor, this setup seemed to map well to their ideas of execution host, submit host, and central manager. But if we want to do flocking or condor_annex, we will need both the submit host and central manager to have external IPs. This makes me question keeping the central manager and submit hosts as separate hosts.
- The central manager will need an external IP. Will it need access to Lustre?
- The submit host that can flock will need an external IP. It will need access to Lustre.
- The submit host that runs condor_annex will need an external IP. It will need access to Lustre.
https://open-confluence.nrao.edu/download/attachments/40537022/nmpost-slurm.conf?api=v2 is a proposed slurm.conf for our nmpost cluster.
Done
- DONE: Run both Slurm and HTCondor on the same nodes
- Slurm starts and stops condor. CHTC does this because their HTCondor can preempt jobs. So when Slurm starts a job it kills the condor startd and any HTCondor jobs will get preempted and probably restarted somewhere else.
- https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToScavengeCycles
- I think we should just try to keep the clusters separate until there is a need to combine them.
- Glidein to Slurm
- https://staff.nrao.edu/wiki/bin/view/NM/HTCondor-glidein
- 2021-08-02 krowe: I have a working OS image that can run Torque/Moab, Slurm, and HTCondor depending on files in /etc/sysconfig. It allows for HTCondor jobs to glidein to a Slurm cluster.
...