Glidein to Slurm

I have an idea how to make one OS image that can be used for both the
HTCondor cluster and Slurm Cluster such that we can have HTCondor jobs
glidein to the Slurm cluster.

#
# CONDOR_CONFIG
#
The condor_startd reads the CONDOR_CONFIG environment variable as its
config file instead of the default /etc/condor/condor_config and exits
with an error if there is a problem reading that file.

https://htcondor.readthedocs.io/en/latest/admin-manual/introduction-to-configuration.html?highlight=condor_config#ordered-evaluation-to-set-the-configuration

#
# DAEMON_SHUTDOWN
#
The condor_startd daemon will shutdown gracefully and not be restarted
if the ClassAd STARTD.DAEMON_SHUTDOWN evlauates to True. E.g.

STARTD.DAEMON_SHUTDONW = State == "Unclaimed" && Activity == "Idle" && (CurrentTime - EnteredCurrentActivity) > 600

https://htcondor.readthedocs.io/en/latest/admin-manual/configuration-macros.html

#
# sysconfig
#
The condor.service unit in systemd reads /etc/sysconfig/condor but
does not evaluate it. So adding something like the following to
/etc/sysconfig/condor won't work

CONDOR_CONFIG=$(cat /var/run/condor/config)

I could instead add a second EnvironmentFile like so

EnvironmentFile=-/etc/sysconfig/condor
EnvironmentFile=-/var/run/condor/config

where /var/run/condor/config sets CONDOR_CONFIG=/etc/condor/condor_config

But I can use this to keep HTCondor from starting, just like I do with
Torque and Slurm. I can set CONDOR_CONFIG=/dontstartcondor in
/etc/syconfig/condor in the OS image and override it with a snapshot.
Then stop setting 99-nrao as a snapshot.

#
# OS image
#

All three schedulers (Torque, slurm, condor) will be configured to
start via systemd. The file pbs_mom, slurm, and condor in
/etc/sysconfig will be set such that all of these schedulers will fail
to start on boot.

/etc/sysconfig/pbs_mom:
/etc/sysconfig/slurm:
/etc/sysconfig/condor: CONDOR_CONFIG=/nosuchfile

If any of these schedulers are wanted to start on boot, the
appropriate /etc/sysconfig file (pbs_mom, slurm, condor) will be
altered via a snapshot.

/etc/sysconfig/pbs_mom:
/etc/sysconfig/slurm:
/etc/sysconfig/condor: CONDOR_CONFIG=/etc/condor/condor_config

Change the LOCAL_CONFIG_FILE in HTCondor to a file that will contain
the configurations needed for a Slurm node to run an HTCondor Pilot
job (e.g. STARTD.DAEMON_SHUTDOWN). This file will be created by the
Pilot job.

echo 'LOCAL_CONFIG_FILE = /var/run/condor/condor_config.local' >> /etc/condor/condor_config

The alternative was to make a complete copy of condor_config and all its
sub-config files into an /etc/condor/glidein-slurm.conf and add the
DAEMON_SHUTDOWN ad as well. This seems dangerous to me as now those
two config files can drift.

#
# Pilot Job
#

The Pilot job submitted to Slurm will do the following

echo 'CONDOR_CONFIG=/etc/condor/glidein-slurm.conf' > /var/run/condor/config
echo 'STARTD.DAEMON_SHUTDONW = State == "Unclaimed" && Activity == "Idle" && (CurrentTime - EnteredCurrentActivity) > 600' > /var/run/condor/condor_config.local
systemctl start condor
# loop until condor_startd is no longer a running process
systemctl stop condor
rm -f /var/run/condor/condor_config.local
rm -f /var/run/condor/config
exit

If the Payload job is very small and exits before the Pilot job can start
blocking on condor_startd then the Pilot job may never end. So, it may
need some code to exit after some amount of time if condor_stard hasn't
been seen.

Space shortcuts

Page tree