Page History

...

The Pilot job submitted to Slurm will one of the two following options depending on results from my testing

echo 'CONDOR_CONFIG=/etc/condor/glidein-slurm.conf' > /var/run/condor/config
echo 'STARTD.DAEMON_SHUTDONW = State SHUTDOWN = size(ChildState) == "Unclaimed" 0 && Activity size(ChildActivity) == "Idle" 0 && (MyCurrentTime - EnteredCurrentActivity) > 600' > /var/run/condor/condor_config.local
echo 'MASTER.DAEMON_SHUTDOWN = STARTD_StartTime == 0' >> /var/run/condor/condor_config.local
/usr/sbin/condor_master -f
rm -f /var/run/condor/condor_config.local
rm -f /var/run/condor/config
exit

or

echo 'CONDOR_CONFIG=/etc/condor/glidein-slurm.conf' > /var/run/condor/config
echo 'STARTD.DAEMON_SHUTDONW = State == "Unclaimed" && Activity == "Idle" && (MyCurrentTime - EnteredCurrentActivity) > 600' > /var/run/condor/condor_config.local
systemctl start condor
# loop until condor_startd is no longer a running process
systemctl stop condor
rm -f /var/run/condor/condor_config.local
rm -f /var/run/condor/config
exit

If the Payload job is very small and exits before the Pilot job can start blocking on condor_startd then the Pilot job may never end. So, it may need some code to exit after some amount of time if condor_stard hasn't been seen.

If the Pilot job starts condor_master then I may not need to add the EnvironmentFile=-/var/run/condor/config line in the condor unit file.

Factory

The factory process that watches the clusters and launches Pilot jobs should be pretty simple

Factory

The factory process that watches the clusters and launches Pilot jobs should be pretty simple cron job

PILOT_JOB=/lustre/aoc/admin/tmp/krowe/pilot.sh
idle_condor_jobs=$(condor_q -global -allusers -constraint 'JobStatus == 1' -format "%d\n" 'ServerTime - QDate' | sort -nr | head -1)
#krowe Jul 21 2021: when there are no jobs, condor_q -global returns 'All queues are empty'. Let's reset that.
if [ "${idle_condor_jobs}" = "All queues are empty" ] ; then
    idle_condor_jobs=""
fi

# Is there at least one free node in Slurm?
free_slurm_nodes=$(sinfo --states=idle --Format=nodehost --noheader)
# launch one pilot job
if [ -n "${idle_condor_jobs}" ] ; then
    if [ -n "${free_slurm_nodes}" ] ; then
        if [ -f "${PILOT_JOB}" ] ; then
            sbatch --quiet ${PILOT_JOB}
        fi
    fi
fi
If jobs are waiting in the HTCondor cluster (perhaps only vlapipe jobs)
If nodes are available in the Slurm Cluster (If not perhaps send email)
Launch one Pilot job
Sleep some amount of time, presumably more than the time HTCondor takes to launch a job

Problems

Ideas

Instead of using systemd to start condor, I could run condor_master -f from the Pilot script. I can set set both STARTD and MASTER DAEMON_SHUTDOWN variables which will cause condor_master to exit and therefore I won't need to watch the condor_startd process. This may still cause weirdness with cgruops (the HTCondor processes being a subset of the Slurm processes) but I will have to try it to find out.

...

Space shortcuts

Page tree

Versions Compared

Old Version 14

New Version 15

Key

Factory

Factory

Problems

Ideas