Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Shadow jobs and Lustre

We had some jobs get restarted because they lost contact with their shadow jobs.  I assume this is because the shadow jobs keep the condor.log file open and if that file is on Lustre and Lustre goes down then the shadow job fails to communicate with the job and the job gets killed.   Does that seem accurate to you?

Docs wrong for evaluating ClassAds?

...

ANSWER: The first asterisk is shouldn't be there.  This is a regex not globbing.  Greg will look into updating this document.

Shutdown

STARTD.DAEMON_SHUTDOWN = State == "Unclaimed" && Activity == "Idle" && (MyCurrentTime - EnteredCurrentActivity) > 600

MASTER.DAEMON_SHUTDOWN = STARTD_StartTime == 0

But I was running a job when it shut down.

07/19/21 11:45:01 The DaemonShutdown expression "State == "Unclaimed" && Activity == "Idle" && (MyCurrentTime - EnteredCurrentActivity) > 600" evaluated to TRUE: starting graceful shutdown

Could this be because we use dynamic slots?

testpost-cm-vml krowe >condor_status
Name OpSys Arch State Activity LoadAv Mem

slot1@testpost001.aoc.nrao.edu LINUX X86_64 Unclaimed Idle 0.000 193
slot1_1@testpost001.aoc.nrao.edu LINUX X86_64 Claimed Busy 0.000
slot1@testpost002.aoc.nrao.edu LINUX X86_64 Unclaimed Idle 0.000 144
slot1_1@testpost002.aoc.nrao.edu LINUX X86_64 Claimed Busy 0.810 49
slot1@testpost003.aoc.nrao.edu LINUX X86_64 Unclaimed Idle 0.000 193

I see that with dynamic slots, the parent slot (slot1) seems always unclaimed and idle and the child slots (slot1_1) are Claimed and Busy.  So I tried checking the ChildState attribute which looks to be a list but doesn't behaive like one.  For example, none of these show any slots

condor_status -const 'ChildState == { "Claimed" }'

condor_status -const 'sum(ChildState) == 0'

Even though this produces true

classad_eval 'a = { }' 'sum(a) == 0'

ANSWER: Try this

condor_status -const 'size(ChildState) == 0'

HTCondor and Slurm

NRAO has effectively two use cases:  1) Operations triggered jobs.  These are well formulated pipeline jobs, they're still fairly monolithic and long running (many hours to few days).   2) User triggered jobs, these are of course not well formulated.  We will be moving the operations jobs to htcondor.   We plan to move the user triggered jobs to SLURM form Torque.   There's enough noise in the two job loads that we don't want to have strict host carve outs for type 1 and type 2 jobs.  What we anticipate doing is having a set of nodes known only to htcondor for the bulk of operations and a set of hosts controlled by SLURM for the user facing jobs.   Periodically when they have a large set of operations jobs we'd like for them to burst into the SLURM controlled nodes.  We neither anticipate nor want the slurm jobs to burst into the htcondor set of nodes.

Say we have two clusters (HTCondor and Slurm) and both can be submitted to from the same host.  We want the HTCondor jobs to use the Slurm cluster resources when the HTCondor cluster resources are full, but we probably don't want to support preemption.  How could we have HTCondor submit jobs to a Slurm cluster?  (HTCondor-C, flocking, overlapping, batch-grid-type, HTCondor-CE, etc)

ANSWER: write our own 'factory' that watched HTCondor and when it is full submit Pilot jobs to Slurm that launch startd daemons thus allowing the Payload jobs waiting in HTCondor to run.  Will want to set the startd to exit after being idle for a little while, run the Pilot job as root, and figure out how to do cgroups properly.

Glidein

The only documentation I can find on glinein (https://htcondor.readthedocs.io/en/latest/grid-computing/introduction-grid-computing.html?highlight=glidein#introduction) seems to imply that glidein only works with Globus "HTCondor permits the temporary addition of a Globus-controlled resource to a local pool. This is called glidein."  Is this correct?  Is there better documentation?  Is glidein even a technology or software package or is it just a generic term?

ANSWER: Greg will look at re-writring this.

request_virtualmemory

If I set request_virtualmemory = 2G, condor_submit accepts it as a valid knob but the job stays idle and never runs.

request_memory = 1G
request_virtualmemory = 2G

If I set request_virtualmemory = 2000000, which should be the same as 2G, the job runs but doesn't set memory.memsw.limit_in_bytes in the cgroup.

ANSWER: krowe sent mail to Greg about it

Memory usage report

The memory usage report at the end of the condor log seems incorrect.  I can watch the memory.max_usage_in_bytes in the cgroup get over 8,400MB yet the report in the condor log reads 6,464MB.  Does the log only report the memory usage of the parent process and not include all the children?  Is it an average memory usage over time?

ANSWER: It is a report of a sum of certain fields in memory.stat in the cgroup.  Get Greg an example.  Try it on two machines in case this is a problem of re-using the same cgroup.  Or reboot and try again.

Answered Questions:

  • JOB ID question from Daniel
    • When I submit a job, I get a job ID back. My plan is to hold onto that job ID permanently for tracking. We have had issues in the past with Torque/Maui because the job IDs got recycled later and our internal bookkeeping got mixed up. So my questions are:

       - Are job IDs guaranteed to be unique in HTCondor?
       - How unique are they—are they _globally_ unique or just unique within a particular namespace (such as our cluster or the submit node)?

    • A Job ID (ClusterID.ProcID)
    • DNS name of the schedd and ctime of the job_queued.log file.
    • It is unique to a schedd.
    • We should talk with Daniel about this.  They should craft their own ID.  It could be seeded with a JobID but should not depend on just it.
  • UpgradingHTCondor without killing jobs?
    • schedd can be upgraded and restarted without loosing state assuming the restart is less than the timeout.
    • currently restarting execute services will kill jobs.  CHTC is working on improving this.
    • negotiator and collector can be restarted without killing jobs.
    • CHTC works hard to ensure 8.8.x is compatible with 8.8.y or 8.9.x is compatible with 8.9.y.
  • Leaving data on execution host between jobs (data reuse)
    • Todd is working on this now.
  • Ask about installation of CASA locally and ancillary data (cfcache)
    • CHTC has a Ceph filesystem that is available to many of their execution hosts (notibly the larger ones)
    • There is another software filesystem where CASA could live that is more used for admin usage but might be available to us.
    • We could download the tarball each time over HTTP.  CHTC uses a proxy server so it would often be cached.
  • Environment:  Is there a way to have condor "login" when a job starts thus sourcing /etc/proflie and the user's rc files? Currently, not even $HOME is set.
    • A good analogy is Torque does a su - _username_ while HTCondor just does a su _username_
    • WORKAROUND: setting getenv = True which is like the -V option to qsub, may help. It doesn't source rc files but does inherit your current environment. This may be a problem if your current environment is not what you want on the cluster node. Perhaps the cluster node is a different OS or architecture.
    • ANSWER: condor doesn't execute things with a shell.  You could set your executable as /bin/bash and then have the arguments be the executable you used to have.  I just changed our stuff to staticly set $HOME and I think that is good enough.
  • Flocking: Suppose I have two hosts in the same pool.  testpost-master is a submit-host and testpost-serv-1 is both a submit-host and the central-manager.  testpost-serv-1 is configured to flock to CHTC but testpost-master is not. Is it possible to submit a job on testpost-master that will flock to CHTC by somehow leveraging testpost-serv-1?  In other words, do I have to setup flocking and an external IP on every submit host?
    • ANSWER: there isn't a good way to do this.  So eventually we will need to make testpost-master flock to CHTC and possibly remove the ability of testpost-serv-1 to flock.


HTCondor and Slurm

NRAO has effectively two use cases:  1) Operations triggered jobs.  These are well formulated pipeline jobs, they're still fairly monolithic and long running (many hours to few days).   2) User triggered jobs, these are of course not well formulated.  We will be moving the operations jobs to htcondor.   We plan to move the user triggered jobs to SLURM form Torque.   There's enough noise in the two job loads that we don't want to have strict host carve outs for type 1 and type 2 jobs.  What we anticipate doing is having a set of nodes known only to htcondor for the bulk of operations and a set of hosts controlled by SLURM for the user facing jobs.   Periodically when they have a large set of operations jobs we'd like for them to burst into the SLURM controlled nodes.  We neither anticipate nor want the slurm jobs to burst into the htcondor set of nodes.

Say we have two clusters (HTCondor and Slurm) and both can be submitted to from the same host.  We want the HTCondor jobs to use the Slurm cluster resources when the HTCondor cluster resources are full, but we probably don't want to support preemption.  How could we have HTCondor submit jobs to a Slurm cluster?  (HTCondor-C, flocking, overlapping, batch-grid-type, HTCondor-CE, etc)

ANSWER: write our own 'factory' that watched HTCondor and when it is full submit Pilot jobs to Slurm that launch startd daemons thus allowing the Payload jobs waiting in HTCondor to run.  Will want to set the startd to exit after being idle for a little while, run the Pilot job as root, and figure out how to do cgroups properly.

Glidein

The only documentation I can find on glinein (https://htcondor.readthedocs.io/en/latest/grid-computing/introduction-grid-computing.html?highlight=glidein#introduction) seems to imply that glidein only works with Globus "HTCondor permits the temporary addition of a Globus-controlled resource to a local pool. This is called glidein."  Is this correct?  Is there better documentation?  Is glidein even a technology or software package or is it just a generic term?

ANSWER: Greg will look at re-writring this.


request_virtualmemory

If I set request_virtualmemory = 2G, condor_submit accepts it as a valid knob but the job stays idle and never runs.

request_memory = 1G
request_virtualmemory = 2G

If I set request_virtualmemory = 2000000, which should be the same as 2G, the job runs but doesn't set memory.memsw.limit_in_bytes in the cgroup.

ANSWER: krowe sent mail to Greg about it


Memory usage report

The memory usage report at the end of the condor log seems incorrect.  I can watch the memory.max_usage_in_bytes in the cgroup get over 8,400MB yet the report in the condor log reads 6,464MB.  Does the log only report the memory usage of the parent process and not include all the children?  Is it an average memory usage over time?

ANSWER: It is a report of a sum of certain fields in memory.stat in the cgroup.  Get Greg an example.  Try it on two machines in case this is a problem of re-using the same cgroup.  Or reboot and try again.



...

Answered Questions:

  • JOB ID question from Daniel
    • When I submit a job, I get a job ID back. My plan is to hold onto that job ID permanently for tracking. We have had issues in the past with Torque/Maui because the job IDs got recycled later and our internal bookkeeping got mixed up. So my questions are:

       - Are job IDs guaranteed to be unique in HTCondor?
       - How unique are they—are they _globally_ unique or just unique within a particular namespace (such as our cluster or the submit node)?

    • A Job ID (ClusterID.ProcID)
    • DNS name of the schedd and ctime of the job_queued.log file.
    • It is unique to a schedd.
    • We should talk with Daniel about this.  They should craft their own ID.  It could be seeded with a JobID but should not depend on just it.
  • UpgradingHTCondor without killing jobs?
    • schedd can be upgraded and restarted without loosing state assuming the restart is less than the timeout.
    • currently restarting execute services will kill jobs.  CHTC is working on improving this.
    • negotiator and collector can be restarted without killing jobs.
    • CHTC works hard to ensure 8.8.x is compatible with 8.8.y or 8.9.x is compatible with 8.9.y.
  • Leaving data on execution host between jobs (data reuse)
    • Todd is working on this now.
  • Ask about installation of CASA locally and ancillary data (cfcache)
    • CHTC has a Ceph filesystem that is available to many of their execution hosts (notibly the larger ones)
    • There is another software filesystem where CASA could live that is more used for admin usage but might be available to us.
    • We could download the tarball each time over HTTP.  CHTC uses a proxy server so it would often be cached.
  • Environment:  Is there a way to have condor "login" when a job starts thus sourcing /etc/proflie and the user's rc files? Currently, not even $HOME is set.
    • A good analogy is Torque does a su - _username_ while HTCondor just does a su _username_
    • WORKAROUND: setting getenv = True which is like the -V option to qsub, may help. It doesn't source rc files but does inherit your current environment. This may be a problem if your current environment is not what you want on the cluster node. Perhaps the cluster node is a different OS or architecture.
    • ANSWER: condor doesn't execute things with a shell.  You could set your executable as /bin/bash and then have the arguments be the executable you used to have.  I just changed our stuff to staticly set $HOME and I think that is good enough.
  • Flocking: Suppose I have two hosts in the same pool.  testpost-master is a submit-host and testpost-serv-1 is both a submit-host and the central-manager.  testpost-serv-1 is configured to flock to CHTC but testpost-master is not. Is it possible to submit a job on testpost-master that will flock to CHTC by somehow leveraging testpost-serv-1?  In other words, do I have to setup flocking and an external IP on every submit host?
    • ANSWER: there isn't a good way to do this.  So eventually we will need to make testpost-master flock to CHTC and possibly remove the ability of testpost-serv-1 to flock.


  • It seems the transfer mechanism won't transfer symlinks to directories (e.g. data/vlass.ms → /lustre/aoc/...) Is there a way around this?
    • ANSWER: there is no flag to chase symlinks at the moment.  The top level dir (e.g. data) could be a symlink if transfer_input_files=data/ but it will then transfer the contents of data instead of data itself.
    • If symlink → data and transfer_input_files=symlink I get the error Transfer of symlinks to directories is not supported.
    • if symlink → ../data and transfer_input_files=symlink/ it transfers the contents not the directory.  In other words I don't have a data directory in scratch I have a VLASS... directory.
    • If data/VLASS → /some/path/VLASS and transfer_input_files=data/VLASS/
    • If data/VLASS → /some/path/VLASS and transfer_input_files=data/
    It seems the transfer mechanism won't transfer symlinks to directories (e.g. data/vlass.ms → /lustre/aoc/...) Is there a way around this?
    • ANSWER: there is no flag to chase symlinks at the moment.  The top level dir (e.g. data) could be a symlink if transfer_input_files=data/ but it will then transfer the contents of data instead of data itself.
    • If symlink → data and transfer_input_files=symlink I get the error Transfer of symlinks to directories is not supported.
    • if symlink → ../data and transfer_input_files=symlink/ it transfers the contents not the directory.  In other words I don't have a data directory in scratch I have a VLASS... directory.
    • If data/VLASS → /some/path/VLASS and transfer_input_files=data/VLASS/
    • If data/VLASS → /some/path/VLASS and transfer_input_files=data/ I get the error Transfer of symlinks to directories is not supported.

  • DAG log time stamps,  is there a way to differentiate data import/export time and process run time.
    • Look in the job log file not the dag log file
    • 040 (150.000.000) 2020-06-15 13:05:45 Started transferring input files
              Transferring to host: <10.64.10.172:
    DAG log time stamps,  is there a way to differentiate data import/export time and process run time.
    • Look in the job log file not the dag log file
    • 040 (150.000.000) 2020-06-15 13:05:45 Started transferring input files
              Transferring to host: <10.64.10.172:9618?addrs=10.64.10.172-9618&alias=nmpost072.aoc.nrao.edu&noUDP&sock=slot1_1_72656_7984_60>
      ...
      040 (150.000.000) 2020-06-15 13:06:04 Finished transferring input files

...

machine configsubmit file

HERA = True

STARTD_ATTRS = $(STARTD_ATTRS) HERA

START = ($(START)) && (TARGET.partition =?= "HERA")

requirements = (HERA == True)

+partition = "HERA"

SOLUTION: yes, this is good.  Submit Transforms could also be set on herapost-master (Submit Host)

https://htcondor.readthedocs.io/en/latest/misc-concepts/transforms.html?highlight=submit%20transform

Reservations

What if you know certain nodes will be unavailable for a window of time say the second week of next month.  Is there a way to schedule that in advance in HTCondor?  For example in Slurm

scontrol create reservation starttime=2021-02-8T08:00:00 duration=7-0:0:0 nodes=nmpost[020-030] user=root reservationname=siw2022

ANSWER: HTCondor doesn't have a feature like this.

Bug: All on one core

...

(HERA == True)

+partition = "HERA"

SOLUTION: yes, this is good.  Submit Transforms could also be set on herapost-master (Submit Host)

https://htcondor.readthedocs.io/en/latest/misc-concepts/transforms.html?highlight=submit%20transform


Reservations

What if you know certain nodes will be unavailable for a window of time say the second week of next month.  Is there a way to schedule that in advance in HTCondor?  For example in Slurm

scontrol create reservation starttime=2021-02-8T08:00:00 duration=7-0:0:0 nodes=nmpost[020-030] user=root reservationname=siw2022

ANSWER: HTCondor doesn't have a feature like this.


Bug: All on one core

  • Bug where James's jobs are all put on the same core.  Here is top -u krowe showing the Last Used Cpu (SMP) after I submitted five sleep jobs to the same host.
    • Is this just a side effect of condor using cpuacct instead of cpuset in cgroup?
    • Is this a failure of the Linux kernel to schedule things on separate cores?
    • Is this because cpu.shares is set to 100 instead of 1024?
    • Check if CPU affinity is set in /proc/self/status
    • Is sleep cpu-intensive enough to properly test this?  Perhaps submit a while 1 loop instead?
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND P
66713 krowe 20 0 4364 348 280 S 0.0 0.0 0:00.01 sleep 22
66714 krowe 20 0 4364 352 280 S 0.0 0.0 0:00.02 sleep 24
66715 krowe 20 0 4364 348 280 S 0.0 0.0 0:00.01 sleep 24
66719 krowe 20 0 4364 348 280 S 0.0 0.0 0:00.02 sleep 2
66722 krowe 20 0 4364 352 280 S 0.0 0.0 0:00.02 sleep 22

From jrobnett@nrao.edu Tue Nov 10 16:38:18 2020

As (bad) luck would have it I had some jobs running where I forgot to set the #cores to do so they triggered the behavior.

Sshing into the node I see three processes sharing the same core and the following for the 3 python processes:

bash-4.2$ cat /proc/113531/status | grep Cpus
Cpus_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
Cpus_allowed_list:      0

If I look at another node with 3 processes where they aren't sharing the same core I see:

bash-4.2$ cat /proc/248668/status | grep Cpu
Cpus_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00555555
Cpus_allowed_list:      0,2,4,6,8,10,12,14,16,18,20,22

Dec. 8, 2020 krowe: I launched five sqrt(rand()) jobs and each one landed on its own CPU. 

...

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND P
6671348833 krowe 20 0 436412532 3481052 280884 SR 0100.0 0.0 0:00.01 sleep 22
66714 krowe 20 0 4364 352 280 S 0.0 0.0 0:00.02 sleep 24
667159:20.95 a.out 4
49014 krowe 20 0 436412532 3481052 280884 SR 0100.0 0.0 08:0034.0191 sleepa.out 245
6671948960 krowe 20 0 436412532 3481052 280884 SR 099.06 0.0 08:0054.0240 sleepa.out 23
6672249011 krowe 20 0 436412532 3521052 280884 SR 099.06 0.0 08:35.00.02 sleep 22

From jrobnett@nrao.edu Tue Nov 10 16:38:18 2020

As (bad) luck would have it I had some jobs running where I forgot to set the #cores to do so they triggered the behavior.

Sshing into the node I see three processes sharing the same core and the following for the 3 python processes:

bash-4.2$ cat /proc/113531/status | grep Cpus
Cpus_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
Cpus_allowed_list:      0

If I look at another node with 3 processes where they aren't sharing the same core I see:

 a.out 1
49013 krowe 20 0 12532 1048 884 R 99.6 0.0 8:34.84 a.out 0

and the masks aren't restricting them to specific cpus.  So I am yet unable to reproduce James's problem.

bash-4.2$ cat /proc/248668/status | grep Cpu
Cpus_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00555555

st077.aoc.nrao.edu]# grep -i cpus /proc/48960/status
Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff


Cpus_allowed_list:

      0,2,4,6,8,10,12,14,16,18,20,22

Dec. 8, 2020 krowe: I launched five sqrt(rand()) jobs and each one landed on its own CPU. 

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND P
48833 krowe 20 0 12532 1052 884 R 100.0 0.0 9:20.95 a.out 4
49014 krowe 20 0 12532 1052 884 R 100.0 0.0 8:34.91 a.out 5
48960 krowe 20 0 12532 1052 884 R 99.6 0.0 8:54.40 a.out 3
49011 krowe 20 0 12532 1052 884 R 99.6 0.0 8:35.00 a.out 1
49013 krowe 20 0 12532 1048 884 R 99.6 0.0 8:34.84 a.out 0

and the masks aren't restricting them to specific cpus.  So I am yet unable to reproduce James's problem.

st077.aoc.nrao.edu]# grep -i cpus /proc/48960/status
Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
Cpus_allowed_list: 0-447

We can reproduce this without HTCondor.  So this is either being caused by our mpicasa program or the openmpi libraries it uses.  Even better, I can reproduce this with a simple shell script executed from two shells at the same time on the same host.  Another MPI implementation (mvapich2) didn't show this problem.

...

0-447

We can reproduce this without HTCondor.  So this is either being caused by our mpicasa program or the openmpi libraries it uses.  Even better, I can reproduce this with a simple shell script executed from two shells at the same time on the same host.  Another MPI implementation (mvapich2) didn't show this problem.

#!/bin/sh
export PATH=/usr/lib64/openmpi/bin:{$PATH}
export LD_LIBRARY_PATH=/usr/lib64/openmpi/lib:${LD_LIBRARY_PATH}
mpirun -np 2 /users/krowe/work/doc/all/openmpi/busy/busy


Array Jobs

Does HTCondor support array jobs like Slurm? For example in Slurm #SBATCH --array=0-3%2 or is one supposed to use queue options and DAGMan throttling?

ANSWER: HTCondor does reduce the priority of a user the more jobs they run so there may be less need of a maxjob or modulus option.  But here are some other things to look into.

https://htcondor.readthedocs.io/en/latest/users-manual/dagman-workflows.html#throttling-nodes-by-category

queue from seq 10 5 30 |

queue item in 1, 2, 3


combined cluster (Slurm and HTCondor)

Slurm starts and stops condor.  CHTC does this because their HTCondor can preempt jobs.  So when Slurm starts a job it kills the condor startd and any HTCondor jobs will get preempted and probably restarted somewhere else.


Node Priority

Is there a way to set an order to which nodes are picked first or a weight system?  We want certain nodes to be chosen first because they are faster, or have less memory or other such criteria.

NEGOTIATOR_PRE_JOB_RANK on the negotiator


HPC Cluster

Could I have access to the HPC cluster?  To learn Slurm.

ANSWER: https://chtc.cs.wisc.edu/hpc-overview  I need to login to submit2 first but that's fine.

How does CHTC keep shared directories (/tmp, /var/tmp, /dev/shm) clean with Slurm?

ANSWER: CHTC doesn't do any cleaning of shared directories.  But the suggested looking at https://derekweitzel.com/2016/03/22/fedora-copr-slurm-per-job-tmp/  I don't know if this plugin will clean files created by an interactive ssh, but i suspect it won't because it is a slurm plugin and ssh'ing to the host is outside of the control of Slurm except for the pam_slurm_adopt that adds you to the cgroup.  So I may still need a reaper script to keep these directories clean.


vmem exceeded in Torque

We have seen a problem in Torque recently that reminds us of the memory fix you recently implemented in HTCondor.  What that fix related to any recent changes in the Linux kernel or was it a pure HTCondor bug?  What was it that you guys did to fix it?

ANSWER: There are two problems here.  The first is the short read, which we are still trying to understand the root cause.  We've worked around the problem in the short term by re-polling when the number of processes we see drops by 10% or more. The other problem is when condor uses cgroups to measure the amount of memory that all processes in a job use, it goes through the various field in /sys/fs/cgroup/memory/cgroup_name/memory.stat.  Memory is categorized into a number of different types in this file, and we were omitting some types of memory when summing up the total.

cpuset issues

ANSWER: git bisect could be useful.  Maybe we could ask Ville.

Distant execute nodes

Are there any problems having compute nodes at a distant site?

ANSWER: no intrinsic issues.  Be sure to set requirements.


Memory bug fix?

What version of condor has this fix?

ANSWER: 8.9.9

When is it planned for 8.8 or 9.x inclusion?

ANSWER: 9.0 in Apr. 2021

Globus

You mentioned that the globus RPMs are going away.  Yes?

ANSWER: They expect to drop globus support in 9.1 around May 2021.

VNC

Do you have any experience using VNC with HTCondor?

ANSWER: no they don't have experience like this.  But mount_under_scratch= will use the real /tmp


Which hosts do the flocking?

Lustre is going to be a problem.  Our new virtual CMs can't see lustre.  Can just a submit host see lustre and not the CM in order to flock?

ANSWER: Only submit machines need to be configured to flock.  It goes from a local submit host to a remote CM.  So we could keep gibson as a flocking submit host.  This means the new CMs don't need the firewall rules.


Which hosts do the flocking?

Lustre is going to be a problem.  Our new virtual CMs can't see lustre.  Can just a submit host see lustre and not the CM in order to flock?

ANSWER: Only submit machines need to be configured to flock.  It goes from a local submit host to a remote CM.  So we could keep gibson as a flocking submit host.  This means the new CMs don't need the firewall rules.


Transfer Mechanism Plugin

  • Our environment has a complex network topology.   We have a prototype rsync plugin but may  want to specify a specific network interface for a host as a function of where the execute host resides.
    • Do file transfer plugins have access to the JobAd, either internally or via an external command condor_q -l?  For example can they tell what PoolName a job requested?
    • Can we make use of logic during the match making where 'if execute host is in set of X, then set some variable to Y' and then the plugin inspects some variable to determine where it is toplogically and therefore which interface to use.
    • ANSWER: look at .job.ad or .machine.ad in the scratch area.  Could set some attributes in the config file for the nodes.


Containers

  • Is HTC basically committed to distributing container implementations with each new release
    • ANSWER: CHTC is planning to release containers with each HTCondor release.
  • Is this migrating toward a recommended implementation method for things like the submit hosts and possibly even execute hosts where the transactions could be light weight.
    • ANSWER: The jobs are tied to a submit host.  If that submit host goes away the job may be orphaned.


Remote

condor_submit -remote  what does it do?  The manpage makes me think it submits your job using a different submit host but when I run it I get lots of authentication errors. Can it not use host-based authentication (e.g. ALLOW_WRITE = *.aoc.nrao.edu)?

Here is an example of me running condor_submit on one of our Submit Hosts (testpost-master) trying to remote to our Central Manager (testpost-cm) which is also a submit host.

condor_submit -remote testpost-cm tiny.htc
Submitting job(s)
ERROR: Failed to connect to queue manager testpost-cm-vml.aoc.nrao.edu
AUTHENTICATE:1003:Failed to authenticate with any method
AUTHENTICATE:1004:Failed to authenticate using GSI
GSI:5003:Failed to authenticate. Globus is reporting error (851968:50). There
is probably a problem with your credentials. (Did you run grid-proxy-init?)
AUTHENTICATE:1004:Failed to authenticate using KERBEROS
AUTHENTICATE:1004:Failed to authenticate using FS

ANSWER:

condor_submit -remote does indeed, tell the condor_submit tool to submit to a remote schedd. (it also implies -spool)

Because the schedd can run the job as the submitting owner, and runs the shadow as the submitting owner, the remote schedd needs to not just authorize the remote user to submit jobs, but must authenticate the remote user as some allowed user.

Condor's IP host-based authentication is just authentication, it can say "everyone coming from this IP address is allowed to do X, but I don't know who that entity is".

So, for remote submit to work, we need some kind of authentication method as well, like IDTOKENS, munge.


Authentication

  • We're currently using host based authentication.   Is there a 'future proof' recommended authentication system for HTCondor-9.x for a site planning to use both on-premesis cluster and CHTC flocking and or glide-ins to other facitlities?  host_based?  password?  Tokens?  SSL?  Munge?  Munge might be my preferred method as Slurm already requires it.
  • If we're using containers for submit hosts is there a preferred authentication scheme (host based doesn't scale well).
    • ANSWER: idtokens


HTcondor+Slurm

  • Do people do HTCondor glide-ins to slurm where the HTCondor jobs are not prempted, as a way to share resources with both schedulers?
    • ANSWER: You can glide in to Slurm.
    • You can have Slurm preempt HTCondor jobs in favor of its own jobs (HTCondor jobs presumably will be resubmitted)
    • You can have HTCondor preempt Slurm jobs in the same sort of way.


Transfer Plugin Order

HTCondor guarantees that the condor file transfer happens before the plugin transfer, but only when using the "multi-file" plugin style,
like we have in our curl plugin.  If you used the curl plugin as the model for rsync, you should be good.


AMQP

The AMQP gateway that we had developed was called Qpid, and worked by tailing the user job log and turning it into qpid events.  I suspect
there's also ways to have condor plugins directly send amqp events as well.


CPU Shares

Torque uses cpusets which is pretty straight forward, but HTCondor uses cpu.shares which confuses me a bit.  For example, a job with request_cpus = 8 executing on a 24-core machine gets cpu.shares = 800  If there are no other jobs on node, does this job essentially get more CPU time than 1024/800?

ANSWER: yes it is oppertunistic.  If there are no other jobs running on a node you essentially get all the node.


Nodescheduler

We found a way to implement our nodescheduler script in Slurm using the --exclude option.  Is there a way to exclude certain hosts from a job?  Or perhaps a constraint that prevents a job from running on a node that is already running a job of that user?  Is there a better way than this?

requirements = Machine != "nmpost097.aoc.nrao.edu" && Machine != "nmpost119.aoc.nrao.edu"

badmachines=one+two+three

requirements not in $(badmachines)

I didn't get the actual syntax from Greg and I am apparently not able to look it up.  The long syntax I suggested should work I just dont know what Greg's more efficient syntax is.


condor_ssh_to_job

Is there a way to use condor_ssh_to_job to connect to a job submitted from a different submit host (schedd) or do you have to run it from the submit host used to submit the job?  I have tried using the -name option to condor_ssh_to_job but I always get Failed to send GET_JOB_CONNECT_INFO to schedd

ANSWER: idtokens.  Host-based and poolpassword are not sufficient to identify users and allow for this (and probably condor_submit -remote).


HTCondor Workshop vs Condor Week

ANSWER: Essentially it is "Condor Week Europe".  Mostly the same talks but different customer presentations.  Could be interesting for the different customer presentations.


Shutdown

STARTD.DAEMON_SHUTDOWN = State == "Unclaimed" && Activity == "Idle" && (MyCurrentTime - EnteredCurrentActivity) > 600

MASTER.DAEMON_SHUTDOWN = STARTD_StartTime == 0

But I was running a job when it shut down.

07/19/21 11:45:01 The DaemonShutdown expression "State == "Unclaimed" && Activity == "Idle" && (MyCurrentTime - EnteredCurrentActivity) > 600" evaluated to TRUE: starting graceful shutdown

Could this be because we use dynamic slots?

testpost-cm-vml krowe >condor_status
Name OpSys Arch State Activity LoadAv Mem

slot1@testpost001.aoc.nrao.edu LINUX X86_64 Unclaimed Idle 0.000 193
slot1_1@testpost001.aoc.nrao.edu LINUX X86_64 Claimed Busy 0.000
slot1@testpost002.aoc.nrao.edu LINUX X86_64 Unclaimed Idle 0.000 144
slot1_1@testpost002.aoc.nrao.edu LINUX X86_64 Claimed Busy 0.810 49
slot1@testpost003.aoc.nrao.edu LINUX X86_64 Unclaimed Idle 0.000 193

I see that with dynamic slots, the parent slot (slot1) seems always unclaimed and idle and the child slots (slot1_1) are Claimed and Busy.  So I tried checking the ChildState attribute which looks to be a list but doesn't behaive like one.  For example, none of these show any slots

condor_status -const 'ChildState == { "Claimed" }'

condor_status -const 'sum(ChildState) == 0'

Even though this produces true

classad_eval 'a = { }' 'sum(a) == 0'

ANSWER: Try this

condor_status -const 'size(ChildState) == 0'

Array Jobs

Does HTCondor support array jobs like Slurm? For example in Slurm #SBATCH --array=0-3%2 or is one supposed to use queue options and DAGMan throttling?

ANSWER: HTCondor does reduce the priority of a user the more jobs they run so there may be less need of a maxjob or modulus option.  But here are some other things to look into.

https://htcondor.readthedocs.io/en/latest/users-manual/dagman-workflows.html#throttling-nodes-by-category

queue from seq 10 5 30 |

queue item in 1, 2, 3

combined cluster (Slurm and HTCondor)

Slurm starts and stops condor.  CHTC does this because their HTCondor can preempt jobs.  So when Slurm starts a job it kills the condor startd and any HTCondor jobs will get preempted and probably restarted somewhere else.

Node Priority

Is there a way to set an order to which nodes are picked first or a weight system?  We want certain nodes to be chosen first because they are faster, or have less memory or other such criteria.

NEGOTIATOR_PRE_JOB_RANK on the negotiator

HPC Cluster

Could I have access to the HPC cluster?  To learn Slurm.

ANSWER: https://chtc.cs.wisc.edu/hpc-overview  I need to login to submit2 first but that's fine.

How does CHTC keep shared directories (/tmp, /var/tmp, /dev/shm) clean with Slurm?

ANSWER: CHTC doesn't do any cleaning of shared directories.  But the suggested looking at https://derekweitzel.com/2016/03/22/fedora-copr-slurm-per-job-tmp/  I don't know if this plugin will clean files created by an interactive ssh, but i suspect it won't because it is a slurm plugin and ssh'ing to the host is outside of the control of Slurm except for the pam_slurm_adopt that adds you to the cgroup.  So I may still need a reaper script to keep these directories clean.

vmem exceeded in Torque

We have seen a problem in Torque recently that reminds us of the memory fix you recently implemented in HTCondor.  What that fix related to any recent changes in the Linux kernel or was it a pure HTCondor bug?  What was it that you guys did to fix it?

ANSWER: There are two problems here.  The first is the short read, which we are still trying to understand the root cause.  We've worked around the problem in the short term by re-polling when the number of processes we see drops by 10% or more. The other problem is when condor uses cgroups to measure the amount of memory that all processes in a job use, it goes through the various field in /sys/fs/cgroup/memory/cgroup_name/memory.stat.  Memory is categorized into a number of different types in this file, and we were omitting some types of memory when summing up the total.

cpuset issues

ANSWER: git bisect could be useful.  Maybe we could ask Ville.

Distant execute nodes

Are there any problems having compute nodes at a distant site?

ANSWER: no intrinsic issues.  Be sure to set requirements.

Memory bug fix?

What version of condor has this fix?

ANSWER: 8.9.9

When is it planned for 8.8 or 9.x inclusion?

ANSWER: 9.0 in Apr. 2021

Globus

You mentioned that the globus RPMs are going away.  Yes?

ANSWER: They expect to drop globus support in 9.1 around May 2021.

VNC

Do you have any experience using VNC with HTCondor?

ANSWER: no they don't have experience like this.  But mount_under_scratch= will use the real /tmp

Which hosts do the flocking?

Lustre is going to be a problem.  Our new virtual CMs can't see lustre.  Can just a submit host see lustre and not the CM in order to flock?

ANSWER: Only submit machines need to be configured to flock.  It goes from a local submit host to a remote CM.  So we could keep gibson as a flocking submit host.  This means the new CMs don't need the firewall rules.

Which hosts do the flocking?

Lustre is going to be a problem.  Our new virtual CMs can't see lustre.  Can just a submit host see lustre and not the CM in order to flock?

ANSWER: Only submit machines need to be configured to flock.  It goes from a local submit host to a remote CM.  So we could keep gibson as a flocking submit host.  This means the new CMs don't need the firewall rules.

Transfer Mechanism Plugin

  • Our environment has a complex network topology.   We have a prototype rsync plugin but may  want to specify a specific network interface for a host as a function of where the execute host resides.
    • Do file transfer plugins have access to the JobAd, either internally or via an external command condor_q -l?  For example can they tell what PoolName a job requested?
    • Can we make use of logic during the match making where 'if execute host is in set of X, then set some variable to Y' and then the plugin inspects some variable to determine where it is toplogically and therefore which interface to use.
    • ANSWER: look at .job.ad or .machine.ad in the scratch area.  Could set some attributes in the config file for the nodes.

Containers

  • Is HTC basically committed to distributing container implementations with each new release
    • ANSWER: CHTC is planning to release containers with each HTCondor release.
  • Is this migrating toward a recommended implementation method for things like the submit hosts and possibly even execute hosts where the transactions could be light weight.
    • ANSWER: The jobs are tied to a submit host.  If that submit host goes away the job may be orphaned.

Remote

condor_submit -remote  what does it do?  The manpage makes me think it submits your job using a different submit host but when I run it I get lots of authentication errors. Can it not use host-based authentication (e.g. ALLOW_WRITE = *.aoc.nrao.edu)?

Here is an example of me running condor_submit on one of our Submit Hosts (testpost-master) trying to remote to our Central Manager (testpost-cm) which is also a submit host.

condor_submit -remote testpost-cm tiny.htc
Submitting job(s)
ERROR: Failed to connect to queue manager testpost-cm-vml.aoc.nrao.edu
AUTHENTICATE:1003:Failed to authenticate with any method
AUTHENTICATE:1004:Failed to authenticate using GSI
GSI:5003:Failed to authenticate. Globus is reporting error (851968:50). There
is probably a problem with your credentials. (Did you run grid-proxy-init?)
AUTHENTICATE:1004:Failed to authenticate using KERBEROS
AUTHENTICATE:1004:Failed to authenticate using FS

ANSWER:

condor_submit -remote does indeed, tell the condor_submit tool to submit to a remote schedd. (it also implies -spool)

Because the schedd can run the job as the submitting owner, and runs the shadow as the submitting owner, the remote schedd needs to not just authorize the remote user to submit jobs, but must authenticate the remote user as some allowed user.

Condor's IP host-based authentication is just authentication, it can say "everyone coming from this IP address is allowed to do X, but I don't know who that entity is".

So, for remote submit to work, we need some kind of authentication method as well, like IDTOKENS, munge.

Authentication

  • We're currently using host based authentication.   Is there a 'future proof' recommended authentication system for HTCondor-9.x for a site planning to use both on-premesis cluster and CHTC flocking and or glide-ins to other facitlities?  host_based?  password?  Tokens?  SSL?  Munge?  Munge might be my preferred method as Slurm already requires it.
  • If we're using containers for submit hosts is there a preferred authentication scheme (host based doesn't scale well).
    • ANSWER: idtokens

HTcondor+Slurm

  • Do people do HTCondor glide-ins to slurm where the HTCondor jobs are not prempted, as a way to share resources with both schedulers?
    • ANSWER: You can glide in to Slurm.
    • You can have Slurm preempt HTCondor jobs in favor of its own jobs (HTCondor jobs presumably will be resubmitted)
    • You can have HTCondor preempt Slurm jobs in the same sort of way.

Transfer Plugin Order

HTCondor guarantees that the condor file transfer happens before the plugin transfer, but only when using the "multi-file" plugin style,
like we have in our curl plugin.  If you used the curl plugin as the model for rsync, you should be good.

AMQP

The AMQP gateway that we had developed was called Qpid, and worked by tailing the user job log and turning it into qpid events.  I suspect
there's also ways to have condor plugins directly send amqp events as well.

CPU Shares

Torque uses cpusets which is pretty straight forward, but HTCondor uses cpu.shares which confuses me a bit.  For example, a job with request_cpus = 8 executing on a 24-core machine gets cpu.shares = 800  If there are no other jobs on node, does this job essentially get more CPU time than 1024/800?

ANSWER: yes it is oppertunistic.  If there are no other jobs running on a node you essentially get all the node.

Nodescheduler

We found a way to implement our nodescheduler script in Slurm using the --exclude option.  Is there a way to exclude certain hosts from a job?  Or perhaps a constraint that prevents a job from running on a node that is already running a job of that user?  Is there a better way than this?

requirements = Machine != "nmpost097.aoc.nrao.edu" && Machine != "nmpost119.aoc.nrao.edu"

badmachines=one+two+three

requirements not in $(badmachines)

I didn't get the actual syntax from Greg and I am apparently not able to look it up.  The long syntax I suggested should work I just dont know what Greg's more efficient syntax is.

condor_ssh_to_job

Is there a way to use condor_ssh_to_job to connect to a job submitted from a different submit host (schedd) or do you have to run it from the submit host used to submit the job?  I have tried using the -name option to condor_ssh_to_job but I always get Failed to send GET_JOB_CONNECT_INFO to schedd

ANSWER: idtokens.  Host-based and poolpassword are not sufficient to identify users and allow for this (and probably condor_submit -remote).

HTCondor Workshop vs Condor Week

...