Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Transfer Plugins

I see the docs for transfer_plugins only references input files.  https://htcondor.readthedocs.io/en/latest/man-pages/condor_submit.html#index-123 Maybe HTCondor doesn't support transferring output files with a plugin.  But then what does the -upload option in things like the box_plugin.py and gdrive_plugin.py do?

Remote

condor_submit -remote  what does it do?

Authentication

  • We're currently using host based authentication.   Is there a 'future proof' recommended authentication system for HTCondor-9.x for a site planning to use both on-premesis cluster and CHTC flocking and or glide-ins to other facitlities?  host_based?  password?  Tokens?  SSL?  Munge?  Munge might be my preferred method as Slurm already requires it.
  • If we're using containers for submit hosts is there a preferred authentication scheme (host based doesn't scale well).
    • ANSWER: idtokens

Transfer Mechanism Plugin

  • Our environment has a complex network topology.   We have a prototype rsync plugin but may  want to specify a specific network interface for a host as a function of where the execute host resides.
    • Do file transfer plugins have access to the JobAd, either internally or via an external command condor_q -l?  For example can they tell what PoolName a job requested?
    • Can we make use of logic during the match making where 'if execute host is in set of X, then set some variable to Y' and then the plugin inspects some variable to determine where it is toplogically and therefore which interface to use.
    • ANSWER: look at .job.ad or .machine.ad in the scratch area.  Could set some attributes in the config file for the nodes.

Containers

...

  • ANSWER: CHTC is planning to release containers with each HTCondor release.

...

  The manpage makes me think it submits your job using a different submit host but when I run it I get lots of authentication errors. Can it not use host-based authentication (e.g. ALLOW_WRITE = *.aoc.nrao.edu)?

Here is an example of me running condor_submit on one of our Submit Hosts (testpost-master) trying to remote to our Central Manager (testpost-cm) which is also a submit host.

condor_submit -remote testpost-cm tiny.htc
Submitting job(s)
ERROR: Failed to connect to queue manager testpost-cm-vml.aoc.nrao.edu
AUTHENTICATE:1003:Failed to authenticate with any method
AUTHENTICATE:1004:Failed to authenticate using GSI
GSI:5003:Failed to authenticate. Globus is reporting error (851968:50). There
is probably a problem with your credentials. (Did you run grid-proxy-init?)
AUTHENTICATE:1004:Failed to authenticate using KERBEROS
AUTHENTICATE:1004:Failed to authenticate using FS

Authentication

  • We're currently using host based authentication.   Is there a 'future proof' recommended authentication system for HTCondor-9.x for a site planning to use both on-premesis cluster and CHTC flocking and or glide-ins to other facitlities?  host_based?  password?  Tokens?  SSL?  Munge?  Munge might be my preferred method as Slurm already requires it.
  • If we're using containers for submit hosts is there a preferred authentication scheme (host based doesn't scale well).
    • ANSWER: idtokens


HTcondor+Slurm

  • Do people do HTCondor glide-ins to slurm where the HTCondor jobs are not prempted, as a way to share resources with both schedulers.?
    • ANSWER: You can glide in to Slurm.
    • You can have Slurm preempt HTCondor jobs in favor of its own jobs (HTCondor jobs presumably will be resubmitted)
    • You can have HTCondor preempt Slurm jobs in the same sort of way.

...

ANSWER: CHTC doesn't do any cleaning of shared directories.  But the suggested looking at https://derekweitzel.com/2016/03/22/fedora-copr-slurm-per-job-tmp/  I don't know if this plugin will clean files created by an interactive ssh, but i suspect it won't because it is a slurm plugin and ssh'ing to the host is outside of the control of Slurm except for the pam_slurm_adopt that adds you to the cgroup.  So I may still need a reaper script to keep these directories clean.

vmem exceeded in Torque

We have seen a problem in Torque recently that reminds us of the memory fix you recently implemented in HTCondor.  What that fix related to any recent changes in the Linux kernel or was it a pure HTCondor bug?  What was it that you guys did to fix it?

ANSWER: There are two problems here.  The first is the short read, which we are still trying to understand the root cause.  We've worked around the problem in the short term by re-polling when the number of processes we see drops by 10% or more. The other problem is when condor uses cgroups to measure the amount of memory that all processes in a job use, it goes through the various field in /sys/fs/cgroup/memory/cgroup_name/memory.stat.  Memory is categorized into a number of different types in this file, and we were omitting some types of memory when summing up the total.

cpuset issues

ANSWER: git bisect could be useful.  Maybe we could ask Ville.

Distant execute nodes

Are there any problems having compute nodes at a distant site?

ANSWER: no intrinsic issues.  Be sure to set requirements.

Memory bug fix?

What version of condor has this fix?

ANSWER: 8.9.9

When is it planned for 8.8 or 9.x inclusion?

ANSWER: 9.0 in Apr. 2021

Globus

You mentioned that the globus RPMs are going away.  Yes?

ANSWER: They expect to drop globus support in 9.1 around May 2021.

VNC

Do you have any experience using VNC with HTCondor?

ANSWER: no they don't have experience like this.  But mount_under_scratch= will use the real /tmp

Which hosts do the flocking?

Lustre is going to be a problem.  Our new virtual CMs can't see lustre.  Can just a submit host see lustre and not the CM in order to flock?

ANSWER: Only submit machines need to be configured to flock.  It goes from a local submit host to a remote CM.  So we could keep gibson as a flocking submit host.  This means the new CMs don't need the firewall rules.

Which hosts do the flocking?

Lustre is going to be a problem.  Our new virtual CMs can't see lustre.  Can just a submit host see lustre and not the CM in order to flock?

...

directories clean.


vmem exceeded in Torque

We have seen a problem in Torque recently that reminds us of the memory fix you recently implemented in HTCondor.  What that fix related to any recent changes in the Linux kernel or was it a pure HTCondor bug?  What was it that you guys did to fix it?

ANSWER: There are two problems here.  The first is the short read, which we are still trying to understand the root cause.  We've worked around the problem in the short term by re-polling when the number of processes we see drops by 10% or more. The other problem is when condor uses cgroups to measure the amount of memory that all processes in a job use, it goes through the various field in /sys/fs/cgroup/memory/cgroup_name/memory.stat.  Memory is categorized into a number of different types in this file, and we were omitting some types of memory when summing up the total.

cpuset issues

ANSWER: git bisect could be useful.  Maybe we could ask Ville.

Distant execute nodes

Are there any problems having compute nodes at a distant site?

ANSWER: no intrinsic issues.  Be sure to set requirements.


Memory bug fix?

What version of condor has this fix?

ANSWER: 8.9.9

When is it planned for 8.8 or 9.x inclusion?

ANSWER: 9.0 in Apr. 2021

Globus

You mentioned that the globus RPMs are going away.  Yes?

ANSWER: They expect to drop globus support in 9.1 around May 2021.

VNC

Do you have any experience using VNC with HTCondor?

ANSWER: no they don't have experience like this.  But mount_under_scratch= will use the real /tmp


Which hosts do the flocking?

Lustre is going to be a problem.  Our new virtual CMs can't see lustre.  Can just a submit host see lustre and not the CM in order to flock?

ANSWER: Only submit machines need to be configured to flock.  It goes from a local submit host to a remote CM.  So we could keep gibson as a flocking submit host.  This means the new CMs don't need the firewall rules.


Which hosts do the flocking?

Lustre is going to be a problem.  Our new virtual CMs can't see lustre.  Can just a submit host see lustre and not the CM in order to flock?

ANSWER: Only submit machines need to be configured to flock.  It goes from a local submit host to a remote CM.  So we could keep gibson as a flocking submit host.  This means the new CMs don't need the firewall rules.


Transfer Mechanism Plugin

  • Our environment has a complex network topology.   We have a prototype rsync plugin but may  want to specify a specific network interface for a host as a function of where the execute host resides.
    • Do file transfer plugins have access to the JobAd, either internally or via an external command condor_q -l?  For example can they tell what PoolName a job requested?
    • Can we make use of logic during the match making where 'if execute host is in set of X, then set some variable to Y' and then the plugin inspects some variable to determine where it is toplogically and therefore which interface to use.
    • ANSWER: look at .job.ad or .machine.ad in the scratch area.  Could set some attributes in the config file for the nodes.


Containers

  • Is HTC basically committed to distributing container implementations with each new release
    • ANSWER: CHTC is planning to release containers with each HTCondor release.
  • Is this migrating toward a recommended implementation method for things like the submit hosts and possibly even execute hosts where the transactions could be light weight.
    • ANSWER: The jobs are tied to a submit host.  If that submit host goes away the job may be orphaned.