Nodescheduler
We found a way to implement out nodescheduler script in Slurm using the --exclude option. Is there a way to exclude certain hosts from a job? Or perhaps a constraint that prevents a job from running on a node that is already running a job of that user?
...
Answered Questions:
- JOB ID question from Daniel
When I submit a job, I get a job ID back. My plan is to hold onto that job ID permanently for tracking. We have had issues in the past with Torque/Maui because the job IDs got recycled later and our internal bookkeeping got mixed up. So my questions are:
- Are job IDs guaranteed to be unique in HTCondor?
- How unique are they—are they _globally_ unique or just unique within a particular namespace (such as our cluster or the submit node)?- A Job ID (ClusterID.ProcID)
- DNS name of the schedd and ctime of the job_queued.log file.
- It is unique to a schedd.
- We should talk with Daniel about this. They should craft their own ID. It could be seeded with a JobID but should not depend on just it.
- UpgradingHTCondor without killing jobs?
- schedd can be upgraded and restarted without loosing state assuming the restart is less than the timeout.
- currently restarting execute services will kill jobs. CHTC is working on improving this.
- negotiator and collector can be restarted without killing jobs.
- CHTC works hard to ensure 8.8.x is compatible with 8.8.y or 8.9.x is compatible with 8.9.y.
- Leaving data on execution host between jobs (data reuse)
- Todd is working on this now.
- Ask about installation of CASA locally and ancillary data (cfcache)
- CHTC has a Ceph filesystem that is available to many of their execution hosts (notibly the larger ones)
- There is another software filesystem where CASA could live that is more used for admin usage but might be available to us.
- We could download the tarball each time over HTTP. CHTC uses a proxy server so it would often be cached.
- Environment: Is there a way to have condor "login" when a job starts thus sourcing /etc/proflie and the user's rc files? Currently, not even $HOME is set.
- A good analogy is Torque does a su - _username_ while HTCondor just does a su _username_
- WORKAROUND: setting getenv = True which is like the -V option to qsub, may help. It doesn't source rc files but does inherit your current environment. This may be a problem if your current environment is not what you want on the cluster node. Perhaps the cluster node is a different OS or architecture.
- ANSWER: condor doesn't execute things with a shell. You could set your executable as /bin/bash and then have the arguments be the executable you used to have. I just changed our stuff to staticly set $HOME and I think that is good enough.
- Flocking: Suppose I have two hosts in the same pool. testpost-master is a submit-host and testpost-serv-1 is both a submit-host and the central-manager. testpost-serv-1 is configured to flock to CHTC but testpost-master is not. Is it possible to submit a job on testpost-master that will flock to CHTC by somehow leveraging testpost-serv-1? In other words, do I have to setup flocking and an external IP on every submit host?
- ANSWER: there isn't a good way to do this. So eventually we will need to make testpost-master flock to CHTC and possibly remove the ability of testpost-serv-1 to flock.
- It seems the transfer mechanism won't transfer symlinks to directories (e.g. data/vlass.ms → /lustre/aoc/...) Is there a way around this?
- ANSWER: there is no flag to chase symlinks at the moment. The top level dir (e.g. data) could be a symlink if transfer_input_files=data/ but it will then transfer the contents of data instead of data itself.
- If symlink → data and transfer_input_files=symlink I get the error Transfer of symlinks to directories is not supported.
- if symlink → ../data and transfer_input_files=symlink/ it transfers the contents not the directory. In other words I don't have a data directory in scratch I have a VLASS... directory.
- If data/VLASS → /some/path/VLASS and transfer_input_files=data/VLASS/
- If data/VLASS → /some/path/VLASS and transfer_input_files=data/ I get the error Transfer of symlinks to directories is not supported.
- DAG log time stamps, is there a way to differentiate data import/export time and process run time.
- Look in the job log file not the dag log file
- 040 (150.000.000) 2020-06-15 13:05:45 Started transferring input files
Transferring to host: <10.64.10.172:9618?addrs=10.64.10.172-9618&alias=nmpost072.aoc.nrao.edu&noUDP&sock=slot1_1_72656_7984_60>
...
040 (150.000.000) 2020-06-15 13:06:04 Finished transferring input files
- Rank and Premption: Can we use Rank to set "preferences" without requiring job preemption?
- ANSWER: There are 2 kinds of rank (job rank, machine rank). job rank (RANK=... in a submit file) is purely a preference. That does not preempt. Machine rank (in startd.config) will preempt. Negotiator pre-job rank is a third type of rank that works at a pool level and is often used to pack jobs efficiently.
- Update on software store for CASA either on shared Ceph storage or admin software storage
- Staging area for datasets 100MB - TBs. This is where we could try keeping the cfcache assuming doing so doesn't overwhelm the filesystem.
- /staging/nu_jrobnett
Requirements = (Target.HasCHTCStaging == true)
- Quota: 100GB, 100K files
- Squid area for 100MB - 1GB input or shared software. This is where we could keep casa.tgz and then have the execution host retrieve it via HTTP.
- /squid/nu_jrobnett
- only accessable via this path on the submit hosts. Execution hosts will need to access it via HTTP.
transfer_input_files = http://proxy.chtc.wisc.edu/SQUID/nu_jrobnett/casa.tgz
Software area We can use this in run-time applications. Think of it like /usr/local.
/software/nu_jrobnett/casa/casa-pipeline-release-5.6.1-8.el7
- export PATH=/opt/local/bin:/software/nu_jrobnett/casa/casa-pipeline-release-5.6.1-8.el7/bin:${PATH}
- Quota: 5GB, 100K files
- Staging area for datasets 100MB - TBs. This is where we could try keeping the cfcache assuming doing so doesn't overwhelm the filesystem.
- Public_input_files: How is this different than transfer_input files and when would one want to use it instead of files or URLs with transfer_input_files?
- https://htcondor.readthedocs.io/en/latest/users-manual/file-transfer.html#public-input-files
- This is still a work in progress. It may allow for caching on a squid server, fetchable by others someday.
- Flocking: When we flock to CHTC what is the data path for transfer_input_files? Is it our submit host and CHTC's execution host, is CHTCs submit host involved ?
- Dataflow is from our schedd (submit host) to their execute host but CCB will reverse the connection. Their execution hosts are publicly addressable but that may not be necessary.
- Dataflow is from our schedd (submit host) to their execute host but CCB will reverse the connection. Their execution hosts are publicly addressable but that may not be necessary.
- How can we choose the data path for transfer_input_files to our clients given multiple networks. Currently we assume it will use the 1Gb link but we have IB links. Is there a way for condor to use the IB link just for transferring files, is that hostname based ? Other ideas?
- CHTC doesn't have a good solution for this.
- We could upgrade from 1Gb to 10Gb
- We could use the IB names for everything (problematic for submit hosts that don't have IB)
- We could not use transfer mechanism and instead use something else like scp
- We could use a custom transfer plugin
- Are there known issues with distributed scratch via NFS or Lustre w.r.t tmpdir or other, e.g. OpenMPI complains about tmpdir being on network FS?
- Some problems with log files on the submit host but rare.
- Any general best practices to support MPI in terms of class ads or other.
- Use the shared memory transport for security
- Use the shared memory transport for security
- Is there a way DAGMan can be told to ignore errors, in some cases we want a DAG to mindlessly continue vs retry.
- The job is considered successful based on the return of the post script. If there isn't a post script, the success is based on the return of the job.
- The job is considered successful based on the return of the post script. If there isn't a post script, the success is based on the return of the job.
- Transfer mechanism: Documentation implies that only files with an mtime newer than when the transfer_input_files finished will be transferred back to the submit host. While running a dag, the files in my working directory (which is in both transfer_input_files and transfer_output_files) seem to always have an mtime around the most recent step in the DAG suggesting that the entire working directory is copied from the execution host to the submit host at the end of each DAG step. Perhaps this means the transfer mechanism only looks at the mtime of the files/dirs specified in transfer_output_files and doesn't descend into the directories.
- Subdirectories are treated differently
- SOLUTION: I think casa just touches every file and therefore condor is forced to copy everything in the working directory. I have been unable to reproduce the problem outside of casa.
- SOLUTION: If you specify a directory, HTCondor will transfer the entire directory not just files with new mtime.
- Does the trasnfer mechanism accept any sort of regular expression? E.g. transfer_input_files=*.txt
- No
- No
- Can the transfer mechanism accept manifest files? E.g. a file that is a list of files?
- Use include : <some file> in the submit script where <some file> contains the full transfer_input_files line
- use queue FILES from manifest Which defines the submit variable $(FILES) which could be used in a transfer_input_files like: transfer_input_files = $(FILES)
- Perhaps a plugin
- What other options are there than holding a job? I find myself not noticing, sometimes for hours, that a job is on hold. Is there a way to make jobs fail instead of getting held? I assume others will make this mistake like me.
- I see I can set periodic_remove = (JobStatus == 5) but HTCondor doesn't seem to think that is an error so if I have notification = Error I don't get any email.
- Greg will look into adding a Hold option to notification
- The HTCondor idea of held jobs is that you submitted a large DAG of jobs, one step is missing a file and you would like to put that file in place and continue the job instead of the whole DAG failing and having to be resubmitted. This makes sense but it would be nice to be notified when a job gets held.
Greg wrote "notification = error in the submit file is supposed to send email when the job is held by the system, but there's a bug now where it doesn't. I'll fix this."
- What limits are there to transfer_input_files? I would sometimes get Failed to transfer files when the number of files was around 10,000
- ANSWER: There is a memory limit because of ClassAds but in general there isn't a defined limit.
- Is there a way to generate the dag.dot file without having to submit the job?
- The -no_submit option doesn't create the .dot file
- Is adding NOOP to all the JOB commands the right thing to do? The DAG still gets submitted but then quickly ends.
- ANSWER: You need to submit the DAG. NOOP is the current solution
- Is there a way to start a dag at a given point? E.g. if there are 5 steps in the dag, can you start the job at step 3?
- Is the answer again to add NOOP to the JOB commands you don't want to run?
- ANSWER: Splices may work here but NOOP is certainly a tool to use.
- I see at CHTC jobs now require a request_disk setting. How does one launch interactive jobs?
- ANSWER: This is a bug.
- For our initial tests, we want to flock jobs to CHTC that transfer about 60GB input and output. Eventually we will reduce this significantly but for now what can we do?
- Globus or rsync to move data? If Globus, how to do so in an automated way (E.g. no password)?
- ANSWER: Using the Transfer Mechanism from our submit server to CHTC's execution host is ok with CHTC because it doesn't interfere with their submit host. Outbound from their execute hosts is also allowed (scp, rsync, etc).
- Use +Longjob = true attribute for 72+ hour jobs. This is a CHTC-specific knob.
- How can I know if my job swapped?
- ANSWER: CHTC nodes have no or minimal swap space.
- Condor Annex processing in AWS. Is there support for spot market
- ANSWER: Condor Annex does indeed support the spot market. It is a bit more work to set up because you don't say "give my X of Y", but "I'll pay d1 dollars for machines like X1 and d2 for machines like X2, etc.".
- What network mask should we use to allow ssh from CHTC into NRAO? Is it a class B or several class Cs?
- ANSWER: The ip (v4 !) ranges for CHTC execute nodes are
128.104.100.0/22
128.104.55.0/24
128.104.58.0/23
128.105.244.0/23
- ANSWER: The ip (v4 !) ranges for CHTC execute nodes are
- Is there a ganglia server or some other monitor service at CHTC we can view?
ANSWER: We have a bunch of ganglia and grafana graphs for the system, but I think they are restricted to campus folks and tend to show system-wide utliization and problem
I have a machine with an externaly-accessable, non-NATed address (146.88.10.48) and an internal, non-routable address (10.64.1.226). I want to install condor_annex on this machine such that I can submit jobs to AWS from it. I don't necessarily need to submit jobs to local execute hosts from this machine. Should I make this machine a central manager, a submit host, both, or does it matter?
- ANSWER: I think instances in AWS will need to contact both the schedd (submit host) and collectord (central manager) from the internet using port 9618. So either both submit host and central manager need external connections and IPs with port 9618 open or combine them into one host with an external IP and port 9618 open to the Internet.
Last time I configured condor_annex I was using an older version of condor (8.8.3 I think) and used a pool password for security. Now I am using 8.9.7. Is there a newer/better security method I should use?
- ANSWER: The annex still primarily uses pool password, so let's stick with that for now.
- How can I find out what hosts are available for given requirements (LongJob, memory, staging)
- condor_status -compact -constraint "HasChtcStaging==true" -constraint 'DetectedMemory>500000' -constraint "CanRunLongJobs isnt Undefined"
- Answer: yes this is correct but it doesn't show what other jobs are waiting on the same resources. Which is fine.
- It looks to me like most hosts at CHTC are setup to run LongJobs. The following shows a small list of about 20 hosts so I assume all others can run long jobs. Is the correct?
- condor_status -compact -constraint "CanRunLongJobs is Undefined"
- JongJobs is for something like 72 hours. So it might be best to not set it unless we really need it like step23.
- Answer: yes this is correct but it doesn't show what other jobs are waiting on the same resources. Which is fine.
- Is port 9618 needed for flocking or just for condor_annex?
- ANSWER: Greg thinks yes 9618 is needed for both flocking and condor_annex.
- Are there bugs in the condor.log output of a DAG node? For example, I have a condor.log file that clearly shows the job taking about three hours to run yet at the bottom lists user time of 13 hours and system time of 1 hour. https://open-confluence.nrao.edu/download/attachments/40541486/step07.py.condor.log?api=v2
And as for the cpu usage report, there could very well be a bug, but first, is your job multi-threaded or multi-process? If so, the cpu usage will be the aggregate across all cpu cores.
- Yes they are all parallel jobs to some extent so I accept your answer for that job. But I have another job that took 21 hours of wallclock time and yet the condor.log shows 55 minutes of user and 5:34 hours of system time. https://open-confluence.nrao.edu/download/attachments/40541486/step05.py.condor.log?api=v2
- ANSWER: if you look, the user time is actually 6 days and 55 minutes. I missed the 6 in there.
- Given a DAG where some steps are to always run at AOC and some are to always run in CHTC how can we dictate this. Right now local jobs flock to CHTC if local resources are full. Can we make local jobs idle instead of flock?
- ANSWER: Use PoolNames. I need to make a testpost PoolName.
- PoolName = "TESTPOST"
STARTD_ATTRS = $(STARTD_ATTRS) PoolName
- It seems that when using DAGs the recommended method is to define variables in the DAG script instead of submit scripts. This makes sense as it allows for only one file, the DAG script, that needs to be edited to make changes. But, is there a way to get variables into the DAG script from the command line or environment or an include_file or something?
- ANSWER: There is an INCLUDE syntax but there is no command-line or environment variable way to get vars into a DAG.
- We are starting to run 10s of jobs in CHTC requiring 40GB as part of a local DAG. Are there any options we can set to improve their execution chance. What memory footprint (32, 20, 16, 8GB) would significantly improve their chances.
- ANSWER: only use +LongJob if the job needs more than 72 hour, which is the default "walltime".
- How can we set AWS Tags with condor_annex? We'd like this to track jobs and set billing tags. Looks like there isn't really a way.
- SOLUTION: Greg wrote (Oct. 5, 2020) "we've just now added code to the condor_annex to support setting multiple aws tags on the annex command line". K. Scott expects it will take a while before that code makes it to released software.
- Launch Templates didn't work. I don't think condor_annex supports Launch Templates.
- Use aws-user-data options to condor_annex?
- I have tried all sorts of user-data and default-user-data-file options. On-demand apparently no longer works and I was never able to get something working with spot-fleet. I think all things user-data are non-functional.
- I tried setting a tag in the role defined in config.json (aws-ec2-spot-fleet-tagging-role) but that tag didn't translate to the instance.
- I tried adding a tag to the AMI when creating a new AMI (EC2 → Instances → Actions → Image → Create Image). Didn't work.
- What about selftagging? The instance figures out its instance id and runs aws.
- wget -qO- http://instance-data/latest/meta-data/instance-id
- returns nothing when logged in as nobody (condor_ssh_to_job)
- returns nothing when logged in as centos (ssh -i ~/.ssh/...)
- returns instanceid when logged in as root (ssh as centos then sudo su)
- Aha! There is a firewall (iptables) rule blocking exactly this. But I can't figure out what file sets this iptables rule on boot.
- wget -qO- http://instance-data/latest/meta-data/instance-id
- I tried adding tags to the json file using both ResourceType set to instance and spot-fleet-request. Neither created an instance with my tag.
"TagSpecifications": [
{
"ResourceType": "instance",
"Tags": [
{
"Key": "budget",
"Value": "VLASS"
}
Transfer Plugins
I see the docs for transfer_plugins only references input files. https://htcondor.readthedocs.io/en/latest/man-pages/condor_submit.html#index-123 Maybe HTCondor doesn't support transferring output files with a plugin. But then what does the -upload option in things like the box_plugin.py and gdrive_plugin.py do?
Remote
condor_submit -remote what does it do? The manpage makes me think it submits your job using a different submit host but when I run it I get lots of authentication errors. Can it not use host-based authentication (e.g. ALLOW_WRITE = *.aoc.nrao.edu)?
Here is an example of me running condor_submit on one of our Submit Hosts (testpost-master) trying to remote to our Central Manager (testpost-cm) which is also a submit host.
condor_submit -remote testpost-cm tiny.htc
Submitting job(s)
ERROR: Failed to connect to queue manager testpost-cm-vml.aoc.nrao.edu
AUTHENTICATE:1003:Failed to authenticate with any method
AUTHENTICATE:1004:Failed to authenticate using GSI
GSI:5003:Failed to authenticate. Globus is reporting error (851968:50). There
is probably a problem with your credentials. (Did you run grid-proxy-init?)
AUTHENTICATE:1004:Failed to authenticate using KERBEROS
AUTHENTICATE:1004:Failed to authenticate using FS
Authentication
- We're currently using host based authentication. Is there a 'future proof' recommended authentication system for HTCondor-9.x for a site planning to use both on-premesis cluster and CHTC flocking and or glide-ins to other facitlities? host_based? password? Tokens? SSL? Munge? Munge might be my preferred method as Slurm already requires it.
- If we're using containers for submit hosts is there a preferred authentication scheme (host based doesn't scale well).
- ANSWER: idtokens
HTcondor+Slurm
- Do people do HTCondor glide-ins to slurm where the HTCondor jobs are not prempted, as a way to share resources with both schedulers?
- ANSWER: You can glide in to Slurm.
- You can have Slurm preempt HTCondor jobs in favor of its own jobs (HTCondor jobs presumably will be resubmitted)
- You can have HTCondor preempt Slurm jobs in the same sort of way.
Answered Questions:
- JOB ID question from Daniel
When I submit a job, I get a job ID back. My plan is to hold onto that job ID permanently for tracking. We have had issues in the past with Torque/Maui because the job IDs got recycled later and our internal bookkeeping got mixed up. So my questions are:
- Are job IDs guaranteed to be unique in HTCondor?
- How unique are they—are they _globally_ unique or just unique within a particular namespace (such as our cluster or the submit node)?- A Job ID (ClusterID.ProcID)
- DNS name of the schedd and ctime of the job_queued.log file.
- It is unique to a schedd.
- We should talk with Daniel about this. They should craft their own ID. It could be seeded with a JobID but should not depend on just it.
- UpgradingHTCondor without killing jobs?
- schedd can be upgraded and restarted without loosing state assuming the restart is less than the timeout.
- currently restarting execute services will kill jobs. CHTC is working on improving this.
- negotiator and collector can be restarted without killing jobs.
- CHTC works hard to ensure 8.8.x is compatible with 8.8.y or 8.9.x is compatible with 8.9.y.
- Leaving data on execution host between jobs (data reuse)
- Todd is working on this now.
- Ask about installation of CASA locally and ancillary data (cfcache)
- CHTC has a Ceph filesystem that is available to many of their execution hosts (notibly the larger ones)
- There is another software filesystem where CASA could live that is more used for admin usage but might be available to us.
- We could download the tarball each time over HTTP. CHTC uses a proxy server so it would often be cached.
- Environment: Is there a way to have condor "login" when a job starts thus sourcing /etc/proflie and the user's rc files? Currently, not even $HOME is set.
- A good analogy is Torque does a su - _username_ while HTCondor just does a su _username_
- WORKAROUND: setting getenv = True which is like the -V option to qsub, may help. It doesn't source rc files but does inherit your current environment. This may be a problem if your current environment is not what you want on the cluster node. Perhaps the cluster node is a different OS or architecture.
- ANSWER: condor doesn't execute things with a shell. You could set your executable as /bin/bash and then have the arguments be the executable you used to have. I just changed our stuff to staticly set $HOME and I think that is good enough.
- Flocking: Suppose I have two hosts in the same pool. testpost-master is a submit-host and testpost-serv-1 is both a submit-host and the central-manager. testpost-serv-1 is configured to flock to CHTC but testpost-master is not. Is it possible to submit a job on testpost-master that will flock to CHTC by somehow leveraging testpost-serv-1? In other words, do I have to setup flocking and an external IP on every submit host?
- ANSWER: there isn't a good way to do this. So eventually we will need to make testpost-master flock to CHTC and possibly remove the ability of testpost-serv-1 to flock.
...
- Look in the job log file not the dag log file
- 040 (150.000.000) 2020-06-15 13:05:45 Started transferring input files
Transferring to host: <10.64.10.172:9618?addrs=10.64.10.172-9618&alias=nmpost072.aoc.nrao.edu&noUDP&sock=slot1_1_72656_7984_60>
...
040 (150.000.000) 2020-06-15 13:06:04 Finished transferring input files
- Rank and Premption: Can we use Rank to set "preferences" without requiring job preemption?
- ANSWER: There are 2 kinds of rank (job rank, machine rank). job rank (RANK=... in a submit file) is purely a preference. That does not preempt. Machine rank (in startd.config) will preempt. Negotiator pre-job rank is a third type of rank that works at a pool level and is often used to pack jobs efficiently.
...
- /staging/nu_jrobnett
Requirements = (Target.HasCHTCStaging == true)
- Quota: 100GB, 100K files
...
- /squid/nu_jrobnett
- only accessable via this path on the submit hosts. Execution hosts will need to access it via HTTP.
transfer_input_files = http://proxy.chtc.wisc.edu/SQUID/nu_jrobnett/casa.tgz
Software area We can use this in run-time applications. Think of it like /usr/local.
...
/software/nu_jrobnett/casa/casa-pipeline-release-5.6.1-8.el7
...
- Public_input_files: How is this different than transfer_input files and when would one want to use it instead of files or URLs with transfer_input_files?
- https://htcondor.readthedocs.io/en/latest/users-manual/file-transfer.html#public-input-files
- This is still a work in progress. It may allow for caching on a squid server, fetchable by others someday.
...
- Some problems with log files on the submit host but rare.
...
- Subdirectories are treated differently
- SOLUTION: I think casa just touches every file and therefore condor is forced to copy everything in the working directory. I have been unable to reproduce the problem outside of casa.
- SOLUTION: If you specify a directory, HTCondor will transfer the entire directory not just files with new mtime.
...
- I see I can set periodic_remove = (JobStatus == 5) but HTCondor doesn't seem to think that is an error so if I have notification = Error I don't get any email.
- Greg will look into adding a Hold option to notification
- The HTCondor idea of held jobs is that you submitted a large DAG of jobs, one step is missing a file and you would like to put that file in place and continue the job instead of the whole DAG failing and having to be resubmitted. This makes sense but it would be nice to be notified when a job gets held.
Greg wrote "notification = error in the submit file is supposed to send email when the job is held by the system, but there's a bug now where it doesn't. I'll fix this."
- What limits are there to transfer_input_files? I would sometimes get Failed to transfer files when the number of files was around 10,000
- ANSWER: There is a memory limit because of ClassAds but in general there isn't a defined limit.
- Is there a way to generate the dag.dot file without having to submit the job?
- The -no_submit option doesn't create the .dot file
- Is adding NOOP to all the JOB commands the right thing to do? The DAG still gets submitted but then quickly ends.
- ANSWER: You need to submit the DAG. NOOP is the current solution
- Is there a way to start a dag at a given point? E.g. if there are 5 steps in the dag, can you start the job at step 3?
- Is the answer again to add NOOP to the JOB commands you don't want to run?
- ANSWER: Splices may work here but NOOP is certainly a tool to use.
- I see at CHTC jobs now require a request_disk setting. How does one launch interactive jobs?
- ANSWER: This is a bug.
- For our initial tests, we want to flock jobs to CHTC that transfer about 60GB input and output. Eventually we will reduce this significantly but for now what can we do?
- Globus or rsync to move data? If Globus, how to do so in an automated way (E.g. no password)?
- ANSWER: Using the Transfer Mechanism from our submit server to CHTC's execution host is ok with CHTC because it doesn't interfere with their submit host. Outbound from their execute hosts is also allowed (scp, rsync, etc).
- Use +Longjob = true attribute for 72+ hour jobs. This is a CHTC-specific knob.
- How can I know if my job swapped?
- ANSWER: CHTC nodes have no or minimal swap space.
- Condor Annex processing in AWS. Is there support for spot market
- ANSWER: Condor Annex does indeed support the spot market. It is a bit more work to set up because you don't say "give my X of Y", but "I'll pay d1 dollars for machines like X1 and d2 for machines like X2, etc.".
- What network mask should we use to allow ssh from CHTC into NRAO? Is it a class B or several class Cs?
- ANSWER: The ip (v4 !) ranges for CHTC execute nodes are
128.104.100.0/22
128.104.55.0/24
128.104.58.0/23
128.105.244.0/23
- ANSWER: The ip (v4 !) ranges for CHTC execute nodes are
- Is there a ganglia server or some other monitor service at CHTC we can view?
ANSWER: We have a bunch of ganglia and grafana graphs for the system, but I think they are restricted to campus folks and tend to show system-wide utliization and problem
I have a machine with an externaly-accessable, non-NATed address (146.88.10.48) and an internal, non-routable address (10.64.1.226). I want to install condor_annex on this machine such that I can submit jobs to AWS from it. I don't necessarily need to submit jobs to local execute hosts from this machine. Should I make this machine a central manager, a submit host, both, or does it matter?
- ANSWER: I think instances in AWS will need to contact both the schedd (submit host) and collectord (central manager) from the internet using port 9618. So either both submit host and central manager need external connections and IPs with port 9618 open or combine them into one host with an external IP and port 9618 open to the Internet.
Last time I configured condor_annex I was using an older version of condor (8.8.3 I think) and used a pool password for security. Now I am using 8.9.7. Is there a newer/better security method I should use?
- ANSWER: The annex still primarily uses pool password, so let's stick with that for now.
- How can I find out what hosts are available for given requirements (LongJob, memory, staging)
- condor_status -compact -constraint "HasChtcStaging==true" -constraint 'DetectedMemory>500000' -constraint "CanRunLongJobs isnt Undefined"
- Answer: yes this is correct but it doesn't show what other jobs are waiting on the same resources. Which is fine.
- It looks to me like most hosts at CHTC are setup to run LongJobs. The following shows a small list of about 20 hosts so I assume all others can run long jobs. Is the correct?
- condor_status -compact -constraint "CanRunLongJobs is Undefined"
- JongJobs is for something like 72 hours. So it might be best to not set it unless we really need it like step23.
- Answer: yes this is correct but it doesn't show what other jobs are waiting on the same resources. Which is fine.
- Is port 9618 needed for flocking or just for condor_annex?
- ANSWER: Greg thinks yes 9618 is needed for both flocking and condor_annex.
- Are there bugs in the condor.log output of a DAG node? For example, I have a condor.log file that clearly shows the job taking about three hours to run yet at the bottom lists user time of 13 hours and system time of 1 hour. https://open-confluence.nrao.edu/download/attachments/40541486/step07.py.condor.log?api=v2
And as for the cpu usage report, there could very well be a bug, but first, is your job multi-threaded or multi-process? If so, the cpu usage will be the aggregate across all cpu cores.
- Yes they are all parallel jobs to some extent so I accept your answer for that job. But I have another job that took 21 hours of wallclock time and yet the condor.log shows 55 minutes of user and 5:34 hours of system time. https://open-confluence.nrao.edu/download/attachments/40541486/step05.py.condor.log?api=v2
- ANSWER: if you look, the user time is actually 6 days and 55 minutes. I missed the 6 in there.
- Given a DAG where some steps are to always run at AOC and some are to always run in CHTC how can we dictate this. Right now local jobs flock to CHTC if local resources are full. Can we make local jobs idle instead of flock?
- ANSWER: Use PoolNames. I need to make a testpost PoolName.
- PoolName = "TESTPOST"
- It seems that when using DAGs the recommended method is to define variables in the DAG script instead of submit scripts. This makes sense as it allows for only one file, the DAG script, that needs to be edited to make changes. But, is there a way to get variables into the DAG script from the command line or environment or an include_file or something?
- ANSWER: There is an INCLUDE syntax but there is no command-line or environment variable way to get vars into a DAG.
- We are starting to run 10s of jobs in CHTC requiring 40GB as part of a local DAG. Are there any options we can set to improve their execution chance. What memory footprint (32, 20, 16, 8GB) would significantly improve their chances.
- ANSWER: only use +LongJob if the job needs more than 72 hour, which is the default "walltime".
- How can we set AWS Tags with condor_annex? We'd like this to track jobs and set billing tags. Looks like there isn't really a way.
- SOLUTION: Greg wrote (Oct. 5, 2020) "we've just now added code to the condor_annex to support setting multiple aws tags on the annex command line". K. Scott expects it will take a while before that code makes it to released software.
- Launch Templates didn't work. I don't think condor_annex supports Launch Templates.
- Use aws-user-data options to condor_annex?
- I have tried all sorts of user-data and default-user-data-file options. On-demand apparently no longer works and I was never able to get something working with spot-fleet. I think all things user-data are non-functional.
- I tried setting a tag in the role defined in config.json (aws-ec2-spot-fleet-tagging-role) but that tag didn't translate to the instance.
- I tried adding a tag to the AMI when creating a new AMI (EC2 → Instances → Actions → Image → Create Image). Didn't work.
- What about selftagging? The instance figures out its instance id and runs aws.
- wget -qO- http://instance-data/latest/meta-data/instance-id
- returns nothing when logged in as nobody (condor_ssh_to_job)
- returns nothing when logged in as centos (ssh -i ~/.ssh/...)
- returns instanceid when logged in as root (ssh as centos then sudo su)
- Aha! There is a firewall (iptables) rule blocking exactly this. But I can't figure out what file sets this iptables rule on boot.
- wget -qO- http://instance-data/latest/meta-data/instance-id
- I tried adding tags to the json file using both ResourceType set to instance and spot-fleet-request. Neither created an instance with my tag.
- "TagSpecifications": [
{
"ResourceType": "instance",
"Tags": [
{
"Key": "budget",
"Value": "VLASS"
}
]
{
], - What are the clever solutions to submitting N different DAG jobs with each having different parmeters?
T10t34
J220200-003000
bin, working, data
J220600-003000
bin, working, data
...
T10t35
J170743-393000
bin, working, data
J171241-383000
bin, working, data
...
ANSWERS:
INCLUDE syntax for DAGs
include syntax for submit files
make a template of files
use a PRE script that populates things
usedagdir
STARTD_ATTRS = $(STARTD_ATTRS) PoolName
...
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND P
48833 krowe 20 0 12532 1052 884 R 100.0 0.0 9:20.95 a.out 4
49014 krowe 20 0 12532 1052 884 R 100.0 0.0 8:34.91 a.out 5
48960 krowe 20 0 12532 1052 884 R 99.6 0.0 8:54.40 a.out 3
49011 krowe 20 0 12532 1052 884 R 99.6 0.0 8:35.00 a.out 1
49013 krowe 20 0 12532 1048 884 R 99.6 0.0 8:34.84 a.out 0
and the masks aren't restricting them to specific cpus. So I am yet unable to reproduce James's problem.
st077.aoc.nrao.edu]# grep -i cpus /proc/48960/status
Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
Cpus_allowed_list: 0-447
We can reproduce this without HTCondor. So this is either being caused by our mpicasa program or the openmpi libraries it uses. Even better, I can reproduce this with a simple shell script executed from two shells at the same time on the same host. Another MPI implementation (mvapich2) didn't show this problem.
...
99.6 0.0 8:34.84 a.out 0
and the masks aren't restricting them to specific cpus. So I am yet unable to reproduce James's problem.
st077.aoc.nrao.edu]# grep -i cpus /proc/48960/status
Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
Cpus_allowed_list: 0-447
We can reproduce this without HTCondor. So this is either being caused by our mpicasa program or the openmpi libraries it uses. Even better, I can reproduce this with a simple shell script executed from two shells at the same time on the same host. Another MPI implementation (mvapich2) didn't show this problem.
#!/bin/sh
export PATH=/usr/lib64/openmpi/bin:{$PATH}
export LD_LIBRARY_PATH=/usr/lib64/openmpi/lib:${LD_LIBRARY_PATH}
mpirun -np 2 /users/krowe/work/doc/all/openmpi/busy/busy
Array Jobs
Does HTCondor support array jobs like Slurm? For example in Slurm #SBATCH --array=0-3%2 or is one supposed to use queue options and DAGMan throttling?
ANSWER: HTCondor does reduce the priority of a user the more jobs they run so there may be less need of a maxjob or modulus option. But here are some other things to look into.
queue from seq 10 5 30 |
queue item in 1, 2, 3
combined cluster (Slurm and HTCondor)
Slurm starts and stops condor. CHTC does this because their HTCondor can preempt jobs. So when Slurm starts a job it kills the condor startd and any HTCondor jobs will get preempted and probably restarted somewhere else.
Node Priority
Is there a way to set an order to which nodes are picked first or a weight system? We want certain nodes to be chosen first because they are faster, or have less memory or other such criteria.
NEGOTIATOR_PRE_JOB_RANK on the negotiator
HPC Cluster
Could I have access to the HPC cluster? To learn Slurm.
ANSWER: https://chtc.cs.wisc.edu/hpc-overview I need to login to submit2 first but that's fine.
How does CHTC keep shared directories (/tmp, /var/tmp, /dev/shm) clean with Slurm?
ANSWER: CHTC doesn't do any cleaning of shared directories. But the suggested looking at https://derekweitzel.com/2016/03/22/fedora-copr-slurm-per-job-tmp/ I don't know if this plugin will clean files created by an interactive ssh, but i suspect it won't because it is a slurm plugin and ssh'ing to the host is outside of the control of Slurm except for the pam_slurm_adopt that adds you to the cgroup. So I may still need a reaper script to keep these directories clean.
vmem exceeded in Torque
We have seen a problem in Torque recently that reminds us of the memory fix you recently implemented in HTCondor. What that fix related to any recent changes in the Linux kernel or was it a pure HTCondor bug? What was it that you guys did to fix it?
ANSWER: There are two problems here. The first is the short read, which we are still trying to understand the root cause. We've worked around the problem in the short term by re-polling when the number of processes we see drops by 10% or more. The other problem is when condor uses cgroups to measure the amount of memory that all processes in a job use, it goes through the various field in /sys/fs/cgroup/memory/cgroup_name/memory.stat. Memory is categorized into a number of different types in this file, and we were omitting some types of memory when summing up the total.
cpuset issues
ANSWER: git bisect could be useful. Maybe we could ask Ville.
Distant execute nodes
Are there any problems having compute nodes at a distant site?
ANSWER: no intrinsic issues. Be sure to set requirements.
Memory bug fix?
What version of condor has this fix?
ANSWER: 8.9.9
When is it planned for 8.8 or 9.x inclusion?
ANSWER: 9.0 in Apr. 2021
Globus
You mentioned that the globus RPMs are going away. Yes?
ANSWER: They expect to drop globus support in 9.1 around May 2021.
VNC
Do you have any experience using VNC with HTCondor?
ANSWER: no they don't have experience like this. But mount_under_scratch= will use the real /tmp
Which hosts do the flocking?
Lustre is going to be a problem. Our new virtual CMs can't see lustre. Can just a submit host see lustre and not the CM in order to flock?
ANSWER: Only submit machines need to be configured to flock. It goes from a local submit host to a remote CM. So we could keep gibson as a flocking submit host. This means the new CMs don't need the firewall rules.
Which hosts do the flocking?
Lustre is going to be a problem. Our new virtual CMs can't see lustre. Can just a submit host see lustre and not the CM in order to flock?
ANSWER: Only submit machines need to be configured to flock. It goes from a local submit host to a remote CM. So we could keep gibson as a flocking submit host. This means the new CMs don't need the firewall rules.
Transfer Mechanism Plugin
- Our environment has a complex network topology. We have a prototype rsync plugin but may want to specify a specific network interface for a host as a function of where the execute host resides.
- Do file transfer plugins have access to the JobAd, either internally or via an external command condor_q -l? For example can they tell what PoolName a job requested?
- Can we make use of logic during the match making where 'if execute host is in set of X, then set some variable to Y' and then the plugin inspects some variable to determine where it is toplogically and therefore which interface to use.
- ANSWER: look at .job.ad or .machine.ad in the scratch area. Could set some attributes in the config file for the nodes.
Containers
- Is HTC basically committed to distributing container implementations with each new release
- ANSWER: CHTC is planning to release containers with each HTCondor release.
- Is this migrating toward a recommended implementation method for things like the submit hosts and possibly even execute hosts where the transactions could be light weight.
- ANSWER: The jobs are tied to a submit host. If that submit host goes away the job may be orphaned.
Remote
condor_submit -remote what does it do? The manpage makes me think it submits your job using a different submit host but when I run it I get lots of authentication errors. Can it not use host-based authentication (e.g. ALLOW_WRITE = *.aoc.nrao.edu)?
Here is an example of me running condor_submit on one of our Submit Hosts (testpost-master) trying to remote to our Central Manager (testpost-cm) which is also a submit host.
condor_submit -remote testpost-cm tiny.htc
Submitting job(s)
ERROR: Failed to connect to queue manager testpost-cm-vml.aoc.nrao.edu
AUTHENTICATE:1003:Failed to authenticate with any method
AUTHENTICATE:1004:Failed to authenticate using GSI
GSI:5003:Failed to authenticate. Globus is reporting error (851968:50). There
is probably a problem with your credentials. (Did you run grid-proxy-init?)
AUTHENTICATE:1004:Failed to authenticate using KERBEROS
AUTHENTICATE:1004:Failed to authenticate using FS
ANSWER:
condor_submit -remote does indeed, tell the condor_submit tool to submit to a remote schedd. (it also implies -spool)
Because the schedd can run the job as the submitting owner, and runs the shadow as the submitting owner, the remote schedd needs to not just authorize the remote user to submit jobs, but must authenticate the remote user as some allowed user.
Condor's IP host-based authentication is just authentication, it can say "everyone coming from this IP address is allowed to do X, but I don't know who that entity is".
So, for remote submit to work, we need some kind of authentication method as well, like IDTOKENS, munge.
Authentication
- We're currently using host based authentication. Is there a 'future proof' recommended authentication system for HTCondor-9.x for a site planning to use both on-premesis cluster and CHTC flocking and or glide-ins to other facitlities? host_based? password? Tokens? SSL? Munge? Munge might be my preferred method as Slurm already requires it.
- If we're using containers for submit hosts is there a preferred authentication scheme (host based doesn't scale well).
- ANSWER: idtokens
HTcondor+Slurm
- Do people do HTCondor glide-ins to slurm where the HTCondor jobs are not prempted, as a way to share resources with both schedulers?
- ANSWER: You can glide in to Slurm.
- You can have Slurm preempt HTCondor jobs in favor of its own jobs (HTCondor jobs presumably will be resubmitted)
- You can have HTCondor preempt Slurm jobs in the same sort of way
Array Jobs
Does HTCondor support array jobs like Slurm? For example in Slurm #SBATCH --array=0-3%2 or is one supposed to use queue options and DAGMan throttling?
ANSWER: HTCondor does reduce the priority of a user the more jobs they run so there may be less need of a maxjob or modulus option. But here are some other things to look into.
queue from seq 10 5 30 |
queue item in 1, 2, 3
combined cluster (Slurm and HTCondor)
Slurm starts and stops condor. CHTC does this because their HTCondor can preempt jobs. So when Slurm starts a job it kills the condor startd and any HTCondor jobs will get preempted and probably restarted somewhere else.
Node Priority
Is there a way to set an order to which nodes are picked first or a weight system? We want certain nodes to be chosen first because they are faster, or have less memory or other such criteria.
NEGOTIATOR_PRE_JOB_RANK on the negotiator
HPC Cluster
Could I have access to the HPC cluster? To learn Slurm.
ANSWER: https://chtc.cs.wisc.edu/hpc-overview I need to login to submit2 first but that's fine.
How does CHTC keep shared directories (/tmp, /var/tmp, /dev/shm) clean with Slurm?
ANSWER: CHTC doesn't do any cleaning of shared directories. But the suggested looking at https://derekweitzel.com/2016/03/22/fedora-copr-slurm-per-job-tmp/ I don't know if this plugin will clean files created by an interactive ssh, but i suspect it won't because it is a slurm plugin and ssh'ing to the host is outside of the control of Slurm except for the pam_slurm_adopt that adds you to the cgroup. So I may still need a reaper script to keep these directories clean.
vmem exceeded in Torque
We have seen a problem in Torque recently that reminds us of the memory fix you recently implemented in HTCondor. What that fix related to any recent changes in the Linux kernel or was it a pure HTCondor bug? What was it that you guys did to fix it?
ANSWER: There are two problems here. The first is the short read, which we are still trying to understand the root cause. We've worked around the problem in the short term by re-polling when the number of processes we see drops by 10% or more. The other problem is when condor uses cgroups to measure the amount of memory that all processes in a job use, it goes through the various field in /sys/fs/cgroup/memory/cgroup_name/memory.stat. Memory is categorized into a number of different types in this file, and we were omitting some types of memory when summing up the total.
cpuset issues
ANSWER: git bisect could be useful. Maybe we could ask Ville.
Distant execute nodes
Are there any problems having compute nodes at a distant site?
ANSWER: no intrinsic issues. Be sure to set requirements.
Memory bug fix?
What version of condor has this fix?
ANSWER: 8.9.9
When is it planned for 8.8 or 9.x inclusion?
ANSWER: 9.0 in Apr. 2021
Globus
You mentioned that the globus RPMs are going away. Yes?
ANSWER: They expect to drop globus support in 9.1 around May 2021.
VNC
Do you have any experience using VNC with HTCondor?
ANSWER: no they don't have experience like this. But mount_under_scratch= will use the real /tmp
Which hosts do the flocking?
Lustre is going to be a problem. Our new virtual CMs can't see lustre. Can just a submit host see lustre and not the CM in order to flock?
ANSWER: Only submit machines need to be configured to flock. It goes from a local submit host to a remote CM. So we could keep gibson as a flocking submit host. This means the new CMs don't need the firewall rules.
Which hosts do the flocking?
Lustre is going to be a problem. Our new virtual CMs can't see lustre. Can just a submit host see lustre and not the CM in order to flock?
ANSWER: Only submit machines need to be configured to flock. It goes from a local submit host to a remote CM. So we could keep gibson as a flocking submit host. This means the new CMs don't need the firewall rules.
Transfer Mechanism Plugin
- Our environment has a complex network topology. We have a prototype rsync plugin but may want to specify a specific network interface for a host as a function of where the execute host resides.
- Do file transfer plugins have access to the JobAd, either internally or via an external command condor_q -l? For example can they tell what PoolName a job requested?
- Can we make use of logic during the match making where 'if execute host is in set of X, then set some variable to Y' and then the plugin inspects some variable to determine where it is toplogically and therefore which interface to use.
- ANSWER: look at .job.ad or .machine.ad in the scratch area. Could set some attributes in the config file for the nodes.
Containers
- Is HTC basically committed to distributing container implementations with each new release
- ANSWER: CHTC is planning to release containers with each HTCondor release.
- Is this migrating toward a recommended implementation method for things like the submit hosts and possibly even execute hosts where the transactions could be light weight.
- ANSWER: The jobs are tied to a submit host. If that submit host goes away the job may be orphaned.