Open questions:

Update on software store for CASA either on shared Ceph storage or admin software storage
- Staging area for datasets 100MB - TBs. This is where we could try keeping the cfcache assuming doing so doesn't overwhelm the filesystem.
  - ```
  Requirements = (Target.HasCHTCStaging == true)
```
- /staging/nu_jrobnett
- Squid area for 100MB - 1GB input or shared software. This is where we could keep casa.tgz and then have the execution host retrieve it via HTTP.
  - /squid/nu_jrobnett only accessable via this path on the submit hosts. Execution hosts will need to access it via HTTP.
  - ```
  transfer_input_files = http://proxy.chtc.wisc.edu/SQUID/nu_jrobnett/ca
```

Rank and Premption: Can we use Rank to set "preferences" without requiring job preemption?
- ANSWER: There are 2 kinds of rank (job rank, machine rank). job rank (RANK=... in a submit file) is purely a preference. That does not preempt. Machine rank (in startd.config) will preempt. Negotiator pre-job rank is a third type of rank that works at a pool level and is often used to pack jobs efficiently.

Answered Questions:

JOB ID question from Daniel
- When I submit a job, I get a job ID back. My plan is to hold onto that job ID permanently for tracking. We have had issues in the past with Torque/Maui because the job IDs got recycled later and our internal bookkeeping got mixed up. So my questions are:
  - Are job IDs guaranteed to be unique in HTCondor?
  - How unique are they—are they _globally_ unique or just unique within a particular namespace (such as our cluster or the submit node)?
- A Job ID (ClusterID.ProcID)
- DNS name of the schedd and ctime of the job_queued.log file.
- It is unique to a schedd.
- We should talk with Daniel about this. They should craft their own ID. It could be seeded with a JobID but should not depend on just it.
UpgradingHTCondor without killing jobs?
- schedd can be upgraded and restarted without loosing state assuming the restart is less than the timeout.
- currently restarting execute services will kill jobs. CHTC is working on improving this.
- negotiator and collector can be restarted without killing jobs.
- CHTC works hard to ensure 8.8.x is compatible with 8.8.y or 8.9.x is compatible with 8.9.y.
Leaving data on execution host between jobs (data reuse)
- Todd is working on this now.
Ask about installation of CASA locally and ancillary data (cfcache)
- CHTC has a Ceph filesystem that is available to many of their execution hosts (notibly the larger ones)
- There is another software filesystem where CASA could live that is more used for admin usage but might be available to us.
- We could download the tarball each time over HTTP. CHTC uses a proxy server so it would often be cached.
Environment: Is there a way to have condor "login" when a job starts thus sourcing /etc/proflie and the user's rc files? Currently, not even $HOME is set.
- A good analogy is Torque does a su - _username_ while HTCondor just does a su _username_
- WORKAROUND: setting getenv = True which is like the -V option to qsub, may help. It doesn't source rc files but does inherit your current environment. This may be a problem if your current environment is not what you want on the cluster node. Perhaps the cluster node is a different OS or architecture.
- ANSWER: condor doesn't execute things with a shell. You could set your executable as /bin/bash and then have the arguments be the executable you used to have. I just changed our stuff to staticly set $HOME and I think that is good enough.

Flocking: Suppose I have two hosts in the same pool. testpost-master is a submit-host and testpost-serv-1 is both a submit-host and the central-manager. testpost-serv-1 is configured to flock to CHTC but testpost-master is not. Is it possible to submit a job on testpost-master that will flock to CHTC by somehow leveraging testpost-serv-1? In other words, do I have to setup flocking and an external IP on every submit host?
- ANSWER: there isn't a good way to do this. So eventually we will need to make testpost-master flock to CHTC and possibly remove the ability of testpost-serv-1 to flock.

It seems the transfer mechanism won't transfer symlinks to directories (e.g. data/vlass.ms → /lustre/aoc/...) Is there a way around this?
- ANSWER: there is no flag to chase symlinks at the moment. The top level dir (e.g. data) could be a symlink may work if transfer_input_files=data/
- If data is a symlink (e.g. data → ../data) and transfer_input_files=data then I get the error about won't transfer symlinks to directories
- if data is a symlink (e.g. data → ../data) and transfer_input_files=data/ then it transfers the contents not the directory. In other words I don't have a data directory in scratch I have a VLASS... directory.
DAG log time stamps, is there a way to differentiate data import/export time and process run time.
- Look in the job log file not the dag log file
- 040 (150.000.000) 2020-06-15 13:05:45 Started transferring input files
  Transferring to host: <10.64.10.172:9618?addrs=10.64.10.172-9618&alias=nmpost072.aoc.nrao.edu&noUDP&sock=slot1_1_72656_7984_60>
  ...
  040 (150.000.000) 2020-06-15 13:06:04 Finished transferring input files

Space shortcuts

Page tree

Open questions:

Answered Questions:

Space shortcuts

Page tree

NRAO-CHTC HTCondor collaboration

Open questions:

Answered Questions: