...

Is there a config option that will cause condor to not start? We have diskless nodes and it is easier to modify the config file then change systemd.

Torque has this command called pbsnodes that can not only offline/drain a node but keeps a note about it that all can see in one place. I know I can use condor_off to drain a node but is there a central place keep notes so I can remember a month later why I set a certain node to drain?

Bug where James's jobs are all put on the same core. Here is top -u krowe showing the Last Used Cpu (SMP) after I submitted five sleep jobs to the same host.
- Is this just a side effect of condor using cpuacct instead of cpuset in cgroup?
- Is this a failure of the Linux kernel to schedule things on separate cores?
- Is this because cpu.shares is set to 100 instead of 1024?

...

Bug in condor_annex: The following will wait for an annex named krowe - annex - casa5 (note the spaces). If I pass $(myannex) as an argument to a shell script, the spaces are not there. Underscores instead of hyphens cause different problems.
- include.htc
  - myannex = krowe-annex-casa5
- submit.htc
  - include : include.htc
  - executable = /bin/sleep
  - arguments = 127
  - +MayUseAWS = True
  - requirements = AnnexName == $(myannex)
  - queue
- Actually, I think this isn't a bug but a limitation on using macros. The AnnexName needs to be quoted but how can I quote a macro?
  - No: requirements = AnnexName == "$(myannex)"
  - No: myannex = "krowe-annex-casa5"
  - No: myannex = \"krowe-annex-casa5\"
  - No: myannex = "\"krowe-annex-casa5\""
Bug in condor_annex: Underscores in the AnnexName prevent the annex from moving into the pool.
- Also when I try to terminate an annex with underscores (e.g. krowe_annex_casa5) with the command condor_off -annex krowe_annex_casa5 I get the following error
  - Found no ClassAds when querying pool (local)
  - Can't find addresses for master's for constraint 'AnnexName =?= "krowe_annex_casa5"'
  - Perhaps you need to query another pool.
Torque has this command called pbsnodes that can not only offline/drain a node but keeps a note about it that all can see in one place
- - .
I know I can use condor_off to drain a node but is there a central place keep notes so I can remember a month later why I set a certain node to drain?

Answered Questions:

JOB ID question from Daniel
- When I submit a job, I get a job ID back. My plan is to hold onto that job ID permanently for tracking. We have had issues in the past with Torque/Maui because the job IDs got recycled later and our internal bookkeeping got mixed up. So my questions are:
  - Are job IDs guaranteed to be unique in HTCondor?
  - How unique are they—are they _globally_ unique or just unique within a particular namespace (such as our cluster or the submit node)?
- A Job ID (ClusterID.ProcID)
- DNS name of the schedd and ctime of the job_queued.log file.
- It is unique to a schedd.
- We should talk with Daniel about this. They should craft their own ID. It could be seeded with a JobID but should not depend on just it.
UpgradingHTCondor without killing jobs?
- schedd can be upgraded and restarted without loosing state assuming the restart is less than the timeout.
- currently restarting execute services will kill jobs. CHTC is working on improving this.
- negotiator and collector can be restarted without killing jobs.
- CHTC works hard to ensure 8.8.x is compatible with 8.8.y or 8.9.x is compatible with 8.9.y.
Leaving data on execution host between jobs (data reuse)
- Todd is working on this now.
Ask about installation of CASA locally and ancillary data (cfcache)
- CHTC has a Ceph filesystem that is available to many of their execution hosts (notibly the larger ones)
- There is another software filesystem where CASA could live that is more used for admin usage but might be available to us.
- We could download the tarball each time over HTTP. CHTC uses a proxy server so it would often be cached.
Environment: Is there a way to have condor "login" when a job starts thus sourcing /etc/proflie and the user's rc files? Currently, not even $HOME is set.
- A good analogy is Torque does a su - _username_ while HTCondor just does a su _username_
- WORKAROUND: setting getenv = True which is like the -V option to qsub, may help. It doesn't source rc files but does inherit your current environment. This may be a problem if your current environment is not what you want on the cluster node. Perhaps the cluster node is a different OS or architecture.
- ANSWER: condor doesn't execute things with a shell. You could set your executable as /bin/bash and then have the arguments be the executable you used to have. I just changed our stuff to staticly set $HOME and I think that is good enough.

...

Space shortcuts

Page tree

Versions Compared

Old Version 223

New Version 224

Key

Answered Questions:

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 223

New Version 224

Key

Answered Questions: