...
- Is there a config option that will cause condor to not start? We have diskless nodes and it is easier to modify the config file then change systemd.
- Torque has this command called pbsnodes that can not only offline/drain a node but keeps a note about it that all can see in one place. I know I can use condor_off to drain a node but is there a central place keep notes so I can remember a month later why I set a certain node to drain?
- Bug where James's jobs are all put on the same core. Here is top -u krowe showing the Last Used Cpu (SMP) after I submitted five sleep jobs to the same host.
- Is this just a side effect of condor using cpuacct instead of cpuset in cgroup?
- Is this a failure of the Linux kernel to schedule things on separate cores?
- Is this because cpu.shares is set to 100 instead of 1024?
...
- Bug in condor_annex: The following will wait for an annex named krowe - annex - casa5 (note the spaces). If I pass $(myannex) as an argument to a shell script, the spaces are not there. Underscores instead of hyphens cause different problems.
- include.htc
- myannex = krowe-annex-casa5
- submit.htc
- include : include.htc
- executable = /bin/sleep
- arguments = 127
- +MayUseAWS = True
- requirements = AnnexName == $(myannex)
- queue
- Actually, I think this isn't a bug but a limitation on using macros. The AnnexName needs to be quoted but how can I quote a macro?
- No: requirements = AnnexName == "$(myannex)"
- No: myannex = "krowe-annex-casa5"
- No: myannex = \"krowe-annex-casa5\"
- No: myannex = "\"krowe-annex-casa5\""
- include.htc
- Bug in condor_annex: Underscores in the AnnexName prevent the annex from moving into the pool.
- Also when I try to terminate an annex with underscores (e.g. krowe_annex_casa5) with the command condor_off -annex krowe_annex_casa5 I get the following error
- Found no ClassAds when querying pool (local)
- Can't find addresses for master's for constraint 'AnnexName =?= "krowe_annex_casa5"'
Perhaps you need to query another pool.
.
- Also when I try to terminate an annex with underscores (e.g. krowe_annex_casa5) with the command condor_off -annex krowe_annex_casa5 I get the following error
Answered Questions:
- JOB ID question from Daniel
When I submit a job, I get a job ID back. My plan is to hold onto that job ID permanently for tracking. We have had issues in the past with Torque/Maui because the job IDs got recycled later and our internal bookkeeping got mixed up. So my questions are:
- Are job IDs guaranteed to be unique in HTCondor?
- How unique are they—are they _globally_ unique or just unique within a particular namespace (such as our cluster or the submit node)?- A Job ID (ClusterID.ProcID)
- DNS name of the schedd and ctime of the job_queued.log file.
- It is unique to a schedd.
- We should talk with Daniel about this. They should craft their own ID. It could be seeded with a JobID but should not depend on just it.
- UpgradingHTCondor without killing jobs?
- schedd can be upgraded and restarted without loosing state assuming the restart is less than the timeout.
- currently restarting execute services will kill jobs. CHTC is working on improving this.
- negotiator and collector can be restarted without killing jobs.
- CHTC works hard to ensure 8.8.x is compatible with 8.8.y or 8.9.x is compatible with 8.9.y.
- Leaving data on execution host between jobs (data reuse)
- Todd is working on this now.
- Ask about installation of CASA locally and ancillary data (cfcache)
- CHTC has a Ceph filesystem that is available to many of their execution hosts (notibly the larger ones)
- There is another software filesystem where CASA could live that is more used for admin usage but might be available to us.
- We could download the tarball each time over HTTP. CHTC uses a proxy server so it would often be cached.
- Environment: Is there a way to have condor "login" when a job starts thus sourcing /etc/proflie and the user's rc files? Currently, not even $HOME is set.
- A good analogy is Torque does a su - _username_ while HTCondor just does a su _username_
- WORKAROUND: setting getenv = True which is like the -V option to qsub, may help. It doesn't source rc files but does inherit your current environment. This may be a problem if your current environment is not what you want on the cluster node. Perhaps the cluster node is a different OS or architecture.
- ANSWER: condor doesn't execute things with a shell. You could set your executable as /bin/bash and then have the arguments be the executable you used to have. I just changed our stuff to staticly set $HOME and I think that is good enough.
...