...
Lustre is going to be a problem. Our new VMs can't see lustre at the moment. Can just a submit host see lustre and not the CM in order to flock?
ANSWER: Only submit machines need to be configured to flock. It goes from a lock submit host to a remote CM. So we couild keep gibson as a flocking submit host. But the new CMs don't need the firewall rules.
Memory bug fix?
What version of condor has this fix?
ANSWER: 8.9.9
When is it planned for 8.8 or 9.x inclusion?
Globus
You mentioned that the globus RPMs are going away. Yes? They expect to drop globus support in 9.1 around May 2021.
Our MPI problem
We figured it out, mostly.
VNC
Do you have any experience using VNC with HTCondor?
ANSWER: no they don't have experience like this. But mount_under_scratch= will use the real /tmp
Answered Questions:
...
Answered Questions:
- JOB ID question from Daniel
When I submit a job, I get a job ID back. My plan is to hold onto that job ID permanently for tracking. We have had issues in the past with Torque/Maui because the job IDs got recycled later and our internal bookkeeping got mixed up. So my questions are:
- Are job IDs guaranteed to be unique in HTCondor?
- How unique are they—are they _globally_ unique or just unique within a particular namespace (such as our cluster or the submit node)?- A Job ID (ClusterID.ProcID)
- DNS name of the schedd and ctime of the job_queued.log file.
- It is unique to a schedd.
- We should talk with Daniel about this. They should
When I submit a job, I get a job ID back. My plan is to hold onto that job ID permanently for tracking. We have had issues in the past with Torque/Maui because the job IDs got recycled later and our internal bookkeeping got mixed up. So my questions are:
- Are job IDs guaranteed to be unique in HTCondor?
- How unique are they—are they _globally_ unique or just unique within a particular namespace (such as our cluster or the submit node)?- A Job ID (ClusterID.ProcID)
- DNS name of the schedd and ctime of the job_queued.log file.
- It is unique to a schedd.
- We should talk with Daniel about this. They should craft their own ID. It could be seeded with a JobID but should not depend on just it.
- UpgradingHTCondor without killing jobs?
- schedd can be upgraded and restarted without loosing state assuming the restart is less than the timeout.
- currently restarting execute services will kill jobs. CHTC is working on improving this.
- negotiator and collector can be restarted without killing jobs.
- CHTC works hard to ensure 8.8.x is compatible with 8.8.y or 8.9.x is compatible with 8.9.y.
- Leaving data on execution host between jobs (data reuse)
- Todd is working on this now.
- Ask about installation of CASA locally and ancillary data (cfcache)
- CHTC has a Ceph filesystem that is available to many of their execution hosts (notibly the larger ones)
- There is another software filesystem where CASA could live that is more used for admin usage but might be available to us.
- We could download the tarball each time over HTTP. CHTC uses a proxy server so it would often be cached.
- Environment: Is there a way to have condor "login" when a job starts thus sourcing /etc/proflie and the user's rc files? Currently, not even $HOME is set.
- A good analogy is Torque does a su - _username_ while HTCondor just does a su _username_
- WORKAROUND: setting getenv = True which is like the -V option to qsub, may help. It doesn't source rc files but does inherit your current environment. This may be a problem if your current environment is not what you want on the cluster node. Perhaps the cluster node is a different OS or architecture.
- ANSWER: condor doesn't execute things with a shell. You could set your executable as /bin/bash and then have the arguments be the executable you used to have. I just changed our stuff to staticly set $HOME and I think that is good enough.
...
ANSWER: There are two problems here. The first is the short read, which we are still trying to understand the root cause. We've worked around the problem in the short term by re-polling when the number of processes we see drops by 10% or more. The other problem is when condor uses cgroups to measure the amount of memory that all processes in a job use, it goes through the various field in /sys/fs/cgroup/memory/cgroup_name/memory.stat. Memory is categorized into a number of different types in this file, and we were omitting some types of memory when summing up the total.
cpuset issues
ANSWER: git bisect could be useful. Maybe we could ask Ville.
Distant execute nodes
Are there any problems having compute nodes at a distant site?
memory when summing up the total.
cpuset issues
ANSWER: git bisect could be useful. Maybe we could ask Ville.
Distant execute nodes
Are there any problems having compute nodes at a distant site?
ANSWER: no intrinsic issues. Be sure to set requirements.
Memory bug fix?
What version of condor has this fix?
ANSWER: 8.9.9
When is it planned for 8.8 or 9.x inclusion?
ANSWER: 9.0 in Apr. 2021
Globus
You mentioned that the globus RPMs are going away. Yes?
ANSWER: They expect to drop globus support in 9.1 around May 2021.
VNC
Do you have any experience using VNC with HTCondor?
ANSWER: no they don't have experience like this. But mount_under_scratch= will use the real /tmpANSWER: no intrinsic issues. Be sure to set requirements.