...
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND P
66713 krowe 20 0 4364 348 280 S 0.0 0.0 0:00.01 sleep 22
66714 krowe 20 0 4364 352 280 S 0.0 0.0 0:00.02 sleep 24
66715 krowe 20 0 4364 348 280 S 0.0 0.0 0:00.01 sleep 24
66719 krowe 20 0 4364 348 280 S 0.0 0.0 0:00.02 sleep 2
66722 krowe 20 0 4364 352 280 S 0.0 0.0 0:00.02 sleep 22
Try my sqrt(rand()) on gibson instead of testpost.
Look at the masks in the cgroup.
From jrobnett@nrao.edu Tue Nov 10 16:38:18 2020
As (bad) luck would have it I had some jobs running where I forgot to set the #cores to do so they triggered the behavior.
Sshing into the node I see three processes sharing the same core and the following for the 3 python processes:
bash-4.2$ cat /proc/113531/status | grep Cpus
Cpus_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
Cpus_allowed_list: 0
If I look at another node with 3 processes where they aren't sharing the same core I see:
bash-4.2$ cat /proc/248668/status | grep Cpu
Cpus_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00555555
Cpus_allowed_list: 0,2,4,6,8,10,12,14,16,18,20,22
Dec. 8, 2020 krowe: I launched five sqrt(rand()) jobs and each one landed on its own CPU.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND P
48833 krowe 20 0 12532 1052 884 R 100.0 0.0 9:20.95 a.out 4
49014 krowe 20 0 12532 1052 884 R 100.0 0.0 8:34.91 a.out 5
48960 krowe 20 0 12532 1052 884 R 99.6 0.0 8:54.40 a.out 3
49011 krowe 20 0 12532 1052 884 R 99.6 0.0 8:35.00 a.out 1
49013 krowe 20 0 12532 1048 884 R 99.6 0.0 8:34.84 a.out 0
and the masks aren't restricting them to specific cpus. So I am yet unable to reproduce James's problem.
st077.aoc.nrao.edu]# grep -i cpus /proc/48960/status
Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
Cpus_allowed_list: 0-447
Answered Questions:
- JOB ID question from Daniel
When I submit a job, I get a job ID back. My plan is to hold onto that job ID permanently for tracking. We have had issues in the past with Torque/Maui because the job IDs got recycled later and our internal bookkeeping got mixed up. So my questions are:
- Are job IDs guaranteed to be unique in HTCondor?
- How unique are they—are they _globally_ unique or just unique within a particular namespace (such as our cluster or the submit node)?- A Job ID (ClusterID.ProcID)
- DNS name of the schedd and ctime of the job_queued.log file.
- It is unique to a schedd.
- We should talk with Daniel about this. They should craft their own ID. It could be seeded with a JobID but should not depend on just it.
- UpgradingHTCondor without killing jobs?
- schedd can be upgraded and restarted without loosing state assuming the restart is less than the timeout.
- currently restarting execute services will kill jobs. CHTC is working on improving this.
- negotiator and collector can be restarted without killing jobs.
- CHTC works hard to ensure 8.8.x is compatible with 8.8.y or 8.9.x is compatible with 8.9.y.
- Leaving data on execution host between jobs (data reuse)
- Todd is working on this now.
- Ask about installation of CASA locally and ancillary data (cfcache)
- CHTC has a Ceph filesystem that is available to many of their execution hosts (notibly the larger ones)
- There is another software filesystem where CASA could live that is more used for admin usage but might be available to us.
- We could download the tarball each time over HTTP. CHTC uses a proxy server so it would often be cached.
- Environment: Is there a way to have condor "login" when a job starts thus sourcing /etc/proflie and the user's rc files? Currently, not even $HOME is set.
- A good analogy is Torque does a su - _username_ while HTCondor just does a su _username_
- WORKAROUND: setting getenv = True which is like the -V option to qsub, may help. It doesn't source rc files but does inherit your current environment. This may be a problem if your current environment is not what you want on the cluster node. Perhaps the cluster node is a different OS or architecture.
- ANSWER: condor doesn't execute things with a shell. You could set your executable as /bin/bash and then have the arguments be the executable you used to have. I just changed our stuff to staticly set $HOME and I think that is good enough.
...