Open questions:
- How can you tell which job is associated with an email given the email message doesn't include a working dir or the assigned batch_name?
- CHTC will look into adding such information to the email condor sends.
- Bug where James's jobs are all put on the same core. Here is top -u krowe showing the Last Used Cpu (SMP) after I submitted five sleep jobs to the same host.
- Is this just a side effect of condor using cpuacct instead of cpuset in cgroup?
- Is this a failure of the Linux kernel to schedule things on separate cores?
- Is this because cpu.shares is set to 100 instead of 1024?
- Check if CPU affinity is set in /proc/self/status
- Is sleep cpu-intensive enough to properly test this? Perhaps submit a while 1 loop instead?
...
- Torque has this command called pbsnodes that can not only offline/drain a node but keeps a note about it that all can see in one place. I know I can use condor_off to drain a node but is there a central place keep notes so I can remember a month later why I set a certain node to drain?
- ANSWER: there is no place to keep such notes but Greg likes the idea and may look into it.
- May want to use condor_drain instead of condor_off. condor_off will kill the startd when all jobs finish and it no longer shows up in condor_status. condor_drain will leave the node in condor_status.
- condor_drain doesn't work for me because it immediatly sets jobs idle instead of letting them run to completion. This is why I use condor_off -startd -peaceful instead.
- How can you tell which job is associated with an email given the email message doesn't include a working dir or the assigned batch_name?
- CHTC will look into adding such information to the email condor sends.