...
We had some jobs get restarted because they lost contact with their shadow jobs. I assume this is because the shadow jobs keep the condor.log file open and if that file is on Lustre and Lustre goes down then the shadow job fails to communicate with the job and the job gets killed. Does that seem accurate to you?
nmpost-master root >ps auxww|grep shadow|grep krowe
krowe 1631810 0.0 0.0 38708 3676 ? S 09:29 0:00 condor_shadow -f 486.0 --schedd=<10.64.10.100:9618?addrs=10.64.10.100-9618&noUDP&sock=5837_96cc_3> --xfer-queue=limit=upload,download;addr=<10.64.10.100:14115> <10.64.10.100:14115> -
nmpost-master root >ls -la /proc/1631810/fd
total 0
dr-x------ 2 root root 0 Jul 27 09:29 ./
dr-xr-xr-x 8 krowe nmstaff 0 Jul 27 09:29 ../
lr-x------ 1 root root 64 Jul 27 09:29 0 -> pipe:[16358528]
lr-x------ 1 root root 64 Jul 27 09:29 1 -> pipe:[16358540]
lrwx------ 1 root root 64 Jul 27 09:29 18 -> socket:[16358529]
l-wx------ 1 root root 64 Jul 27 09:29 2 -> pipe:[16358540]
l-wx------ 1 root root 64 Jul 27 09:29 3 -> /lustre/aoc/sciops/krowe/condor.486.log
lrwx------ 1 root root 64 Jul 27 09:29 4 -> socket:[16358542]
Docs wrong for evaluating ClassAds?
...