...
I am working on a transfer plugin that uses rsync and I ran into a situation that confounds me. I have a job the following job:
#!/bin/sh
mkdir newdir
date > newdir/date
/bin/sleep ${1}
that creates a directory newdir and a file newdir/date. If I set transfer_output_files = newdir and output_destination = nraorsync://$ENV(PWD) in the submit file, the resulting input file that is fed to my plugin looks like thisplugin when the plugin is called with the -upload argument is this:
[ LocalFileName = "/lustre/aoc/admin/tmp/condor/testpost003/execute/dir_29453/_condor_stderr"; Url = "nraorsync:///lustre/aoc/sciops/krowe/plugin/_condor_stderr" ][ LocalFileName = "/lustre/aoc/admin/tmp/condor/testpost003/execute/dir_29453/_condor_stdout"; Url = "nraorsync:///lustre/aoc/sciops/krowe/plugin/_condor_stdout" ][ LocalFileName = "/lustre/aoc/admin/tmp/condor/testpost003/execute/dir_29453/newdir/date"; Url = "nraorsync:///lustre/aoc/sciops/krowe/plugin/newdir/date" ]
I am surprised to see that it sets LocalFileName and Url to the file inside newdir instead of newdir itself. Needless to say, this makes rsync unhappy as newdir doesn't exist on the destination yet. Is it possible that setting preserve_relative_paths = true will affect this?
What's weirder is the condor log file shows the following even if 'newdir' exists in the output_destination. The file 'newdir/date' ends up on the submit host and looks correct.
022 (4149.000.000) 08/05 09:22:04 Job disconnected, attempting to reconnect
Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to slot1_4@testpost003.aoc.nrao.edu <10.64.1.173:9618?addrs=10.64.1.173-9618&alias=testpost003.aoc.nrao.edu&noUDP&sock=startd_28565_cae3>
...
023 (4149.000.000) 08/05 09:22:04 Job reconnected to slot1_4@testpost003.aoc.nrao.edu
startd address: <10.64.1.173:9618?addrs=10.64.1.173-9618&alias=testpost003.aoc.nrao.edu&noUDP&sock=startd_28565_cae3>
starter address: <10.64.1.173:9618?addrs=10.64.1.173-9618&alias=testpost003.aoc.nrao.edu&noUDP&sock=slot1_4_28601_bde4_612>
...
yet condor re-runs the upload portion of the plugin four more times before finally giving up with this error
007 (4149.000.000) 08/05 09:22:31 Shadow exception!
Error from slot1_4@testpost003.aoc.nrao.edu: Repeated attempts to transfer output failed for unknown reasons
0 - Run Bytes Sent By Job
1007 - Run Bytes Received By Job
Shadow jobs and Lustre
We had some jobs get restarted because they lost contact with their shadow jobs. I assume this is because the shadow jobs keep the condor.log file open and if that file is on Lustre and Lustre goes down then the shadow job fails to communicate with the job and the job gets killed. Does that seem accurate to you?
...