...
What happens to running jobs if the submit host reboots? Shadow processes? What if the submithost is replaced with a new server? I think we have shown there is a 2400 second (40 minute) timeout.
Transfer Plugin Upload
I have added my nraorsync_plugin.py to /usr/libexec/condor and added the following to the execution host configuration:
FILETRANSFER_PLUGINS = $(LIBEXEC)/nraorsync_plugin.py, $(FILETRANSFER_PLUGINS)
I have am working on a transfer plugin that uses rsync and I ran into a situation that confounds me. I have a job the following job:
#!/bin/sh
mkdir newdir
date > newdir/date
/bin/sleep ${1}
...
and the following submit file:
executable = small.sh
arguments = "27"
output = stdout.$(ClusterId).log
error = stderr.$(ClusterId).log
log = condor.$(ClusterId).logshould_transfer_files = YES
transfer_input_files = /users/krowe/.ssh/condor_transfer
transfer_output_files = newdir
...
output_destination = nraorsync://$ENV(PWD)
+WantIOProxy = Truequeue
The in the submit file, the resulting input file that is fed to my plugin when the plugin is called with the -upload argument is (.nraorsync_plugin.in) contains this:
[ LocalFileName = "/lustre/aoc/admin/tmp/condor/testpost003/execute/dir_29453/_condor_stderr"; Url = "nraorsync:///lustre/aoc/sciops/krowe/plugin/_condor_stderr" ][ LocalFileName = "/lustre/aoc/admin/tmp/condor/testpost003/execute/dir_29453/_condor_stdout"; Url = "nraorsync:///lustre/aoc/sciops/krowe/plugin/_condor_stdout" ][ LocalFileName = "/lustre/aoc/admin/tmp/condor/testpost003/execute/dir_29453/newdir/date"; Url = "nraorsync:///lustre/aoc/sciops/krowe/plugin/newdir/date" ]
...