...
Every DAG or task creates .log, .out and maybe .png files that we want to keep. Also, .last files like tclean.last are often created . These and are not necessary but can be usefull for debugging things. I assume that almost all tasks require the Measurement Set (MS). I question what tasks actually modify the MS. run_tclean() defaults to using the corrected datacolumn. Does that mean it is changing this column? The reference to datacolumn is stating which column it should read from, it does not imply any change to the MS. Task07 reference sets savemodel='datacolumn', this which actually modifies the MS.
This document it is not complete. I am sure I am missing inputs and perhaps outputs as well.
In this document, "data" when referenced as an input or an output is a directory containing the Measurement Set (E.g. VLASS1.2.sb36491855.eb36574404.58585.53016267361_split.ms/). The jobs are run in the working directory so any file references are relative to that.
How do we handle the want to start a job at a given task? For example, say a job ran to completion but you want to re-run the job after altering something in task17. It would be unfortunate to have to run tasks 1 through 16. It would be better to start and task17 and run through to the end of task25. To do this requires saving the output of each task. But how? Incremental or Differential? Using prolog and epilog scripts? Other?
The jobs are run in the working directory so any file references are relative to that.
This process doesn't need the SYSPOWER table. How can we remove it from the MS? Presumably we can just cp /dev/null SYSPOWER/table.f0 and cp /dev/null SYSPOWER/table.f0i
Does run_tclean need just the .psf directories or does it need more than that? Tclean will need all image types (suffixes) for the named image. For instance Task01 makes a set of 'iter0' images, task04 makes an 'iter1' set of images. Task 5 references both. Does run_tclean() need just the .psf directories or does it need more than that? Tclean will need all image types (suffixes) for the named image. For instance Task01 makes a set of 'iter0' images, task04 makes an 'iter1' set of images. Task 5 references both. It would be acceptable to pass images of iter0* and iter1* but in practice it only needs the PSF from both so something like iter0*.psf and iter1*.psf should work.
Is it safe to assume I don't need to transfer lockfiles like table.lock even if they have been modified?
How do we transfer input files for each DAG?
- explicity list every file/directory in transfer_input_files (it doens't grok regexps). This would be a large list . E.g.
- transfer_input_files = "working/VIP_iter0.gridwt, working/VIP_iter0.pb.tt0, working/VIP_iter0.psf.tt0, working/VIP_iter0.psf.tt1, working/VIP_iter0.psf.tt2, working/VIP_iter0.sumwt.tt0, working/VIP_iter0.sumwt.tt1, working/VIP_iter0.sumwt.tt2, working/VIP_iter0.weight.tt0, working/VIP_iter0.weight.tt1, working/VIP_iter0.weight.tt2"
- Can transfer_input_fies take a manifest? E.g a file containing the list of files to transfer. Sort of using the include syntax
- Make a temporary director directory on the submit host, and transfer that (possibly tarring it up). PRE and POST scripts might be useful here.
- Set the inputs and outputs for both data and working as a variable variables in the unified DAG file for each DAG step. The task.sh script deletes and then makes working-<dagstep>, copies the inputs into this directory, transfers it to the scratch area via transfer_input_files=working-<dagstep> then when finished transfers explicitly things out of working-<dagstep> we know changed by an outputs variable defined in the DAG file.
Task01
Doesn't alter the MS
run_tclean( 'iter0', cfcache=cfcache_nowb, robust=-2.0, uvtaper='3arcsec', calcres=False )
- input: ../data
- Input: cfcache_nowb='/mnt/scratch/cfcache/cfcache_spw2-17_imsize16384_cell0.6arcsec_w32_conjT_psf_wbawp_False.cf'
- output: VIP_iter0.*
Task02
This tasks creates VIP_iter0b.* but I don't see those files ever referenced in this script again. What does this task do that is necessary to other tasks? Josh Marvil said that this is a leftover task and can be removed.
Doesn't alter the MS
run_tclean( 'iter0b', cfcache=cfcache_nowb, calcres=False )
input: ../datainput: cfcache_nowb='/mnt/scratch/cfcache/cfcache_spw2-17_imsize16384_cell0.6arcsec_w32_conjT_psf_wbawp_False.cf'output: VIP_iter0b.*
Task03
This task doesn't parallelize and only takes tens of seconds to run.
Doesn't alter the MS
mask_from_catalog(inext=inext,outext="QLcatmask.mask",catalog_search_size=1.5,catalog_fits_file='../VLASS1Q.fits')
- input: ../data
- input: ../VLASS1Q.fits, VIP_iter0.psf.tt0
- output: mask_from_cat.crtf, VIP_QLcatmask.mask
Task04
Doesn't alter the MS
run_tclean( 'iter1', robust=-2.0, uvtaper="3arcsec" )
- input: ../data
- output: VIP_iter1.*
- uses rsync to merge the various data_inputs together into one data directory and the various working_inputs together into one working directory. Then at the end, task.sh moves data to data-<dagstep> and working to working-<dagstep> and the appropriate dirs/files from these are transferred back to the submithost. The result of all this is that the data needed as an input for a step (E.g. Task08) may need to be combined from multiple places (initial data and data output from Task07)
To Do
- DONE: write .log, .out and .png files one level up so they are not in the working directory and therefore not copied to execute hosts.
- DONE: add rm -f *.last to the sh script?
- DONE: Task24 and Task25 swapped with 8cores so they need to run with fewer cores. So I may need to make another variable to pass to the sh script for this.
- DONE: re-create my DOT graph after finishing task01-25-parallel-dag4. Also make a PDF instead of a PS.
- DONE: Craft parallel-dag5 with the concept of running tclean at CHTC and everything else locally.
- DONE: I don't like using the name Task as that has meaning to CASA. A better term might be Step as in DAG Step.
- DONE: Figure out how to not copy SYSPOWER in the MS. Presumably we can just cp /dev/null SYSPOWER/table.f0 and cp /dev/null SYSPOWER/table.f0i I am doing this with my dag5 test now.
- DONE: Setup testpost-serv-1 with Lustre access over IB so we can start submitting to CHTC.
- DONE: Need to make the MS in my DAG script a variable. Right now I specify data/VLASS.../table.f23_TSM1
- Task19 needs to be unwraveled from NRAO filesystems. Actually maybe not. This can be a task we always run here. But it does need to move into a VLASS area instead of Josh's home acocunt.
- Task06 has parallel=false and I am using 8core mpicasa which is a waste of cores.
- Update this document to reflect the changes needed for CASA-6.
Possible Improvements
- Task03 could perhaps be run concurrently with Task01 as long as Task04 is run after both.
- Task13 could perhaps be run concurrently with Task12.
- Task13 could perhaps be run concurrently with Taks11 and Task12 as long as Task14 is run after both Task11 and Task13.
- Task21 could perhaps be run concurrently with Task11 or later as long as Task22 is run after both Task21 and Task12.
Task01 - Step01
...
Doesn't alter the MS
replacerun_psftclean( 'iter1iter0', 'iter0')
This is just some python that deletes VIP_iter1.psf.* and copies VIP_iter0.psf.* to VIP_iter1.psf.*. It is inefficient to ever make this task be its own DAG. I suggest it always be in the same DAG as Task04. Will produce an error because *.workdirectory doesn't exist but that error is ignorable.
- input: VIP_iter0.psf.*, VIP_iter1.psf.*
- output: VIP_iter1.psf.*
...
cfcache=cfcache_nowb, robust=-2.0, uvtaper='3arcsec', calcres=False )
- input: ../data
- Input: cfcache_nowb='/mnt/scratch/cfcache/cfcache_spw2-17_imsize16384_cell0.6arcsec_w32_conjT_psf_wbawp_False.cf'
- output: VIP_iter0.*
Task02
This task doesn't parallelize and only takes tens of seconds to run. Should this be stuck on the end of task01?
Doesn't alter the MS
runmask_from_tclean( 'iter1', robust=-2.0, uvtaper="3arcsec", niter=20000, nsigma=5.0, mask="catalog(inext=inext,outext="QLcatmask.mask", calcres=False, calcpsf=False catalog_search_size=1.5,catalog_fits_file='../VLASS1Q.fits')
- input: ../data
- input: VIP_iter1.*../VLASS1Q.fits, VIP_QLcatmaskiter0.psf.masktt0
- output: mask_from_cat.crtf, VIP_iter1QLcatmask.*mask
Task07
Task03
Doesn't alter Alters the MS
This task could possibly run at the same time as Task01 except that I have combined this with Task04 which requires both Task01 and Task34.
run_tclean( 'iter1', calcres=False, calcpsf=False, savemodel='modelcolumn', parallel=False robust=-2.0, uvtaper="3arcsec" )
- input: ../data
- inputoutput: VIP_iter1.*output:
Task04
Doesn't alter the MS
replace_psf('iter1','iter0')
This is just some python that deletes VIP_iter1.
...
- VLASS1.2.sb36491855.eb36574404.58585.53016267361_split.ms/table.f23_TSM1
- VLASS1.2.sb36491855.eb36574404.58585.53016267361_split.ms/SOURCE/table.lock
- VLASS1.2.sb36491855.eb36574404.58585.53016267361_split.ms/table.lock
psf.[tt0|tt1|tt2] and copies VIP_iter0.psf.[tt0|tt1|tt2] to VIP_iter1.psf.[tt0|tt1|tt2]. It would be inefficient to make this task be its own DAG step because the job would have to transfer iter0 and iter1 to the scratch area just to make the copy. I suggest it always be in the same DAG step as Task03. It will produce an error because *.workdirectory doesn't exist but that error is ignorable.
- input: VIP_iter0.psf.*, VIP_iter1.psf.*
- output: VIP_iter1.psf.*
Task05 - Step05
Doesn't alter the MS
run_tclean( 'iter1', robust=-2.0, uvtaper="3arcsec", niter=20000, nsigma=5.0, mask="QLcatmask.mask", calcres=False, calcpsf=False )
- input: ../data
- input: VIP_iter1.*, VIP_QLcatmask.mask
- output: VIP_iter1.*
Task06 - Step06
Alters the MS
Note that this sets parallel=False which means running mpicasa may be a waste of cores. Hopefully this step will not be necessary with CASA-6.
run_tclean( 'iter1', calcres=False, calcpsf=False, savemodel='modelcolumn', parallel=False )
Task08
Alters the MS
flagdata(vis=vis, mode='rflag', datacolumn='residual_data',timedev='tdev.txt',freqdev='fdev.txt',action='calculate')
replace_rflag_levels()
flagdata(vis=vis, mode='rflag', datacolumn='residual_data',timedev='tdev.txt',freqdev='fdev.txt',action='apply',extendflags=False)
flagdata(vis=vis, mode='extend', extendpols=True, growaround=True)
- input: ../dataoutput
- input: VIP_iter1.*
- output tdev.txt,. fdev.txtouput: ../data
Task07 - Step07
Alters the MS
Tasks 08, 09, 10 and 11 take only minutes to run so could be combined into one DAG step.
flagdata(vis=vis, mode='rflag', datacolumn='residual_data',timedev='tdev.txt',freqdev='fdev.txt',action='calculate')
replace_rflag_levels()
flagdata(vis=vis, mode='rflag', datacolumn='residual_data',timedev='tdev.txt',freqdev='fdev.txt',action='apply',extendflags=False)
flagdata(vis=vis, mode='extend', extendpols=True, growaround=True)
- input: ../data
- output: tdev.txt,. fdev.txt
- ouput: ../data
- VLASS1.2.sb36491855.eb36574404.58585.53016267361_split.ms/./table.f6
- VLASS1.2.sb36491855.eb36574404.58585.53016267361_split.ms/./table.f6
- VLASS1.2.sb36491855.eb36574404.58585.53016267361_split.ms/./HISTORY/table.f0
- VLASS1.2.sb36491855.eb36574404.58585.53016267361_split.ms/./HISTORY/table.lock
- VLASS1.2.sb36491855.eb36574404.58585.53016267361_split.ms/./HISTORY/table.f20_TSM1f0
- VLASS1.2.sb36491855.eb36574404.58585.53016267361_split.ms/./HISTORY/table.lock
- output: ../data/VLASS1.2.sb36491855.eb36574404.58585.53016267361_split.ms.flagversions
...
Task08
Alters the MS
statwt(vis=vis,combine='field,scan,state,corr',chanbin=1,timebin='1yr', datacolumn='residual_data' )
- input: ../data
- output: ../data
- VLASS1.2.sb36491855.eb36574404.58585.53016267361_split.ms.flagversions/FLAG_VERSION_LIST
- VLASS1.2.sb36491855.eb36574404.58585.53016267361_split.ms.flagversions/flags.statwt_1/table.f1
- VLASS1.2.sb36491855.eb36574404.58585.53016267361_split.ms.flagversions/flags.statwt_1/table.dat
- VLASS1.2.sb36491855.eb36574404.58585.53016267361_split.ms.flagversions/flags.statwt_1/table.f0
- VLASS1.2.sb36491855.eb36574404.58585.53016267361_split.ms.flagversions/flags.statwt_1/table.lock
- VLASS1.2.sb36491855.eb36574404.58585.53016267361_split.ms.flagversions/flags.statwt_1/table.info
- VLASS1.2.sb36491855.eb36574404.58585.53016267361_split.ms.flagversions/flags.statwt_1/table.f0_TSM1
- VLASS1.2.sb36491855.eb36574404.58585.53016267361_split.ms/table.f25_TSM1
- VLASS1.2.sb36491855.eb36574404.58585.53016267361_split.ms/table.f22_TSM1
- VLASS1.2.sb36491855.eb36574404.58585.53016267361_split.ms/table.f6
- VLASS1.2.sb36491855.eb36574404.58585.53016267361_split.ms/HISTORY/table.f0
- VLASS1.2.sb36491855.eb36574404.58585.53016267361_split.ms/HISTORY/table.lock
- VLASS1.2.sb36491855.eb36574404.58585.53016267361_split.ms/table.lock
...
Task09
Doesn't alter the MS
gaincal(vis=vis,caltable='g.0',gaintype='T',calmode='p',refant='0',combine='field,spw',minsnr=5)
- input: ../data
- output: g.0
...
Task10
Alters the MS
applycal(vis=vis,calwt=False,applymode='calonly',gaintable='g.0',spwmap=18*[2], interp='nearest')
- input: ../data
- input: g.0
- output: ../data
...
Task11
Doesn't alter the MS
run_tclean( 'iter0c', datacolumn='corrected', cfcache=cfcache_nowb, robust=-2.0, uvtaper='3arcsec', calcres=False )
- input: ../data
- output: VIP_iter0c.*
...
Task12
Doesn't alter the MS
Could this run in parallel with one or more previous run_tclean calls like Task11?
run_tclean( 'iter0d', datacolumn='corrected', cfcache=cfcache_nowb, calcres=False )
- input: ../data
- output: VIP_iter0d.*
...
Task13
Doesn't alter the MS
This task could possibly run at the same time as Task11 and/or Task12 except that I have combined this with Task14 which requires both Task13 and Task11.
run_tcleanrun_tclean( 'iter1b', datacolumn='corrected', robust=-2.0, uvtaper="3arcsec" )
- input: ../data
- output: VIP_iter1b.*
...
Task14
Doesn't alter the MS
replace_psf('iter1b','iter0c')
This is just some python that deletes VIP_iter1b.psf.* and copies VIP_iter0c.psf.* to VIP_iter1b.psf.*. It is inefficient to ever make this task be its own DAG. I suggest it always be in the same DAG as Task14. Will produce an error because *.workdirectory doesn't exist but that error is ignorable.
We could remove iter0c because it is never used again.
- input: VIP_iter1b.psf.*, VIP_iter0c.psf.*
- output: VIP_iter1b.psf.*
Task16
Task15 - Step15
Doesn't alter the MS
run_tclean( 'iter1b', datacolumn='corrected', robust=-2.0, uvtaper="3arcsec", niter=20000, nsigma=5.0, mask="QLcatmask.mask", calcres=False, calcpsf=False )
- input: ../data
- input: VIP_iter1b.*, VIP_QLcatmask.mask
- output: inter1b
Task17
- VIP_iter1b.*, VIP_QLcatmask.mask
- output: VIP_iter1b.*
Task16 - Step16
Doesn't alter the MS
imsmooth(imagename=imagename_base+"iter1b.image.tt0", major='5arcsec', minor='5arcsec', pa='0deg', outfile=imagename_base+"iter1b.image.smooth5.tt0")
- input: ../data
- input: VIP_iter1b.image.tt0
- output: VIP_iter1b.image.smooth5.tt0
Task17
Doesn't alter the MS
exportfits(imagename=imagename_base+"iter1b.image.smooth5.tt0", fitsimage=imagename_base+"iter1b.image.smooth5.fits")
- input: ../data
- input: VIP_iter1b.image.smooth5.tt0
- output: VIP_iter1b.image.smooth5.fits
Task18
This needs some modification. It calls a script from Josh's homedir and runs bdsf out of /lustre. Also, I have been unable to run this task by itself. I get the following errors. I am going to combine this with tasks17, 18, 20 and 21 so it isn't an issue right now.
2020-07-29 16:13:07 SEVERE exportfits::image::tofits (file ../../tools/images/image_cmpt.cc, line 6211) Exception Reported: Exception: File VIP_iter1b.image.smooth5.fits exists, and the user does not want to remove it..
2020-07-29 16:13:07 SEVERE exportfits::image::tofits (file ../../tools/images/image_cmpt.cc, line 6211)+ ... thrown by static void casa::ImageFactory::_checkOutfile(const casacore::String&, casacore::Bool) at File: ../../imageanalysis/ImageAnalysis/ImageFactory2.cc, line: 568
2020-07-29 16:13:07 SEVERE exportfits::::@testpost001:MPIClient An error occurred running task exportfits.
I de-linked everything in bdsf_1.8.15-py27-rh7-env to make it more portable but then would sometimes get Failed to transfer files errors in the condor.log which seems to be related to the number of files which was about 12,000. I could reproduce the problem by transferring a directory with about 10,008 files in it but not repeatedly reproduce it. So there is something else going wrong. Anyway, since these tasks are not planned to run at CHTC I am going to go back to calling this out of Josh's home account. A longer-term solution might be to move it out of josh's area and into VLASS or tar up the portable directory and copy that instead.
Doesn't alter the MS
subprocess.call(['/users/jmarvil/scripts/run_bdsf.py', imagename_base+'iter1b.image.smooth5.fits'],env={'PYTHONPATH':''}imsmooth(imagename=imagename_base+"iter1b.image.tt0", major='5arcsec', minor='5arcsec', pa='0deg', outfile=imagename_base+"iter1b.image.smooth5.tt0")
- input: ../data
- input: VIP_iter1b.image.tt0.smooth5.fits
- input: ??
- output:
- VIP_iter1b.image.smooth5.cat.ds9.
Task18
...
- reg
- VIP_iter1b.image.smooth5
...
- .cat.fits
- VIP_iter1b.image.smooth5.fits
...
- input:
- .island.
- mask
- VIP_iter1b.image.smooth5.
- fits.pybdsf.log
- VIP_iter1b.image.smooth5.fits
Task19
...
- .rms
- VIP_iter1b.image.smooth5.
...
- fits
Task19
This needs some modification. It calls a script from Josh's homedir and runs bdsf out of /lustre.
...
iter0.psf.tt0 which is set to the variable inext. VIP_iter0 was copied to VIP_iter1 back in Task04
Doesn't alter the MS
edit_pybdsf_islands(catalog_fits_file=imagename_base+'iter1b.image.smooth5.cat.fits')
...
- input: VIP_iter1b.image.smooth5.cat.fits
- input: VIP_iter1b.image.smooth5.cat.edited.fits.edited.fits, VIP_iter0.psf.tt0
- output: secondmask.mask
...
Task20
Doesn't alter the MS
immath(imagename=[imagename_base+'secondmask.mask',imagename_base+'QLcatmask.mask'],expr='IM0+IM1',outfile=imagename_base+'sum_of_masks.mask')
im.mask(image=imagename_base+'sum_of_masks.mask',mask=imagename_base+'combined.mask',threshold=0.5)
- input: secondmask.mask, VIP_QLcatmask.mask
- output: sum_of_masks.mask
- input: sum_of_masks.mask
- output: combined.mask
...
- output: VIP_combined.mask
Task21
Doesn't alter the MS
As far as I can tell at this point, ../data has not changed since Task10 (applycal).
Could this run in parallel with one or more previous run_tclean calls like Task15?
run_tclean( 'iter2', datacolumn='corrected' )
- input: ../data
- output: VIP_iter2.*
...
Task22
Doens't alter the MS
replace_psf('iter2', 'iter0d')
This is just some python that deletes VIP_iter2.psf.* and copies VIP_iter0d.psf.* to VIP_iter2.psf.*. It is inefficient to ever make this task be its own DAG. I suggest it always be in the same DAG as Task22Task21.
- input: VIP_iter2.psf.*, VIP_iter0d.psf.*
- output: VIP_iter2.psf.*
...
- iter0d.psf.*
- output: VIP_iter2.psf.*
Task23 - Step23
Doesn't alter the MS
At this point I think we are using iter2's image with iter0d's psf.
I ran this on a node with 512GB, asking for 500GB, 8 cores and using mpicasa -n 9 this task swapped. So I am restarting it with 4 cores and -n 5
run_tclean( 'iter2', datacolumn='corrected', scales=[0,5,12], nsigma=3.0, niter=20000, cycleniter=3000, mask="QLcatmask.mask", calcres=False, calcpsf=False )
- input: ../data
- input: VIP_iter2.*, VIP_QLcatmask.mask
- output: VIP_iter2.*
...
Task24 - Step24
os.system('rm -rf *.workdirectory')
...
- input: ../data
- input: VIP_iter2.*, VIP_combined.mask
- output: VIP_iter2.*