I list each task here and what is necessary to run them in HTCondor.  I am assuming this will be running without a shared filesystem and also without access to NRAO filesystems.  So any call to /lustre/aoc or /users/<username> or other such things need to be altered to be site agnostic.

Every DAG or task creates .log, .out and maybe .png files that we want to keep.  Also, .last files like tclean.last are often created and are not necessary but can be usefull for debugging.  I assume that almost all tasks require the Measurement Set (MS).  I question what tasks actually modify the MS.  The reference to datacolumn is stating which column it should read from,  it does not imply any change to the MS.  Task07 sets savemodel='datacolumn' which actually modifies the MS.

This document is not complete.  I am sure I am missing inputs and perhaps outputs as well.

In this document, "data" when referenced as an input or an output is a directory containing the Measurement Set (E.g. VLASS1.2.sb36491855.eb36574404.58585.53016267361_split.ms/).  The jobs are run in the working directory so any file references are relative to that.

How do we handle the want to start a job at a given task?  For example, say a job ran to completion but you want to re-run the job after altering something in task17.  It would be unfortunate to have to run tasks 1 through 16.  It would be better to start and task17 and run through to the end of task25.  To do this requires saving the output of each task.  But how?  Incremental or Differential?  Using prolog and epilog scripts? Other?

Does run_tclean() need just the .psf directories or does it need more than that?  Tclean will need all image types (suffixes) for the named image.  For instance Task01 makes a set of 'iter0' images,  task04 makes an 'iter1' set of images.  Task 5 references both.  It would be acceptable to pass images of iter0* and iter1* but in practice it only needs the PSF from both so something like iter0*.psf and iter1*.psf  should work.

Is it safe to assume I don't need to transfer lockfiles like table.lock even if they have been modified?


How do we transfer input files for each DAG?

To Do

Possible Improvements


Task01 - Step01

Doesn't alter the MS

run_tclean( 'iter0', cfcache=cfcache_nowb, robust=-2.0, uvtaper='3arcsec', calcres=False  )


Task02

This task doesn't parallelize and only takes tens of seconds to run.  Should this be stuck on the end of task01?

Doesn't alter the MS

mask_from_catalog(inext=inext,outext="QLcatmask.mask",catalog_search_size=1.5,catalog_fits_file='../VLASS1Q.fits')


Task03

Doesn't alter the MS

This task could possibly run at the same time as Task01 except that I have combined this with Task04 which requires both Task01 and Task34.

run_tclean( 'iter1', robust=-2.0, uvtaper="3arcsec"  )


Task04

Doesn't alter the MS

replace_psf('iter1','iter0')

This is just some python that deletes VIP_iter1.psf.[tt0|tt1|tt2] and copies VIP_iter0.psf.[tt0|tt1|tt2] to VIP_iter1.psf.[tt0|tt1|tt2].  It would be inefficient to make this task be its own DAG step because the job would have to transfer iter0 and iter1 to the scratch area just to make the copy.  I suggest it always be in the same DAG step as Task03.  It will produce an error because *.workdirectory doesn't exist but that error is ignorable.


Task05 - Step05

Doesn't alter the MS

run_tclean( 'iter1', robust=-2.0, uvtaper="3arcsec", niter=20000, nsigma=5.0, mask="QLcatmask.mask", calcres=False, calcpsf=False  )


Task06 - Step06

Alters the MS

Note that this sets parallel=False which means running mpicasa may be a waste of cores.  Hopefully this step will not be necessary with CASA-6.

run_tclean( 'iter1', calcres=False, calcpsf=False, savemodel='modelcolumn', parallel=False  )

Task07 - Step07

Alters the MS

Tasks 08, 09, 10 and 11 take only minutes to run so could be combined into one DAG step.

flagdata(vis=vis, mode='rflag', datacolumn='residual_data',timedev='tdev.txt',freqdev='fdev.txt',action='calculate')

replace_rflag_levels()

flagdata(vis=vis, mode='rflag', datacolumn='residual_data',timedev='tdev.txt',freqdev='fdev.txt',action='apply',extendflags=False)

flagdata(vis=vis, mode='extend', extendpols=True, growaround=True)


Task08

Alters the MS

statwt(vis=vis,combine='field,scan,state,corr',chanbin=1,timebin='1yr', datacolumn='residual_data' )


Task09

Doesn't alter the MS

gaincal(vis=vis,caltable='g.0',gaintype='T',calmode='p',refant='0',combine='field,spw',minsnr=5)


Task10

Alters the MS

applycal(vis=vis,calwt=False,applymode='calonly',gaintable='g.0',spwmap=18*[2], interp='nearest')


Task11

Doesn't alter the MS

run_tclean( 'iter0c', datacolumn='corrected', cfcache=cfcache_nowb, robust=-2.0, uvtaper='3arcsec', calcres=False  )


Task12

Doesn't alter the MS

Could this run in parallel with one or more previous run_tclean calls like Task11?

run_tclean( 'iter0d', datacolumn='corrected', cfcache=cfcache_nowb, calcres=False  )


Task13

Doesn't alter the MS

This task could possibly run at the same time as Task11 and/or Task12 except that I have combined this with Task14 which requires both Task13 and Task11.

run_tclean( 'iter1b', datacolumn='corrected', robust=-2.0, uvtaper="3arcsec" )


Task14

Doesn't alter the MS

replace_psf('iter1b','iter0c')

This is just some python that deletes VIP_iter1b.psf.* and copies VIP_iter0c.psf.* to VIP_iter1b.psf.*.  It is inefficient to ever make this task be its own DAG.  I suggest it always be in the same DAG as Task14.  Will produce an error because *.workdirectory doesn't exist but that error is ignorable.

We could remove iter0c because it is never used again.


Task15 - Step15

Doesn't alter the MS

run_tclean( 'iter1b', datacolumn='corrected', robust=-2.0, uvtaper="3arcsec", niter=20000, nsigma=5.0, mask="QLcatmask.mask", calcres=False, calcpsf=False  )


Task16 - Step16

Doesn't alter the MS

imsmooth(imagename=imagename_base+"iter1b.image.tt0", major='5arcsec', minor='5arcsec', pa='0deg', outfile=imagename_base+"iter1b.image.smooth5.tt0")


Task17

Doesn't alter the MS

exportfits(imagename=imagename_base+"iter1b.image.smooth5.tt0", fitsimage=imagename_base+"iter1b.image.smooth5.fits")


Task18

This needs some modification. It calls a script from Josh's homedir and runs bdsf out of /lustre. Also, I have been unable to run this task by itself.  I get the following errors.  I am going to combine this with tasks17, 18, 20 and 21 so it isn't an issue right now.

2020-07-29 16:13:07     SEVERE  exportfits::image::tofits (file ../../tools/images/image_cmpt.cc, line 6211)    Exception Reported: Exception: File VIP_iter1b.image.smooth5.fits exists, and the user does not want to remove it..
2020-07-29 16:13:07     SEVERE  exportfits::image::tofits (file ../../tools/images/image_cmpt.cc, line 6211)+   ... thrown by static void casa::ImageFactory::_checkOutfile(const casacore::String&, casacore::Bool) at File: ../../imageanalysis/ImageAnalysis/ImageFactory2.cc, line: 568
2020-07-29 16:13:07     SEVERE  exportfits::::@testpost001:MPIClient    An error occurred running task exportfits.

I de-linked everything in bdsf_1.8.15-py27-rh7-env to make it more portable but then would sometimes get Failed to transfer files errors in the condor.log which seems to be related to the number of files which was about 12,000. I could reproduce the problem by transferring a directory with about 10,008 files in it but not repeatedly reproduce it.  So there is something else going wrong.  Anyway, since these tasks are not planned to run at CHTC I am going to go back to calling this out of Josh's home account.  A longer-term solution might be to move it out of josh's area and into VLASS or tar up the portable directory and copy that instead.


Doesn't alter the MS

subprocess.call(['/users/jmarvil/scripts/run_bdsf.py', imagename_base+'iter1b.image.smooth5.fits'],env={'PYTHONPATH':''})


Task19

This needs iter0.psf.tt0 which is set to the variable inext.  VIP_iter0 was copied to VIP_iter1 back in Task04

Doesn't alter the MS

edit_pybdsf_islands(catalog_fits_file=imagename_base+'iter1b.image.smooth5.cat.fits')

mask_from_catalog(inext=inext,outext="secondmask.mask",catalog_fits_file=imagename_base+'iter1b.image.smooth5.cat.edited.fits', catalog_search_size=1.5)


Task20

Doesn't alter the MS

immath(imagename=[imagename_base+'secondmask.mask',imagename_base+'QLcatmask.mask'],expr='IM0+IM1',outfile=imagename_base+'sum_of_masks.mask')

im.mask(image=imagename_base+'sum_of_masks.mask',mask=imagename_base+'combined.mask',threshold=0.5)


Task21

Doesn't alter the MS

As far as I can tell at this point, ../data has not changed since Task10 (applycal).

Could this run in parallel with one or more previous run_tclean calls like Task15?

run_tclean( 'iter2', datacolumn='corrected' )


Task22

Doens't alter the MS

replace_psf('iter2', 'iter0d')

This is just some python that deletes VIP_iter2.psf.* and copies VIP_iter0d.psf.* to VIP_iter2.psf.*.  It is inefficient to ever make this task be its own DAG.  I suggest it always be in the same DAG as Task21.


Task23 - Step23

Doesn't alter the MS

At this point I think we are using iter2's image with iter0d's psf.

I ran this on a node with 512GB, asking for 500GB, 8 cores and using mpicasa -n 9 this task swapped.  So I am restarting it with 4 cores and -n 5

run_tclean( 'iter2', datacolumn='corrected', scales=[0,5,12], nsigma=3.0, niter=20000, cycleniter=3000, mask="QLcatmask.mask", calcres=False, calcpsf=False  )


Task24 - Step24

os.system('rm -rf *.workdirectory')

os.mkdir('iter2_intermediate_results')

os.system('cp -r *iter2* iter2_intermediate_results')

shutil.rmtree(imagename_base+'iter2.mask')

shutil.copytree(imagename_base+'combined.mask',imagename_base+'iter2.mask')

run_tclean( 'iter2', datacolumn='corrected', scales=[0,5,12], nsigma=3.0, niter=20000, cycleniter=3000, mask="", calcres=False, calcpsf=False  )

This does some file cleaning and then runs run_tclean.  Where do we want to do that file cleaning?  In the previous task?  On the submit host?