Checklist for ALMA optimized imaging workflow:

  1. DA is notified by email when there is a new image dataset to review. The Operations Manager will assign for review (for now). The data are in /lustre/naasc/web/almapipe/pipeline/naasc-prod/image-qa Note: operations manager will still need to check the processing area for stalled or failed jobs. The initiator of any such jobs can be found in spool/<jobid>/metadata.json for contact by Helpdesk email if needed.
  2. login as almapipe (ssh almapipe@localhost or almapipe@<your desktop>, use the ssh key), set the appropriate environment (type activate_profile naasc-prod on the command line).
  3. QA: in the pilot, only channel widths and ranges can be set, so QA should be focused on those, along with general QA issues (note that firefox (to browse the weblog), casa and casaviewer are not in the default path for almapipe, they live in /opt/local/bin, so need to be invoked as e.g. /opt/local/bin/firefox, /opt/local/bin/casaviewer):
    1. Was the restoration of the calibration successful?
    2. Is the continuum subtraction satisfactory (hif_findcont task)? If not, DA should pick a new continuum range by editing cont.dat and removing findcont from the PPR, then rerun (step 4 below), or consult with an SRDP Scientist (note the old work-around of making cont.dat no-longer works as the file is not copied to the new working directory).
    3. Are there artifacts in the image suggesting that target flagging is needed? If so, flag the data and rerun the PPR using the pipeline rerun script supplied by SSA (see Step 3 below).
    4. Does the cube as made seem to likely to have covered the region of interest requested by the PI (in the PPR)? If large parts of the cube are blank, and/or a line is cut off at the edge of the cube the DA should consult with a scientist.
    5. Is the synthesized beam highly elliptical? (Axial ratio > 3:1). If so, check that this is not due to heavy flagging of the target. If flagging is the cause (and not the observing HA and Dec),  consult with a scientist as to whether or not the job should fail QA.
    6. Check that the continuum images look sensible, compare theoretical to achieved RMS and note in the QA report (step 7) if any are dynamic range limited.
    7. If the user has requested a non-default angular resolution, then the imageprecheck task (stage 3 in the weblog) will indicate the approximate requested beam size and the last two lines of the table in the stage 3 weblog will report the taper used (if any) and the expected beamsize of the product (the task itself may show a fail until the score heuristics are updated, this can be ignored). The achieved beam size in the final product should be checked against the values in imageprecheck (it need not be exactly the same, but should be within ~20%).  
  4. To rerun a job, modify the necessary files in the working directory (e.g. PPR.xml, cont..cat and/or flagtargetstemplate.txt) and run almaReimageCube -r <job id> <UID of Job Directory> e.g. almaReimageCube -r 320578787 uid___A002_Xe29133_X3610. Note that the uid of the job directory is not the MOUSid, but rather that of one of the ASDMs included. It can be found by going to image_qa/<jobID>/ and copying whatever UID is the directory there.
  5. If the job is still not passing QA, please contact a scientist. QA fail jobs are not archived, but in most cases we will email the user via the helpdesk stating the reasons for the failure and make helpful suggestions, or suggest an ALMA QA3 report if there are problems with the data that were missed by ALMA QA2.
  6. Update the Google Spreadsheet with the QA state https://docs.google.com/spreadsheets/d/1USJ5rQRNbR3ORj80-UuEuqGnSYm4l6h_A6ac5s1FyJ0/edit#gid=0 (to find the user who requested the reprocessing, go to the spool directory /lustre/naasc/web/almapipe/pipeline/naasc-prod/spool/<job id> and look for the userEmail field in the metadata.json file).
  7. Login as almapipe (if not already done so above). Optional: especially if there were issues that needed a rerun, write a short QA report suitable for transmission to the user by making a file called  qa_notes.html in the weblog html directory:
    • Note any target flagging (on the level of antennas/spw).
    • Note any change to the continuum range(s).
    • Add any other comments (e.g. if the image is dynamic range limited and self calibration is recommended).
  8. Initiate archive ingest of image products if they passed QA using the audiPass script, audiPass <job id> -E <your email> e.g. audiPass 320755390 -E mlacy@nrao.edu.
  9. Reporting issues: if a software problem is encountered during this process, please alert the Operations Manager, who will file a JIRA ticket with SSA if it is not already a known issue.


     

8 Comments

  1. Known issues in 3.8.2:

    1) Imaging will fail on those Cycle 5 datasets that can only be restored with CASA 5.1, as the imaging has to use 5.6. Job will send a failure message. The pipeline will appear to run to completion, but the software checks the logs for the SEVERE error when the flagging task reports a row mis-match in this situation and will terminate the job. Doing a grep of the casa log (in the spool/<jobid>/<uid>/working  directory) for SEVERE will find these. This mostly seems to affect 12m data, but some 7m data breaks too. Partially fixed through the "Kludge" workflow, though imaging pipeline recipe still needs work  SSA-5519 - Getting issue details... STATUS SSA-6285 - Getting issue details... STATUS PIPE-579 - Getting issue details... STATUS

    2) Large jobs: jobs can fail if they exceed the memory limit - the error appears in casa-alma-pipeline.sh.err.txt in the spool/<uid>/working directory SRDP-543 - Getting issue details... STATUS . Even if a large job completes, it may not ingest into the archive ( SSA-6245 - Getting issue details... STATUS SSA-6282 - Getting issue details... STATUS )

    3) Ingest will fail if there is a prior cube for the same dataset still in the staging area  SSA-6142 - Getting issue details... STATUS SSA-6476 - Getting issue details... STATUS

    4) Downloads to local disk in Charlottesville only work to directories on /lustre/naasc and download directories must be world writable/executable. 

    5) Under high load, ASDM binaries can sometimes fail to download. This gives the same error code (2) and symptoms as issue (4), but can happen to any Cycle's data. To establish this as the cause, in the spool/<jobid>/<uid> directory type ls rawdata/uid*/ASDMBinary/*.missing - if it comes back with a .missing file then this is the problem. Usually a rerun will fix it.

    6) It is possible for users to specify frequency ranges too close to the edge of the spw that cause the cube imaging to fail, even though the inputs pass validation on the front end. In these cases, simply rerun with a higher starting frequency and/or fewer channels specified in the PPR so it fits comfortably within the spw.


  2. It is sometimes necessary to kill a "rogue" job given the job ID from the AAT system; the following seems to work (but is a bit inelegant, there may be better ways):

    1) login to cvpost-master and ssh to almapipe@localhost

    2) type: qstat | grep almapipe to find the running batch jobs

    3) type: qstat -f | grep almapipe -B 3 to get a fuller output, and relate the job number from qstat to the jobid in the AAT system, so you can identify the correct job number to kill.

    4) type: nodescheduler -t <job number from qstat> (works for both batch and interactive jobs).

    5) Remove the directories corresponding to the jobid in the AAT system in vaprod/spool and vaprod/image-qa

  3. Every time the SSA team redeploys the AAT software into production, any running jobs will fail to complete. If SSA gives enough advance notice, running job numbers should be noted before a redeploy and the users (identifiable from the "userEmail" in the metadata.json file in the spool/<jobid> directory) notified by helpdesk that they will need to resubmit their jobs.


  4. Another bug was introduced in 3.9.3 by a combination of pipeline and/or ALMA DB changes, requires Kludged restores of Cycle 6 and maybe other cycle data with CASA 5.4.2  SSA-6883 - Getting issue details... STATUS

  5. We've now had a few failures where the calibration products fail to be downloaded from the archive for early (Cycle 4) data. Looks like they were from back before when the calibration tar.gz files were ingested separately into the ALMA archive, the AAT still shows the calibrations as present though.

  6. Workaround for duplicate continuum products - delete the combined continuum and individual mfs images in the products directory, ingest should then work OK

  7. Workaround for session mis-matches: in the cal manifest file in the rawdata directory, change the session name from session_2 to session_1 and rerun.