The DSOC cluster will be offline October 28 and 29 for lustre improvements and to switch the cluster from RHEL6 to RHEL7, this document enumerates the steps we will take beforehand to prepare for it and the steps we will take afterwords to confirm things are working properly and restore access to it. Note that these upgrades won't touch the control systems for the AAT/PPI or VLASS, but they will touch the environment the workflows for both execute on.

1 Week Before the Shutdown (October 21)

  • Stephan Witz to work with CIS to make sure the DNS TTLs for archive.nrao.edu and archive-new.nrao.edu are low
  • Stephan Witz (SSA) to put MOTD banners up on the legacy archive announcing the downtime, will work with John Tobin on the messaging
  • Stephan Witz (SSA) to work with CIS to replace the message on http://offline.nrao.edu, with work with John Tobin on the messaging
  • Stephan Witz  (SSA) to modify the vlass.test.scripts to change /usr/local/bin/python2.7 to /opt/local/bin/python2.7 (this is done, but the scripts can't execute until CIS adds a dependency to /opt/local/bin/python2.7, tracked in helpdesk ticket 116753)
  • Drew Medlin (Operations) to disable the CIPL auto-start CAPO setting

Morning of the Shutdown (October 28)

  • CIS to change external DNS of archive.nrao.edu and archive-new.nrao.edu to point to offline.nrao.edu
  • SSA to change the casa CAPO properties from the RHEL6 paths to RHEL7 paths
  • Operations to change the CASA symlink CIPL uses to point to RHEL7 version

Morning After the Shutdown (October 30)

  • Once SSA give them the all-clear, stakeholders (John Tobin, Mark Lacy ) test critical user-facing functions of the AAT/PPI under RHEL7
  • User-facing things that need to be tested on the AAT/PPI production system (https://archive-new.nrao.edu):
    • Downloads of VLA EBs: SDM-only (ML - job 316460859 OK), SDM (DM-Job 316336595 OK), basic MS (DM - test OK, but tar'd even when I ask for untar'd!), CMS - ops tests on SRDP-348 (pass)!DM FAIL, no CASA 5.6.2 SSA-5935) , SRDP-356 (pass).
    • Downloads of VLA calibrations. - ML- job 316444531 started for non-prop data, worked, so calling this good. X - ML failed to be able to download my own proprietary data (but dd not try before the upgrade) - JT -I have not had trouble yet. DM: no issue for me, either.  CASA 5.6.2 issue - SSA-5935
    • Downloads of VLASS images (ML- Job 316404900)
    • Downloads of VLBA UVFits files (SRDP-415)
    • AUDI imaging (JT-309617329) - produces a cube and moves to the image-qa; not sure about beyond that
    • ALMA restored MS download (Jt- 309621798 )
    • ASDM Download (JT-309629153 )
    • ALMA basic MS download  (JT -309625789) (ML - my job 316420925 failed)
    • Searches work
    • Proprietary periods respected for VLA and ALMA - ML - I cannot my proprietary VLA dataset 19A-306 (authentication through NRAO). SSA-5934 submitted - SW: I can't access 19A-306 via my test account, which is proper
  • User-facing things that need to be tested on the legacy archive production system (https://archive.nrao.edu):
    • Downloads of VLA EBs: basic MS (DM - pass for VLA SDM FAIL. nothing written to destination path but test ping successful. Tried with two latest 19A-020 EBs).
  • Stakeholders (Mark LacyDrew Medlin) test critical operations-facing functions of the AAT/PPI under RHEL7
  • VNC functionality to cluster nodes (DM - currently failing with some people's default xstartup, workaround appears to exist, can move forward.)
  • Operations-facing things that need to be tested:
    • EB ingestion (SW: I'd note the system has been up for a day now, so we can test anything that should have come in last night)
    • CIPL being triggered (manual CIPL start works, so does the auto-trigger - DM Update: I set the workflow to RUN to trigger something, will remove it if it starts)
    • CIPL working (DM pipeline manually started and works to completion, including moving of files)
    • calibration ingestion (QAPass) (DM, ingested test run of 19A-020 from 2019-10-27, ingested cal file shows, can be downloaded, is same size as original) with qa_notes.html indicating the results shouldn't be used, will replace with real run later)
  • GO/NO decision, 3pm MDT October 30th (Drew Medlin, John Tobin, Mark Lacy, Stephan Witz, Amy Kimball ):
    • If GO: Stephan Witzto with with CIS to undo the DNS change and MOTD banners, re-nable CIPL triggers. SW: Decision os GO, in the morning we'll walk back the DNS changes. undo the banners and send the all-clear.
    • If NO-GO: SSA to iterate with stakeholders until result is GO, note that this may mean putting out a bugfix release of AAT/PPI 3.6 and doing the same for VLASS

Stakeholder tests/actions

VLASS actions

    • update CASA versions in production Manager for active product types (QL imaging, SE calibration) to CASA RHel7 versions

VLASS tests

    • Run QL calibration job (test epoch on VLASS production manager)  AEK: partial SUCCESS
      • 1st try FAIL (workflows not running): Test epoch product #34132 Test.ql.cal.vlassRF-sqdeg-3C286_rise: VLASS Manager "job" successfully created and submitted (execution # 59093), but no cluster jobs running. First cluster job should create working directory. 
      • 2nd try FAIL (workflow had been built around RHel6): same product but new "job" and execution (#59094). cluster job 707.nmpost-serv-1.aoc.nrao.edu-PrepareWorkingDirectoryJob.vlass.w7.ab510b59-c86e-425b-9a40-ac3985f21b49 successful but status in Manager of downloadDataFormat, jobid, queue, etc. etc. etc. are all "undefined". Next step in execution is cluster job 710.nmpost-serv-1.aoc.nrao.edu-get-files.sh.vlass.w7.8d88844c-d361-4bef-a5b5-82c1f27c453d which appears as "started" in VLASS Manager but doesn't appear on cluster and seems to be hanging
      • 3rd try PARTIAL SUCCESS: execution completed correctly but status of cluster job and vlass-job execution did not update in Manager  
    • Run QL imaging job (production manager)  AEK: partial SUCCESS:  job completed successfully but not all steps tracked correctly in Manager
      • Launched successfully: execution 59096, VLASS1.2_T17t10.J064711+243000
    • A&A / R&A QL imaging jobs  (the original attempts to A&A and R&A later completed after workflows turned on)
      • AEK: attempted to A&A QL image (execution 59091) SUCCESS; execution/job status changed in Manager but nothing happened on disk in spool or cache
      • AEK: attempted to R&A QL image (execution 59092) SUCCESS; execution/job status changed in Manager but nothing happened on disk in spool or cache
    • Create scheduling block and products (test epoch)


  • No labels

22 Comments

  1. Monday, Oct. 21, I'll turn off auto-starts of CIPL first thing when I'm in. At some point, I'll also switch the production and current symlinks for /home/casa/packages/pipeline/ to the new CASA installation we'll use for CIPL/SRDP.

  2. G'head and edit the pages, its all good.

  3. CIPL CASA versions updated to point to casa-pipeline-release-5.62.-2.el7 (current and production).

  4. CIPL tar file download looks OK via direct download (316449565), qa_notes.html included when qaPass was run, files moved as expected. Restore test fails as I cannot select CASA 5.6.2 for the restore.

  5. Legacy archive fails to write even the *.loading directory for two test EBs.


    1. I have a lead on the legacy archive VLA problem. Meanwhile, give VLBA on the legacy archive a shot?

      1. Now I get a loading data as expected. However, Meri reports VLBA data failed as the e2e area for the proprietary data wasn't world writable.

        1. I think I have a lead on the VLBA one as well.

        2. Try the VLBA one again, please.

          1. This completed OK.

  6. Amy Kimball I suspect vlass test is trying to hit a queue that doesn't exist any more, hold on.


    1. Stephan Witz to be clear this is the test epoch but in the production manager (as opposed to the test manager)

      1. Production workflows were not running, please try again in a few.

    2. Nope, I'm nuts. Must be something else.

    3. Stephan Witz but speaking of queues that don't exist anymore, VLASS Test Manager was previously sending to rhel7 queue which shouldn't exist anymore (right?) so heads-up that I'll be testing creation of jobs from there soon too

  7. Here is one thing:


    Traceback (most recent call last):
    File "/users/vlapipe/workflows/vlass.w7/bin/notify.py", line 3, in <module>
    import pika
    ImportError: No module named pika

  8. Archive-new VLA MS file is returned, but is always given as FSID.ms.tgz despite deliberately unchecking the return as a tar file. SSA-5936

  9. Meri's VLBA ingestion is failing for the production/legacy archive (archive.nrao.edu).

  10. Drew Medlin can we check off 'Downloads of VLA EBs: basic MS'?

    1. I think so, the important part is that a user gets the data, and they can untar it.

  11. Drew Medlin I guess I'd note the CIPL trigger runs on a RHEL6 machine and shouldn't be affected by the cluster work.

  12. I launched this VLASS imaging job as a test: production manager, job ID 42321, execution ID 59097.

    The first cluster job was to prepare the working directory and this was successful, showing green in the Manager at the QA tab:  822.nmpost-serv-1.aoc.nrao.edu-PrepareWorkingDirectoryJob.vlass.w7.bd7b20ff-9c05-44c8-bdbd-eac677ae8b01

    The second cluster job was to create the necessary files and this was successful, showing green in the Manager at the QA tab: 823.nmpost-serv-1.aoc.nrao.edu-get-files.sh.vlass.w7.48352215-4501-4c04-811e-82345096cb02

    The third cluster job is to run the pipeline, and this has started, showing blue in the Manager at the QA tab:  824.nmpost-serv-1.aoc.nrao.edu-casa-imaging-pipeline.sh.vlass.w7.7ecc6846-ecf4-438c-b1b3-85e555fb5d2c

    What's odd is that once the third cluster job started, the first cluster job above (PrepareWorkingDirectoryJob) reverted in the Manager to showing again as "Started" and blue instead of "Success" and green.