The DSOC cluster will be offline October 28 and 29 for lustre improvements and to switch the cluster from RHEL6 to RHEL7, this document enumerates the steps we will take beforehand to prepare for it and the steps we will take afterwords to confirm things are working properly and restore access to it. Note that these upgrades won't touch the control systems for the AAT/PPI or VLASS, but they will touch the environment the workflows for both execute on.
1 Week Before the Shutdown (October 21)
- Stephan Witz to work with CIS to make sure the DNS TTLs for archive.nrao.edu and archive-new.nrao.edu are low
- Stephan Witz (SSA) to put MOTD banners up on the legacy archive announcing the downtime, will work with John Tobin on the messaging
- Stephan Witz (SSA) to work with CIS to replace the message on http://offline.nrao.edu, with work with John Tobin on the messaging
- Stephan Witz (SSA) to modify the vlass.test.scripts to change /usr/local/bin/python2.7 to /opt/local/bin/python2.7
(this is done, but the scripts can't execute until CIS adds a dependency to /opt/local/bin/python2.7, tracked in helpdesk ticket 116753) - Drew Medlin (Operations) to disable the CIPL auto-start CAPO setting
Morning of the Shutdown (October 28)
- CIS to change external DNS of archive.nrao.edu and archive-new.nrao.edu to point to offline.nrao.edu
- SSA to change the casa CAPO properties from the RHEL6 paths to RHEL7 paths
- Operations to change the CASA symlink CIPL uses to point to RHEL7 version
Morning After the Shutdown (October 30)
- Once SSA give them the all-clear, stakeholders (John Tobin, Mark Lacy ) test critical user-facing functions of the AAT/PPI under RHEL7
- User-facing things that need to be tested on the AAT/PPI production system (https://archive-new.nrao.edu):
- Downloads of VLA EBs: SDM-only (ML - job 316460859 OK), SDM (DM-Job 316336595 OK), basic MS (DM - test OK, but tar'd even when I ask for untar'd!), CMS - ops tests on SRDP-348 (pass)!DM FAIL, no CASA 5.6.2 SSA-5935) , SRDP-356 (pass).
- Downloads of VLA calibrations. - ML- job 316444531 started for non-prop data, worked, so calling this good. X - ML failed to be able to download my own proprietary data (but dd not try before the upgrade) - JT -I have not had trouble yet. DM: no issue for me, either. CASA 5.6.2 issue - SSA-5935
- Downloads of VLASS images (ML- Job 316404900)
- Downloads of VLBA UVFits files (SRDP-415)
- AUDI imaging (JT-309617329) - produces a cube and moves to the image-qa; not sure about beyond that
- ALMA restored MS download (Jt- 309621798 )
- ASDM Download (JT-309629153 )
- ALMA basic MS download (JT -309625789) (ML - my job 316420925 failed)
- Searches work
- Proprietary periods respected for VLA and ALMA - ML - I cannot my proprietary VLA dataset 19A-306 (authentication through NRAO). SSA-5934 submitted - SW: I can't access 19A-306 via my test account, which is proper
- Downloads of VLA EBs: SDM-only (ML - job 316460859 OK), SDM (DM-Job 316336595 OK), basic MS (DM - test OK, but tar'd even when I ask for untar'd!), CMS - ops tests on SRDP-348 (pass)!DM FAIL, no CASA 5.6.2 SSA-5935) , SRDP-356 (pass).
- User-facing things that need to be tested on the legacy archive production system (https://archive.nrao.edu):
- Downloads of VLA EBs: basic MS (DM - pass for VLA SDM
FAIL. nothing written to destination path but test ping successful. Tried with two latest 19A-020 EBs).
- Downloads of VLA EBs: basic MS (DM - pass for VLA SDM
- Stakeholders (Mark Lacy, Drew Medlin) test critical operations-facing functions of the AAT/PPI under RHEL7
- VNC functionality to cluster nodes (DM - currently failing with some people's default xstartup, workaround appears to exist, can move forward.)
- Operations-facing things that need to be tested:
- EB ingestion (SW: I'd note the system has been up for a day now, so we can test anything that should have come in last night)
- CIPL being triggered (manual CIPL start works, so does the auto-trigger - DM
Update: I set the workflow to RUN to trigger something, will remove it if it starts) - CIPL working (DM pipeline manually started and works to completion, including moving of files)
- calibration ingestion (QAPass) (DM, ingested test run of 19A-020 from 2019-10-27, ingested cal file shows, can be downloaded, is same size as original) with qa_notes.html indicating the results shouldn't be used, will replace with real run later)
- GO/NO decision, 3pm MDT October 30th (Drew Medlin, John Tobin, Mark Lacy, Stephan Witz, Amy Kimball ):
- If GO: Stephan Witzto with with CIS to undo the DNS change and MOTD banners, re-nable CIPL triggers
- If NO-GO: SSA to iterate with stakeholders until result is GO, note that this may mean putting out a bugfix release of AAT/PPI 3.6 and doing the same for VLASS
Stakeholder tests
VLASS
- Run QL calibration job (test epoch on VLASS production manager) AEK: partial SUCCESS
- 1st try FAIL (workflows not running): Test epoch product #34132 Test.ql.cal.vlassRF-sqdeg-3C286_rise: VLASS Manager "job" successfully created and submitted (execution # 59093), but no cluster jobs running. First cluster job should create working directory.
- 2nd try FAIL (workflow had been built around RHel6): same product but new "job" and execution (#59094). cluster job 707.nmpost-serv-1.aoc.nrao.edu-PrepareWorkingDirectoryJob.vlass.w7.ab510b59-c86e-425b-9a40-ac3985f21b49 successful but status in Manager of downloadDataFormat, jobid, queue, etc. etc. etc. are all "undefined". Next step in execution is cluster job 710.nmpost-serv-1.aoc.nrao.edu-get-files.sh.vlass.w7.8d88844c-d361-4bef-a5b5-82c1f27c453d which appears as "started" in VLASS Manager but doesn't appear on cluster and seems to be hanging
- 3rd try SUCCESS SO FAR (jobs launched successfully; calibration step will take several hours)
- Run QL imaging job (production manager)
- Launched successfully: execution 59096, VLASS1.2_T17t10.J064711+243000
- A&A / R&A QL imaging jobs (the original attempts to A&A and R&A later completed after workflows turned on)
- AEK: attempted to A&A QL image (execution 59091) SUCCESS; execution/job status changed in Manager but nothing happened on disk in spool or cache
- AEK: attempted to R&A QL image (execution 59092) SUCCESS; execution/job status changed in Manager but nothing happened on disk in spool or cache
- Create scheduling block and products (test epoch)
- Run QL calibration job (test epoch on VLASS production manager) AEK: partial SUCCESS