You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 66 Next »

The DSOC cluster will be offline October 28 and 29 for lustre improvements and to switch the cluster from RHEL6 to RHEL7, this document enumerates the steps we will take beforehand to prepare for it and the steps we will take afterwords to confirm things are working properly and restore access to it. Note that these upgrades won't touch the control systems for the AAT/PPI or VLASS, but they will touch the environment the workflows for both execute on.

1 Week Before the Shutdown (October 21)

  • Stephan Witz to work with CIS to make sure the DNS TTLs for archive.nrao.edu and archive-new.nrao.edu are low
  • Stephan Witz (SSA) to put MOTD banners up on the legacy archive announcing the downtime, will work with John Tobin on the messaging
  • Stephan Witz (SSA) to work with CIS to replace the message on http://offline.nrao.edu, with work with John Tobin on the messaging
  • Stephan Witz  (SSA) to modify the vlass.test.scripts to change /usr/local/bin/python2.7 to /opt/local/bin/python2.7 (this is done, but the scripts can't execute until CIS adds a dependency to /opt/local/bin/python2.7, tracked in helpdesk ticket 116753)
  • Drew Medlin (Operations) to disable the CIPL auto-start CAPO setting

Morning of the Shutdown (October 28)

  • CIS to change external DNS of archive.nrao.edu and archive-new.nrao.edu to point to offline.nrao.edu
  • SSA to change the casa CAPO properties from the RHEL6 paths to RHEL7 paths
  • Operations to change the CASA symlink CIPL uses to point to RHEL7 version

Morning After the Shutdown (October 30)

  • Once SSA give them the all-clear, stakeholders (John Tobin, Mark Lacy ) test critical user-facing functions of the AAT/PPI under RHEL7
  • User-facing things that need to be tested on the AAT/PPI production system (https://archive-new.nrao.edu):
    • Downloads of VLA EBs: SDM-only (ML - job 316460859 OK), SDM (DM-Job 316336595 OK), basic MS, CMS - ops tests on SRDP-348 (pass)!DM FAIL, no CASA 5.6.2 SSA-5935) , SRDP-356 (pass).
    • Downloads of VLA calibrations. - ML- job 316444531 started for non-prop data, worked, so calling this good. X - ML failed to be able to download my own proprietary data (but dd not try before the upgrade) - JT -I have not had trouble yet. DM: no issue for me, either.  CASA 5.6.2 issue - SSA-5935
    • Downloads of VLASS images (ML- Job 316404900)
    • Downloads of VLBA UVFits files (SRDP-415)
    • AUDI imaging (JT-309617329)
    • ALMA restored MS download (Jt- 309621798 )
    • ASDM Download (JT-309629153 )
    • ALMA basic MS download  (JT -309625789) (ML - my job 316420925 failed)
    • Searches work
    • Proprietary periods respected for VLA and ALMA - ML - I cannot my proprietary VLA dataset 19A-306 (authentication through NRAO). SSA-5934 submitted - SW: I can't access 19A-306 via my test account, which is proper
  • User-facing things that need to be tested on the legacy archive production system (https://archive.nrao.edu):
    • Downloads of VLA EBs: basic MS (DM - FAIL. nothing written to destination path but test ping successful. Tried with two latest 19A-020 EBs).
  • Stakeholders (Mark LacyDrew Medlin) test critical operations-facing functions of the AAT/PPI under RHEL7
  • Operations-facing things that need to be tested:
    • EB ingestion
    • CIPL being triggered (manual CIPL start works, DM. Update: I set the workflow to RUN to trigger something, will remove it if it starts)
    • CIPL working (DM pipeline manually started and works to completion, including moving of files)
    • calibration ingestion (QAPass) (DM, ingested test run of 19A-020 from 2019-10-27, ingested cal file shows, can be downloaded, is same size as original) with qa_notes.html indicating the results shouldn't be used, will replace with real run later)
  • GO/NO decision, 3pm MDT October 30th (Drew Medlin, John Tobin, Mark Lacy, Stephan Witz):
    • If GO: Stephan Witzto with with CIS to undo the DNS change and MOTD banners, re-nable CIPL triggers
    • If NO-GO: SSA to iterate with stakeholders until result is GO, note that this may mean putting out a bugfix release of AAT/PPI 3.6 and doing the same for VLASS

Stakeholder tests

VLASS

    • Run QL calibration job (test epoch)
    • Run QL imaging job (as part of reprocessing)
    • A&A / R&A QL imaging job (as part of reprocessing)
    • Create scheduling block and products (test epoch)


  • No labels