On Jul. 25, 2022 Jeff Kern asked K. Scott Rowe to head a tiger team to investigate the various issues that have affected the ALMA Archive hosted in CV for the past few weeks to months. The team was initially just K. Scott.
Documented Issues
- https://ictjira.alma.cl/browse/AES-52
- https://confluence.alma.cl/pages/viewpage.action?pageId=91826715
People (not necessarily team members)
- K. Scott Rowe - Tiger Team Lead
- CJ Allen - sysadmin
- Tom Booth - programmer
- Liz Sharp - sysadmin
- Brian Mason - DRM Scientist
- Zhon Butcher - sysadmin
- Tracy Halstead - sysadmin
- Alvaro Aguirre - ALMA software
- Pat Murphy - CIS lead
- Rachel Rosen - previous ICT lead
- Laura Jenson - current ICT lead
- Catherine Vlahakis - Scientist
Communcation lines
- Mattermost NAASC Systems - Mostly used by NAASC sysadmins
- asg@listmgr.nrao.edu email list run by rrosen (Sadly, no archives are kept)
Timeline
- 2020-03-19: ALMA suspends science observing and stows the array because of COVID-19.
- 2020-06-24: Archive webapps (aq, asaz, rh, etc, but not SP) moved to new Docker Swarm (na-arc-*) system. See more.
- 2021-03-17: ALMA re-starts limited science observations, resuming Cycle 7. See more.
- 2021-10-01: ALMA starts Cycle 8 observations. See more.
- 2022-02-03: Science Portal (SP) upgraded Plone, Python, RHEL and moved into Docker Swarm. All other webapps had already been in Docker Swarm.
- 2022-04-18: First documented report of performance issues. Webapps moved to pre-production Docker Swarm (natest-arc-*). See more
- 2022-05-09: moved Science Portal (SP) from Docker Swarm to an rsync copy on http://almaportal.cv.nrao.edu/ for performance issues
- 2022-05-31: moved Science Portal (SP) from rsync copy back to Docker Swarm
- 2022-06-30: Tracy changed the eth0 MTU on the production docker swarm nodes (na-arc-*) from the default 1500 to 9000. The test swarm is still 1500.
Benchmarks
- Using Apache Benchmarks every hour to load http://almascience.nrao.edu/ on rastan.aoc.nrao.edu
- Using download script to get 2013.1.00226.S-small (no ASDM tarballs) every hour on cvpost-master.aoc.nrao.edu
- Using download script to get 2013.1.00226.S-large (with ASDM tarballs) every hour on testpost-master.aoc.nrao.edu
iperf test between hosts measured in Gb/s. 0 means less than 0.01Gb/s.
There is clearly something wrong with na-arc-3 as it is getting about 0.002Gb/s throughput (reports as 0Gb/s) to the other nodes.
na-arc-1 | na-arc-2 | na-arc-3 | na-arc-4 | na-arc-5 | |
---|---|---|---|---|---|
na-arc-1 | 18 | 0 | 20 | 10 | |
na-arc-2 | 20 | 0 | 20 | 10 | |
na-arc-3 | 0 | 0 | 0 | 0 | |
na-arc-4 | 20 | 19 | 0 | ||
na-arc-5 | 10 | 10 | 0 | 10 | 10 |
The test docker swarm (natest-arc-*) are performing as expected. The VM hosts have 1Gb/s links so getting 80% to 90% bandwidth is about as good as one can expect.
natest-arc-1 | natest-arc-2 | natest-arc-3 | |
---|---|---|---|
natest-arc-1 | 0.9 | 0.8 | |
natest-arc-2 | 0.9 | 0.8 | |
natest-arc-3 | 0.3 | 0.4 |
References
- Prepare offline infrastructure from the scratch (Describes docker swarm setup)
- file:///tmp/ALMA%20Offline%20Software%20Test_Deployment%20Concept(2).pdf