On Jul. 25, 2022 Jeff Kern asked K. Scott Rowe to head a tiger team to investigate the various issues that have affected the ALMA Archive hosted in CV for the past few weeks to months.  The team was initially just K. Scott.


Documented Issues


Timeline of events

Benchmarks


Table1

Production docker swarm iperf tests measured in Gb/s.

2022-08-11: After re-creating na-arc-3 (a clone of na-arc-2).  Also set the MTU to 1500.  The VM Host interfaces (p5p1.97 and br97 on naasc-vs-3) were still 1500 so we changed the interface on the VM guest (na-arc-3) to 1500 instead of changing the interfaces on the VM host to 9000 because there was concern that may interfere with other running VM guests on that host.


na-arc-1

(naasc-vs-4)

na-arc-2

(naasc-vs-4)

na-arc-3

(naasc-vs-3)

na-arc-4

(naasc-vs-4)

na-arc-5

(naasc-vs-5)

na-arc-1
1992110

na-arc-2

22
92010
na-arc-377
77
na-arc-421219
10
na-arc-5109810



Test docker swarm iperf tests measured in Gb/s


natest-arc-1

(naasc-dev-vs)

natest-arc-2

(naasc-vs-1)

natest-arc-3

(naasc-vs-5)

natest-arc-1
0.90.8
natest-arc-20.9
0.8
natest-arc-30.30.4

The test docker swarm (natest-arc-*) are performing as expected.  The VM hosts have 1Gb/s links so getting 80% to 90% bandwidth is about as good as one can expect.

Diagrams

Questions

To Do

  1. Done: Recreate na-arc-3 so it gets the same performance as other na-arc-* nodes which is apparently at least 10Gb/s. (pmurphy)
    1. 2022-08-11: cloned na-arc-2 and moved the clone to naasc-vs-3 (zbutcher)
    2. 2022-08-11: moved old na-arc-3 to na-arc-3-OLD (thalstea)
    3. 2022-08-11: Renamed the clone to na-arc-3.  We connected it to the swarm successfully, but it had a low connection speed.
    4. 2022-08-11: Changed the model of  na-arc-3's vnet5 interface on naasc-vs-3 from rtl8139 to virtio to match all the other na-arc-* nodes.  Performance was still poor.
    5. 2022-08-11: Changed the MTU of na-arc-3 eth0 to 1500.  This is different than all the other na-arc-* nodes but it was either that or change the p5p1.120 and br97 on naasc-vs-3 from 9000 to 1500 which my have impacted other VM guests on that host.  Performance was now reasonable.  7Gb/s.  I was expecting about 9Gb/s but perhaps the 1500 MTU is affecting performance.
    6. 2022-08-11: Joined na-arc-3 to the swarm and started services (sbooth)
  2.  Done: Launch services on production swarm (sbooth)
    1. 2022-08-11: Joined na-arc-3 to the swarm and started services (sbooth)
  3. Test the production docker swarm with a test web interface. (lsharp)
    1. 2022-08-12: http://almaportal.cv.nrao.edu/
    2. 2022-08-12 krowe: rant tcpdump on all five na-arc-{1..5} nodes tcpdump dst almaportal and then downloaded a datafile wget --no-check-certificate https://almaportal.cv.nrao.edu/dataPortal/2013.1.00226.S_uid___A001_X122_X1f1_001_of_001.tar and with each execution of the wget, I could see the nex na-arc host report the traffic.  This is because the web proxy on almaportal will select the next na-arc node via round-robin.  All five nodes were providing about 6KB/s speeds to cvpost-master.
    3. 2022-08-12 krowe: I did iperf tests from host to host in the entire chain (nangas14 -> na-arc-{1..5} -> almaportal -> cvpost-master) and each step the performance was at least 900Mb/s yet downloading with wget was about 0.06Mb/s.
  4. Done: Ask other ARC if they use MTU 9000 on 10Gb. (krowe)
    1. JAO uses MTU of 1500
    2. ESO uses two VM hosts running VMware with 10Gb/s and MTU of 1500
  5. Switch the production docker swarm back to MTU 1500 since the test docker swarm uses MTU 1500 and is performing better?
  6. Fix natest-arc-3 so it's NIC Model is virtio instead of rtl8139
  7. Upgrade production swarm to meet ALMA requirements (16-core, 32GB)

People (not necessarily team members)


Communcation lines


Answers

References