NAASC Archive Stabilization Tiger Team

On Jul. 25, 2022 Jeff Kern asked K. Scott Rowe to head a tiger team to investigate the various issues that have affected the ALMA Archive hosted in CV for the past few weeks to months. The team was initially just K. Scott.

Documented Issues

Timeline of events

2020-03-19: ALMA suspends science observing and stows the array because of COVID-19.
2020-06-24: Archive webapps (aq, asaz, rh, etc, but not SP) moved to new Docker Swarm (na-arc-*) system. See more.
2021-03-17: ALMA re-starts limited science observations, resuming Cycle 7. See more.
2021-10-01: ALMA starts Cycle 8 observations. See more.
2022-02-03: Science Portal (SP) upgraded Plone, Python, RHEL and moved into Docker Swarm. All other webapps had already been in Docker Swarm.
2022-04-18: First documented report of performance issues. Webapps moved to pre-production Docker Swarm (natest-arc-*). See more
2022-05-09: moved Science Portal (SP) from Docker Swarm to an rsync copy on http://almaportal.cv.nrao.edu/ for performance issues
2022-05-31: moved Science Portal (SP) from rsync copy back to Docker Swarm
2022-06-30: Tracy changed the eth0 MTU on the production docker swarm nodes (na-arc-*) from the default 1500 to 9000. The test swarm is still 1500.

Benchmarks

Using Apache Benchmarks every hour to load http://almascience.nrao.edu/ on rastan.aoc.nrao.edu
- ssh.aoc.nrao.edu:/users/krowe/alma_archive/benchmarks/almascience.nrao.edu/data (times are in milliseconds)
  - Mode load time is 98ms
- ssh.aoc.nrao.edu:/users/krowe/alma_archive/benchmarks/almaportal.cv.nrao.edu/data (times are in milliseconds)
  - Mode load time is 123ms
Using wget to get 2013.1.00226.S-small (about 700MB) every hour on cvpost-master.aoc.nrao.edu
- ssh.cv.nrao.edu:/lustre/cv/users/krowe/tickets/scg-207/benchmarks/almascience.nrao.edu/2013.1.00226.S-small
  - 2022-08-16: average time to download is about 42 seconds which is about 16MB/s
iperf tests using iperf3 -s -B <local IP> and iperf3 -B <local IP> -c <dest IP>
2022-08-15 krowe: I had tcpdump running on each na-arc-{1..5} nodes watching for traffic from almaportal tcpdump dst almaportal. Then I would run the following wget on cvpost-master. The first execution would be shown by tcpdump on na-arc-1, the second execution on na-arc-2 and so forth. This is because of the round-robin nature of the web proxy on almaportal and was a nice confirmation of that process. However, each execution also downloaded at about 32KB/s (0.3Mb/s) after a minute or so of downloading, which is about 300 times slower than expected. Using the test swarm (natest-arc-{1..3}) I can download the same file at about 10MB/s (100Mb/s). Also, I did not see any difference in performance across the five nodes which was also surprising given that one of the nodes runs the downloader container and the other four need to forward traffic to the one download container.
- cvpost-master wget --no-check-certificate https://almaportal.cv.nrao.edu/dataPortal/2013.1.00226.S_uid___A001_X122_X1f1_001_of_001.tar
2022-08-15 krowe: I ran iperf tests from end to end and don't see any unexpected performance.
- [nangas11] -- ~900Mb/s --> [rh-download container on na-arc-5] -- ~8,000Mb/s --> [almaportal] -- ~900Mb/s --> [cvpost-master]
- [nangas11] -- ~900Mb/s --> [na-arc-5] -- ~8,000Mb/s --> [almaportal] -- ~900Mb/s --> [cvpost-master]
2022-08-17 krowe: doing scp tests of a 784MB file
- [root@rh-download-na-production-2022jun tmp]# scp krowe@nangas13:/NGAS1/volume1/afa/2022-08-17/1/member.uid___A001_X158f_X90c.IRAS_09022-3615_sci.spw29.cube.I.pb.fits.gz /tmp (93MB/s)
- [root@rh-download-na-production-2022jun tmp]# scp member.uid___A001_X158f_X90c.IRAS_09022-3615_sci.spw29.cube.I.pb.fits.gz krowe@almaportal:/tmp (70MB/s)
- almaportal krowe >scp /tmp/member.uid___A001_X158f_X90c.IRAS_09022-3615_sci.spw29.cube.I.pb.fits.gz krowe@cvpost-master:/tmp (110MB/s)
tcpdump bandwidth tests
- When I download a file from na-arc-5 like so `wget --no-check-certificate http://na-arc-5.cv.nrao.edu:8088/dataPortal/member.uid___A001_X122_X1f1.LKCA_15_13CO_cube.image.fits` which lives on nangas13, to cvpost-master, the download runs at about 32KB/s.
  - On nangas13 I see about that much traffic (32KB/s to 50KB/s) almost all of it going to na-arc-5.
  - on na-arc-5 (rh-download container) I see between about 200KB/s and 300KB/s of traffic.
  - on na-arc-2 (httpd container) I see between about 100KB/s and 150KB/s of traffic. It seems like it is about half the traffic na-arc-5 sees.
2022-08-19 krowe: For some reason, all the swarm services on na-arc-5 shutdown about 11am Central Aug. 18, 2022. Now my wget tests are getting about 100MB/s and I tested this five times to walk through all four nodes.
- na-arc-5 was running
  - acralmaprod001.azurecr.io/offline-production/asax-elasticsearch:2022.02.01.2022feb (now on na-arc-3)
  - acralmaprod001.azurecr.io/offline-production/asax-explorer:2022.04.01.2022apr (now on na-arc-2)
  - acralmaprod001.azurecr.io/offline-production/asax-ingestor:2022.06.01.2022jun (now on na-arc-3)
  - acralmaprod001.azurecr.io/offline-production/rh-download:2022.06.01.2022jun (now on na-arc-2)
  - acralmaprod001.azurecr.io/offline-production/rh-logging:2022.06.01.2022jun (now on na-arc-4)
- Looks like na-arc-5 lost its heartbeat with the swarm. This is around the time I learned that setting net.ipv4.tcp_timestamps = 0 makes wget performance drop to 0.0KB/s. So it may have been me that caused na-arc-5 to loose its hearbeat with the swarm although setting net.ipv4.tcp_timestamps=0 on na-arc-5 again didn't cause docker to move the service off it this time.
  - Aug 18 13:34:16 na-arc-5 dockerd: time="2022-08-18T13:34:14.131474019-04:00" level=warning msg="memberlist: Refuting a suspect message (from: c30261b68826)"
    Aug 18 13:34:16 na-arc-5 dockerd: time="2022-08-18T13:34:15.929428007-04:00" level=info msg="memberlist: Suspect 886f1454e2b4 has failed, no acks received"
    Aug 18 13:34:16 na-arc-5 dockerd: time="2022-08-18T13:34:16.061224152-04:00" level=error msg="heartbeat to manager {xojanp58fu1ysx3yk0rpvjsft 10.2.97.71:2377} failed" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" method="(*session).heartbeat" module=node/agent node.id=1l5cnfmt16f6hyg5it0rq39rr session.id=fl2thh44rmjfxgu7xidnakjg8 sessionID=fl2thh44rmjfxgu7xidnakjg8
- I moved the rh-download service back to na-arc-5 with docker service update --force production_requesthandler_download and wget performance is back to about 32KB/s.
- I moved rh-download from na-arc-5 back to na-arc-2 by draining na-arc-5 docker node update --availability drain na-arc-5 and wget performance was back to about 100MB/s. I ran it four times to make sure.
- Then I moved rh-download from na-arc-2 to na-arc-1 by forcing a rebalance again with docker service update --force production_requesthandler_download. This is because na-arc-5 was still drained. wget performance was back to about 100MB/s. I ran it four times to make sure. I wanted to make sure the performance was good because rh-download wasn't on na-arc-5 and not because it was on na-arc-2. I think I have shown that. So, the question is why is performance so poor when rh-download is on na-arc-5?
- I moved production_httpd from na-arc-2 to na-arc-5 and wget performance is vaiable.
  - wget --no-check-certificate http://na-arc-1.cv.nrao.edu:8088/dataPortal/member.uid___A001_X122_X1f1.LKCA_15_13CO_cube.image.fits is about 32KB/s
  - wget --no-check-certificate http://na-arc-2.cv.nrao.edu:8088/dataPortal/member.uid___A001_X122_X1f1.LKCA_15_13CO_cube.image.fits is about 32KB/s
  - wget --no-check-certificate http://na-arc-3.cv.nrao.edu:8088/dataPortal/member.uid___A001_X122_X1f1.LKCA_15_13CO_cube.image.fits is about 100MB/s
  - wget --no-check-certificate http://na-arc-4.cv.nrao.edu:8088/dataPortal/member.uid___A001_X122_X1f1.LKCA_15_13CO_cube.image.fits is about 32KB/s
  - wget --no-check-certificate http://na-arc-5.cv.nrao.edu:8088/dataPortal/member.uid___A001_X122_X1f1.LKCA_15_13CO_cube.image.fits is about 100MB/s
  - My first thought was this has something to do with naasc-vs-4 since na-arc-[1,2,4] are on that VM host. But iperf tests still show about 900MB/s between all hosts and cvpost-master.
    - There are some differences in the VM hosts sysctl settings
      - naasc-vs-3
        net.ipv4.conf.all.accept_redirects = 0
        net.ipv4.conf.all.forwarding = 1
      - naasc-vs-4
        net.ipv4.conf.all.accept_redirects = 0
        net.ipv4.conf.all.forwarding = 1
      - naasc-vs-5
        net.ipv4.conf.all.accept_redirects = 1
        net.ipv4.conf.all.forwarding = 0

Table1

Production docker swarm iperf tests measured in Gb/s.

2022-08-11: After re-creating na-arc-3 (a clone of na-arc-2). Also set the MTU to 1500. The VM Host interfaces (p5p1.97 and br97 on naasc-vs-3) were still 1500 so we changed the interface on the VM guest (na-arc-3) to 1500 instead of changing the interfaces on the VM host to 9000 because there was concern that may interfere with other running VM guests on that host.

	na-arc-1 (naasc-vs-4)	na-arc-2 (naasc-vs-4)	na-arc-3 (naasc-vs-3)	na-arc-4 (naasc-vs-4)	na-arc-5 (naasc-vs-5)
na-arc-1		19	9	21	10
na-arc-2	22		9	20	10
na-arc-3	7	7		7	7
na-arc-4	21	21	9		10
na-arc-5	10	9	8	10

Test docker swarm iperf tests measured in Gb/s

natest-arc-1

(naasc-dev-vs)

natest-arc-2

(naasc-vs-1)

natest-arc-3

(naasc-vs-5)

natest-arc-1

0.9

natest-arc-2

0.9

natest-arc-3

0.5

The test docker swarm (natest-arc-*) are performing as expected. The VM hosts have 1Gb/s links so getting 80% to 90% bandwidth is about as good as one can expect.

Diagrams

Questions

I see sysctl differences between the natest-arc servers and the na-arc servers. Here is a diff of /etc/sysctl.d/99-nrao.conf on natest-arc-1 and na-arc-5

< #net.ipv4.tcp_tw_recycle = 1
---
> net.ipv4.tcp_tw_recycle = 1
22,39d21
< net.ipv4.conf.all.accept_redirects=0
< net.ipv4.conf.default.accept_redirects=0
< net.ipv4.conf.all.secure_redirects=0
< net.ipv4.conf.default.secure_redirects=0
<
< #net.ipv6.conf.all.disable_ipv6 = 1
< #net.ipv6.conf.default.disable_ipv6 = 1
<
< # Mellanox recommends the following
< net.ipv4.tcp_timestamps = 0
< net.core.netdev_max_backlog = 250000
<
< net.core.rmem_default = 16777216
< net.core.wmem_default = 16777216
< net.core.optmem_max = 16777216
< net.ipv4.tcp_mem = 16777216 16777216 16777216
< net.ipv4.tcp_low_latency = 1

If I set net.ipv4.tcp_timestamps = 0 on na-arc-5, the wget download drops to nothing (--.-KB/s).
If I set all the above sysctl options, execpt net.ipv4.tcp_timestamps, on all five na-arc nodes, wget download performance doesn't change. It is still about 32KB/s. Also I still zeeo ZeroWindow packets.
Try rebooting VMs after making changes?

I see ZeroWindow packets sent from na-arc-5 to nangas13 while downloading a file from nangas13 using wget. This is na-arc-5 telling nangas13 to wiat because its network buffer is full.
- Is this because of qdisc pfifo_fast? No. krowe changed eth0 to *qdisc fq_codel* and still seeing ZeroWait packets.
- Now that I have moved the rh_download to na-arc-1 and put httpd on na-arc-5 I no longer see ZeroWindow packets on na-arc-5. But I am seeing them on na-arc-1 which is where the rh_downloader is now. Is this because the rh_downloader is being stalled talking to something else like httpd and therefore telling nangas13 to wait?
Why does almaportal use ens3 while almascience uses eth0?
What if we move the rh-downloader container to a different node? In fact walk it through all five nodes and test.

Why do I see cv-6509 when tracerouting from na-arc-5 to nangas13 but not on natest-arc-1

[root@na-arc-5 ~]# traceroute nangas13
traceroute to nangas13 (10.2.140.33), 30 hops max, 60 byte packets
 1  cv-6509-vlan97.cv.nrao.edu (10.2.97.1)  0.426 ms  0.465 ms  0.523 ms
 2  cv-6509.cv.nrao.edu (10.2.254.5)  0.297 ms  0.277 ms  0.266 ms
 3  nangas13.cv.nrao.edu (10.2.140.33)  0.197 ms  0.144 ms  0.109 ms

[root@natest-arc-1 ~]# traceroute nangas13
traceroute to nangas13 (10.2.140.33), 30 hops max, 60 byte packets
 1  cv-6509-vlan96.cv.nrao.edu (10.2.96.1)  0.459 ms  0.427 ms  0.402 ms
 2  nangas13.cv.nrao.edu (10.2.140.33)  0.184 ms  0.336 ms  0.311 ms

Derek wrote that 10.2.99.1 = CV-NEXUS and 10.2.96.1 = CV-6509

Why does natest-arc-3 have ens3 instead of eth0 and why is its speed 100Mb/s?
- virsh domiflist natest-arc-3 shows the Model as rtl8139 instead of virtio
- When I run ethtool eth0 on nar-arc-{1..5} natest-arc-{1..2} as root, the result is just Link detected: yes instead of the full report with speed while na-arc-3 shows 100Mb/s.
Why do iperf tests from natest-arc-1 and natest-arc-2 to natest-arc-3 get about half the performance (0.5Gb/s) expected especially when the reverse tests get expected performance (0.9Gb/s).
Is putting the production swarm nodes (na-arc-*) on the 10Gb/s network a good idea? Sure it makes a fast connection to cvsan but it adds one more hop to the nangas servers (e.g. na-arc-1 -> cv-nexus9k -> cv-nexus -> nangas11)
When I connect to the container acralmaprod001.azurecr.io/offline-production/rh-download:2022.06.01.2022jun I get errors like unknown user 1009 I get the same errors on the natest-arc-1 container.
Does it matter that the na-arc nodes are on 10.2.97.x, their VM host is on 10.2.99.x while the natest-arc nodes are on 10.2.96.x and their VM hosts (well 2 out of 3) are also on 10.2.96.x? Is this why I see cv-509.cv.nrao.edu when running traceroute from the na-arc nodes?
When running wget --no-check-certificate http://na-arc-3.cv.nrao.edu:8088/dataPortal/member.uid___A001_X1358_Xd2.3C286_sci.spw31.cube.I.pbcor.fits I see traffic going through veth14ce034 on na-arc-3 but I can't find a container associated with that veth.

To Do

Now that we have changed sysctl settings it would be good to reboot. I suspect some things only look at the sysctl settings when they are created. Reboot naasc-vs-5 would be my suggestion.

Done

Recreate na-arc-3 so it gets the same performance as other na-arc-* nodes which is apparently at least 10Gb/s. (pmurphy)
1. 2022-08-11: cloned na-arc-2 and moved the clone to naasc-vs-3 (zbutcher)
2. 2022-08-11: moved old na-arc-3 to na-arc-3-OLD (thalstea)
3. 2022-08-11: Renamed the clone to na-arc-3. We connected it to the swarm successfully, but it had a low connection speed.
4. 2022-08-11: Changed the model of na-arc-3's vnet5 interface on naasc-vs-3 from rtl8139 to virtio to match all the other na-arc-* nodes. Performance was still poor.
5. 2022-08-11: Changed the MTU of na-arc-3 eth0 to 1500. This is different than all the other na-arc-* nodes but it was either that or change the p5p1.120 and br97 on naasc-vs-3 from 9000 to 1500 which my have impacted other VM guests on that host. Performance was now reasonable. 7Gb/s. I was expecting about 9Gb/s but perhaps the 1500 MTU is affecting performance.
6. 2022-08-11: Joined na-arc-3 to the swarm and started services (sbooth)
Launch services on production swarm (sbooth)
1. 2022-08-11: Joined na-arc-3 to the swarm and started services (sbooth)
Test the production docker swarm with a test web interface. (lsharp)
1. 2022-08-12: http://almaportal.cv.nrao.edu/
2. 2022-08-12 krowe: rant tcpdump on all five na-arc-{1..5} nodes tcpdump dst almaportal and then downloaded a datafile wget --no-check-certificate https://almaportal.cv.nrao.edu/dataPortal/2013.1.00226.S_uid___A001_X122_X1f1_001_of_001.tar and with each execution of the wget, I could see the nex na-arc host report the traffic. This is because the web proxy on almaportal will select the next na-arc node via round-robin. All five nodes were providing about 6KB/s speeds to cvpost-master.
3. 2022-08-12 krowe: I did iperf tests from host to host in the entire chain (nangas14 -> na-arc-{1..5} -> almaportal -> cvpost-master) and each step the performance was at least 900Mb/s yet downloading with wget was about 0.06Mb/s.
Ask other ARC if they use MTU 9000 on 10Gb. (krowe)
1. JAO uses MTU of 1500
2. ESO uses two VM hosts running VMware with 10Gb/s and MTU of 1500
2022-08-17 krowe: Changed eth0 on na-arc-5 from qdisc pfifo_fast to qdisc fq_codel to match all the other na-arc and natest-arc nodes. This seemed to have no affect on performance.
- tc qdisc replace dev eth0 root fq_codel
2022-08-25 krowe: Tracy cahnged the following sysctl options on na-arc-5 to match the other VM Hosts. Sadly it seems to have had no effect on wget performance. na-arc-1, na-arc-2, na-arc-4 are 32KB/s while na-arc-3 and na-arc-5 are 45MB/s.
- net.ipv4.conf.all.accept_redirects = 0
- net.ipv4.conf.all.forwarding = 1

People (not necessarily team members)

K. Scott Rowe - Tiger Team Lead
CJ Allen - sysadmin
Tom Booth - programmer
Liz Sharp - sysadmin
Brian Mason - DRM Scientist
Zhon Butcher - sysadmin
Tracy Halstead - sysadmin
Alvaro Aguirre - ALMA software
Pat Murphy - CIS lead
Rachel Rosen - previous ICT lead
Laura Jenson - current ICT lead
Catherine Vlahakis - Scientist

Communcation lines

asg@listmgr.nrao.edu email list run by rrosen (Sadly, no archives are kept)
Mattermost NAASC Systems - Mostly used by NAASC sysadmins

Answers

Why does iperf show 10Gb/s between na-arc-5 and na-arc-[1,2,4]? How is this possible if the default interface on the respective VM Hosts is 1Gb/s?
- ANSWER: The vnets for the VM guests are tied to the 10Gb/s NICs on the VM hosts not the 1Gb/s NICs.
Why do natest-arc-{1..3} have 9 veth* interfaces in ip addr show while na-arc-{1..5} don't have any veth* interfaces?
- Each container creates a veth* interface.

Why does na-arc-3 have such poor network performance to the other na-arc nodes?

ping na-arc-[1,2,4,5] with anything larger than -s 1490 drops all packets
iperf tests show 10Gb/s between the VM host of na-arc-3 (naasc-vs-3 p5p1.120) and the VM host of na-arc-5 (naasc-vs-5 p2p1.120). So it isn't a bad card in either of the VM hosts.
iptables on na-arc-3 looks different than iptables on na-arc-[2,3,5]. na-arc-1 also looks a bit different.
docker_gwbridge interface on na-arc-[1,2,4,5] shows NO_CARRIER but not on na-arc-3.
na-arc-3 has a veth10fd1da@if37 interface. None of the other na-arc-* nodes have a veth interface.