Table of Contents |
---|
Overview
On Jul. 25, 2022 Jeff Kern asked K. Scott Rowe to head a tiger team to investigate the various issues that have affected the ALMA Archive hosted in CV for the past few weeks to months. The team was initially just K. Scott.
Communcation lines
- asg@listmgr.nrao.edu email list run by rrosen (Sadly, no archives are kept)
- Mattermost NAASC Systems - Mostly used by NAASC sysadmins
- Slack - Mostly to communicate with Tom Booth
Documented Issues
- https://ictjira.alma.cl/browse/AES-52
- https://confluence.alma.cl/pages/viewpage.action?pageId=91826715
- You can see poor performance with a command like
- wget --no-check-certificate http://almaportal.cv.nrao.edu/dataPortal/member.uid___A001_X1358_Xd2.3C286_sci.spw31.cube.I.pbcor.fits
- krowe has narrowed this down to the ingress overlay network created by docker swarm which is used to re-route traffic sent to the wrong host.
- On na-arc-2
- nsenter --net=/var/run/docker/netns/ingress_sbox
- iperf -B 10.0.0.21 -s
- On na-arc-3
- nsenter --net=/var/run/docker/netns/ingress_sbox
- iperf3 -B 10.0.0.19 -c 10.0.0.21
- SOLUTION: Set rx-gro-hw=off on naasc-vs-4. See Conclusions for more details.
- On na-arc-2
- 2022-09-19 krowe: With rx-gro-hw=off, retransmissions have reduced, but I still see them when sending to naasc-vs-5 and even more when sending to naasc-vs-2. Sending to naasc-vs-3 or naasc-vs-4 does not produce retransmissions. This is surprising given how similar naasc-vs-3 and naasc-vs-5 are. I expect this is caused by congestion as the retransmissions are not easily reproducable.
- On naasc-vs-2
- iperf3 -B 10.2.120.107 -s
- On naasc-vs-3 or naasc-vs-4 or naasc-vs-5 or naasc-cont-1 or...
- iperf3 -B <LOCAL_IP> -c 10.2.120.107
- 2022-09-21 cfultz: Replaced the 10Gb network cable on naasc-vs-2. "the cable was nearly bent in half at the router". Sadly, I am still seeing retransmissions.
- On naasc-vs-2
- 2022-09-19 krowe: With rx-gro-hw=off, throughput over the vxlan/overlay network (like ingress_sbox) has improved but still ranges from 1Gb/s to 4.6Gb/s. The network is rated at 10Gb/s. VLAN and VXLAN introduce about a 10% overhead penalty. So I would expect throughput to be more like 8Gb/s to 9Gb/s. Granted this isn't really an issue since at the moment the NGAS servers are on a 1Gb/s network, but that may change someday.
Diagrams
- naasc-archive-network.pdf - This is probably out of date because the docker services can be moved to a different na-arc host which also changes their IP addresses. It also doesn't list all the docker containers or VM guests, but it is a good approximation.
Timeline of events
- 2020-03-19: ALMA suspends science observing and stows the array because of COVID-19.
- 2020-06-24: Archive webapps (aq, asaz, rh, etc, but not SP) moved to new Docker Swarm (na-arc-*) system. See more.
- 2021-03-17: ALMA re-starts limited science observations, resuming Cycle 7. See more.
- 2021-10-01: ALMA starts Cycle 8 observations. See more.
- 2022-02-03: Science Portal (SP) upgraded Plone, Python, RHEL and moved into Docker Swarm. All other webapps had already been in Docker Swarm.
- 2022-04-18: First documented report of performance issues. Webapps moved to pre-production Docker Swarm (natest-arc-*). See more
- 2022-05-09: moved Science Portal (SP) from Docker Swarm to an rsync copy on http://almaportal.cv.nrao.edu/ for performance issues
- 2022-05-31: moved Science Portal (SP) from rsync copy back to Docker Swarm
- 2022-06-30: Tracy changed the eth0 MTU on the production docker swarm nodes (na-arc-*) from the default 1500 to 9000. The test swarm is still 1500.
- 2022-07-25: Jeff Kern asked K. Scott Rowe to head a tiger team to investigate the various issues that have affected the ALMA Archive.
- 2022-08-11: cloned na-arc-2 and moved the clone to naasc-vs-3 as na-arc-3 and change MTU to 1500. Other na-arc nodes are 9000 but changing na-arc-3 to 9000 would require changing naasc-vs-3 which could affect other, non-archive, VM guests.
- 2022-08-12: setup http://almaportal.cv.nrao.edu/ which uses the five na-arc nodes. This is for internal testing. Results show download speed at about 32KB/s regaurdless of which na-arc node the web proxy chooses.
- 2022-08-17 krowe: Changed eth0 on na-arc-5 from qdisc pfifo_fast to qdisc fq_codel to match all the other na-arc and natest-arc nodes. This seemed to have no affect on performance.
- tc qdisc replace dev eth0 root fq_codel
- 2022-08-19 krowe: For some reason, all the swarm services on na-arc-5 shutdown about 11am Central Aug. 18, 2022. Now my wget tests are getting about 100MB/s and I tested this five times to walk through all four nodes. I then moved the httpd to na-arc-5 and now na-arc-[1,2,4] download at ~32KB/s while na-arc-[3,5] download at ~100MB/s.
- 2022-08-25 krowe: Tracy cahnged the following sysctl options on naasc-vs-5 to match the other VM Hosts. Sadly it seems to have had no effect on wget performance. na-arc-1, na-arc-2, na-arc-4 are 32KB/s while na-arc-3 and na-arc-5 are 45MB/s.
- net.ipv4.conf.all.accept_redirects = 0
- net.ipv4.conf.all.forwarding = 1
- 2022-09-01: Tracy rebooted naasc-vs-5 which hosts na-arc-5 just in case this was necessary for the net.ipv4.conf.all.forwarding sysctl change to take effect. Sadly, no change in performance.
- 2022-09-07 krowe: rebooted na-arc-5 and now net.ipv4.conf.all.accept_redirects = 0 but no change in performance.
- 2022-09-16 krowe: Set ethtool -K em1 gro off on naasc-vs-4.
Benchmarks
- Using Apache Benchmarks every hour to load http://almascience.nrao.edu/ on rastan.aoc.nrao.edu
- ssh.aoc.nrao.edu:/users/krowe/alma_archive/benchmarks/almascience.nrao.edu/data (times are in milliseconds)
- Mode load time is 98ms
- ssh.aoc.nrao.edu:/users/krowe/alma_archive/benchmarks/almaportal.cv.nrao.edu/data (times are in milliseconds)
- Mode load time is 123ms
- ssh.aoc.nrao.edu:/users/krowe/alma_archive/benchmarks/almascience.nrao.edu/data (times are in milliseconds)
- Using wget to get 2013.1.00226.S-small (about 700MB) every hour on cvpost-master.aoc.nrao.edu
- ssh.cv.nrao.edu:/lustre/cv/users/krowe/tickets/scg-207/benchmarks/almascience.nrao.edu/2013.1.00226.S-small
- 2022-08-16: average time to download is about 42 seconds which is about 16MB/s
- ssh.cv.nrao.edu:/lustre/cv/users/krowe/tickets/scg-207/benchmarks/almascience.nrao.edu/2013.1.00226.S-small
- iperf tests using iperf3 -s -B <local IP> and iperf3 -B <local IP> -c <dest IP>
2022-08-11: After re-creating na-arc-3 (a clone of na-arc-2). Also set the MTU to 1500. The VM Host interfaces (p5p1.97 and br97 on naasc-vs-3) were still 1500 so we changed the interface on the VM guest (na-arc-3) to 1500 instead of changing the interfaces on the VM host to 9000 because there was concern that may interfere with other running VM guests on that host.
Table1: iperf3 to/from hosts (Gb/s) na-arc-1
(naasc-vs-4)
na-arc-2
(naasc-vs-4)
na-arc-3
(naasc-vs-3)
na-arc-4
(naasc-vs-4)
na-arc-5
(naasc-vs-5)
na-arc-1 19 9 21 10 na-arc-2
22 9 20 10 na-arc-3 7 7 7 7 na-arc-4 21 21 9 10 na-arc-5 10 9 8 10 Test docker swarm iperf tests measured in Gb/s
Table2: iperf3 to/from hosts (Gb/s) natest-arc-1
(naasc-dev-vs)
natest-arc-2
(naasc-vs-1)
natest-arc-3
(naasc-vs-5)
natest-arc-1 0.9 0.9 natest-arc-2 0.9 0.9 natest-arc-3 0.5 0.5 The test docker swarm (natest-arc-*) are performing as expected. The VM hosts have 1Gb/s links so getting 80% to 90% bandwidth is about as good as one can expect.
- 2022-08-15 krowe: I had tcpdump running on each na-arc-{1..5} nodes watching for traffic from almaportal tcpdump dst almaportal. Then I would run the following wget on cvpost-master. The first execution would be shown by tcpdump on na-arc-1, the second execution on na-arc-2 and so forth. This is because of the round-robin nature of the web proxy on almaportal and was a nice confirmation of that process. However, each execution also downloaded at about 32KB/s (0.3Mb/s) after a minute or so of downloading, which is about 300 times slower than expected. Using the test swarm (natest-arc-{1..3}) I can download the same file at about 10MB/s (100Mb/s). Also, I did not see any difference in performance across the five nodes which was also surprising given that one of the nodes runs the downloader container and the other four need to forward traffic to the one download container.
- cvpost-master wget --no-check-certificate https://almaportal.cv.nrao.edu/dataPortal/2013.1.00226.S_uid___A001_X122_X1f1_001_of_001.tar
- 2022-08-15 krowe: I ran iperf tests from end to end and don't see any unexpected performance.
- [nangas11] -- ~900Mb/s --> [rh-download container on na-arc-5] -- ~8,000Mb/s --> [almaportal] -- ~900Mb/s --> [cvpost-master]
- [nangas11] -- ~900Mb/s --> [na-arc-5] -- ~8,000Mb/s --> [almaportal] -- ~900Mb/s --> [cvpost-master]
- 2022-08-17 krowe: doing scp tests of a 784MB file
- [root@rh-download-na-production-2022jun tmp]# scp krowe@nangas13:/NGAS1/volume1/afa/2022-08-17/1/member.uid___A001_X158f_X90c.IRAS_09022-3615_sci.spw29.cube.I.pb.fits.gz /tmp (93MB/s)
- [root@rh-download-na-production-2022jun tmp]# scp member.uid___A001_X158f_X90c.IRAS_09022-3615_sci.spw29.cube.I.pb.fits.gz krowe@almaportal:/tmp (70MB/s)
- almaportal krowe >scp /tmp/member.uid___A001_X158f_X90c.IRAS_09022-3615_sci.spw29.cube.I.pb.fits.gz krowe@cvpost-master:/tmp (110MB/s)
- tcpdump bandwidth tests
- When I download a file from na-arc-5 like so `wget --no-check-certificate http://na-arc-5.cv.nrao.edu:8088/dataPortal/member.uid___A001_X122_X1f1.LKCA_15_13CO_cube.image.fits` which lives on nangas13, to cvpost-master, the download runs at about 32KB/s.
- On nangas13 I see about that much traffic (32KB/s to 50KB/s) almost all of it going to na-arc-5.
- on na-arc-5 (rh-download container) I see between about 200KB/s and 300KB/s of traffic.
- on na-arc-2 (httpd container) I see between about 100KB/s and 150KB/s of traffic. It seems like it is about half the traffic na-arc-5 sees.
- When I download a file from na-arc-5 like so `wget --no-check-certificate http://na-arc-5.cv.nrao.edu:8088/dataPortal/member.uid___A001_X122_X1f1.LKCA_15_13CO_cube.image.fits` which lives on nangas13, to cvpost-master, the download runs at about 32KB/s.
- 2022-08-19 krowe: For some reason, all the swarm services on na-arc-5 shutdown about 11am Central Aug. 18, 2022. Now my wget tests are getting about 100MB/s and I tested this five times to walk through all four nodes.
- na-arc-5 was running
- acralmaprod001.azurecr.io/offline-production/asax-elasticsearch:2022.02.01.2022feb (now on na-arc-3)
- acralmaprod001.azurecr.io/offline-production/asax-explorer:2022.04.01.2022apr (now on na-arc-2)
- acralmaprod001.azurecr.io/offline-production/asax-ingestor:2022.06.01.2022jun (now on na-arc-3)
- acralmaprod001.azurecr.io/offline-production/rh-download:2022.06.01.2022jun (now on na-arc-2)
- acralmaprod001.azurecr.io/offline-production/rh-logging:2022.06.01.2022jun (now on na-arc-4)
- Looks like na-arc-5 lost its heartbeat with the swarm. This is around the time I learned that setting net.ipv4.tcp_timestamps = 0 makes wget performance drop to 0.0KB/s. So it may have been me that caused na-arc-5 to loose its heartbeat with the swarm although setting net.ipv4.tcp_timestamps=0 on na-arc-5 again didn't cause docker to move the service off it this time.
Aug 18 13:34:16 na-arc-5 dockerd: time="2022-08-18T13:34:14.131474019-04:00" level=warning msg="memberlist: Refuting a suspect message (from: c30261b68826)"
Aug 18 13:34:16 na-arc-5 dockerd: time="2022-08-18T13:34:15.929428007-04:00" level=info msg="memberlist: Suspect 886f1454e2b4 has failed, no acks received"
Aug 18 13:34:16 na-arc-5 dockerd: time="2022-08-18T13:34:16.061224152-04:00" level=error msg="heartbeat to manager {xojanp58fu1ysx3yk0rpvjsft 10.2.97.71:2377} failed" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" method="(*session).heartbeat" module=node/agent node.id=1l5cnfmt16f6hyg5it0rq39rr session.id=fl2thh44rmjfxgu7xidnakjg8 sessionID=fl2thh44rmjfxgu7xidnakjg8
- I moved the rh-download service back to na-arc-5 with docker service update --force production_requesthandler_download and wget performance is back to about 32KB/s.
- I moved rh-download from na-arc-5 back to na-arc-2 by draining na-arc-5 docker node update --availability drain na-arc-5 and wget performance was back to about 100MB/s. I ran it four times to make sure.
- Then I moved rh-download from na-arc-2 to na-arc-1 by forcing a rebalance again with docker service update --force production_requesthandler_download. This is because na-arc-5 was still drained. wget performance was back to about 100MB/s. I ran it four times to make sure. I wanted to make sure the performance was good because rh-download wasn't on na-arc-5 and not because it was on na-arc-2. I think I have shown that. So, the question is why is performance so poor when rh-download is on na-arc-5?
- I moved production_httpd from na-arc-2 to na-arc-5 and wget performance is vaiable.
- wget --no-check-certificate http://na-arc-1.cv.nrao.edu:8088/dataPortal/member.uid___A001_X122_X1f1.LKCA_15_13CO_cube.image.fits is about 32KB/s
- wget --no-check-certificate http://na-arc-2.cv.nrao.edu:8088/dataPortal/member.uid___A001_X122_X1f1.LKCA_15_13CO_cube.image.fits is about 32KB/s
- wget --no-check-certificate http://na-arc-3.cv.nrao.edu:8088/dataPortal/member.uid___A001_X122_X1f1.LKCA_15_13CO_cube.image.fits is about 100MB/s
- wget --no-check-certificate http://na-arc-4.cv.nrao.edu:8088/dataPortal/member.uid___A001_X122_X1f1.LKCA_15_13CO_cube.image.fits is about 32KB/s
- wget --no-check-certificate http://na-arc-5.cv.nrao.edu:8088/dataPortal/member.uid___A001_X122_X1f1.LKCA_15_13CO_cube.image.fits is about 100MB/s
- My first thought was this has something to do with naasc-vs-4 since na-arc-[1,2,4] are on that VM host. But iperf tests still show about 900MB/s between all hosts and cvpost-master.
- There are some differences in the VM hosts sysctl settings
- naasc-vs-3
net.ipv4.conf.all.accept_redirects = 0
- net.ipv4.conf.all.forwarding = 1
- naasc-vs-4
- net.ipv4.conf.all.accept_redirects = 0
- net.ipv4.conf.all.forwarding = 1
- naasc-vs-5
- net.ipv4.conf.all.accept_redirects = 1
- net.ipv4.conf.all.forwarding = 0
- naasc-vs-3
- There are some differences in the VM hosts sysctl settings
- na-arc-5 was running
- 2022-08-31 krowe: I have learned how to do iperf tests ingress network (docker mesh).
- Login to an VM guest and run the following to get a shell in the ingress_sbox namespace
- nsenter --net=/var/run/docker/netns/ingress_sbox
- You can now run ip -c addr to see the IPs of the ingress network
- Below is a table using default iperf3 test. The values are rounded for simplicity. Hosts accross the top row are receiving while hosts along the left column are transmitting.
- Login to an VM guest and run the following to get a shell in the ingress_sbox namespace
Table3: iperf3 to/from ingress_sbox (Mb/s) | |||||
---|---|---|---|---|---|
na-arc-1 10.0.0.2 | na-arc-2 10.0.0.21 | na-arc-3 10.0.0.19 | na-arc-4 10.0.0.5 | na-arc-5 10.0.0.6 | |
na-arc-1 | 4,000 | 2,000 | 4,000 | 3,000 | |
na-arc-2 | 4,000 | 2,000 | 4,000 | 3,000 | |
na-arc-3 | 0.3 | 0.3 | 0.3 | 3,000 | |
na-arc-4 | 4,000 | 4,000 | 2,000 | 3,000 | |
na-arc-5 | 0.3 | 0.3 | 2,000 | 0.3 |
Table4: iperf3 to/from ingress_sbox (Mb/s) | |||
---|---|---|---|
natest-arc-1 10.0.0.2 | natest-arc-2 10.0.0.8 | natest-arc-3 10.0.0.4 | |
natest-arc-1 | 900 | 700 | |
natest-arc-2 | 900 | 700 | |
natest-arc-3 | 300 | 300 |
- 2022-09-08 krowe: I have tested the other overlay networks (production_agent_network 10.0.1.0/24 and production_default 10.0.2.0/24) and they perform similarly to the ingress overlay network 10.0.0.0/24.
- 2022-09-09 krowe: na-arc-6 is now online served from naasc-vs-2. Here are the iperf3 tests from ingress_sbox to ingress_sbox. When throughput is slow (Kb/s) I see that the congestion window size is reduced from about 1MB to about 2.73KB.
Table6: iperf3 TCP throughput from/to ingress_sbox (Mb/s) na-arc-1
(naasc-vs-4)
na-arc-2
(naasc-vs-4)
na-arc-3
(naasc-vs-3)
na-arc-4
(naasc-vs-4)
na-arc-5
(naasc-vs-5)
na-arc-6
(naasc-vs-2)
na-arc-1 3920 2300 4200 3110 3280 na-arc-2 3950 2630 4000 3350 3530 na-arc-3 0.2 0.3 0.2 2720 2810 na-arc-4 3860 3580 2410 3390 3290 na-arc-5 0.2 0.2 2480 0.2 2550 na-arc-6 0.005 0.005 2790 0.005 3290 - 2022-09-09 krowe: The ingress network (docker mesh) that I have been testing using the ingress_sbox namespace uses a veth interface (this is like a pipe) that connects to its corrosponding veth interface in another namespace on the same host which connects to a vxlan over a bridge in that second namespace. vxlan is a tunneling protocol that uses UDP over port 4789. This is why I am seeing my TCP packets turn into UDP packets. Using tcpdump in the ingress_sbox to watch iperf TCP traffic going from na-arc-2 to na-arc-3 looks clean. Watching traffic going from na-arc-3 to na-arc-2, which is slow (32KB/s), shows lots of TCP Retransmission and TCP Out-Of-Order packets.
TCP Retransmissions
- 2022-09-15 krowe: Even with rx-gro-hw=off on naasc-vs-4, I am still seeing some retransmissions in iper3 tests. These are the same as TCP Retransmissions seen previously. On a modern, well-designed network I would expect to see almost no TCP Retransmissions. So this may indicate that there are still improvements to be made. The number of retransmissions seems to vary over time from 0 retransmissions to over a thousand retransmissions on certain directions. This makes me think there is something else using the 10Gb network that is interfering with my tests.
This is a 10 second iper3 test using TCP from the host in the left column to the host in the top row.
TableXX iperf3 Retransmissions over 10Gb and rx-gro-hw=off naasc-vs-2
(10.2.120.107)
naasc-vs-3
(10.2.120.109)
naasc-vs-4
(10.2.120.110)
naasc-vs-5
(10.2.120.112)
naasc-vs-2 0, 0, 0 0, 0, 0 45, 52, 59 naasc-vs-3 87, 0, 19, 1734 0, 0, 0 74, 52, 56 naasc-vs-4 0, 342, 1147, 363 0, 0, 0 83, 51, 50 naasc-vs-5 494, 0, 1296, 24 0, 0, 0 0, 0, 0 This looks like some sort of misconfiguration on the receiving ends of naasc-vs-2 and naasc-vs-5. This may be congestion. For example if I start an iper3 test from naasc-vs-4 to naasc-vs-2 I can often see 0 retransmissions every second. But then while doing that I also start in iper3 test from na-arc-1 (a guest on naasc-vs-4) to na-arc-6 (a guest on naasc-vs-2), I can see around 20 to 70 retransmissions per second on both naasc-vs-4 and na-arc-1. They are clearly interfering with each other. I don't see congestion when I reverse the test (naasc-vs-2 to naasc-vs-4 while na-arc-6 to na-arc-1). Doing the same test from naasc-vs-5 to naasc-vs-3 while doing na-arc-5 (a guest on naasc-vs-5) to na-arc-3 (a guest on naasc-vs-3) I don't see any congestion. If I turn off TSO on naasc-vs-2 with ethtool -K ens1f0np0 tx-tcp-segmentation off I then no longer can create congestion by doing two simultaneous iperf3 tests but I still get occational retransmissions (like 100+ per second) when testing from naasc-vs-{3..5} to naasc-vs-2. Disabling TSO also doesn't seem to reduce the number of retransmissions when testing from na-arc-* to na-arc-6.
TableXX iperf3 Retransmissions over 10Gb and rx-gro-hw=off na-arc-1
10.2.97.71
(naasc-vs-4)
na-arc-2
10.2.97.72
(naasc-vs-4)
na-arc-3
10.2.97.73
(naasc-vs-3)
na-arc-4
10.2.97.74
(naasc-vs-4)
na-arc-5
10.2.97.75
(naasc-vs-5)
na-arc-6
10.2.97.76
(naasc-vs-2)
na-arc-1 0, 0, 0 0, 0, 0 0, 0, 0 55, 75, 50 323, 501, 538 na-arc-2 0, 0, 0 0, 0, 0 0, 0, 0 68, 81, 64 768, 1050, 658 na-arc-3 1692, 1627, 2071 0, 1326, 592 1471, 3376, 686 360, 2477, 664 1873, 1872, 2384 na-arc-4 0, 0, 0 0, 0, 0 0, 0, 0 58, 86, 65 4, 9, 38 na-arc-5 108, 6, 6 6, 6, 6 2, 1, 1 6, 6, 6 1293, 1197, 33 na-arc-6 106, 0, 28 0, 0, 21 0, 88, 0 7, 0, 28 89, 75, 52 - I see a lot of dropped packets on the Rx side of all the naasc-vs hosts.
- I think the large number of retransmissions when transmissing from naasc-vs-* to naasc-vs-2 the cause for the large number of retransmissions when transmitting from na-arc-* to na-arc-6.
- I don't know what explains the retransmissions when transmitting from na-arc-3 to na-arc-*.
I don't think the retransmissions from na-arc-3 to na-arc-* can be atributed to MTU. Sure eth0 on na-arc-3 is 1500 while all the other na-arc nodes are 9000 but that should not cause a problem. If anything it sould be a problem the other way around. Also I tested changing na-arc-6 to 1500 and retransmissions didn't change. The lack of retransmissions between na-arc-1, na-arc-2, and na-arc-4 is because they are all on the same VM Host (naasc-vs-4).
- You can use ping to see if your packet size actually gets through. This is a good way to test MTU sizes.
ping -c 3 -M do -s 1500 na-arc-1
- You can use ping to see if your packet size actually gets through. This is a good way to test MTU sizes.
- Try increasing Rx buffers (ethtool -G) and see if that helps retransmits
- ethtool -G ens1f0np0 rx 4096
- ethtool -G ens1f0np0 tx 2048
- Setting these didn't seem to help with the TCP Retransmissions.
- 2022-09-28 krowe: Not so fast. I was running a long iperf3 test from naasc-vs-4 to naasc-vs-2 and while seeing over 100 TCP Retransmissions per second I ran the following on naasc-vs-2 ethtool -G ens1f0np0 rx 4096 and immediatly saw the TCP Retransmissions drop to 0. I put the rx ring parameter back to 1024 and performed the test again and saw an immediate drop to 0. So there may be something here. Perhaps it helps but does not do away with all the TCP Retransmissions. Theoretically, there will always be some TCP Retransmissions over a long enough sample. I then succeded in two more tests. I think there is something here. Though I can still produce TCP retransmissions by running an iper3 test from na-arc-4 to na-arc-6 while running a test from naasc-vs-4 to naasc-vs-2.
- 2022-09-28 krowe: But more iper3 tests once rx=4096 show occational bouts of over 100 TCP Retransmissions per second. So perhaps increasing the ring parameter just reduces that perticular storm but other storms still occur.
- ethtool -G ens1f0np0 rx 4096
- Map the retransmissions. Is there a pattern over time? A regular cadence?
- map background traffic
- Is the fact that APIPA (zeroconf) is configured an indication that some network device was not brought up properly
- 2022-09-28 krowe: No. APIPA routes are created via /etc/sysconfig/network-scripts/ifup-eth which is installed from the network-scripts RPM. This RPM is legacy for RHEL8 (naasc-vs-2 is RHEL8.6) and must have been installed specificly. It is not installed on any other RHEL8 machine I have checked.
- 2022-09-28 krowe: NO. I think APIPA is a red herring. I created routes on naasc-vs-3 against p5p1, p5p1.120, and br97 and didn't see any TCP retransmissions. I also didn't see any increase in dropped packets. I also removed the routes from naasc-vs-2 and still occationally see 110 TCP Retransmissions per second.
- What about ethtool -K ens1f0np0 tx-tcp-segmentation off
- 2022-09-26 krowe: didn't make a difference.
- What about net.core.netdev_budget
- https://access.redhat.com/articles/1391433
- dropwatch - Couldn't get it to compile on naasc-vs-2 or any other RHEL8 machine because of missing libraries
- naasc-vs-2 is RHEL8. Could that be the problem?
- Solarflare card is a slightly newer model. Could that be the problem?
- 2022-09-28 krowe: Don't know. CV tried putting the same model card in naasc-vs-3 and 5 but then naasc-vs-2 wouldn't boot.
- Replace the Solarflare card with a different card of the same model. Perhaps this card has bad RAM or somesuch.
- 2022-09-29 krowe: BCC https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html-single/configuring_and_managing_networking/index
- 2022-09-30 krowe: I set all three of these on naasc-vs-2 sysctl -w net.core.rmem_max=26214400 and sysctl -w net.ipv4.tcp_rmem="4096 87380 26214400" and ethtool -G ens1f0np0 rx 4096. I can't say if it reduced the likehood of TCP Retransmissions but I still saw them eventually.
- 2022-10-03 krowe: dhart and thalstead inserted a second 10Gb/s card in naasc-vs-2. This one is supposedly a Solarflare SNF8522 even though Linux detects it as the same model as the original card (Solarflare Communications SFC9220 10/40G Ethernet Controller [1924:0a03]). Tracy configured this card to be the 10Gb/s NIC of naasc-vs-2 (ens2f0np0). I don't know why they didn't just remove the original card an insert the new card thus requiring no changes to the configuration but whatever. My iperf3 tests still show over 100 TCP Retransmissions per second. So the idea that the original card had some hardware flaw (like bad memory or something) is disproven.
Dropped packets
I see dropped Rx packets on interface ens1f0np0 on naasc-vs-2 at a rate of about 100 packets per minute. You can see this with watch ifconfig ens1f0np0. This is especially interesting given that there isn't much traffic on naasc-vs-2 right now. It is only hosting one VM guest (na-arc-6) and that guest is only running the docker agent container. I am not seeing any dropped packets on na-arc-6.
- 2022-09-26 krowe: ethtool -G ens1f0np0 tx 2048 and ethtool -G ens1f0np0 rx 4096 made no difference.
- 2022-09-26 krowe: ethtool -K ens1f0np0 tx-tcp-segmentation off made no difference
I see dropped Rx packets on all the other naasc-vs hosts as well. Hosts naasc-vs-3 and naasc-vs-5 show only about 2 packets dropped per minute while naasc-vs-4 shows about 100 packets per minute. I don't think this is related to the TCP Retransmissions as I don't see any of them when sending to naasc-vs-4.
- I don't think this is because of broadcast noise on the 10Gb/s network (10.2.120.0/24) as I don't see see large dropped packet counts on all naasc-vs hosts.
- 2022-09-26 krowe: Interestingly, if I watch the number of packets dropped per minute (I worte a script) and run the scripts at the same time on all four naasc-vs hosts, I see patterns. The number of dropped packets each minute is identical between naasc-vs-2 and naasc-vs-4 and hovers around 100. The number of dropped packets each minute is identical between naasc-vs-3 and naasc-vs-5 and hovers around 2. This tells me that naasc-vs-2 and naasc-vs-4 are getting the same traffic and dropping it the same way. What is this traffic?
- 2022-09-26 krowe: I set na-arc-6, the only guest on naasc-vs-2, to drain in docker swarm to see if that reduced the number of dropped packets seen on naasc-vs-2. Thinking it was docker swarm creating this traffic. There was no change in dropped packet rate. It continued to match naasc-vs-4.
- 2022-09-26 krowe: On naasc-vs-3 and naasc-vs-5 I see the dropped packet count per minute at about 2 but every 5 or 6 minutes the count inreases to 10 or 11.
- 2022-09-26 krowe: I tried looking at other nodes on the 10Gb 10.2.120.0/24 network but I couldn't login to most of them. One I could login to is cv-vs-4 and it is also seeing dropped Rx packets on its 10Gb interface at about the same rate as naasc-vs-3 and naasc-vs-5. This makes me think that these dropped packets have nothing to do with docker swarm. Perhaps there is just something on that network (some misconfigured Windows box or something) that is throwing bad packets around. That doesn't explain the difference in the dropped packet rates though.
- 2022-09-27 krowe: try clearing the ARP cache on the switch? Perhaps the switch is sending packets to the node to an IP address that is no longer there like because the container moved.
- use tcpdump and sort by destination looking for the number of dropped packets per minute.
- 2022-10-03 krowe: dhart and thalstead inserted a second 10Gb/s card in naasc-vs-2. This one is supposedly a Solarflare SNF8522 even though Linux detects it as the same model as the original card (Solarflare Communications SFC9220 10/40G Ethernet Controller [1924:0a03]). Tracy configured this card to be the 10Gb/s NIC of naasc-vs-2 (ens2f0np0). I don't know why they didn't just remove the original card an insert the new card thus requiring no changes to the configuration but whatever. I am still seeing about 60 dropped packets per minute, and it still matches the dropped packets on naasc-vs-4. So the idea that the original card had some hardware flaw (like bad memory or something) is disproven.
Comparisons
naasc-vs-2, 3, 4, 5
Identical
- 2022-09-02 krowe: sysctl -a | grep br97 across naasc-vs-{3..5} are identical.
- 2022-09-02 krowe: sysctl -a | grep <vnet> across naasc-vs-{3..5} are identical except for the vnet name (e.g. vnet2, vnet4, etc)
- 2022-09-02 krowe: sysctl -a across naasc-vs-3 and naasc-vs-5 have no significant differences.
- 2022-09-06 krowe: CV-NEXUS switch port capabilities for naasc-vs-{3..5} are identical.
- 2022-09-06 krowe: CV-NEXUS9K switch port capabilities for naasc-vs-{2..5} are identical.
- 2022-09-06 krowe: ethtool -k <NIC> across naasc-vs-3 and naasc-vs-5 are identical execpt for the NIC name.
- 2022-09-07 krowe: iptables -L and iptables -S across naasc-vs-{3..5} are identical.
- 2022-09-02 krowe: sysctl -a | grep <10Gb NIC> across naasc-vs-3 and naasc-vs-5 are identical except for the name of the NIC.
- 2022-09-19 krowe: ethtool -g <NIC> across naasc-vs-3 and naasc-vs-5 are identical.
Differences
- 2022-09-02 krowe: sysctl -a | grep <10Gb NIC> between naasc-vs-3/naasc-vs-5 and naasc-vs-4 are different
- naasc-vs-4 has entries for VLANs 101 and 140 while naasc-vs-3 and naasc-vs-5 have entries for VLANs 192 and 96.
- 2022-09-02 krowe: sysctl -a on naasc-vs-4 and naasc-vs-5 and found many questionable differences
- naasc-vs-4: net.iw_cm.default_backlog = 256
- Is this because the IB modules are loaded?
- naasc-vs-4: net.rdma_ucm.max_backlog = 1024
- Is this because the IB modules are loaded?
- naasc-vs-4: sunrpc.rdma*
- Is this because the IB modules are loaded?
- naasc-vs-4: net.netfilter.nf_log.2 = nfnetlink_log
- nfnetlink is a module for packet mangling. Could this interfear with the docker swarm networking?
- Though the recorded output rate of naasc-vs-5 is about 500 Mb/s while naasc-vs-{3..4} is about 300Kb/s.
- And the recorded input rate of naasc-vs-5 is about 500 Mb/s while naasc-vs{3..4} is about 5 Mb/s.
- This is very strange as it seemed naasc-vs-5 was the limiting factor but the switch ports suggest not. Perhaps this data rate is caused by other VM guests on naasc-vs-5 (helpdesk-prod, naascweb2-prod, cartaweb-prod, natest-arc-3, cobweb2-dev)
- naasc-vs-4: net.iw_cm.default_backlog = 256
- 2022-09-06 krowe: ethtool -k <NIC> for naasc-vs-3/naasc-vs-5 are very different from naasc-vs-4.
- hw-tc-offload: off vs hw-tc-offload: on
- rx-gro-hw: off vs rx-gro-hw: on
- rx-vlan-offload: off vs rx-vlan-offload: on
- rx-vlan-stag-hw-parse: off vs rx-vlan-stag-hw-parse: on
- tcp-segmentation-offload: off vs tcp-segmentation-offload: on
- tx-gre-csum-segmentation: off vs tx-gre-csum-segmentation: on
- tx-gre-segmentation: off vs tx-gre-segmentation: on
- tx-gso-partial: off vs x-gso-partial: on
- tx-ipip-segmentation: off vs tx-ipip-segmentation: on
- tx-sit-segmentation: off vs tx-sit-segmentation: on
- tx-tcp-segmentation: off vs tx-tcp-segmentation: on
- tx-udp_tnl-csum-segmentation: off vs tx-udp_tnl-csum-segmentation: on
- tx-udp_tnl-segmentation: off vs tx-udp_tnl-segmentation: on
- tx-vlan-offload: off vs tx-vlan-offload: on
- tx-vlan-stag-hw-insert: off vs tx-vlan-stag-hw-insert: on
- 2022-09-12 krowe: I found the rx and tx buffers for em1 on naasc-vs-4 were 511 while on naasc-vs-2, 3, and 5 were 1024. You can see this with ethtool -g em1. I changed naasc-vs-4 with the following ethtool -G em1 rx 1024 tx 1024 but it didn't change iperf performance.
- 2022-09-12 krowe: I found an article suggesting that gro can make traffic slower when it is enabled. I see that rx-gro-hw is enabled on naasc-vs-4 but disabled on naasc-vs-3 and 5. You can see this with ethtool -k em1 | grep gro.So I disabled it on naasc-vs-4 with ethtool -K em1 gro off and iperf3 tests now show about 2Gb/s both directions!!!
- GRO = Generic Receive Offload. It is hardware on the physical NIC. GRO is an aggregation technique to coalesce several receive packets from a stream into a single large packet, thus saving CPU cycles as fewer packets need to be processed by the kernel.
- https://bugzilla.redhat.com/show_bug.cgi?id=1424076
- https://access.redhat.com/solutions/20278
- https://techdocs.broadcom.com/us/en/storage-and-ethernet-connectivity/ethernet-nic-controllers/bcm957xxx/adapters/Tuning/tcp-performance-tuning/nic-tuning_22/gro-generic-receive-offload.html
- https://techdocs.broadcom.com/us/en/storage-and-ethernet-connectivity/ethernet-nic-controllers/bcm957xxx/adapters/Tuning/ip-forwarding-tunings/nic-tuning_48.html
- https://techdocs.broadcom.com/us/en/storage-and-ethernet-connectivity/ethernet-nic-controllers/bcm957xxx/adapters/Tuning/tcp-performance-tuning/os-tuning-linux.html
- After disabling rx-gro-hw, I no longer see TCP Retransmission or TCP Out-Of-Order packets when tracing the iperf3 test from na-arc-3 to na-arc-2.
Table7: iperf3 TCP throughput from/to ingress_sbox with rx-gro-hw=off (Mb/s) na-arc-1
(naasc-vs-4)
na-arc-2
(naasc-vs-4)
na-arc-3
(naasc-vs-3)
na-arc-4
(naasc-vs-4)
na-arc-5
(naasc-vs-5)
na-arc-6
(naasc-vs-2)
na-arc-1 4460
2580 4630 2860 3150 na-arc-2 4060
2590 4220 3690 2570 na-arc-3 2710
2580
3080 2770 2920 na-arc-4 1090
3720 2200 2970 3200 na-arc-5 4010
3970 2340 4010 3080 na-arc-6 3380
3060 3060 3010 3080 - This definatly improves performance buth I am still seeing lots of retransmists in the iperf3 tests so perhaps there is more that can be done.
- 2022-09-15 krowe: The VM Hosts have different 10Gb network cards
- naasc-vs-2 uses a Solarflare Communications SFC9220
- naasc-vs-3 uses a Solarflare Communications SFC9020
- naasc-vs-4 uses a Broadcom BCM57412 NetXtreme-E
- naasc-vs-5 uses a Solarflare Communications SFC9020
na-arc-1,2,3,4,5
Identical
Differences
Questions
- 2022-09-28 krowe: Why was the network-scripts RPM installed on naasc-vs-2? No other RHEL8 machine has this RPM. Was it because nobody knew how to configure vlans and other complicated networking using NetworkManager, which is the new standard in RHEL8?
- 2022-09-26 krowe: Can someone who is able to login, login to the nodes on the 10.2.120 network and see if those interfaces are showing dropped Rx packets? I would, but I can't login to most of them because CV.
- 2022-09-21 krowe: Why are there dozens of stuck inventory processes on naasv-vs-2?
- 2022-09-20 krowe: ifconfig shows dropped RX packes on all naasc-vs-* nodes. Is that increasing still with time? What is causing this? CJ mentioned this two months ago. I am finally looking at it now. sigh.
- 2022-09-20 krowe: It looks like device eno1 on naasc-vs-2 is configured via DHCP instead of STATIC. Is that correct?
- Why, with rx-gro-hw=off on naasc-vs-4, does na-arc-6 see so many retransmissions and small Congestion Window (Cwnd)?
[root@na-arc-6 ~]# iperf3 -B 10.0.0.16 -c 10.0.0.21
Connecting to host 10.0.0.21, port 5201
[ 4] local 10.0.0.16 port 38534 connected to 10.0.0.21 port 5201
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-1.00 sec 302 MBytes 2.54 Gbits/sec 523 207 KBytes
[ 4] 1.00-2.00 sec 322 MBytes 2.70 Gbits/sec 596 186 KBytes
[ 4] 2.00-3.00 sec 312 MBytes 2.62 Gbits/sec 687 245 KBytes
[ 4] 3.00-4.00 sec 335 MBytes 2.81 Gbits/sec 638 278 KBytes
[ 4] 4.00-5.00 sec 309 MBytes 2.60 Gbits/sec 780 146 KBytes[root@na-arc-3 ~]# iperf3 -B 10.0.0.19 -c 10.0.0.21
Connecting to host 10.0.0.21, port 5201
[ 4] local 10.0.0.19 port 52986 connected to 10.0.0.21 port 5201
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-1.00 sec 309 MBytes 2.59 Gbits/sec 232 638 KBytes
[ 4] 1.00-2.00 sec 358 MBytes 3.00 Gbits/sec 0 967 KBytes
[ 4] 2.00-3.00 sec 351 MBytes 2.95 Gbits/sec 0 1.18 MBytes
[ 4] 3.00-4.00 sec 339 MBytes 2.84 Gbits/sec 74 1.36 MBytes
[ 4] 4.00-5.00 sec 359 MBytes 3.01 Gbits/sec 0 1.54 MBytes- Actuqally the retransmissions seem to very quite a lot from one run to another. That is the more important question. Also the throughput seems to vary as well from 1Gb/s to 4Gb/s. Of course the more retransmissions the less throughput. Granted this is a second order force and given that the nangas hosts have 1Gb/s links, probably won't be seen. But if we ever put 10Gb/s cards in the nangas nodes we will see this and be sad.
- Why does naasc-vs-3 have a br120 in state UNKNOWN? none of the other naasc-vs nodes have a br120.
- Why does naasc-vs-4 have all the infiniband modules loaded? I don't see an IB card. naasc-vs-1 and naasc-dev-vs also have some IB modules loaded but naasc-vs-3 and naasc-vs-5 don't have any IB modules loaded.
- Tracy will look into this
- Why is nfnetlink logging enabled on naasc-vs-4? You can see this with cat /proc/net/netfilter/nf_log and lsmod|grep -i nfnet
- nfnetlink is a module for packet mangling. Could this interfear with the docker swarm networking?
- why is the eth1 interfaces in all the containers and docker_gwbridge on na-arc-1 in the 172.18.x.x range while all the other na-arcs are in the 172.19.x.x range? Does it matter?
- Here are some diffs in sysctl on na-arc nodes. I tried changed na-arc-4 and na-arc-5 to match the others but performance was the same. I then changed all the nodes to match na-arc-{1..3} and still no change in performance. I still don't understand how na-arc-{4..5} got different setttings. I did find that there is another directory for sysctl settings in /usr/lib/sysctl.d but that isn't why these are different.
- na-arc-1, na-arc-2, na-arc-3, natest-arc-1, natest-arc-2, natest-arc-3
net.bridge.bridge-nf-call-arptables = 0
net.bridge.bridge-nf-call-ip6tables = 0
net.bridge.bridge-nf-call-iptables = 1
- na-arc-4, na-arc-5
net.bridge.bridge-nf-call-arptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
- na-arc-1, na-arc-2, na-arc-3, natest-arc-1, natest-arc-2, natest-arc-3
- I see sysctl differences between the natest-arc servers and the na-arc servers. Here is a diff of /etc/sysctl.d/99-nrao.conf on natest-arc-1 and na-arc-5
< #net.ipv4.tcp_tw_recycle = 1
---
> net.ipv4.tcp_tw_recycle = 1
22,39d21
< net.ipv4.conf.all.accept_redirects=0
< net.ipv4.conf.default.accept_redirects=0
< net.ipv4.conf.all.secure_redirects=0
< net.ipv4.conf.default.secure_redirects=0
<
< #net.ipv6.conf.all.disable_ipv6 = 1
< #net.ipv6.conf.default.disable_ipv6 = 1
<
< # Mellanox recommends the following
< net.ipv4.tcp_timestamps = 0
< net.core.netdev_max_backlog = 250000
<
< net.core.rmem_default = 16777216
< net.core.wmem_default = 16777216
< net.core.optmem_max = 16777216
< net.ipv4.tcp_mem = 16777216 16777216 16777216
< net.ipv4.tcp_low_latency = 1If I set net.ipv4.tcp_timestamps = 0 on na-arc-5, the wget download drops to nothing (--.-KB/s).
- If I set all the above sysctl options, execpt net.ipv4.tcp_timestamps, on all five na-arc nodes, wget download performance doesn't change. It is still about 32KB/s. Also I still zeeo ZeroWindow packets.
- Try rebooting VMs after making changes?
- I see ZeroWindow packets sent from na-arc-5 to nangas13 while downloading a file from nangas13 using wget. This is na-arc-5 telling nangas13 to wiat because its network buffer is full.
- Is this because of qdisc pfifo_fast? No. krowe changed eth0 to *qdisc fq_codel* and still seeing ZeroWait packets.
- Now that I have moved the rh_download to na-arc-1 and put httpd on na-arc-5 I no longer see ZeroWindow packets on na-arc-5. But I am seeing them on na-arc-1 which is where the rh_downloader is now. Is this because the rh_downloader is being stalled talking to something else like httpd and therefore telling nangas13 to wait?
- Why does almaportal use ens3 while almascience uses eth0?
- What if we move the rh-downloader container to a different node? In fact walk it through all five nodes and test.
- Why do I see cv-6509 when tracerouting from na-arc-5 to nangas13 but not on natest-arc-1
[root@na-arc-5 ~]# traceroute nangas13
traceroute to nangas13 (10.2.140.33), 30 hops max, 60 byte packets
1 cv-6509-vlan97.cv.nrao.edu (10.2.97.1) 0.426 ms 0.465 ms 0.523 ms
2 cv-6509.cv.nrao.edu (10.2.254.5) 0.297 ms 0.277 ms 0.266 ms
3 nangas13.cv.nrao.edu (10.2.140.33) 0.197 ms 0.144 ms 0.109 ms[root@natest-arc-1 ~]# traceroute nangas13
traceroute to nangas13 (10.2.140.33), 30 hops max, 60 byte packets
1 cv-6509-vlan96.cv.nrao.edu (10.2.96.1) 0.459 ms 0.427 ms 0.402 ms
2 nangas13.cv.nrao.edu (10.2.140.33) 0.184 ms 0.336 ms 0.311 ms- Derek wrote that 10.2.99.1 = CV-NEXUS and 10.2.96.1 = CV-6509
- Why does natest-arc-3 have ens3 instead of eth0 and why is its speed 100Mb/s?
- virsh domiflist natest-arc-3 shows the Model as rtl8139 instead of virtio
- When I run ethtool eth0 on nar-arc-{1..5} natest-arc-{1..2} as root, the result is just Link detected: yes instead of the full report with speed while na-arc-3 shows 100Mb/s.
- Why do iperf tests from natest-arc-1 and natest-arc-2 to natest-arc-3 get about half the performance (0.5Gb/s) expected especially when the reverse tests get expected performance (0.9Gb/s).
- Is putting the production swarm nodes (na-arc-*) on the 10Gb/s network a good idea? Sure it makes a fast connection to cvsan but it adds one more hop to the nangas servers (e.g. na-arc-1 -> cv-nexus9k -> cv-nexus -> nangas11)
- When I connect to the container acralmaprod001.azurecr.io/offline-production/rh-download:2022.06.01.2022jun I get errors like unknown user 1009 I get the same errors on the natest-arc-1 container.
- Does it matter that the na-arc nodes are on 10.2.97.x, their VM host is on 10.2.99.x while the natest-arc nodes are on 10.2.96.x and their VM hosts (well 2 out of 3) are also on 10.2.96.x? Is this why I see cv-509.cv.nrao.edu when running traceroute from the na-arc nodes?
- When running wget --no-check-certificate http://na-arc-3.cv.nrao.edu:8088/dataPortal/member.uid___A001_X1358_Xd2.3C286_sci.spw31.cube.I.pbcor.fits I see traffic going through veth14ce034 on na-arc-3 but I can't find a container associated with that veth.
- Why does the httpd container have eth0(10.0.0.8). This is the ingress network. I don't see any other conrainter with an interface on 10.0.0.0/24.
- Do we want to use jumbo frames? If so, some recommend using mtu=8900 and there are a lot of places it needs to be set.
To Do
- Create na-arc-6 on new naasc-vs-2 (https://support.nrao.edu/show-ticket.php?ticketid=144552)
- Test iperf between ingress_sbox on new na-arc-6 when it is available
- Set ethtool -K em1 gro off perminantly on naasc-vs-4 and document it. How do we do this?
- Double check switch port settings for naasc-vs-2. I am seeing many TCP retransmissions (dhart)
- Check and perhaps replace 10Gb network cable to naas-vs-2. Does that help with TCP retransmissions?
- are the retarnsmissions to naasc-vs-2 causing my wget to na-arc-6 to fail?
- Strawman proposal for reassigning VM guests
Done
- Recreate na-arc-3 so it gets the same performance as other na-arc-* nodes which is apparently at least 10Gb/s. (pmurphy)
- 2022-08-11: cloned na-arc-2 and moved the clone to naasc-vs-3 (zbutcher)
- 2022-08-11: moved old na-arc-3 to na-arc-3-OLD (thalstea)
- 2022-08-11: Renamed the clone to na-arc-3. We connected it to the swarm successfully, but it had a low connection speed.
- 2022-08-11: Changed the model of na-arc-3's vnet5 interface on naasc-vs-3 from rtl8139 to virtio to match all the other na-arc-* nodes. Performance was still poor.
- 2022-08-11: Changed the MTU of na-arc-3 eth0 to 1500. This is different than all the other na-arc-* nodes but it was either that or change the p5p1.120 and br97 on naasc-vs-3 from 9000 to 1500 which my have impacted other VM guests on that host. Performance was now reasonable. 7Gb/s. I was expecting about 9Gb/s but perhaps the 1500 MTU is affecting performance.
- 2022-08-11: Joined na-arc-3 to the swarm and started services (sbooth)
- Launch services on production swarm (sbooth)
- 2022-08-11: Joined na-arc-3 to the swarm and started services (sbooth)
- Test the production docker swarm with a test web interface. (lsharp)
- 2022-08-12: http://almaportal.cv.nrao.edu/
- 2022-08-12 krowe: ran tcpdump on all five na-arc-{1..5} nodes tcpdump dst almaportal and then downloaded a datafile wget --no-check-certificate https://almaportal.cv.nrao.edu/dataPortal/2013.1.00226.S_uid___A001_X122_X1f1_001_of_001.tar and with each execution of the wget, I could see the next na-arc host report the traffic. This is because the web proxy on almaportal will select the next na-arc node via round-robin. All five nodes were providing about 6KB/s speeds to cvpost-master.
- 2022-08-12 krowe: I did iperf tests from host to host in the entire chain (nangas14 -> na-arc-{1..5} -> almaportal -> cvpost-master) and each step the performance was at least 900Mb/s yet downloading with wget was about 0.06Mb/s.
- Ask other ARC if they use MTU 9000 on 10Gb. (krowe)
- JAO uses MTU of 1500
- ESO uses two VM hosts running VMware with 10Gb/s and MTU of 1500
- 2022-08-17 krowe: Changed eth0 on na-arc-5 from qdisc pfifo_fast to qdisc fq_codel to match all the other na-arc and natest-arc nodes. This seemed to have no affect on performance.
- tc qdisc replace dev eth0 root fq_codel
- 2022-08-25 krowe: Tracy cahnged the following sysctl options on na-arc-5 to match the other VM Hosts. Sadly it seems to have had no effect on wget performance. na-arc-1, na-arc-2, na-arc-4 are 32KB/s while na-arc-3 and na-arc-5 are 45MB/s.
- net.ipv4.conf.all.accept_redirects = 0
- net.ipv4.conf.all.forwarding = 1
- 2022-09-01: Tracy rebooted naasc-vs-5 which hosts na-arc-5 just in case this was necessary for the net.ipv4.conf.all.forwarding sysctl change to take effect. Sadly, no change in performance.
- Why does na-arc-5 still have net.ipv4.conf.all.accept_redirects = 1 even after a reboot while all the other na-arc nodes have this set to 0?
- 2022-09-06 krowe: probably because na-arc-5 didn't reboot when naasc-vs-5 rebooted. I expect it was suspended instead of rebooted. Yet natest-arc-3 and naascweb2-prod were rebooted. I just checked virt-manager and na-arc-5 is hosted by naasc-vs-5. Can we reboote na-arc-5?
- 2022-09-07 krowe: rebooted na-arc-5 and now net.ipv4.conf.all.accept_redirects = 0
- 2022-09-21 cfultz: Replaced the 10Gb network cable on naasc-vs-2. "the cable was nearly bent in half at the router".
People (not necessarily team members)
- K. Scott Rowe - Tiger Team Lead
- CJ Allen - sysadmin
- Tom Booth - programmer
- Liz Sharp - sysadmin
- Brian Mason - DRM Scientist
- Zhon Butcher - sysadmin
- Tracy Halstead - sysadmin
- Alvaro Aguirre - ALMA software
- Pat Murphy - CIS lead
- Rachel Rosen - previous ICT lead
- Laura Jenson - current ICT lead
- Catherine Vlahakis - Scientist
Answers
- Why does iperf show 10Gb/s between na-arc-5 and na-arc-[1,2,4]? How is this possible if the default interface on the respective VM Hosts is 1Gb/s?
- ANSWER: The vnets for the VM guests are tied to the 10Gb/s NICs on the VM hosts not the 1Gb/s NICs.
- Why do natest-arc-{1..3} have 9 veth* interfaces in ip addr show while na-arc-{1..5} don't have any veth* interfaces?
- Each container creates a veth* interface.
- Why does na-arc-3 have such poor network performance to the other na-arc nodes?
- ping na-arc-[1,2,4,5] with anything larger than -s 1490 drops all packets
- iperf tests show 10Gb/s between the VM host of na-arc-3 (naasc-vs-3 p5p1.120) and the VM host of na-arc-5 (naasc-vs-5 p2p1.120). So it isn't a bad card in either of the VM hosts.
- iptables on na-arc-3 looks different than iptables on na-arc-[2,3,5]. na-arc-1 also looks a bit different.
- docker_gwbridge interface on na-arc-[1,2,4,5] shows NO_CARRIER but not on na-arc-3.
- na-arc-3 has a veth10fd1da@if37 interface. None of the other na-arc-* nodes have a veth interface.
Production docker swarm iperf tests measured in Gb/s.
na-arc-1
(naasc-vs-4)
na-arc-2
(naasc-vs-4)
na-arc-3
(naasc-vs-3)
na-arc-4
(naasc-vs-4)
na-arc-5
(naasc-vs-5)
na-arc-1 18 0.002 20 10 na-arc-2
20 0.002 20 10 na-arc-3 0.002 0.002 0.002 0.002 na-arc-4 20 19 0.002 na-arc-5 10 10 0.002 10 10 There is clearly something wrong with na-arc-3
- ANSWER: Since there were so many problems with na-arc-3, it was decided to recreate it. It was recreated from a clone of na-arc-2.
- Is putting all the 1Gb/s production docker swarm nodes on the same ASIC on the same Fabric Extender of the cv-nexus switch a good idea?
- I am thinking it does not matter because it looks like the production docker swarm nodes use the 10Gb/s network which is on cv-nexus9k
- Can we set up a test archive query that uses the "other" docker swarm which in this case would be the production swarm (na-arc-*)?
- Why are there VLANs on the VM hosts. e.g. em1.97 on naasc-vs-4?
2022-08-12 dhart: If you want all of your guest VMs to be on the same subnet as the VM host, then VLAN awareness isn't needed. However, in most cases we want the flexibility of being able to have VM guests on different networks (from one another and/or the VM host) so the VM host is configured with a trunk interface to the network to allow for any VLAN to be passed to the underlying VM guests housed on that VM host machine
- 2022-08-12 dhart: 10.2.97.x (and 10.2.96.x) = internal VLAN for servers (primarily) 10.2.99.x = internal VLAN for server management
- 10.2.120.x = internal VLAN for 10 GE connections
- Where is the main docker config (yaml file)?
- 2022-09-20 krowe: Why does naasc-vs-2 have APIPA configured networks (169.254.0.0)? Aren't these usually created only if there are misconfigured network(s)?
[root@naasc-vs-2 ~]# netstat -nr
Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
0.0.0.0 10.2.99.1 0.0.0.0 UG 0 0 0 eno1
10.2.99.0 0.0.0.0 255.255.255.0 U 0 0 0 eno1
10.2.120.0 0.0.0.0 255.255.255.0 U 0 0 0 ens1f0np0.120
169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 ens1f0np0
169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 ens1f0np0.120
169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 br97
169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 br101
192.168.122.0 0.0.0.0 255.255.255.0 U 0 0 0 virbr0- 2022-09-28 krowe: APIPA routes are created via /etc/sysconfig/network-scripts/ifup-eth which is installed from the network-scripts RPM. This RPM is legacy for RHEL8 (naasc-vs-2 is RHEL8.6) and must have been installed specificly. It is not installed on any other RHEL8 machine I have checked.
- 2022-09-26 krowe: Can an older solarflare card (Solarflare Communications SFC9020) replace the card in naasc-vs-2 to see if that helps with the TCP Retransmissions?
- 2022-09-28 krowe: No. When thalstea replaced the card with an old SFP9020 card from cv-vs-1, the machine would not boot. So the original SFC9022 is back in naasc-vs-2. See ticket https://support.nrao.edu/show-ticket.php?ticketid=145153 for deatils.
- Why can't I download via na-arc-6? I don't think it is properly setup yet.
- wget --no-check-certificate http://na-arc-6.cv.nrao.edu:8088/dataPortal/member.uid___A001_X1284_Xc9b.spt2349-56_sci.spw19.cube.I.pbcor.fits
--2022-09-15 10:22:32-- http://na-arc-6.cv.nrao.edu:8088/dataPortal/member.uid___A001_X1284_Xc9b.spt2349-56_sci.spw19.cube.I.pbcor.fits
Resolving na-arc-6.cv.nrao.edu (na-arc-6.cv.nrao.edu)... 10.2.97.76
Connecting to na-arc-6.cv.nrao.edu (na-arc-6.cv.nrao.edu)|10.2.97.76|:8088... failed: Connection timed out. - 2022-09-29 krowe: Apparently docker just needed to be restarted on na-arc-6. Now I can download files via wget at the same rate using na-arc-6 as other na-arc nodes.
- wget --no-check-certificate http://na-arc-6.cv.nrao.edu:8088/dataPortal/member.uid___A001_X1284_Xc9b.spt2349-56_sci.spw19.cube.I.pbcor.fits
Conclusions
NAASC Archive Stabilization Solutions
References
- Prepare offline infrastructure from the scratch (Describes docker swarm setup)
- file:///tmp/ALMA%20Offline%20Software%20Test_Deployment%20Concept(2).pdf