...
- 2022-09-15 krowe: Even with rx-gro-hw=off on naasc-vs-4, I am still seeing some retransmissions in iper3 tests. These are the same as TCP Retransmissions seen previously. On a modern, well-designed network I would expect to see almost no TCP Retransmissions. So this may indicate that there are still improvements to be made. The number of retransmissions seems to vary over time from 0 retransmissions to over a thousand retransmissions on certain directions. This makes me think there is something else using the 10Gb network that is interfering with my tests.
This is a 10 second iper3 test using TCP from the host in the left column to the host in the top row.
TableXX iperf3 Retransmissions over 10Gb and rx-gro-hw=off naasc-vs-2
(10.2.120.107)
naasc-vs-3
(10.2.120.109)
naasc-vs-4
(10.2.120.110)
naasc-vs-5
(10.2.120.112)
naasc-vs-2 0, 0, 0 0, 0, 0 45, 52, 59 naasc-vs-3 87, 0, 19, 1734 0, 0, 0 74, 52, 56 naasc-vs-4 0, 342, 1147, 363 0, 0, 0 83, 51, 50 naasc-vs-5 494, 0, 1296, 24 0, 0, 0 0, 0, 0 This looks like some sort of misconfiguration on the receiving ends of naasc-vs-2 and naasc-vs-5. This may be congestion. For example if I start an iper3 test from naasc-vs-4 to naasc-vs-2 I can often see 0 retransmissions every second. But then while doing that I also start in iper3 test from na-arc-1 (a guest on naasc-vs-4) to na-arc-6 (a guest on naasc-vs-2), I can see around 20 to 70 retransmissions per second on both naasc-vs-4 and na-arc-1. They are clearly interfering with each other. I don't see congestion when I reverse the test (naasc-vs-2 to naasc-vs-4 while na-arc-6 to na-arc-1). Doing the same test from naasc-vs-5 to naasc-vs-3 while doing na-arc-5 (a guest on naasc-vs-5) to na-arc-3 (a guest on naasc-vs-3) I don't see any congestion. If I turn off TSO on naasc-vs-2 with ethtool -K ens1f0np0 tx-tcp-segmentation off I then no longer can create congestion by doing two simultaneous iperf3 tests but I still get occational retransmissions (like 100+ per second) when testing from naasc-vs-{3..5} to naasc-vs-2. Disabling TSO also doesn't seem to reduce the number of retransmissions when testing from na-arc-* to na-arc-6.
TableXX iperf3 Retransmissions over 10Gb and rx-gro-hw=off na-arc-1
10.2.97.71
(naasc-vs-4)
na-arc-2
10.2.97.72
(naasc-vs-4)
na-arc-3
10.2.97.73
(naasc-vs-3)
na-arc-4
10.2.97.74
(naasc-vs-4)
na-arc-5
10.2.97.75
(naasc-vs-5)
na-arc-6
10.2.97.76
(naasc-vs-2)
na-arc-1 0, 0, 0 0, 0, 0 0, 0, 0 55, 75, 50 323, 501, 538 na-arc-2 0, 0, 0 0, 0, 0 0, 0, 0 68, 81, 64 768, 1050, 658 na-arc-3 1692, 1627, 2071 0, 1326, 592 1471, 3376, 686 360, 2477, 664 1873, 1872, 2384 na-arc-4 0, 0, 0 0, 0, 0 0, 0, 0 58, 86, 65 4, 9, 38 na-arc-5 108, 6, 6 6, 6, 6 2, 1, 1 6, 6, 6 1293, 1197, 33 na-arc-6 106, 0, 28 0, 0, 21 0, 88, 0 7, 0, 28 89, 75, 52 - I see a lot of dropped packets on the Rx side of all the naasc-vs hosts.
- I think the large number of retransmissions when transmissing from naasc-vs-* to naasc-vs-2 the cause for the large number of retransmissions when transmitting from na-arc-* to na-arc-6.
- I don't know what explains the retransmissions when transmitting from na-arc-3 to na-arc-*.
I don't think the retransmissions from na-arc-3 to na-arc-* can be atributed to MTU. Sure eth0 on na-arc-3 is 1500 while all the other na-arc nodes are 9000 but that should not cause a problem. If anything it sould be a problem the other way around. Also I tested changing na-arc-6 to 1500 and retransmissions didn't change. The lack of retransmissions between na-arc-1, na-arc-2, and na-arc-4 is because they are all on the same VM Host (naasc-vs-4).
- You can use ping to see if your packet size actually gets through. This is a good way to test MTU sizes.
ping
-c
3
-M
do
-s
1500
na-arc-1
- You can use ping to see if your packet size actually gets through. This is a good way to test MTU sizes.
- Try increasing Rx buffers (ethtool -G) and see if that helps retransmits
- ethtool -G ens1f0np0 rx 4096
- ethtool -G ens1f0np0 tx 2048
- Setting these didn't seem to help with the TCP Retransmissions.
- ethtool -G ens1f0np0 rx 4096
- Map the retransmissions. Is there a pattern over time? A regular cadence?
- map background traffic
Comparisons
naasc-vs-2, 3, 4, 5
...
- Create na-arc-6 on new naasc-vs-2 (https://support.nrao.edu/show-ticket.php?ticketid=144552)
- Test iperf between ingress_sbox on new na-arc-6 when it is available
- Set ethtool -K em1 gro off perminantly on naasc-vs-4 and document it. How do we do this?
- Double check switch port settings for naasc-vs-2. I am seeing many TCP retransmissions (dhart)
- Check and perhaps replace 10Gb network cable to naas-vs-2. Does that help with TCP retransmissions?
- are the retarnsmissions to naasc-vs-2 causing my wget to na-arc-6 to fail?
- Strawman proposal for reassigning VM guests
- Try increasing Rx buffers (ethtool -G) and see if that helps retransmits
- ethtool -G ens1f0np0 rx 4096
- ethtool -G ens1f0np0 tx 2048
- Setting these didn't seem to help with the TCP Retransmissions.
- ethtool -G ens1f0np0 rx 4096
- Map the retransmissions. Is there a pattern over time? A regular cadence?
- map background traffic
Done
- Recreate na-arc-3 so it gets the same performance as other na-arc-* nodes which is apparently at least 10Gb/s. (pmurphy)
- 2022-08-11: cloned na-arc-2 and moved the clone to naasc-vs-3 (zbutcher)
- 2022-08-11: moved old na-arc-3 to na-arc-3-OLD (thalstea)
- 2022-08-11: Renamed the clone to na-arc-3. We connected it to the swarm successfully, but it had a low connection speed.
- 2022-08-11: Changed the model of na-arc-3's vnet5 interface on naasc-vs-3 from rtl8139 to virtio to match all the other na-arc-* nodes. Performance was still poor.
- 2022-08-11: Changed the MTU of na-arc-3 eth0 to 1500. This is different than all the other na-arc-* nodes but it was either that or change the p5p1.120 and br97 on naasc-vs-3 from 9000 to 1500 which my have impacted other VM guests on that host. Performance was now reasonable. 7Gb/s. I was expecting about 9Gb/s but perhaps the 1500 MTU is affecting performance.
- 2022-08-11: Joined na-arc-3 to the swarm and started services (sbooth)
- Launch services on production swarm (sbooth)
- 2022-08-11: Joined na-arc-3 to the swarm and started services (sbooth)
- Test the production docker swarm with a test web interface. (lsharp)
- 2022-08-12: http://almaportal.cv.nrao.edu/
- 2022-08-12 krowe: ran tcpdump on all five na-arc-{1..5} nodes tcpdump dst almaportal and then downloaded a datafile wget --no-check-certificate https://almaportal.cv.nrao.edu/dataPortal/2013.1.00226.S_uid___A001_X122_X1f1_001_of_001.tar and with each execution of the wget, I could see the next na-arc host report the traffic. This is because the web proxy on almaportal will select the next na-arc node via round-robin. All five nodes were providing about 6KB/s speeds to cvpost-master.
- 2022-08-12 krowe: I did iperf tests from host to host in the entire chain (nangas14 -> na-arc-{1..5} -> almaportal -> cvpost-master) and each step the performance was at least 900Mb/s yet downloading with wget was about 0.06Mb/s.
- Ask other ARC if they use MTU 9000 on 10Gb. (krowe)
- JAO uses MTU of 1500
- ESO uses two VM hosts running VMware with 10Gb/s and MTU of 1500
- 2022-08-17 krowe: Changed eth0 on na-arc-5 from qdisc pfifo_fast to qdisc fq_codel to match all the other na-arc and natest-arc nodes. This seemed to have no affect on performance.
- tc qdisc replace dev eth0 root fq_codel
- 2022-08-25 krowe: Tracy cahnged the following sysctl options on na-arc-5 to match the other VM Hosts. Sadly it seems to have had no effect on wget performance. na-arc-1, na-arc-2, na-arc-4 are 32KB/s while na-arc-3 and na-arc-5 are 45MB/s.
- net.ipv4.conf.all.accept_redirects = 0
- net.ipv4.conf.all.forwarding = 1
- 2022-09-01: Tracy rebooted naasc-vs-5 which hosts na-arc-5 just in case this was necessary for the net.ipv4.conf.all.forwarding sysctl change to take effect. Sadly, no change in performance.
- Why does na-arc-5 still have net.ipv4.conf.all.accept_redirects = 1 even after a reboot while all the other na-arc nodes have this set to 0?
- 2022-09-06 krowe: probably because na-arc-5 didn't reboot when naasc-vs-5 rebooted. I expect it was suspended instead of rebooted. Yet natest-arc-3 and naascweb2-prod were rebooted. I just checked virt-manager and na-arc-5 is hosted by naasc-vs-5. Can we reboote na-arc-5?
- 2022-09-07 krowe: rebooted na-arc-5 and now net.ipv4.conf.all.accept_redirects = 0
- 2022-09-21 cfultz: Replaced the 10Gb network cable on naasc-vs-2. "the cable was nearly bent in half at the router".
...