...
- 2022-09-15 krowe: Even with rx-gro-hw=off on naasc-vs-4, I am still seeing some retransmissions in iper3 tests. These are the same as TCP Retransmissions seen previously. On a modern, well-designed network I would expect to see almost no TCP Retransmissions. So this may indicate that there are still improvements to be made. The number of retransmissions seems to vary over time from 0 retransmissions to over a thousand retransmissions on certain directions. This makes me think there is something else using the 10Gb network that is interfering with my tests.
This is a 10 second iper3 test using TCP from the host in the left column to the host in the top row.
TableXX iperf3 Retransmissions over 10Gb and rx-gro-hw=off naasc-vs-2
(10.2.120.107)
naasc-vs-3
(10.2.120.109)
naasc-vs-4
(10.2.120.110)
naasc-vs-5
(10.2.120.112)
naasc-vs-2 0, 0, 0 0, 0, 0 45, 52, 59 naasc-vs-3 87, 0, 19, 1734 0, 0, 0 74, 52, 56 naasc-vs-4 0, 342, 1147, 363 0, 0, 0 83, 51, 50 naasc-vs-5 494, 0, 1296, 24 0, 0, 0 0, 0, 0 This looks like some sort of misconfiguration on the receiving ends of naasc-vs-2 and naasc-vs-5. This may be congestion. For example if I start an iper3 test from naasc-vs-4 to naasc-vs-2 I can often see 0 retransmissions every second. But then while doing that I also start in iper3 test from na-arc-1 (a guest on naasc-vs-4) to na-arc-6 (a guest on naasc-vs-2), I can see around 20 to 70 retransmissions per second on both naasc-vs-4 and na-arc-1. They are clearly interfering with each other. I don't see congestion when I reverse the test (naasc-vs-2 to naasc-vs-4 while na-arc-6 to na-arc-1). Doing the same test from naasc-vs-5 to naasc-vs-3 while doing na-arc-5 (a guest on naasc-vs-5) to na-arc-3 (a guest on naasc-vs-3) I don't see any congestion. If I turn off TSO on naasc-vs-2 with ethtool -K ens1f0np0 tx-tcp-segmentation off I then no longer can create congestion by doing two simultaneous iperf3 tests but I still get occational retransmissions (like 100+ per second) when testing from naasc-vs-{3..5} to naasc-vs-2. Disabling TSO also doesn't seem to reduce the number of retransmissions when testing from na-arc-* to na-arc-6.
TableXX iperf3 Retransmissions over 10Gb and rx-gro-hw=off na-arc-1
10.2.97.71
(naasc-vs-4)
na-arc-2
10.2.97.72
(naasc-vs-4)
na-arc-3
10.2.97.73
(naasc-vs-3)
na-arc-4
10.2.97.74
(naasc-vs-4)
na-arc-5
10.2.97.75
(naasc-vs-5)
na-arc-6
10.2.97.76
(naasc-vs-2)
na-arc-1 0, 0, 0 0, 0, 0 0, 0, 0 55, 75, 50 323, 501, 538 na-arc-2 0, 0, 0 0, 0, 0 0, 0, 0 68, 81, 64 768, 1050, 658 na-arc-3 1692, 1627, 2071 0, 1326, 592 1471, 3376, 686 360, 2477, 664 1873, 1872, 2384 na-arc-4 0, 0, 0 0, 0, 0 0, 0, 0 58, 86, 65 4, 9, 38 na-arc-5 108, 6, 6 6, 6, 6 2, 1, 1 6, 6, 6 1293, 1197, 33 na-arc-6 106, 0, 28 0, 0, 21 0, 88, 0 7, 0, 28 89, 75, 52 - I see a lot of dropped packets on the Rx side of all the naasc-vs hosts.
- I think the large number of retransmissions when transmissing from naasc-vs-* to naasc-vs-2 the cause for the large number of retransmissions when transmitting from na-arc-* to na-arc-6.
- I don't know what explains the retransmissions when transmitting from na-arc-3 to na-arc-*.
I don't think the retransmissions from na-arc-3 to na-arc-* can be atributed to MTU. Sure eth0 on na-arc-3 is 1500 while all the other na-arc nodes are 9000 but that should not cause a problem. If anything it sould be a problem the other way around. Also I tested changing na-arc-6 to 1500 and retransmissions didn't change. The lack of retransmissions between na-arc-1, na-arc-2, and na-arc-4 is because they are all on the same VM Host (naasc-vs-4).
- You can use ping to see if your packet size actually gets through. This is a good way to test MTU sizes.
ping -c 3 -M do -s 1500 na-arc-1
- You can use ping to see if your packet size actually gets through. This is a good way to test MTU sizes.
- Try increasing Rx buffers (ethtool -G) and see if that helps retransmits
- ethtool -G ens1f0np0 rx 4096
- ethtool -G ens1f0np0 tx 2048
- Setting these didn't seem to help with the TCP Retransmissions.
- 2022-09-28 krowe: Not so fast. I was running a long iperf3 test from naasc-vs-4 to naasc-vs-2 and while seeing over 100 TCP Retransmissions per second I ran the following on naasc-vs-2 ethtool -G ens1f0np0 rx 4096 and immediatly saw the TCP Retransmissions drop to 0. I put the rx ring parameter back to 1024 and performed the test again and saw an immediate drop to 0. So there may be something here. Perhaps it helps but does not do away with all the TCP Retransmissions. Theoretically, there will always be some TCP Retransmissions over a long enough sample. I then succeded in two more tests. I think there is something here. Though I can still produce TCP retransmissions by running an iper3 test from na-arc-4 to na-arc-6 while running a test from naasc-vs-4 to naasc-vs-2.
- 2022-09-28 krowe: But more iper3 tests once rx=4096 show occational bouts of over 100 TCP Retransmissions per second. So perhaps increasing the ring parameter just reduces that perticular storm but other storms still occur.
- ethtool -G ens1f0np0 rx 4096
- Map the retransmissions. Is there a pattern over time? A regular cadence?
- map background traffic
- Is the fact that APIPA (zeroconf) is configured an indication that some network device was not brought up properly
- 2022-09-28 krowe: No. APIPA routes are created via /etc/sysconfig/network-scripts/ifup-eth which is installed from the network-scripts RPM. This RPM is legacy for RHEL8 (naasc-vs-2 is RHEL8.6) and must have been installed specificly. It is not installed on any other RHEL8 machine I have checked.
- 2022-09-28 krowe: NO. I think APIPA is a red herring. I created routes on naasc-vs-3 against p5p1, p5p1.120, and br97 and didn't see any TCP retransmissions. I also didn't see any increase in dropped packets. I also removed the routes from naasc-vs-2 and still occationally see 110 TCP Retransmissions per second.
- What about ethtool -K ens1f0np0 tx-tcp-segmentation off
- 2022-09-26 krowe: didn't make a difference.
- What about net.core.netdev_budget
- https://access.redhat.com/articles/1391433
- dropwatch - Couldn't get it to compile on naasc-vs-2 or any other RHEL8 machine because of missing libraries
- naasc-vs-2 is RHEL8. Could that be the problem?
- Solarflare card is a slightly newer model. Could that be the problem?
- 2022-09-28 krowe: Don't know. CV tried putting the same model card in naasc-vs-3 and 5 but then naasc-vs-2 wouldn't boot.
- Replace the Solarflare card with a different card of the same model. Perhaps this card has bad RAM or somesuch.
- 2022-09-29 krowe: BCC https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html-single/configuring_and_managing_networking/index
- 2022-09-30 krowe: I set all three of these on naasc-vs-2 sysctl -w net.core.rmem_max=26214400 and sysctl -w net.ipv4.tcp_rmem="4096 87380 26214400" and ethtool -G ens1f0np0 rx 4096. I can't say if it reduced the likehood of TCP Retransmissions but I still saw them eventually.
- 2022-10-03 krowe: dhart and thalstead inserted a second 10Gb/s card in naasc-vs-2. This one is supposedly a Solarflare SNF8522 even though Linux detects it as the same model as the original card (Solarflare Communications SFC9220 10/40G Ethernet Controller [1924:0a03]). Tracy configured this card to be the 10Gb/s NIC of naasc-vs-2 (ens2f0np0). I don't know why they didn't just remove the original card an insert the new card thus requiring no changes to the configuration but whatever. My iperf3 tests still show over 100 TCP Retransmissions per second. So the idea that the original card had some hardware flaw (like bad memory or something) is disproven.
- 2022-10-04 krowe: thalstea updated the sfc driver on naasc-vs-2 using dkms to version 5.3.12.1021. Sadly I still see occational periods of TCP Retransmissions on the order of 100 per second.
- 2022-10-05 krowe: now with the new sfc driver, I set ethtool -G ens1f0np0 rx 4096 but still see TCP Retransmissions.
- 2022-10-05 krowe: I think it is also interesting that when I run my iperf3 tests for 400 seconds (which often produced TCP retransmissions on naasc-vs-2) the Congestion window (Cwnd) never gets above 0.578MB while with tests to naasc-vs-4 the Cwnd gets up to 2.2MB.
- 2022-10-06 krowe: thalstea re-installed naasc-vs-2. It was RHEL8 and it is now RHEL7. Since the re-installation I have been unable to reproduce the 100+ TCP retransmissions per second. However I do see times where the throughput to naasc-vs-2 drops to zero bytes per second. I see reports in /var/log/messages on naasc-vs-2 of the p1p1 NIC's link going down and up at the same times my throughput drops to zero.
Dropped packets
I see dropped Rx packets on interface ens1f0np0 on naasc-vs-2 at a rate of about 100 packets per minute. You can see this with watch ifconfig ens1f0np0. This is especially interesting given that there isn't much traffic on naasc-vs-2 right now. It is only hosting one VM guest (na-arc-6) and that guest is only running the docker agent container. I am not seeing any dropped packets on na-arc-6.
...