...
- 2022-09-15 krowe: Even with rx-gro-hw=off on naasc-vs-4, I am still seeing some retransmissions in iper3 tests. These are the same as TCP Retransmissions seen previously. On a modern, well-designed network I would expect to see almost no TCP Retransmissions. So this may indicate that there are still improvements to be made. The number of retransmissions seems to vary over time from 0 retransmissions to over a thousand retransmissions on certain directions. This makes me think there is something else using the 10Gb network that is interfering with my tests.
This is a 10 second iper3 test using TCP from the host in the left column to the host in the top row.
TableXX iperf3 Retransmissions over 10Gb and rx-gro-hw=off naasc-vs-2
(10.2.120.107)
naasc-vs-3
(10.2.120.109)
naasc-vs-4
(10.2.120.110)
naasc-vs-5
(10.2.120.112)
naasc-vs-2 0, 0, 0 0, 0, 0 45, 52, 59 naasc-vs-3 87, 0, 19, 1734 0, 0, 0 74, 52, 56 naasc-vs-4 0, 342, 1147, 363 0, 0, 0 83, 51, 50 naasc-vs-5 494, 0, 1296, 24 0, 0, 0 0, 0, 0 This looks like some sort of misconfiguration on the receiving ends of naasc-vs-2 and naasc-vs-5. This may be congestion. For example if I start an iper3 test from naasc-vs-4 to naasc-vs-2 I can often see 0 retransmissions every second. But then while doing that I also start in iper3 test from na-arc-1 (a guest on naasc-vs-4) to na-arc-6 (a guest on naasc-vs-2), I can see around 20 to 70 retransmissions per second on both naasc-vs-4 and na-arc-1. They are clearly interfering with each other. I don't see congestion when I reverse the test (naasc-vs-2 to naasc-vs-4 while na-arc-6 to na-arc-1). Doing the same test from naasc-vs-5 to naasc-vs-3 while doing na-arc-5 (a guest on naasc-vs-5) to na-arc-3 (a guest on naasc-vs-3) I don't see any congestion. If I turn off TSO on naasc-vs-2 with ethtool -K ens1f0np0 tx-tcp-segmentation off I then no longer can create congestion by doing two simultaneous iperf3 tests but I still get occational retransmissions (like 100+ per second) when testing from naasc-vs-{3..5} to naasc-vs-2. Disabling TSO also doesn't seem to reduce the number of retransmissions when testing from na-arc-* to na-arc-6.
TableXX iperf3 Retransmissions over 10Gb and rx-gro-hw=off na-arc-1
10.2.97.71
(naasc-vs-4)
na-arc-2
10.2.97.72
(naasc-vs-4)
na-arc-3
10.2.97.73
(naasc-vs-3)
na-arc-4
10.2.97.74
(naasc-vs-4)
na-arc-5
10.2.97.75
(naasc-vs-5)
na-arc-6
10.2.97.76
(naasc-vs-2)
na-arc-1 0, 0, 0 0, 0, 0 0, 0, 0 55, 75, 50 323, 501, 538 na-arc-2 0, 0, 0 0, 0, 0 0, 0, 0 68, 81, 64 768, 1050, 658 na-arc-3 1692, 1627, 2071 0, 1326, 592 1471, 3376, 686 360, 2477, 664 1873, 1872, 2384 na-arc-4 0, 0, 0 0, 0, 0 0, 0, 0 58, 86, 65 4, 9, 38 na-arc-5 108, 6, 6 6, 6, 6 2, 1, 1 6, 6, 6 1293, 1197, 33 na-arc-6 106, 0, 28 0, 0, 21 0, 88, 0 7, 0, 28 89, 75, 52 - I see a lot of dropped packets on the Rx side of all the naasc-vs hosts.
- I think the large number of retransmissions when transmissing from naasc-vs-* to naasc-vs-2 the cause for the large number of retransmissions when transmitting from na-arc-* to na-arc-6.
- I don't know what explains the retransmissions when transmitting from na-arc-3 to na-arc-*.
I don't think the retransmissions from na-arc-3 to na-arc-* can be atributed to MTU. Sure eth0 on na-arc-3 is 1500 while all the other na-arc nodes are 9000 but that should not cause a problem. If anything it sould be a problem the other way around. Also I tested changing na-arc-6 to 1500 and retransmissions didn't change. The lack of retransmissions between na-arc-1, na-arc-2, and na-arc-4 is because they are all on the same VM Host (naasc-vs-4).
- You can use ping to see if your packet size actually gets through. This is a good way to test MTU sizes.
ping -c 3 -M do -s 1500 na-arc-1
- You can use ping to see if your packet size actually gets through. This is a good way to test MTU sizes.
- Try increasing Rx buffers (ethtool -G) and see if that helps retransmits
- ethtool -G ens1f0np0 rx 4096
- ethtool -G ens1f0np0 tx 2048
- Setting these didn't seem to help with the TCP Retransmissions.
- 2022-09-28 krowe: Not so fast. I was running a long iperf3 test from naasc-vs-4 to naasc-vs-2 and while seeing over 100 TCP Retransmissions per second I ran the following on naasc-vs-2 ethtool -G ens1f0np0 rx 4096 and immediatly saw the TCP Retransmissions drop to 0. I put the rx ring parameter back to 1024 and performed the test again and saw an immediate drop to 0. So there may be something here. Perhaps it helps but does not do away with all the TCP Retransmissions. Theoretically, there will always be some TCP Retransmissions over a long enough sample. I then succeded in two more tests. I think there is something here. Though I can still produce TCP retransmissions by running an iper3 test from na-arc-4 to na-arc-6 while running a test from naasc-vs-4 to naasc-vs-2.
- 2022-09-28 krowe: But more iper3 tests once rx=4096 show occational bouts of over 100 TCP Retransmissions per second. So perhaps increasing the ring parameter just reduces that perticular storm but other storms still occur.
- ethtool -G ens1f0np0 rx 4096
- Map the retransmissions. Is there a pattern over time? A regular cadence?
- map background traffic
- Is the fact that APIPA (zeroconf) is configured an indication that some network device was not brought up properly
- 2022-09-28 krowe: No. APIPA routes are created via /etc/sysconfig/network-scripts/ifup-eth which is installed from the network-scripts RPM. This RPM is legacy for RHEL8 (naasc-vs-2 is RHEL8.6) and must have been installed specificly. It is not installed on any other RHEL8 machine I have checked.
- 2022-09-28 krowe: NO. I think APIPA is a red herring. I created routes on naasc-vs-3 against p5p1, p5p1.120, and br97 and didn't see any TCP retransmissions. I also didn't see any increase in dropped packets. I also removed the routes from naasc-vs-2 and still occationally see 110 TCP Retransmissions per second.
- What about ethtool -K ens1f0np0 tx-tcp-segmentation off
- 2022-09-26 krowe: didn't make a difference.
- What about net.core.netdev_budget
- https://access.redhat.com/articles/1391433
- dropwatch - Couldn't get it to compile on naasc-vs-2 or any other RHEL8 machine because of missing libraries
- naasc-vs-2 is RHEL8. Could that be the problem?
- Solarflare card is a slightly newer model. Could that be the problem?
- 2022-09-28 krowe: Don't know. CV tried putting the same model card in naasc-vs-3 and 5 but then naasc-vs-2 wouldn't boot.
- Replace the Solarflare card with a different card of the same model. Perhaps this card has bad RAM or somesuch.
- 2022-09-29 krowe: BCC https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html-single/configuring_and_managing_networking/index
- 2022-09-30 krowe: I set all three of these on naasc-vs-2 sysctl -w net.core.rmem_max=26214400 and sysctl -w net.ipv4.tcp_rmem="4096 87380 26214400" and ethtool -G ens1f0np0 rx 4096. I can't say if it reduced the likehood of TCP Retransmissions but I still saw them eventually.
- 2022-10-03 krowe: dhart and thalstead inserted a second 10Gb/s card in naasc-vs-2. This one is supposedly a Solarflare SNF8522 even though Linux detects it as the same model as the original card (Solarflare Communications SFC9220 10/40G Ethernet Controller [1924:0a03]). Tracy configured this card to be the 10Gb/s NIC of naasc-vs-2 (ens2f0np0). I don't know why they didn't just remove the original card an insert the new card thus requiring no changes to the configuration but whatever. My iperf3 tests still show over 100 TCP Retransmissions per second. So the idea that the original card had some hardware flaw (like bad memory or something) is disproven.
- 2022-10-04 krowe: thalstea updated the sfc driver on naasc-vs-2 using dkms to version 5.3.12.1021. Sadly I still see occational periods of TCP Retransmissions on the order of 100 per second.
- 2022-10-05 krowe: now with the new sfc driver, I set ethtool -G ens1f0np0 rx 4096 but still see TCP Retransmissions.
- 2022-10-05 krowe: I think it is also interesting that when I run my iperf3 tests for 400 seconds (which often produced TCP retransmissions on naasc-vs-2) the Congestion window (Cwnd) never gets above 0.578MB while with tests to naasc-vs-4 the Cwnd gets up to 2.2MB.
- 2022-10-06 krowe: thalstea re-installed naasc-vs-2. It was RHEL8 and it is now RHEL7. Since the re-installation I have been unable to reproduce the 100+ TCP retransmissions per second. However I do see times where the throughput to naasc-vs-2 drops to zero bytes per second. I see reports in /var/log/messages on naasc-vs-2 of the p1p1 NIC's link going down and up at the same times my throughput drops to zero.
- Later this same day I finally saw 100+ TCP retransmissions per second. Dang. So now we have both TCP retransmissinos and flapping NIC. Sigh.
- I increased both Rx and Tx ring buffers on naasc-vs-2 from 1024 to 2048 to see if that helps with TCP retransmissions. It seems that the downgrade to RHEL7 may have reduced the number of TCP retransmissions, but I am still occationally seeing them. And shortly after doing this I saw the 100+ TCP retransmissions per second.
- 2022-10-12: thalstea tried adding an old solarflare card to naasc-vs-2 but it wouldn't even get past the BIOS. Here is a screenshot.
- 2022-10-12: thalstea upgraded the firmware on the solarflare card (SFN8511) from 6.5.1.1023 to 8.5.0.1002
- 2022-10-12 krowe: Sadly I have still seen periods of 100+ TCP retransmissions per second when doing iper3 tests. I have noticed that when I see these periods, if I quickly kill and restart the iper3 test, the TCP retransmissions drop back to 0. So it is possible that these TCP retransmissions are just a side effect of my iperf3 tests and may not happen during normal usage. Also, the throughput doesn't drop below 9Gb/s when during these periods so while I am sad that I don't understand why this happens, I am thinking this may not be as much of an issue as I once feared.
- 2022-10-14 krowe: I noticed that while these TCP retransmissions are happening, if I quickly kill and restart the iper3 test they go back to zero. This is starting to make me think these TCP retransmissions are only a problem when the network is significantly stressed, and when they do happen it is like a firestorm where the kernel or the NIC loose ground and can't ever catch up. This could mean that we may never see these TCP retransmissions in production, or if we do it will be rare enough to not worry about. I also found another tool, /usr/share/bcc/tools/tcpretrans, which is part of a suite of tools called BPF Compiler Collection (BCC). There are several usefull tools in there. With tcpretrans, I can see the TCP retransmissions happening on naasc-vs-3 or naasc-vs-4 or whatever the source host is in my iper3 tests. This also means we can watch for them in production.
Dropped packets
I see dropped Rx packets on interface ens1f0np0 on naasc-vs-2 at a rate of about 100 packets per minute. You can see this with watch ifconfig ens1f0np0. This is especially interesting given that there isn't much traffic on naasc-vs-2 right now. It is only hosting one VM guest (na-arc-6) and that guest is only running the docker agent container. I am not seeing any dropped packets on na-arc-6.
...
- Set ethtool -K em1 gro off perminantly on naasc-vs-4 and document it. How do we do this?
- Strawman proposal for reassigning VM guests
- krowe to make tickets for solutions
- 2022-10-05 krowe: Change the NIC Model on natest-arc-3. It is currently rtl8139 instead of virtio and is its speed 100Mb/s instead of 1000Mb/s.
- You can see this with virsh domiflist natest-arc-3 on naasc-vs-5.
- 2022-10-05 krowe: This should be fixed but after the test swarm is no longer acting as the production swarm.
Answers
- Why does iperf show 10Gb/s between na-arc-5 and na-arc-[1,2,4]? How is this possible if the default interface on the respective VM Hosts is 1Gb/s?
- ANSWER: The vnets for the VM guests are tied to the 10Gb/s NICs on the VM hosts not the 1Gb/s NICs.
- Why do natest-arc-{1..3} have 9 veth* interfaces in ip addr show while na-arc-{1..5} don't have any veth* interfaces?
- Each container creates a veth* interface.
- Why does na-arc-3 have such poor network performance to the other na-arc nodes?
- ping na-arc-[1,2,4,5] with anything larger than -s 1490 drops all packets
- iperf tests show 10Gb/s between the VM host of na-arc-3 (naasc-vs-3 p5p1.120) and the VM host of na-arc-5 (naasc-vs-5 p2p1.120). So it isn't a bad card in either of the VM hosts.
- iptables on na-arc-3 looks different than iptables on na-arc-[2,3,5]. na-arc-1 also looks a bit different.
- docker_gwbridge interface on na-arc-[1,2,4,5] shows NO_CARRIER but not on na-arc-3.
- na-arc-3 has a veth10fd1da@if37 interface. None of the other na-arc-* nodes have a veth interface.
Production docker swarm iperf tests measured in Gb/s.
na-arc-1
(naasc-vs-4)
na-arc-2
(naasc-vs-4)
na-arc-3
(naasc-vs-3)
na-arc-4
(naasc-vs-4)
na-arc-5
(naasc-vs-5)
na-arc-1 18 0.002 20 10 na-arc-2
20 0.002 20 10 na-arc-3 0.002 0.002 0.002 0.002 na-arc-4 20 19 0.002 na-arc-5 10 10 0.002 10 10 There is clearly something wrong with na-arc-3
- ANSWER: Since there were so many problems with na-arc-3, it was decided to recreate it. It was recreated from a clone of na-arc-2.
- Is putting all the 1Gb/s production docker swarm nodes on the same ASIC on the same Fabric Extender of the cv-nexus switch a good idea?
- I am thinking it does not matter because it looks like the production docker swarm nodes use the 10Gb/s network which is on cv-nexus9k
- Can we set up a test archive query that uses the "other" docker swarm which in this case would be the production swarm (na-arc-*)?
- Why are there VLANs on the VM hosts. e.g. em1.97 on naasc-vs-4?
2022-08-12 dhart: If you want all of your guest VMs to be on the same subnet as the VM host, then VLAN awareness isn't needed. However, in most cases we want the flexibility of being able to have VM guests on different networks (from one another and/or the VM host) so the VM host is configured with a trunk interface to the network to allow for any VLAN to be passed to the underlying VM guests housed on that VM host machine
- 2022-08-12 dhart: 10.2.97.x (and 10.2.96.x) = internal VLAN for servers (primarily) 10.2.99.x = internal VLAN for server management
- 10.2.120.x = internal VLAN for 10 GE connections
- Where is the main docker config (yaml file)?
- 2022-09-20 krowe: Why does naasc-vs-2 have APIPA configured networks (169.254.0.0)? Aren't these usually created only if there are misconfigured network(s)?
[root@naasc-vs-2 ~]# netstat -nr
Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
0.0.0.0 10.2.99.1 0.0.0.0 UG 0 0 0 eno1
10.2.99.0 0.0.0.0 255.255.255.0 U 0 0 0 eno1
10.2.120.0 0.0.0.0 255.255.255.0 U 0 0 0 ens1f0np0.120
169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 ens1f0np0
169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 ens1f0np0.120
169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 br97
169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 br101
192.168.122.0 0.0.0.0 255.255.255.0 U 0 0 0 virbr0- 2022-09-28 krowe: APIPA routes are created via /etc/sysconfig/network-scripts/ifup-eth which is installed from the network-scripts RPM. This RPM is legacy for RHEL8 (naasc-vs-2 is RHEL8.6) and must have been installed specificly. It is not installed on any other RHEL8 machine I have checked.
- 2022-09-26 krowe: Can an older solarflare card (Solarflare Communications SFC9020) replace the card in naasc-vs-2 to see if that helps with the TCP Retransmissions?
- 2022-09-28 krowe: No. When thalstea replaced the card with an old SFP9020 card from cv-vs-1, the machine would not boot. So the original SFC9022 is back in naasc-vs-2. See ticket https://support.nrao.edu/show-ticket.php?ticketid=145153 for deatils.
- Why can't I download via na-arc-6? I don't think it is properly setup yet.
- wget --no-check-certificate http://na-arc-6.cv.nrao.edu:8088/dataPortal/member.uid___A001_X1284_Xc9b.spt2349-56_sci.spw19.cube.I.pbcor.fits
--2022-09-15 10:22:32-- http://na-arc-6.cv.nrao.edu:8088/dataPortal/member.uid___A001_X1284_Xc9b.spt2349-56_sci.spw19.cube.I.pbcor.fits
Resolving na-arc-6.cv.nrao.edu (na-arc-6.cv.nrao.edu)... 10.2.97.76
Connecting to na-arc-6.cv.nrao.edu (na-arc-6.cv.nrao.edu)|10.2.97.76|:8088... failed: Connection timed out. - 2022-09-29 krowe: Apparently docker just needed to be restarted on na-arc-6. Now I can download files via wget at the same rate using na-arc-6 as other na-arc nodes.
- wget --no-check-certificate http://na-arc-6.cv.nrao.edu:8088/dataPortal/member.uid___A001_X1284_Xc9b.spt2349-56_sci.spw19.cube.I.pbcor.fits
- Why do I see cv-6509 when tracerouting from na-arc-5 to nangas13 but not on natest-arc-1
[root@na-arc-5 ~]# traceroute nangas13
traceroute to nangas13 (10.2.140.33), 30 hops max, 60 byte packets
1 cv-6509-vlan97.cv.nrao.edu (10.2.97.1) 0.426 ms 0.465 ms 0.523 ms
2 cv-6509.cv.nrao.edu (10.2.254.5) 0.297 ms 0.277 ms 0.266 ms
3 nangas13.cv.nrao.edu (10.2.140.33) 0.197 ms 0.144 ms 0.109 ms[root@natest-arc-1 ~]# traceroute nangas13
traceroute to nangas13 (10.2.140.33), 30 hops max, 60 byte packets
1 cv-6509-vlan96.cv.nrao.edu (10.2.96.1) 0.459 ms 0.427 ms 0.402 ms
2 nangas13.cv.nrao.edu (10.2.140.33) 0.184 ms 0.336 ms 0.311 ms- Derek wrote that 10.2.99.1 = CV-NEXUS and 10.2.96.1 = CV-6509
- 2022-09-28 krowe: Why was the network-scripts RPM installed on naasc-vs-2? No other RHEL8 machine has this RPM. Was it because nobody knew how to configure vlans and other complicated networking using NetworkManager, which is the new standard in RHEL8?
- 2022-10-05 krowe: Yes. RHEL8 makes bridges and vlans really complicated so Tracy installed the network-scripts RPM and configured things the old way.
- 2022-09-21 krowe: Why are there stuck inventory processes on naasc-vs-2?
- 2022-10-05 krowe: This is an RHEL8 issue, not a network issue. All the RHEL8 machines in CV have this problem.
- Why does naasc-vs-3 have a br120 in state UNKNOWN? none of the other naasc-vs nodes have a br120.
- 2022-10-05 krowe: This is because it is easier to create and not use then not create.
- Why does natest-arc-3 have ens3 instead of eth0 and why is its speed 100Mb/s?
- virsh domiflist natest-arc-3 shows the Model as rtl8139 instead of virtio
- When I run ethtool eth0 on nar-arc-{1..5} natest-arc-{1..2} as root, the result is just Link detected: yes instead of the full report with speed while na-arc-3 shows 100Mb/s.
- 2022-10-05 krowe: This should be fixed but after the test swarm is no longer acting as the production swarm.
- I think this is just another example of why CV needs good documentation to create VMs
...