...
- I don't think this is because of broadcast noise on the 10Gb/s network (10.2.120.0/24) as I don't see see large dropped packet counts on all naasc-vs hosts.
- 2022-09-26 krowe: Interestingly, if I watch the number of packets dropped per minute (I worte a script) and run the scripts at the same time on all four naasc-vs hosts, I see patterns. The number of dropped packets each minute is identical between naasc-vs-2 and naasc-vs-4 and hovers around 100. The number of dropped packets each minute is identical between naasc-vs-3 and naasc-vs-5 and hovers around 2. This tells me that naasc-vs-2 and naasc-vs-4 are getting the same traffic and dropping it the same way. What is this traffic?
- 2022-09-26 krowe: I set na-arc-6, the only guest on naasc-vs-2, to drain in docker swarm to see if that reduced the number of dropped packets seen on naasc-vs-2. Thinking it was docker swarm creating this traffic. There was no change in dropped packet rate. It continued to match naasc-vs-4.
- 2022-09-26 krowe: On naasc-vs-3 and naasc-vs-5 I see the dropped packet count per minute at about 2 but every 5 or 6 minutes the count inreases to 10 or 11.
- 2022-09-26 krowe: I tried looking at other nodes on the 10Gb 10.2.120.0/24 network but I couldn't login to most of them. One I could login to is cv-vs-4 and it is also seeing dropped Rx packets on its 10Gb interface at about the same rate as naasc-vs-3 and naasc-vs-5. This makes me think that these dropped packets have nothing to do with docker swarm. Perhaps there is just something on that network (some misconfigured Windows box or something) that is throwing bad packets around. That doesn't explain the difference in the dropped packet rates though.
- 2022-09-27 krowe: try clearing the ARP cache on the switch? Perhaps the switch is sending packets to the node to an IP address that is no longer there like because the container moved.
- use tcpdump and sort by destination looking for the number of dropped packets per minute.
- 2022-10-03 krowe: dhart and thalstead inserted a second 10Gb/s card in naasc-vs-2. This one is supposedly a Solarflare SNF8522 even though Linux detects it as the same model as the original card (Solarflare Communications SFC9220 10/40G Ethernet Controller [1924:0a03]). Tracy configured this card to be the 10Gb/s NIC of naasc-vs-2 (ens2f0np0). I don't know why they didn't just remove the original card an insert the new card thus requiring no changes to the configuration but whatever. I am still seeing about 60 dropped packets per minute, and it still matches the dropped packets on naasc-vs-4. So the idea that the original card had some hardware flaw (like bad memory or something) is disproven.
- Later this same day I finally saw 100+ TCP retransmissions per second. Dang. So now we have both TCP retransmissinos and flapping NIC. Sigh.
Comparisons
naasc-vs-2, 3, 4, 5
...