...
TL;DR ethtool -K em1 gro off needs to be permenantly set on naasc-vs-4
This was first reported on 2022-04-18 and documented in https://ictjira.alma.cl/browse/AES-52 What we have seen/has been reported is that sometimes downloads are incredibly slow (10s of kB/s) and sometimes the transfer is closed with data missing from the download. Other times we see perfectly reasonable download speeds (~10 MB/s). This was reproducable with a command like the following
...
Some of the NAASC VM hosts show lots of dropped Rx packets. The rate ranges from 2 to over 100 per minute. This is really unacceptable on a modern, well-designed network. While I can't say these dropped packets are indicative of a problem, they could become a problem with increased load and they certainly will make debugging more difficult when there is a problem. I suggest the reason for these dropped packets be found and resolved.
Further tests show patterns. It looks like the same packets may be being dropped on naasc-vs-2 and naasc-vs-4 as they report the same dropped packet rate. For example, I wrote a simple script to print dropped packets per time interval and ran it at the same time on all four naasc-vs hosts. You can see that naasc-vs-2 and naasc-vs-4 show a similar pattern, while naasc-vs-3 and naasc-vs-5 show a different pattern.
naasc-vs-2 | naasc-vs-3 | naasc-vs-4 | naasc-vs-5 |
---|---|---|---|
30 | 0 | 30 | 0 |
22 | 0 | 24 | 0 |
13 | 1 | 11 | 1 |
9 | 0 | 9 | 0 |
8 | 0 | 8 | 0 |
12 | 1 | 12 | 1 |
TCP retransmissions
The newest NAASC VM Host (naasc-vs-2) shows over 100 TCP retransmissions per second when doing iperf3 tests. Other nodes like naasc-vs-3 and naasc-vs-4 do not show these at all. While I can't say these TCP retransmissions are indicative of a problem, they could become a problem with increased load and they certainly will make debugging more difficult when there is a problem. I suggest the reason for these TCP retransmissions be found and resolved.
...