Poor Download Performance
TL;DR ethtool -K em1 gro off
This was first reported on 2022-04-18 and documented in https://ictjira.alma.cl/browse/AES-52 What we have seen/has been reported is that sometimes downloads are incredibly slow (10s of kB/s) and sometimes the transfer is closed with data missing from the download. Other times we see perfectly reasonable download speeds (~10 MB/s). This was reproducable with a command like the following
...
From there you can use ip -c addr show to see the IPs and interfaces of the ingress network namespace on that node. You can also use iperf3 to test this ingress network. Here are the results of our nodes. The values are rounded for simplicity. Hosts accross the top row are receiving while hosts along the left column are transmitting. You can see that na-arc-3 and na-arc-5 show poor performance when transmitting to na-arc-1, na-arc-2, and na-arc-3. This seems to implicates either naasc-vs-4 as a culpret, or na-arc-3 and na-arc-5 or their VM Hosts as the culprets. We weren't sure.
Table3: iperf3 to/from ingress_sbox (Mb/s) | |||||
---|---|---|---|---|---|
na-arc-1 10.0.0.2 | na-arc-2 10.0.0.21 | na-arc-3 10.0.0.19 | na-arc-4 10.0.0.5 | na-arc-5 10.0.0.6 | |
na-arc-1 | 4,000 | 2,000 | 4,000 | 3,000 | |
na-arc-2 | 4,000 | 2,000 | 4,000 | 3,000 | |
na-arc-3 | 0.3 | 0.3 | 0.3 | 3,000 | |
na-arc-4 | 4,000 | 4,000 | 2,000 | 3,000 | |
na-arc-5 | 0.3 | 0.3 | 2,000 | 0.3 |
On 2022-09-09 a sixth docker swarm node was created (na-arc-6) on a new VM host (naasc-vs-2). We ran iperf3 tests again in over the ingress network and found the following
Table6: iperf3 TCP throughput from/to ingress_sbox (Mb/s) | ||||||
---|---|---|---|---|---|---|
na-arc-1 (naasc-vs-4) | na-arc-2 (naasc-vs-4) | na-arc-3 (naasc-vs-3) | na-arc-4 (naasc-vs-4) | na-arc-5 (naasc-vs-5) | na-arc-6 (naasc-vs-2) | |
na-arc-1 | 3920 | 2300 | 4200 | 3110 | 3280 | |
na-arc-2 | 3950 | 2630 | 4000 | 3350 | 3530 | |
na-arc-3 | 0.2 | 0.3 | 0.2 | 2720 | 2810 | |
na-arc-4 | 3860 | 3580 | 2410 | 3390 | 3290 | |
na-arc-5 | 0.2 | 0.2 | 2480 | 0.2 | 2550 | |
na-arc-6 | 0.005 | 0.005 | 2790 | 0.005 | 3290 |
Seeing na-arc-6 also performing poorly when transmitting to nodes on naasc-vs-4 told us that there is something wrong with the receive end of naasc-vs-4. So we started to look at network settings in the kernel (sysctl), network hardware, ysctl settings, and network hardware features (ethtool -k). We found that the Network Interface Card (NIC) on naasc-vs-4 was very different than the other naasc-vs hosts
- naasc-vs-2 uses a Solarflare Communications SFC9220
- naasc-vs-3 uses a Solarflare Communications SFC9020
- naasc-vs-4 uses a Broadcom BCM57412 NetXtreme-E
- naasc-vs-5 uses a Solarflare Communications SFC9020
There were some sysctl settings that were suspecious
- naasc-vs-4 has entries for VLANs 101 and 140 while naasc-vs-3 and naasc-vs-5 have entries for VLANs 192 and 96.
- naasc-vs-4: net.iw_cm.default_backlog = 256 Is this because the IB modules are loaded?
- naasc-vs-4: net.rdma_ucm.max_backlog = 1024 Is this because the IB modules are loaded?
- naasc-vs-4: sunrpc.rdma* Is this because the IB modules are loaded?
- naasc-vs-4: net.netfilter.nf_log.2 = nfnetlink_log
But the real breakthrough was in the NIC features. You can see them with ethtool -k <NIC>. There were many differences but we found that naasc-vs-4 had rx-gro-hw: on while all the other naasc-vs hosts had it set to off. This feature is for Generic Receive Offload. It is hardware on the physical NIC. GRO is an aggregation technique to coalesce several receive packets from a stream into a single large packet, thus saving CPU cycles as fewer packets need to be processed by the kernel. The Solarflare cards don't have this feature. I found articles suggesting that GRO can make traffic slower when it is enabled, especially when using vxlan which the docker swarm ingress network uses.
- https://bugzilla.redhat.com/show_bug.cgi?id=1424076
- https://access.redhat.com/solutions/20278
- https://techdocs.broadcom.com/us/en/storage-and-ethernet-connectivity/ethernet-nic-controllers/bcm957xxx/adapters/Tuning/tcp-performance-tuning/nic-tuning_22/gro-generic-receive-offload.html
- https://techdocs.broadcom.com/us/en/storage-and-ethernet-connectivity/ethernet-nic-controllers/bcm957xxx/adapters/Tuning/ip-forwarding-tunings/nic-tuning_48.html
- https://techdocs.broadcom.com/us/en/storage-and-ethernet-connectivity/ethernet-nic-controllers/bcm957xxx/adapters/Tuning/tcp-performance-tuning/os-tuning-linux.html
On 2022-09-16 we disabled this feature on naasc-vs-4 with ethtool -K em1 gro off and iperf3 tests now show about between 1Gb/s and 4Gb/s in both directions.
Table7: iperf3 TCP throughput from/to ingress_sbox with rx-gro-hw=off (Mb/s) | ||||||
---|---|---|---|---|---|---|
na-arc-1 (naasc-vs-4) | na-arc-2 (naasc-vs-4) | na-arc-3 (naasc-vs-3) | na-arc-4 (naasc-vs-4) | na-arc-5 (naasc-vs-5) | na-arc-6 (naasc-vs-2) | |
na-arc-1 | 4460 | 2580 | 4630 | 2860 | 3150 | |
na-arc-2 | 4060 | 2590 | 4220 | 3690 | 2570 | |
na-arc-3 | 2710 | 2580 | 3080 | 2770 | 2920 | |
na-arc-4 | 1090 | 3720 | 2200 | 2970 | 3200 | |
na-arc-5 | 4010 | 3970 | 2340 | 4010 | 3080 | |
na-arc-6 | 3380 | 3060 | 3060 | 3010 | 3080 |