...
- 2022-09-02 krowe: sysctl -a | grep <10Gb NIC> between naasc-vs-3/naasc-vs-5 and naasc-vs-4 are different
- naasc-vs-4 has entries for VLANs 101 and 140 while naasc-vs-3 and naasc-vs-5 have entries for VLANs 192 and 96.
- 2022-09-02 krowe: sysctl -a on naasc-vs-4 and naasc-vs-5 and found many questionable differences
- naasc-vs-4: net.iw_cm.default_backlog = 256
- Is this because the IB modules are loaded?
- naasc-vs-4: net.rdma_ucm.max_backlog = 1024
- Is this because the IB modules are loaded?
- naasc-vs-4: sunrpc.rdma*
- Is this because the IB modules are loaded?
- naasc-vs-4: net.netfilter.nf_log.2 = nfnetlink_log
- nfnetlink is a module for packet mangling. Could this interfear with the docker swarm networking?
- Though the recorded output rate of naasc-vs-5 is about 500 Mb/s while naasc-vs-{3..4} is about 300Kb/s.
- And the recorded input rate of naasc-vs-5 is about 500 Mb/s while naasc-vs{3..4} is about 5 Mb/s.
- This is very strange as it seemed naasc-vs-5 was the limiting factor but the switch ports suggest not. Perhaps this data rate is caused by other VM guests on naasc-vs-5 (helpdesk-prod, naascweb2-prod, cartaweb-prod, natest-arc-3, cobweb2-dev)
- naasc-vs-4: net.iw_cm.default_backlog = 256
- 2022-09-06 krowe: ethtool -k <NIC> for naasc-vs-3/naasc-vs-5 are very different from naasc-vs-4.
- hw-tc-offload: off vs hw-tc-offload: on
- rx-gro-hw: off vs rx-gro-hw: on
- rx-vlan-offload: off vs rx-vlan-offload: on
- rx-vlan-stag-hw-parse: off vs rx-vlan-stag-hw-parse: on
- tcp-segmentation-offload: off vs tcp-segmentation-offload: on
- tx-gre-csum-segmentation: off vs tx-gre-csum-segmentation: on
- tx-gre-segmentation: off vs tx-gre-segmentation: on
- tx-gso-partial: off vs x-gso-partial: on
- tx-ipip-segmentation: off vs tx-ipip-segmentation: on
- tx-sit-segmentation: off vs tx-sit-segmentation: on
- tx-tcp-segmentation: off vs tx-tcp-segmentation: on
- tx-udp_tnl-csum-segmentation: off vs tx-udp_tnl-csum-segmentation: on
- tx-udp_tnl-segmentation: off vs tx-udp_tnl-segmentation: on
- tx-vlan-offload: off vs tx-vlan-offload: on
- tx-vlan-stag-hw-insert: off vs tx-vlan-stag-hw-insert: on
- 2022-09-12 krowe: I found the rx and tx buffers for em1 on naasc-vs-4 were 511 while on naasc-vs-2, 3, and 5 were 1024. You can see this with ethtool -g em1. I changed naasc-vs-4 with the following ethtool -G em1 rx 1024 tx 1024 but it didn't change iperf performance.
- 2022-09-12 krowe: I found an article suggesting that gro can make traffic slower when it is enabled. I see that rx-gro-hw is enabled on naasc-vs-4 but disabled on naasc-vs-3 and 5. You can see this with ethtool -k em1 | grep gro.So I disabled it on naasc-vs-4 with ethtool -K em1 gro off and iperf3 tests now show about 2Gb/s both directions!!!
- GRO = Generic Receive Offload. It is hardware on the physical NIC. GRO is an aggregation technique to coalesce several receive packets from a stream into a single large packet, thus saving CPU cycles as fewer packets need to be processed by the kernel.
- https://bugzilla.redhat.com/show_bug.cgi?id=1424076
- https://access.redhat.com/solutions/20278
- https://techdocs.broadcom.com/us/en/storage-and-ethernet-connectivity/ethernet-nic-controllers/bcm957xxx/adapters/Tuning/tcp-performance-tuning/nic-tuning_22/gro-generic-receive-offload.html
- https://techdocs.broadcom.com/us/en/storage-and-ethernet-connectivity/ethernet-nic-controllers/bcm957xxx/adapters/Tuning/ip-forwarding-tunings/nic-tuning_48.html
- https://techdocs.broadcom.com/us/en/storage-and-ethernet-connectivity/ethernet-nic-controllers/bcm957xxx/adapters/Tuning/tcp-performance-tuning/os-tuning-linux.html
- After disabling rx-gro-hw, I no longer see TCP Retransmission or TCP Out-Of-Order packets when tracing the iperf3 test from na-arc-3 to na-arc-2.
Table7: iperf3 TCP throughput from/to ingress_sbox with rx-gro-hw=off (Mb/s) na-arc-1
(naasc-vs-4)
na-arc-2
(naasc-vs-4)
na-arc-3
(naasc-vs-3)
na-arc-4
(naasc-vs-4)
na-arc-5
(naasc-vs-5)
na-arc-6
(naasc-vs-2)
na-arc-1 4460
2580 4630 2860 3150 na-arc-2 4060
2590 4220 3690 2570 na-arc-3 2710
2580
3080 2770 2920 na-arc-4 1090
3720 2200 2970 3200 na-arc-5 4010
3970 2340 4010 3080 na-arc-6 3380
3060 3060 3010 3080 - This definatly improves performance buth I am still seeing lots of retransmists in the iperf3 tests so perhaps there is more that can be done.
- 2022-09-15 krowe: The VM Hosts have different 10Gb network cards
- naasc-vs-2 uses a Solarflare Communications SFC9220
- naasc-vs-3 uses a Solarflare Communications SFC9020
- naasc-vs-4 uses a Broadcom BCM57412 NetXtreme-E
- naasc-vs-5 uses a Solarflare Communications SFC9020
- 2022-10-07 krowe: bare metal differences
- naasc-vs-2: Dell PowerEdge R7525, dual AMD EPYC 7352 24-Core Processor
- naasc-vs-3: Dell PowerEdge R730, dual Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
- naasc-vs-4: Dell PowerEdge R740, dual Intel(R) Xeon(R) Gold 6152 CPU @ 2.10GHz
- naasc-vs-5: Dell PowerEdge R740, dual Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz
Questions
- 2022-09-26 krowe: Can someone who is able to login, login to the nodes on the 10.2.120 network and see if those interfaces are showing dropped Rx packets? I would, but I can't login to most of them because CV.
- Why does naasc-vs-4 have all the infiniband modules loaded? I don't see an IB card. naasc-vs-1 and naasc-dev-vs also have some IB modules loaded but naasc-vs-3 and naasc-vs-5 don't have any IB modules loaded.
- Tracy will look into this
- 2022-10-05 krowe: I don't think this is a significant issue.
- Why is nfnetlink logging enabled on naasc-vs-4? You can see this with cat /proc/net/netfilter/nf_log and lsmod|grep -i nfnet
- nfnetlink is a module for packet mangling. Could this interfear with the docker swarm networking?
- 2022-10-05 krowe: I don't think this is a significant issue.
- why is the eth1 interfaces in all the containers and docker_gwbridge on na-arc-1 in the 172.18.x.x range while all the other na-arcs are in the 172.19.x.x range? Does it matter?
- Here are some diffs in sysctl on na-arc nodes. I tried changing na-arc-4 and na-arc-5 to match the others but performance was the same. I then changed all the nodes to match na-arc-{1..3} and still no change in performance. I still don't understand how na-arc-{4..5} got different setttings. I did find that there is another directory for sysctl settings in /usr/lib/sysctl.d but that isn't why these are different.
- na-arc-1, na-arc-2, na-arc-3, natest-arc-1, natest-arc-2, natest-arc-3
net.bridge.bridge-nf-call-arptables = 0
net.bridge.bridge-nf-call-ip6tables = 0
net.bridge.bridge-nf-call-iptables = 1
- na-arc-4, na-arc-5
net.bridge.bridge-nf-call-arptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
- na-arc-1, na-arc-2, na-arc-3, natest-arc-1, natest-arc-2, natest-arc-3
- Why does almaportal use ens3 while almascience uses eth0?
...