Page History

...

2022-09-02 krowe: sysctl -a | grep <10Gb NIC> between naasc-vs-3/naasc-vs-5 and are different
- naasc-vs-4 has entries for VLANs 101 and 140 while naasc-vs-3 and naasc-vs-5 have entries for VLANs 192 and 96.
2022-09-02 krowe: sysctl -a on naasc-vs-4 and naasc-vs-5 and found many questionable differences
- naasc-vs-4: net.iw_cm.default_backlog = 256
  - Is this because the IB modules are loaded?
- naasc-vs-4: net.rdma_ucm.max_backlog = 1024
  - Is this because the IB modules are loaded?
- naasc-vs-4: sunrpc.rdma*
  - Is this because the IB modules are loaded?
- naasc-vs-4: net.netfilter.nf_log.2 = nfnetlink_log
  - nfnetlink is a module for packet mangling. Could this interfear with the docker swarm networking?
- Though the recorded output rate of naasc-vs-5 is about 500 Mb/s while naasc-vs-{3..4} is about 300Kb/s.
- And the recorded input rate of naasc-vs-5 is about 500 Mb/s while naasc-vs{3..4} is about 5 Mb/s.
- This is very strange as it seemed naasc-vs-5 was the limiting factor but the switch ports suggest not. Perhaps this data rate is caused by other VM guests on naasc-vs-5 (helpdesk-prod, naascweb2-prod, cartaweb-prod, natest-arc-3, cobweb2-dev)
2022-09-06 krowe: ethtool -k <NIC> for naasc-vs-3 and naasc-vs-5 are very different from naasc-vs-4.
- hw-tc-offload: off vs hw-tc-offload: on
- rx-gro-hw: off vs rx-gro-hw: on
- rx-vlan-offload: off vs rx-vlan-offload: on
- rx-vlan-stag-hw-parse: off vs rx-vlan-stag-hw-parse: on
- tcp-segmentation-offload: off vs tcp-segmentation-offload: on
- tx-gre-csum-segmentation: off vs tx-gre-csum-segmentation: on
- tx-gre-segmentation: off vs tx-gre-segmentation: on
- tx-gso-partial: off vs x-gso-partial: on
- tx-ipip-segmentation: off vs tx-ipip-segmentation: on
- tx-sit-segmentation: off vs tx-sit-segmentation: on
- tx-tcp-segmentation: off vs tx-tcp-segmentation: on
- tx-udp_tnl-csum-segmentation: off vs tx-udp_tnl-csum-segmentation: on
- tx-udp_tnl-segmentation: off vs tx-udp_tnl-segmentation: on
- tx-vlan-offload: off vs tx-vlan-offload: on
- tx-vlan-stag-hw-insert: off vs tx-vlan-stag-hw-insert: on
2022-09-12 krowe: I found the rx and tx buffers for em1 on naasc-vs-4 were 511 while on naasc-vs-2, 3, and 5 were 1024. I changed naasc-vs-4 with the following ethtool -G em1 rx 1024 tx 1024 but it didn't change iperf performance.

2022-09-12 krowe: I found an article suggesting that gro can make traffic slower when it is enabled. I see that rx-gro-hw is enabled on naasc-vs-4 but disabled on naasc-vs-3 and 5. You can see this with ethtool -k em1 | grep gro.So I disabled it on naasc-vs-4 with ethtool -K em1 gro off and iperf3 tests now show about 2Gb/s both directions!!!
GRO = Generic Receive Offload. It is hardware on the physical NIC. GRO is an aggregation technique to coalesce several receive packets from a stream into a single large packet, thus saving CPU cycles as fewer packets need to be processed by the kernel.
https://bugzilla.redhat.com/show_bug.cgi?id=1424076
After disabling rx-gro-hw, I no longer see TCP Retransmission or TCP Out-Of-Order packets when tracing the iperf3 test from na-arc-3 to na-arc-2.
Table7: iperf3 TCP throughput from/to ingress_sbox with rx-gro-hw=off (Mb/s)

na-arc-1
(naasc-vs-4)
na-arc-2
(naasc-vs-4)
na-arc-3
(naasc-vs-3)
na-arc-4
(naasc-vs-4)
na-arc-5
(naasc-vs-5)
na-arc-6
(naasc-vs-2)
na-arc-1
4460
2580 4630 2860 3150
na-arc-2
4060

2590 4220 3690 2570
na-arc-3
2710
2580

3080 2770 2920
na-arc-4
1090
3720 2200
2970 3200
na-arc-5
4010
3970 2340 4010
3080
na-arc-6
3380
3060 3060 3010 3080

na-arc-1,2,3,4,5

Identical

...

Where is the main docker config (yaml file)?

Why, with rx-gro-hw=off on naasc-vs-4, does na-arc-6 see so many retransmissions and small Congestion Window (Cwnd)?
[root@na-arc-6 ~]# iperf3 -B 10.0.0.16 -c 10.0.0.21
Connecting to host 10.0.0.21, port 5201
[ 4] local 10.0.0.16 port 38534 connected to 10.0.0.21 port 5201
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-1.00 sec 302 MBytes 2.54 Gbits/sec 523 207 KBytes
[ 4] 1.00-2.00 sec 322 MBytes 2.70 Gbits/sec 596 186 KBytes
[ 4] 2.00-3.00 sec 312 MBytes 2.62 Gbits/sec 687 245 KBytes
[ 4] 3.00-4.00 sec 335 MBytes 2.81 Gbits/sec 638 278 KBytes
[ 4] 4.00-5.00 sec 309 MBytes 2.60 Gbits/sec 780 146 KBytes

[root@na-arc-3 ~]# iperf3 -B 10.0.0.19 -c 10.0.0.21
Connecting to host 10.0.0.21, port 5201
[ 4] local 10.0.0.19 port 52986 connected to 10.0.0.21 port 5201
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-1.00 sec 309 MBytes 2.59 Gbits/sec 232 638 KBytes
[ 4] 1.00-2.00 sec 358 MBytes 3.00 Gbits/sec 0 967 KBytes
[ 4] 2.00-3.00 sec 351 MBytes 2.95 Gbits/sec 0 1.18 MBytes
[ 4] 3.00-4.00 sec 339 MBytes 2.84 Gbits/sec 74 1.36 MBytes
[ 4] 4.00-5.00 sec 359 MBytes 3.01 Gbits/sec 0 1.54 MBytes
Actuqally the retransmissions seem to very quite a lot from one run to another. That is the more important question.

Why does naasc-vs-3 have a br120 in state UNKNOWN? none of the other naasc-vs nodes have a br120.
Why does naasc-vs-4 have all the infiniband modules loaded? I don't see an IB card. naasc-vs-1 and naasc-dev-vs also have some IB modules loaded but naasc-vs-3 and naasc-vs-5 don't have any IB modules loaded.
Tracy will look into this
Why is nfnetlink logging enabled on naasc-vs-4? You can see this with cat /proc/net/netfilter/nf_log and lsmod|grep -i nfnet
nfnetlink is a module for packet mangling. Could this interfear with the docker swarm networking?
why is the eth1 interfaces in all the containers and docker_gwbridge on na-arc-1 in the 172.18.x.x range while all the other na-arcs are in the 172.19.x.x range? Does it matter?
Here are some diffs in sysctl on na-arc nodes. I tried changed na-arc-4 and na-arc-5 to match the others but performance was the same. I then changed all the nodes to match na-arc-{1..3} and still no change in performance. I still don't understand how na-arc-{4..5} got different setttings. I did find that there is another directory for sysctl settings in /usr/lib/sysctl.d but that isn't why these are different.
na-arc-1, na-arc-2, na-arc-3, natest-arc-1, natest-arc-2, natest-arc-3
net.bridge.bridge-nf-call-arptables = 0
net.bridge.bridge-nf-call-ip6tables = 0
net.bridge.bridge-nf-call-iptables = 1
na-arc-4, na-arc-5
net.bridge.bridge-nf-call-arptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1

I see sysctl differences between the natest-arc servers and the na-arc servers. Here is a diff of /etc/sysctl.d/99-nrao.conf on natest-arc-1 and na-arc-5
< #net.ipv4.tcp_tw_recycle = 1
---
> net.ipv4.tcp_tw_recycle = 1
22,39d21
< net.ipv4.conf.all.accept_redirects=0
< net.ipv4.conf.default.accept_redirects=0
< net.ipv4.conf.all.secure_redirects=0
< net.ipv4.conf.default.secure_redirects=0
<
< #net.ipv6.conf.all.disable_ipv6 = 1
< #net.ipv6.conf.default.disable_ipv6 = 1
<
< # Mellanox recommends the following
< net.ipv4.tcp_timestamps = 0
< net.core.netdev_max_backlog = 250000
<
< net.core.rmem_default = 16777216
< net.core.wmem_default = 16777216
< net.core.optmem_max = 16777216
< net.ipv4.tcp_mem = 16777216 16777216 16777216
< net.ipv4.tcp_low_latency = 1
If I set net.ipv4.tcp_timestamps = 0 on na-arc-5, the wget download drops to nothing (--.-KB/s).
If I set all the above sysctl options, execpt net.ipv4.tcp_timestamps, on all five na-arc nodes, wget download performance doesn't change. It is still about 32KB/s. Also I still zeeo ZeroWindow packets.
Try rebooting VMs after making changes?

I see ZeroWindow packets sent from na-arc-5 to nangas13 while downloading a file from nangas13 using wget. This is na-arc-5 telling nangas13 to wiat because its network buffer is full.
Is this because of qdisc pfifo_fast? No. krowe changed eth0 to *qdisc fq_codel* and still seeing ZeroWait packets.
Now that I have moved the rh_download to na-arc-1 and put httpd on na-arc-5 I no longer see ZeroWindow packets on na-arc-5. But I am seeing them on na-arc-1 which is where the rh_downloader is now. Is this because the rh_downloader is being stalled talking to something else like httpd and therefore telling nangas13 to wait?
Why does almaportal use ens3 while almascience uses eth0?
What if we move the rh-downloader container to a different node? In fact walk it through all five nodes and test.

Why do I see cv-6509 when tracerouting from na-arc-5 to nangas13 but not on natest-arc-1
[root@na-arc-5 ~]# traceroute nangas13
traceroute to nangas13 (10.2.140.33), 30 hops max, 60 byte packets
1 cv-6509-vlan97.cv.nrao.edu (10.2.97.1) 0.426 ms 0.465 ms 0.523 ms
2 cv-6509.cv.nrao.edu (10.2.254.5) 0.297 ms 0.277 ms 0.266 ms
3 nangas13.cv.nrao.edu (10.2.140.33) 0.197 ms 0.144 ms 0.109 ms

[root@natest-arc-1 ~]# traceroute nangas13
traceroute to nangas13 (10.2.140.33), 30 hops max, 60 byte packets
1 cv-6509-vlan96.cv.nrao.edu (10.2.96.1) 0.459 ms 0.427 ms 0.402 ms
2 nangas13.cv.nrao.edu (10.2.140.33) 0.184 ms 0.336 ms 0.311 ms
Derek wrote that 10.2.99.1 = CV-NEXUS and 10.2.96.1 = CV-6509

Why does natest-arc-3 have ens3 instead of eth0 and why is its speed 100Mb/s?
virsh domiflist natest-arc-3 shows the Model as rtl8139 instead of virtio
When I run ethtool eth0 on nar-arc-{1..5} natest-arc-{1..2} as root, the result is just Link detected: yes instead of the full report with speed while na-arc-3 shows 100Mb/s.
Why do iperf tests from natest-arc-1 and natest-arc-2 to natest-arc-3 get about half the performance (0.5Gb/s) expected especially when the reverse tests get expected performance (0.9Gb/s).
Is putting the production swarm nodes (na-arc-*) on the 10Gb/s network a good idea? Sure it makes a fast connection to cvsan but it adds one more hop to the nangas servers (e.g. na-arc-1 -> cv-nexus9k -> cv-nexus -> nangas11)
When I connect to the container acralmaprod001.azurecr.io/offline-production/rh-download:2022.06.01.2022jun I get errors like unknown user 1009 I get the same errors on the natest-arc-1 container.
Does it matter that the na-arc nodes are on 10.2.97.x, their VM host is on 10.2.99.x while the natest-arc nodes are on 10.2.96.x and their VM hosts (well 2 out of 3) are also on 10.2.96.x? Is this why I see cv-509.cv.nrao.edu when running traceroute from the na-arc nodes?
When running wget --no-check-certificate http://na-arc-3.cv.nrao.edu:8088/dataPortal/member.uid___A001_X1358_Xd2.3C286_sci.spw31.cube.I.pbcor.fits I see traffic going through veth14ce034 on na-arc-3 but I can't find a container associated with that veth.
Why does the httpd container have eth0(10.0.0.8). This is the ingress network. I don't see any other conrainter with an interface on 10.0.0.0/24.
Do we want to use jumbo frames? If so, some recommend using mtu=8900 and there are a lot of places it needs to be set.
https://vswitchzero.com/2018/08/02/jumbo-frames-and-vxlan-performance/
https://docs.docker.com/engine/swarm/networking/

...

Space shortcuts

Page tree

Versions Compared

Old Version 202

New Version 203

Key

na-arc-1,2,3,4,5

Identical

Table7: iperf3 TCP throughput from/to ingress_sbox with rx-gro-hw=off (Mb/s)
	na-arc-1 (naasc-vs-4)	na-arc-2 (naasc-vs-4)	na-arc-3 (naasc-vs-3)	na-arc-4 (naasc-vs-4)	na-arc-5 (naasc-vs-5)	na-arc-6 (naasc-vs-2)
na-arc-1		4460	2580	4630	2860	3150
na-arc-2	4060		2590	4220	3690	2570
na-arc-3	2710	2580		3080	2770	2920
na-arc-4	1090	3720	2200		2970	3200
na-arc-5	4010	3970	2340	4010		3080
na-arc-6	3380	3060	3060	3010	3080