...
- https://ictjira.alma.cl/browse/AES-52
- https://confluence.alma.cl/pages/viewpage.action?pageId=91826715
- You can see poor performance with a command like
- wget --no-check-certificate http://almaportal.cv.nrao.edu/dataPortal/member.uid___A001_X1358_Xd2.3C286_sci.spw31.cube.I.pbcor.fits
- krowe has narrowed this down to the ingress overlay network created by docker swarm which is used to re-route traffic sent to the wrong host.
- On na-arc-2
- nsenter --net=/var/run/docker/netns/ingress_sbox
- iperf -B 10.0.0.21 -s
- On na-arc-3
- nsenter --net=/var/run/docker/netns/ingress_sbox
- iperf3 -B 10.0.0.19 -c 10.0.0.21
- On na-arc-2
- 2022-09-19 krowe: Retransmissions when sending to naasc-vs-5 and even more to naasc-vs-2 but not on naasc-vs-3 or naasc-vs-4
- On naasc-vs-2
- iperf3 -B 10.2.120.107 -s
- On naasc-vs-3 or naasc-vs-4 or naasc-vs-5 or naasc-cont-1 or...
- iperf3 -B <LOCAL_IP> -c 10.2.120.107
- On naasc-vs-2
Diagrams
...
- 2022-09-07 krowe: Doing tcpdumps of iperf3 tests between ingress_sbox namespaces shows that the TCP iperf3 packets are being NATed into UDP packets. So I used iperf3 from across na-arc nodes (not in the ingress_sbox namespaces)
- iperf3 -B <LOCAL IP> -c <REMOTE IP> -u -b 2000000000 -t 100
Table5: iperf3 UDP to/from hosts (% packet loss) na-arc-1
(naasc-vs-4)
na-arc-2
(naasc-vs-4)
na-arc-3
(naasc-vs-3)
na-arc-4
(naasc-vs-4)
na-arc-5
(naasc-vs-5)
na-arc-1 na-arc-2 na-arc-3 na-arc-4 na-arc-5
- iperf3 -B <LOCAL IP> -c <REMOTE IP> -u -b 2000000000 -t 100
- 2022-09-08 krowe: I have tested the other overlay networks (production_agent_network 10.0.1.0/24 and production_default 10.0.2.0/24) and they perform similarly to the ingress overlay network 10.0.0.0/24.
- 2022-09-09 krowe: na-arc-6 is now online served from naasc-vs-2. Here are the iperf3 tests from ingress_sbox to ingress_sbox. When throughput is slow (Kb/s) I see that the congestion window size is reduced from about 1MB to about 2.73KB.
Table6: iperf3 TCP throughput from/to ingress_sbox (Mb/s) na-arc-1
(naasc-vs-4)
na-arc-2
(naasc-vs-4)
na-arc-3
(naasc-vs-3)
na-arc-4
(naasc-vs-4)
na-arc-5
(naasc-vs-5)
na-arc-6
(naasc-vs-2)
na-arc-1 3920 2300 4200 3110 3280 na-arc-2 3950 2630 4000 3350 3530 na-arc-3 0.2 0.3 0.2 2720 2810 na-arc-4 3860 3580 2410 3390 3290 na-arc-5 0.2 0.2 2480 0.2 2550 na-arc-6 0.005 0.005 2790 0.005 3290 - 2022-09-09 krowe: The ingress network (docker mesh) that I have been testing using the ingress_sbox namespace uses a veth interface (this is like a pipe) that connects to its corrosponding veth interface in another namespace on the same host which connects to a vxlan over a bridge in that second namespace. vxlan is a tunneling protocol that uses UDP over port 4789. This is why I am seeing my TCP packets turn into UDP packets. Using tcpdump in the ingress_sbox to watch iperf TCP traffic going from na-arc-2 to na-arc-3 looks clean. Watching traffic going from na-arc-3 to na-arc-2, which is slow (32KB/s), shows lots of TCP Retransmission and TCP Out-Of-Order packets.
- 2022-09-15 krowe: Even with rx-gro-hw=off on naasc-vs-4, I am still seeing some retransmissions in iper3 tests. These are the same as TCP Retransmissions seen previously. On a modern, well-designed network I would expect to see almost no TCP Retransmissions. So this may indicate that there are still improvements to be made. The number of retransmissions seems to vary over time from 0 retransmissions to over a thousand retransmissions on certain directions. This makes me think there is something else using the 10Gb network that is interfering with my tests.
This is a 10 second iper3 test using TCP from the host in the left column to the host in the top row.
TableXX iperf3 Retransmissions over 10Gb and rx-gro-hw=off naasc-vs-2
(10.2.120.107)
naasc-vs-3
(10.2.120.109)
naasc-vs-4
(10.2.120.110)
naasc-vs-5
(10.2.120.112)
naasc-vs-2 0, 0, 0 0, 0, 0 45, 52, 59 naasc-vs-3 87, 0, 19, 1734 0, 0, 0 74, 52, 56 naasc-vs-4 0, 342, 1147, 363 0, 0, 0 83, 51, 50 naasc-vs-5 494, 0, 1296, 24 0, 0, 0 0, 0, 0 This looks like some sort of misconfiguration on the receiving ends of naasc-vs-2 and naasc-vs-5.
TableXX iperf3 Retransmissions over 10Gb and rx-gro-hw=off na-arc-1
10.2.97.71
na-arc-2
10.2.97.72
na-arc-3
10.2.97.73
na-arc-4
10.2.97.74
na-arc-5
10.2.97.75
na-arc-6
10.2.97.76
na-arc-1 0, 0, 0 0, 0, 0 0, 0, 0 55, 75, 50 323, 501, 538 na-arc-2 0, 0, 0 0, 0, 0 0, 0, 0 68, 81, 64 768, 1050, 658 na-arc-3 1692, 1627, 2071 0, 1326, 592 1471, 3376, 686 360, 2477, 664 1873, 1872, 2384 na-arc-4 0, 0, 0 0, 0, 0 0, 0, 0 58, 86, 65 4, 9, 38 na-arc-5 108, 6, 6 6, 6, 6 2, 1, 1 6, 6, 6 1293, 1197, 33 na-arc-6 106, 0, 28 0, 0, 21 0, 88, 0 7, 0, 28 89, 75, 52 I don't think the retransmissions from na-arc-3 to na-arc-* can be atributed to MTU. Sure eth0 on na-arc-3 is 1500 while all the other na-arc nodes are 9000 but that should not cause a problem. If anything it sould be a problem the other way around. Also I tested changing na-arc-6 to 1500 and retransmissions didn't change. The lack of retransmissions between na-arc-1, na-arc-2, and na-arc-4 is because they are all on the same VM Host (naasc-vs-4). There does seem to be a problem with na-arc-6 is the receiver.
- You can use ping to see if your packet size actually gets through. This is a good way to test MTU sizes.
ping -c 3 -M do -s 1500 na-arc-1
...