Table of Contents |
---|
Task report | ||
---|---|---|
|
Poor Download Performance
- Permanently set ethtool -K em1 gro off on naasc-vs-4
TL;DR ethtool -K em1 gro off needs to be permenantly set on naasc-vs-4
This was first reported on 2022-04-18 and documented in https://ictjira.alma.cl/browse/AES-52 What we have seen /and what has been reported is that sometimes downloads are incredibly slow (10s tens of kB/s) and sometimes the transfer is closed with data missing from the download. Other times we see perfectly reasonable download speeds (~10 tens of MB/s). This was reproducable with a command like the following
wget --no-check-certificate http://almascience.nrao.edu/dataPortal/member.uid___A001_X1358_Xd2.3C286_sci.spw31.cube.I.pbcor.fits
Shortly after this report, the almascience portal was redirected from the production docker swarm to the test-prod docker swarm because it ithat swarm produced better download performance, although still not as good as was expected (10s tens of MB/s). Also, somewhere around this time, the MTUs MTU on the production docker swarm nodes was changed from 1500 to 9000.
It was noticed that one of the production docker swarm nodes, na-arc-3, was configured differently than the other na-arc-* nodes:
- ping na-arc-[1,2,4,5] from na-arc-3 with anything larger than -s 1490 drops all packets
- iperf tests show 10Gb/s between the VM host of na-arc-3 (naasc-vs-3 p5p1.120) and the VM host of na-arc-5 (naasc-vs-5 p2p1.120). So it isn't a bad card in either of the VM hosts.
- iptables on na-arc-3 looks different than iptables on na-arc-[2,3,5]. na-arc-1 also looks a bit different.
- docker_gwbridge interface on na-arc-[1,2,4,5] shows NO_CARRIER but not on na-arc-3.
- na-arc-3 has a veth10fd1da@if37 interface. None of the other na-arc-* nodes have a veth interface.
iperf3 tests between all the na-arc-* nodes showed na-arc-3 was performing about 10e4 times slower on both sending and receiving.
Given the number of issues with na-arc-3 it was decided to just recreated it from a clone of na-arc-2. Also, we changed the model of na-arc-3's vnet5 interface on naasc-vs-3 from rtl8139 to virtio to match all the other na-arc-* nodes. Finally we changed the MTU of na-arc-3 eth0 to from 9000 to 1500. This is different than all the other na-arc-* nodes but it was either that or change the p5p1.120 and br97 on naasc-vs-3 from 9000 to 1500 which my have impacted other VM guests on that host. This all happened on 2022-08-11 and since then iperf3 tests between all the na-arc-* nodes have shown expected performance.
On 2022-08-12 http://almaportal.cv.nrao.edu/ was created so that we could internally test the production docker swarm nodes in a manner similar to how external users would use it. Now tests could be run on almaportal just like on almascience. E.g.
wget --no-check-certificate https://almaportal.cv.nrao.edu/dataPortal/2013.1.00226.S_uid___A001_X122_X1f1_001_of_001.tar
On 2022-08-19, naasc-vs-5 lost its heartbeat with the docker swarm which caused all the swarm services on na-arc-5 shutdown about 11am Central and move to other na-arc nodes. The reason for this lost hearbeat is unknown but it could have been user error. After this event, wget tests started downloading at around 100MB/s, which is the theoretical limit given the 1Gb/s link to NGAS nodes. The node na-arc-5 had been running several services including the rh-download service. So I moved the rh-download service back to na-arc-5 with docker service update --force production_requesthandler_download and found wget performance was back to about 32KB/s. I then moved rh-download from na-arc-5 back to na-arc-2 with docker node update --availability drain na-arc-5 and found wget performance was back to about 100MB/s. I ran the wget test four times to make sure the web proxy walked through all the na-arc nodes. I then moved the httpd service from na-arc-2 to na-arc-5 and found wget performance to be vary from about 32KB/s to about 100MB/s from test to test. Using wget to access each na-arc node directly instead of going through the web proxy's round robin selection process showed that performance was based on the na-arc node used in the wget command. E.g.
- wget --no-check-certificate http://na-arc-1.cv.nrao.edu:8088/dataPortal/member.uid___A001_X122_X1f1.LKCA_15_13CO_cube.image.fits 32KB/s
- wget --no-check-certificate http://na-arc-2.cv.nrao.edu:8088/dataPortal/member.uid___A001_X122_X1f1.LKCA_15_13CO_cube.image.fits 32KB/s
- wget --no-check-certificate http://na-arc-3.cv.nrao.edu:8088/dataPortal/member.uid___A001_X122_X1f1.LKCA_15_13CO_cube.image.fits 100MB/s
- wget --no-check-certificate http://na-arc-4.cv.nrao.edu:8088/dataPortal/member.uid___A001_X122_X1f1.LKCA_15_13CO_cube.image.fits 32KB/s
- wget --no-check-certificate http://na-arc-5.cv.nrao.edu:8088/dataPortal/member.uid___A001_X122_X1f1.LKCA_15_13CO_cube.image.fits 100MB/s
This was a huge breakthrough because now we could see both the poor performance that users were seeing before the almascience portal was redirected, but we could also see the desired and expected performance. It also implicated naasc-vs-4 as the problem since na-arc-1, na-arc-2, and na-arc-4 were all hosted on naasc-vs-4.
On 2022-08-31 we learned how to perform iper3 tests over the docker swarm overlay network known as ingress. This is the network docker swarm uses to redirect traffic sent to the wrong host. You can do this by logging into a docker swarm node like na-arc-1 and starting a shell in the ingress_sbox namespace like so
nsenter --net=/var/run/docker/netns/ingress_sbox
From there you can use ip -c addr show to see the IPs and interfaces of the ingress network namespace on that node. You can also use iperf3 to test this ingress network. Here are the results of our nodes. The values are rounded for simplicity. Hosts accross the top row are receiving while hosts along the left column are transmitting. You can see that na-arc-3 and na-arc-5 show poor performance when transmitting to na-arc-1, na-arc-2, and na-arc-3. This seems to implicates either naasc-vs-4 as a culpret, or na-arc-3 and na-arc-5 or their VM Hosts as the culprets. We weren't sure.
Table3: iperf3 to/from ingress_sbox (Mb/s) | |||||
---|---|---|---|---|---|
na-arc-1 10.0.0.2 | na-arc-2 10.0.0.21 | na-arc-3 10.0.0.19 | na-arc-4 10.0.0.5 | na-arc-5 10.0.0.6 | |
na-arc-1 | 4,000 | 2,000 | 4,000 | 3,000 | |
na-arc-2 | 4,000 | 2,000 | 4,000 | 3,000 | |
na-arc-3 | 0.3 | 0.3 | 0.3 | 3,000 | |
na-arc-4 | 4,000 | 4,000 | 2,000 | 3,000 | |
na-arc-5 | 0.3 | 0.3 | 2,000 | 0.3 |
On 2022-09-09 a sixth docker swarm node was created (na-arc-6) on a new VM host (naasc-vs-2). We ran iperf3 tests again in over the ingress network and found the following
Table6: iperf3 TCP throughput from/to ingress_sbox (Mb/s) | ||||||
---|---|---|---|---|---|---|
na-arc-1 (naasc-vs-4) | na-arc-2 (naasc-vs-4) | na-arc-3 (naasc-vs-3) | na-arc-4 (naasc-vs-4) | na-arc-5 (naasc-vs-5) | na-arc-6 (naasc-vs-2) | |
na-arc-1 | 3920 | 2300 | 4200 | 3110 | 3280 | |
na-arc-2 | 3950 | 2630 | 4000 | 3350 | 3530 | |
na-arc-3 | 0.2 | 0.3 | 0.2 | 2720 | 2810 | |
na-arc-4 | 3860 | 3580 | 2410 | 3390 | 3290 | |
na-arc-5 | 0.2 | 0.2 | 2480 | 0.2 | 2550 | |
na-arc-6 | 0.005 | 0.005 | 2790 | 0.005 | 3290 |
Seeing na-arc-6 also performing poorly when transmitting to nodes on naasc-vs-4 told us that there is something wrong with the receive end of naasc-vs-4. So we started to look at network settings in the kernel (sysctl), network hardware, ysctl settings, and network hardware features (ethtool -k). We found that the Network Interface Card (NIC) on naasc-vs-4 was very different than the other naasc-vs hosts
- naasc-vs-2 uses a Solarflare Communications SFC9220
- naasc-vs-3 uses a Solarflare Communications SFC9020
- naasc-vs-4 uses a Broadcom BCM57412 NetXtreme-E
- naasc-vs-5 uses a Solarflare Communications SFC9020
There were some sysctl settings that were suspecious
- naasc-vs-4 has entries for VLANs 101 and 140 while naasc-vs-3 and naasc-vs-5 have entries for VLANs 192 and 96.
- naasc-vs-4: net.iw_cm.default_backlog = 256 Is this because the IB modules are loaded?
- naasc-vs-4: net.rdma_ucm.max_backlog = 1024 Is this because the IB modules are loaded?
- naasc-vs-4: sunrpc.rdma* Is this because the IB modules are loaded?
- naasc-vs-4: net.netfilter.nf_log.2 = nfnetlink_log
But the real breakthrough was in the NIC features. You can see them with ethtool -k <NIC>. There were many differences but we found that naasc-vs-4 had rx-gro-hw: on while all the other naasc-vs hosts had it set to off. This feature is for Generic Receive Offload. It is hardware on the physical NIC. GRO is an aggregation technique to coalesce several receive packets from a stream into a single large packet, thus saving CPU cycles as fewer packets need to be processed by the kernel. The Solarflare cards don't have this feature. I found articles suggesting that GRO can make traffic slower when it is enabled, especially when using vxlan which the docker swarm ingress network uses.
- https://bugzilla.redhat.com/show_bug.cgi?id=1424076
- https://access.redhat.com/solutions/20278
- https://techdocs.broadcom.com/us/en/storage-and-ethernet-connectivity/ethernet-nic-controllers/bcm957xxx/adapters/Tuning/tcp-performance-tuning/nic-tuning_22/gro-generic-receive-offload.html
- https://techdocs.broadcom.com/us/en/storage-and-ethernet-connectivity/ethernet-nic-controllers/bcm957xxx/adapters/Tuning/ip-forwarding-tunings/nic-tuning_48.html
- https://techdocs.broadcom.com/us/en/storage-and-ethernet-connectivity/ethernet-nic-controllers/bcm957xxx/adapters/Tuning/tcp-performance-tuning/os-tuning-linux.html
On 2022-09-16 we disabled this feature on naasc-vs-4 with ethtool -K em1 gro off and iperf3 tests now show about between 1Gb/s and 4Gb/s in both directions.
Table7: iperf3 TCP throughput from/to ingress_sbox with rx-gro-hw=off (Mb/s) | ||||||
---|---|---|---|---|---|---|
na-arc-1 (naasc-vs-4) | na-arc-2 (naasc-vs-4) | na-arc-3 (naasc-vs-3) | na-arc-4 (naasc-vs-4) | na-arc-5 (naasc-vs-5) | na-arc-6 (naasc-vs-2) | |
na-arc-1 | 4460 | 2580 | 4630 | 2860 | 3150 | |
na-arc-2 | 4060 | 2590 | 4220 | 3690 | 2570 | |
na-arc-3 | 2710 | 2580 | 3080 | 2770 | 2920 | |
na-arc-4 | 1090 | 3720 | 2200 | 2970 | 3200 | |
na-arc-5 | 4010 | 3970 | 2340 | 4010 | 3080 | |
na-arc-6 | 3380 | 3060 | 3060 | 3010 | 3080 |
Poorer vxlan performance than expected
Doing iper3 tests between na-arc nodes using the ingres ingress overlay vxlan network created by docker swarm shows between 1Gb/s and 4Gb/s over a 10Gb/s network. This is at best about half the performance I would expect. Granted there is a performance hit for using vxlan, but I would expect that to be around the 10% range meaning I would still expect about 8Gb/s.
Table7: iperf3 TCP throughput from/to ingress_sbox with rx-gro-hw=off (Mb/s) | ||||||
---|---|---|---|---|---|---|
na-arc-1 (naasc-vs-4) | na-arc-2 (naasc-vs-4) | na-arc-3 (naasc-vs-3) | na-arc-4 (naasc-vs-4) | na-arc-5 (naasc-vs-5) | na-arc-6 (naasc-vs-2) | |
na-arc-1 | 4460 | 2580 | 4630 | 2860 | 3150 | |
na-arc-2 | 4060 | 2590 | 4220 | 3690 | 2570 | |
na-arc-3 | 2710 | 2580 | 3080 | 2770 | 2920 | |
na-arc-4 | 1090 | 3720 | 2200 | 2970 | 3200 | |
na-arc-5 | 4010 | 3970 | 2340 | 4010 | 3080 | |
na-arc-6 | 3380 | 3060 | 3060 | 3010 | 3080 |
TCP retransmissions
While I can't say this poor performance is indicative of a problem, it coule become a problem with increased load or a faster link to NGAS, and it certainly will make debugging more difficult when there is a problem. I suggest doing another benchmark test once the TCP retransmissions and dropped packets have been resolved.
TCP retransmissions
The newest NAASC VM Host (naasc-vs-2) often shows over 100 TCP retransmissions The newest NAASC VM Host (naasc-vs-2) often shows over 100 TCP retransmissions per second when doing iperf3 tests. Other nodes like naasc-vs-3 and naasc-vs-4 show 0 TCP retransmissions per second. While I can't say these TCP retransmissions are indicative of a problem, they could become a problem with increased load and they certainly will make debugging more difficult when there is a problem. I suggest the reason for these TCP retransmissions be found and resolved before naasc-vs-2 is put into production.
MTU
- On naasc-vs-2 run iperf3 -B 10.2.120.107 -s
- On another naasc-vs host like naasc-vs-4 run iperf3 -B 10.2.120.110 -c 10.2.120.107
- Note the Retr column in the output
MTU
- Make Make all MUTs 1500
At some point the Maximum Transmission Unit (MTU) for ethernet frames on the production servers was changed from 1500 to 9000. This is a common technique to improve performance in certain situations. But in order to benefit from a 9000 MTU, all ethernet devices in the data path must be set 9000 MTU. Simply changing the interfaces on the naasc-vs and na-arc nodes is not enough. All the NGAS nodes, docker containers, and namespaces in the data path must also be changed. This means recreating the entire ingress overlay network among other changes. Also, since it is unlikely the end user is going to have an MTU of 9000, there is little advantage in setting an MTU of 9000 if your goal is to move data improve download speeds to the user faster. Finally, because of the overhead of vxlan, an MTU of 8900 would be better than 9000. I suggest leaving the MTU at the default 1500 until there is good evidence that a larger MTU is an improvement.
- https://en.wikipedia.org/wiki/Maximum_transmission_unit
- https://docs.docker.com/engine/swarm/networking/
- https://vswitchzero.com/2018/08/02/jumbo-frames-and-vxlan-performance/
- https://vswitchzero.com/2017/09/26/vmxnet3-rx-ring-buffer-exhaustion-and-packet-loss/
- https://engineering.telefonica.com/maximizing-performance-in-vxlan-overlay-networks-ec35ebe29440
Dropped packets
Some of the NAASC VM hosts show lots a high number of dropped Rx packets. The rate ranges from 2 to over 100 per minute. This is really unacceptable on a modern, well-designed network. While I can't say these dropped packets are indicative of a problem, they could become a problem with increased load and they certainly will make debugging more difficult when there is a problem. I suggest the reason for these dropped packets be found and resolved before naasc-vs-2 is put into production.
Further tests show patterns. It looks like the same packets may be being dropped on both naasc-vs-2 and naasc-vs-4 as since they report almost the same dropped packet rate. For example, I wrote a simple script to print dropped packets per time interval and ran it at the same time on all four naasc-vs hosts. You can see that naasc-vs-2 and naasc-vs-4 show a similar pattern, while naasc-vs-3 and naasc-vs-5 show a different pattern.
Dropped packets per 10 second interval | |||
---|---|---|---|
naasc-vs-2 | naasc-vs-3 | naasc-vs-4 | naasc-vs-5 |
30 | 0 | 30 | 0 |
22 | 0 | 24 | 0 |
13 | 1 | 11 | 1 |
9 | 0 | 9 | 0 |
8 | 0 | 8 | 0 |
12 | 1 | 12 | 1 |
I don't think these dropped packets are viewable with tcpdump. At least I haven't seen a set of packets in a tcpdump that matches the number of dropped packets. I supposed suppose there may be more than one type of packet being dropped, but that is very difficult to tell. An old ARP cache on some host or switch could cause this. I am sure there are other possibilities.
Documentation
- Document how to create a VM, especially for NAASC Archive
- Document how to create a docker swarm node for NAASC Archive
The NAASC doesn't have a documented procedure for creating a VM guest nor making it a docker swarm node. This needs to be documented so that the creation of such nodes can be repeated without error or change. Alvaro's documentation is a good start but far from sufficient. https://confluence.alma.cl/display/OFFLINE/Documentation
In this to-be-written documentation will be one off settings like ethtool -K em1 gro off.
Also, I think it would be useful for each ARC to document their archive system. This would help other ARCs when they are having problems as well as help all the ARCs be as similar as is feasible. Below is an attempt at a diagram of the NAASC archive system. It is meant as an example not an accurate diagram of every container, interface, namespace, etc.
Monitoring
Upgrade swarm to meet ALMA requirements
- Implement the strawman plan to make NAASC Archive nodes meet ALMA requirements
According to Alvaro's document https://confluence.alma.cl/display/OFFLINE/Documentation docker swarm nodes should have a minimum of 16cores and 32GB of memory. None of the production docker swarm nodes meet this requirement. There is a paln to address this.
Monitoring
- Configure naasc-vs hosts and na-arc guests for ganglia monitoring
Consistent Hardware
The VM Hosts used ad the NAASC are of various hardware. This lead to the largest performance issue, the GRO feature on naasc-vs-4. I suggest making hardware as consistent as possible to avoid such issues in the future.
NGAS network limit
There has been much effort to put the docker swarm nodes on a 10Gb/s network yet the links to the NGAS nodes is only 1Gb/s. This means that even though there could be a 10Gb connection between the docker swarm nodes and the download site of the archive user, it will still be limited to 1Gb/s.
Upgrade swarm to meet ALMA requirements
- Implement the strawman plan to make NAASC Archive nodes meet ALMA requirements
According to Alvaro's document https://confluence.alma.cl/display/OFFLINE/Documentation docker swarm nodes should have a minimum of 16cores and 32GB of memory. None of the production docker swarm nodes meet this requirement. There is a paln to address thisThere has been much effort to put the docker swarm nodes on a 10Gb/s network yet the links to the NGAS nodes is only 1Gb/s. This means that even though there could be a 10Gb connection between the docker swarm nodes and the download site of the archive user, it will still be limited to 1Gb/s.
RHEL8 shorcommings
The version of RHEL8 installed on naasc-vs-2 seems to be some small subset of the full RHEL8 distrobution. For over a decade, NRAO installed all packages that came with the Operating System because disk space is cheap and we might need tools like iperf3 or dropwatch or tcpretrans.
ARC benchmarks
I think it would be worthwhile for each ARC to benchmark their download performance. This should be done regularly (weekly, monthly, quarterly, etc) and using as similar a procedure at each arc as possible. This will provide two useful sets of data. 1. It will show when performance has dropped at an ARC hopefully before users start complaining and 2. it will provide a history of benchmarks to measure current benchmarks against. A simple wget script could be used to do this and shared among the ARCs. E.g.
- Use wget to download a file (the same file) from each ARC
wget --no-check-certificate https://almascience.nrao.edu/dataPortal/member.uid___A001_X1284_Xc9b.spt2349-56_sci.spw19.cube.I.pbcor.fits
Use iperf3 to test throughput over the docker swarm overlay ingress network (vxlan).
- Login to a docker swarm node as root (e.g. na-arc-1) and run the following to get a shell in the ingress_sbox namespace
- nsenter --net=/var/run/docker/netns/ingress_sbox
- Find the IP address of the ingress-endpoint. It should look something like "IPv4Address": "10.0.0.2/24"
- docker network inspect ingress | grep "ingress-endpoint" -A 4
- Run iperf3 in server mode in this namespace
- iperf3 -B <IP ADDR> -s
- e.g. iperf3 -B 10.0.0.2 -s
- Login to each of the other docker swarm nodes and get a shell in their ingress_sbox namespaces, find their IPs and run iperf3
- iperf3 -B <IP ADDR> -c <REMOTE IP ADDR>
- e.g. iperf3 -B
- You can now run ip -c addr to see the IPs of the ingress network
- Below is a table using default iperf3 test. The values are rounded for simplicity. Hosts accross the top row are receiving while hosts along the left column are transmitting.
- Login to a docker swarm node as root (e.g. na-arc-1) and run the following to get a shell in the ingress_sbox namespace
More effective use of
docker swarmDocker Swarm
The web proxy points each connection to the next na-arc node in a round-robin manner. Each na-arc node runs no more than one copy of each of the docker containers. There are five na-arc nodes. This means that 80% of requests go to the wrong host and have to be re-routed to the correct host using the docker swarm overlay ingress network (vxlan). This seems very inefficient.