NAASC Archive Stabilization Solutions

Poor Download Performance

TL;DR ethtool -K em1 gro off needs to be permenantly set on naasc-vs-4

This was first reported on 2022-04-18 and documented in https://ictjira.alma.cl/browse/AES-52 What we have seen/has been reported is that sometimes downloads are incredibly slow (10s of kB/s) and sometimes the transfer is closed with data missing from the download. Other times we see perfectly reasonable download speeds (~10 MB/s). This was reproducable with a command like the following

wget --no-check-certificate http://almascience.nrao.edu/dataPortal/member.uid___A001_X1358_Xd2.3C286_sci.spw31.cube.I.pbcor.fits

Shortly after this report, the almascience portal was redirected from the production docker swarm to the test-prod docker swarm because it produced better download performance, although still not as good as was expected (10s of MB/s). Also, somewhere around this time the MTUs on the production docker swarm nodes was changed from 1500 to 9000.

It was noticed that one of the production docker swarm nodes, na-arc-3, was configured differently than the other na-arc-* nodes:

ping na-arc-[1,2,4,5] from na-arc-3 with anything larger than -s 1490 drops all packets
iperf tests show 10Gb/s between the VM host of na-arc-3 (naasc-vs-3 p5p1.120) and the VM host of na-arc-5 (naasc-vs-5 p2p1.120). So it isn't a bad card in either of the VM hosts.
iptables on na-arc-3 looks different than iptables on na-arc-[2,3,5]. na-arc-1 also looks a bit different.
docker_gwbridge interface on na-arc-[1,2,4,5] shows NO_CARRIER but not on na-arc-3.
na-arc-3 has a veth10fd1da@if37 interface. None of the other na-arc-* nodes have a veth interface.
iperf3 tests between all the na-arc-* nodes showed na-arc-3 was performing about 10e4 times slower on both sending and receiving.

Given the number of issues with na-arc-3 it was decided to just recreated it from a clone of na-arc-2. Also, we changed the model of na-arc-3's vnet5 interface on naasc-vs-3 from rtl8139 to virtio to match all the other na-arc-* nodes. Finally we changed the MTU of na-arc-3 eth0 to from 9000 to 1500. This is different than all the other na-arc-* nodes but it was either that or change the p5p1.120 and br97 on naasc-vs-3 from 9000 to 1500 which my have impacted other VM guests on that host. This all happened on 2022-08-11 and since then iperf3 tests between all the na-arc-* nodes have shown expected performance.

On 2022-08-12 http://almaportal.cv.nrao.edu/ was created so that we could internally test the production docker swarm nodes in a manner similar to how external users would use it. Now tests could be run on almaportal just like on almascience. E.g.

wget --no-check-certificate https://almaportal.cv.nrao.edu/dataPortal/2013.1.00226.S_uid___A001_X122_X1f1_001_of_001.tar

On 2022-08-19, naasc-vs-5 lost its heartbeat with the docker swarm which caused all the swarm services on na-arc-5 shutdown about 11am Central and move to other na-arc nodes. The reason for this lost hearbeat is unknown but it could have been user error. After this event, wget tests started downloading at around 100MB/s. The node na-arc-5 had been running several services including the rh-download service. So I moved the rh-download service back to na-arc-5 with docker service update --force production_requesthandler_download and found wget performance was back to about 32KB/s. I then moved rh-download from na-arc-5 back to na-arc-2 with docker node update --availability drain na-arc-5 and found wget performance was back to about 100MB/s. I ran the wget test four times to make sure the web proxy walked through all the na-arc nodes. I then moved the httpd service from na-arc-2 to na-arc-5 and found wget performance to be vary from about 32KB/s to about 100MB/s from test to test. Using wget to access each na-arc node directly instead of going through the web proxy's round robin selection process showed that performance was based on the na-arc node used in the wget command. E.g.

wget --no-check-certificate http://na-arc-1.cv.nrao.edu:8088/dataPortal/member.uid___A001_X122_X1f1.LKCA_15_13CO_cube.image.fits 32KB/s
wget --no-check-certificate http://na-arc-2.cv.nrao.edu:8088/dataPortal/member.uid___A001_X122_X1f1.LKCA_15_13CO_cube.image.fits 32KB/s
wget --no-check-certificate http://na-arc-3.cv.nrao.edu:8088/dataPortal/member.uid___A001_X122_X1f1.LKCA_15_13CO_cube.image.fits 100MB/s
wget --no-check-certificate http://na-arc-4.cv.nrao.edu:8088/dataPortal/member.uid___A001_X122_X1f1.LKCA_15_13CO_cube.image.fits 32KB/s
wget --no-check-certificate http://na-arc-5.cv.nrao.edu:8088/dataPortal/member.uid___A001_X122_X1f1.LKCA_15_13CO_cube.image.fits 100MB/s

This was a huge breakthrough because now we could see both the poor performance that users were seeing before the almascience portal was redirected, but we could also see the desired and expected performance. It also implicated naasc-vs-4 as the problem since na-arc-1, na-arc-2, and na-arc-4 were all hosted on naasc-vs-4.

On 2022-08-31 we learned how to perform iper3 tests over the docker swarm overlay network known as ingress. This is the network docker swarm uses to redirect traffic sent to the wrong host. You can do this by logging into a docker swarm node like na-arc-1 and starting a shell in the ingress_sbox namespace like so

nsenter --net=/var/run/docker/netns/ingress_sbox

From there you can use ip -c addr show to see the IPs and interfaces of the ingress network namespace on that node. You can also use iperf3 to test this ingress network. Here are the results of our nodes. The values are rounded for simplicity. Hosts accross the top row are receiving while hosts along the left column are transmitting. You can see that na-arc-3 and na-arc-5 show poor performance when transmitting to na-arc-1, na-arc-2, and na-arc-3. This seems to implicates either naasc-vs-4 as a culpret, or na-arc-3 and na-arc-5 or their VM Hosts as the culprets. We weren't sure.

Table3: iperf3 to/from ingress_sbox (Mb/s)
	na-arc-1 10.0.0.2	na-arc-2 10.0.0.21	na-arc-3 10.0.0.19	na-arc-4 10.0.0.5	na-arc-5 10.0.0.6
na-arc-1		4,000	2,000	4,000	3,000
na-arc-2	4,000		2,000	4,000	3,000
na-arc-3	0.3	0.3		0.3	3,000
na-arc-4	4,000	4,000	2,000		3,000
na-arc-5	0.3	0.3	2,000	0.3

On 2022-09-09 a sixth docker swarm node was created (na-arc-6) on a new VM host (naasc-vs-2). We ran iperf3 tests again in over the ingress network and found the following

Table6: iperf3 TCP throughput from/to ingress_sbox (Mb/s)
	na-arc-1 (naasc-vs-4)	na-arc-2 (naasc-vs-4)	na-arc-3 (naasc-vs-3)	na-arc-4 (naasc-vs-4)	na-arc-5 (naasc-vs-5)	na-arc-6 (naasc-vs-2)
na-arc-1		3920	2300	4200	3110	3280
na-arc-2	3950		2630	4000	3350	3530
na-arc-3	0.2	0.3		0.2	2720	2810
na-arc-4	3860	3580	2410		3390	3290
na-arc-5	0.2	0.2	2480	0.2		2550
na-arc-6	0.005	0.005	2790	0.005	3290

Seeing na-arc-6 also performing poorly when transmitting to nodes on naasc-vs-4 told us that there is something wrong with the receive end of naasc-vs-4. So we started to look at network settings in the kernel (sysctl), network hardware, ysctl settings, and network hardware features (ethtool -k). We found that the Network Interface Card (NIC) on naasc-vs-4 was very different than the other naasc-vs hosts

naasc-vs-2 uses a Solarflare Communications SFC9220
naasc-vs-3 uses a Solarflare Communications SFC9020
naasc-vs-4 uses a Broadcom BCM57412 NetXtreme-E
naasc-vs-5 uses a Solarflare Communications SFC9020

There were some sysctl settings that were suspecious

naasc-vs-4 has entries for VLANs 101 and 140 while naasc-vs-3 and naasc-vs-5 have entries for VLANs 192 and 96.
naasc-vs-4: net.iw_cm.default_backlog = 256 Is this because the IB modules are loaded?
naasc-vs-4: net.rdma_ucm.max_backlog = 1024 Is this because the IB modules are loaded?
naasc-vs-4: sunrpc.rdma* Is this because the IB modules are loaded?
naasc-vs-4: net.netfilter.nf_log.2 = nfnetlink_log

But the real breakthrough was in the NIC features. You can see them with ethtool -k <NIC>. There were many differences but we found that naasc-vs-4 had rx-gro-hw: on while all the other naasc-vs hosts had it set to off. This feature is for Generic Receive Offload. It is hardware on the physical NIC. GRO is an aggregation technique to coalesce several receive packets from a stream into a single large packet, thus saving CPU cycles as fewer packets need to be processed by the kernel. The Solarflare cards don't have this feature. I found articles suggesting that GRO can make traffic slower when it is enabled, especially when using vxlan which the docker swarm ingress network uses.

On 2022-09-16 we disabled this feature on naasc-vs-4 with ethtool -K em1 gro off and iperf3 tests now show about between 1Gb/s and 4Gb/s in both directions.

Table7: iperf3 TCP throughput from/to ingress_sbox with rx-gro-hw=off (Mb/s)
	na-arc-1 (naasc-vs-4)	na-arc-2 (naasc-vs-4)	na-arc-3 (naasc-vs-3)	na-arc-4 (naasc-vs-4)	na-arc-5 (naasc-vs-5)	na-arc-6 (naasc-vs-2)
na-arc-1		4460	2580	4630	2860	3150
na-arc-2	4060		2590	4220	3690	2570
na-arc-3	2710	2580		3080	2770	2920
na-arc-4	1090	3720	2200		2970	3200
na-arc-5	4010	3970	2340	4010		3080
na-arc-6	3380	3060	3060	3010	3080

Poorer vxlan performance than expected

Doing iper3 tests between na-arc nodes using the ingres overlay vxlan network created by docker swarm shows between 1Gb/s and 4Gb/s over a 10Gb/s network. This is at best about half the performance I would expect. Granted there is a performance hit for using vxlan, but I would expect that to be around the 10% range meaning I would still expect about 8Gb/s.

Table7: iperf3 TCP throughput from/to ingress_sbox with rx-gro-hw=off (Mb/s)
	na-arc-1 (naasc-vs-4)	na-arc-2 (naasc-vs-4)	na-arc-3 (naasc-vs-3)	na-arc-4 (naasc-vs-4)	na-arc-5 (naasc-vs-5)	na-arc-6 (naasc-vs-2)
na-arc-1		4460	2580	4630	2860	3150
na-arc-2	4060		2590	4220	3690	2570
na-arc-3	2710	2580		3080	2770	2920
na-arc-4	1090	3720	2200		2970	3200
na-arc-5	4010	3970	2340	4010		3080
na-arc-6	3380	3060	3060	3010	3080

TCP retransmissions

The newest NAASC VM Host (naasc-vs-2) often shows over 100 TCP retransmissions per second when doing iperf3 tests. Other nodes like naasc-vs-3 and naasc-vs-4 show 0 TCP retransmissions per second. While I can't say these TCP retransmissions are indicative of a problem, they could become a problem with increased load and they certainly will make debugging more difficult when there is a problem. I suggest the reason for these TCP retransmissions be found and resolved.

MTU

At some point the Maximum Transmission Unit (MTU) for ethernet frames on the production servers was changed from 1500 to 9000. This is a common technique to improve performance in certain situations. But in order to benefit from a 9000 MTU, all ethernet devices in the data path must be set 9000 MTU. Simply changing the interfaces on the naasc-vs and na-arc nodes is not enough. All the NGAS nodes, docker containers, and namespaces in the data path must also be changed. This means recreating the entire ingress overlay network among other changes. Also, since it is unlikely the end user is going to have an MTU of 9000, there is little advantage in setting an MTU of 9000 if your goal is to move data to the user faster. Finally, because of the overhead of vxlan, an MTU of 8900 would be better than 9000. I suggest leaving the MTU at the default 1500 until there is good evidence that a larger MTU is an improvement.

Dropped packets

Some of the NAASC VM hosts show lots of dropped Rx packets. The rate ranges from 2 to over 100 per minute. This is really unacceptable on a modern, well-designed network. While I can't say these dropped packets are indicative of a problem, they could become a problem with increased load and they certainly will make debugging more difficult when there is a problem. I suggest the reason for these dropped packets be found and resolved.

Further tests show patterns. It looks like the same packets may be being dropped on naasc-vs-2 and naasc-vs-4 as they report the same dropped packet rate. For example, I wrote a simple script to print dropped packets per time interval and ran it at the same time on all four naasc-vs hosts. You can see that naasc-vs-2 and naasc-vs-4 show a similar pattern, while naasc-vs-3 and naasc-vs-5 show a different pattern.

naasc-vs-2	naasc-vs-3	naasc-vs-4	naasc-vs-5
30	0	30	0
22	0	24	0
13	1	11	1
9	0	9	0
8	0	8	0
12	1	12	1

I don't think these dropped packets are viewable with tcpdump. At least I haven't seen a set of packets in a tcpdump that matches the number of dropped packets. I supposed there may be more than one type of packet being dropped, but that is very difficult to tell.

Documentation

The NAASC doesn't have a documented procedure for creating a VM guest nor making it a docker swarm node. This needs to be documented so that the creation of such nodes can be repeated without error or change. Alvaro's documentation is a good start but far from sufficient. https://confluence.alma.cl/display/OFFLINE/Documentation

In this to-be-written documentation will be one off settings like ethtool -K em1 gro off.

Also, I think it would be useful for each ARC to document their archive system. This would help other ARCs when they are having problems as well as help all the ARCs be as similar as is feasible.

naasc-archive-network.pdf

Consistent Hardware

The VM Hosts used ad the NAASC are of various hardware. This lead to the largest performance issue, the GRO feature on naasc-vs-4. I suggest making hardware as consistent as possible to avoid such issues in the future.

NGAS network limit

There has been much effort to put the docker swarm nodes on a 10Gb/s network yet the links to the NGAS nodes is only 1Gb/s. This means that even though there could be a 10Gb connection between the docker swarm nodes and the download site of the archive user, it will still be limited to 1Gb/s.

Upgrade swarm to meet ALMA requirements

According to Alvaro's document https://confluence.alma.cl/display/OFFLINE/Documentation docker swarm nodes should have a minimum of 16cores and 32GB of memory. None of the production docker swarm nodes meet this requirement. There is a paln to address this.

ARC benchmarks

I think it would be worthwhile for each ARC to benchmark their download performance. This should be done regularly (weekly, monthly, quarterly, etc) and using as similar a procedure at each arc as possible. This will provide two useful sets of data. 1. It will show when performance has dropped at an ARC hopefully before users start complaining and 2. it will provide a history of benchmarks to measure current benchmarks against. A simple wget script could be used to do this and shared among the ARCs. E.g.

wget --no-check-certificate https://almascience.nrao.edu/dataPortal/member.uid___A001_X1284_Xc9b.spt2349-56_sci.spw19.cube.I.pbcor.fits

Better use of docker swarm

The web proxy points each connection to the next na-arc node in a round-robin manner. Each na-arc node runs no more than one copy of each of the docker containers. There are five na-arc nodes. This means that 80% of requests go to the wrong host and have to be re-routed to the correct host using the docker swarm overlay ingress network (vxlan). This seems very inefficient.

RHEL8 shorcommings

The version of RHEL8 installed on naasc-vs-2 seems to be some small subset of the full RHEL8 distrobution. For over a decade, NRAO installed all packages that came with the Operating System because disk space is cheap and we might need tools like iperf3 or dropwatch or tcpretrans.

References

NAASC Archive Stabilization Tiger Team

Space shortcuts

Page tree