Poor Download Performance
This was first reported on 2022-04-18 and documented in https://ictjira.alma.cl/browse/AES-52 What we have seen/has been reported is that sometimes downloads are incredibly slow (10s of kB/s) and sometimes the transfer is closed with data missing from the download. Other times we see perfectly reasonable download speeds (~10 MB/s). This was reproducable with a command like the following
wget --no-check-certificate http://almascience.nrao.edu/dataPortal/member.uid___A001_X1358_Xd2.3C286_sci.spw31.cube.I.pbcor.fits
Shortly after this report, the almascience portal was redirected from the production docker swarm to the test-prod docker swarm because it produced better download performance, although still not as good as was expected (10s of MB/s). Also, somewhere around this time the MTUs on the production docker swarm nodes was changed from 1500 to 9000.
It was noticed that one of the production docker swarm nodes, na-arc-3, was configured differently than the other na-arc-* nodes:
- ping na-arc-[1,2,4,5] from na-arc-3 with anything larger than -s 1490 drops all packets
- iperf tests show 10Gb/s between the VM host of na-arc-3 (naasc-vs-3 p5p1.120) and the VM host of na-arc-5 (naasc-vs-5 p2p1.120). So it isn't a bad card in either of the VM hosts.
- iptables on na-arc-3 looks different than iptables on na-arc-[2,3,5]. na-arc-1 also looks a bit different.
- docker_gwbridge interface on na-arc-[1,2,4,5] shows NO_CARRIER but not on na-arc-3.
- na-arc-3 has a veth10fd1da@if37 interface. None of the other na-arc-* nodes have a veth interface.
iperf3 tests between all the na-arc-* nodes showed na-arc-3 was performing about 10e4 times slower on both sending and receiving.
Given the number of issues with na-arc-3 it was decided to just recreated it from a clone of na-arc-2. This happened on 2022-08-11 and since then iperf3 tests between all the na-arc-* nodes have shown expected performance.
On 2022-08-12 http://almaportal.cv.nrao.edu/ was created so that we could internally test the production docker swarm nodes in a manner similar to how external users would use it.