Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Table7: iperf3 TCP throughput from/to ingress_sbox with rx-gro-hw=off (Mb/s)

na-arc-1

(naasc-vs-4)

na-arc-2

(naasc-vs-4)

na-arc-3

(naasc-vs-3)

na-arc-4

(naasc-vs-4)

na-arc-5

(naasc-vs-5)

na-arc-6

(naasc-vs-2)

na-arc-1

4460

2580463028603150
na-arc-2

4060


2590422036902570
na-arc-3

2710

2580


308027702920
na-arc-4

1090

37202200
29703200
na-arc-5

4010

397023404010
3080
na-arc-6

3380

3060306030103080



Documentation

The NAASC doesn't have a documented procedure for creating a VM guest nor making it a docker swarm node.  This needs to be documented so that the creation of such nodes can be repeated without error or change.  Alvaro's documentation is a good start but far from sufficient.  https://confluence.alma.cl/display/OFFLINE/Documentation

In this to-be-written documentation will be one off settings like ethtool -K em1 gro off.


Consistent Hardware

The VM Hosts used ad the NAASC are of various hardware.  This lead to the largest performance issue, the GRO feature on naasc-vs-4.  I suggest making hardware as consistent as possible to avoid such issues in the future.


NGAS network limit

There has been much effort to put the docker swarm nodes on a 10Gb/s network yet the links to the NGAS nodes is only 1Gb/s.  This means that even though there could be a 10Gb connection between the docker swarm nodes and the download site of the archive user, it will still be limited to 1Gb/s.


Upgrade swarm to meet ALMA requirements

According to Alvaro's document https://confluence.alma.cl/display/OFFLINE/Documentation docker swarm nodes should have a minimum of 16cores and 32GB of memory.  None of the production docker swarm nodes meet this requirement.  There are plans to address this though.


ARC benchmarks

I think it would be worthwhile for each ARC to benchmark their download performance.  This should be done regularly (weekly, monthly, quarterly, etc) and using as similar a procedure at each arc as possible.  This will provide two useful sets of data.  1. It will show when performance has dropped at an ARC hopefully before users start complaining and 2. it will provide a history of benchmarks to measure current benchmarks against.  A simple wget script could be used to do this and shared among the ARCs.  E.g.

wget --no-check-certificate https://almascience.nrao.edu/dataPortal/member.uid___A001_X1284_Xc9b.spt2349-56_sci.spw19.cube.I.pbcor.fits



Dropped packets

Some of the NAASC VM hosts show lots of dropped Rx packets.  The rate ranges from 2 to over 100 per minute.  This is really unacceptable on a modern, well-designed network.  While I can't say these dropped packets are indicative of a problem, they could become a problem with increased load and they certainly will make debugging more difficult when there is a problem.  I suggest the reason for these dropped packets be found and resolved.



TCP retransmissions

The newest NAASC VM Host (naasc-vs-2) shows over 100 TCP retransmissions per second when doing iperf3 tests.  Other nodes like naasc-vs-3 and naasc-vs-4 do not show these at all.  While I can't say these TCP retransmissions are indicative of a problem, they could become a problem with increased load and they certainly will make debugging more difficult when there is a problem.  I suggest the reason for these TCP retransmissions be found and resolved.


Better use of docker swarm