Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Create na-arc-6 on new naasc-vs-2 (https://support.nrao.edu/show-ticket.php?ticketid=144552)
  • Test iperf between ingress_sbox on new na-arc-6 when it is available
  • Set ethtool -K em1 gro off perminantly on naasc-vs-4 and document it.  How do we do this?
  • Double check switch port settings for naasc-vs-2.  I am seeing many TCP retransmissions (dhart)
  • Check and perhaps replace 10Gb network cable to naas-vs-2.  Does that help with TCP retransmissions?


Done

  • Recreate na-arc-3 so it gets the same performance as other na-arc-* nodes which is apparently at least 10Gb/s. (pmurphy)
    1. 2022-08-11: cloned na-arc-2 and moved the clone to naasc-vs-3 (zbutcher)
    2. 2022-08-11: moved old na-arc-3 to na-arc-3-OLD (thalstea)
    3. 2022-08-11: Renamed the clone to na-arc-3.  We connected it to the swarm successfully, but it had a low connection speed.
    4. 2022-08-11: Changed the model of  na-arc-3's vnet5 interface on naasc-vs-3 from rtl8139 to virtio to match all the other na-arc-* nodes.  Performance was still poor.
    5. 2022-08-11: Changed the MTU of na-arc-3 eth0 to 1500.  This is different than all the other na-arc-* nodes but it was either that or change the p5p1.120 and br97 on naasc-vs-3 from 9000 to 1500 which my have impacted other VM guests on that host.  Performance was now reasonable.  7Gb/s.  I was expecting about 9Gb/s but perhaps the 1500 MTU is affecting performance.
    6. 2022-08-11: Joined na-arc-3 to the swarm and started services (sbooth)
  • Launch services on production swarm (sbooth)
    1. 2022-08-11: Joined na-arc-3 to the swarm and started services (sbooth)
  • Test the production docker swarm with a test web interface. (lsharp)
    1. 2022-08-12: http://almaportal.cv.nrao.edu/
    2. 2022-08-12 krowe: ran tcpdump on all five na-arc-{1..5} nodes tcpdump dst almaportal and then downloaded a datafile wget --no-check-certificate https://almaportal.cv.nrao.edu/dataPortal/2013.1.00226.S_uid___A001_X122_X1f1_001_of_001.tar and with each execution of the wget, I could see the next na-arc host report the traffic.  This is because the web proxy on almaportal will select the next na-arc node via round-robin.  All five nodes were providing about 6KB/s speeds to cvpost-master.
    3. 2022-08-12 krowe: I did iperf tests from host to host in the entire chain (nangas14 -> na-arc-{1..5} -> almaportal -> cvpost-master) and each step the performance was at least 900Mb/s yet downloading with wget was about 0.06Mb/s.
  • Ask other ARC if they use MTU 9000 on 10Gb. (krowe)
    1. JAO uses MTU of 1500
    2. ESO uses two VM hosts running VMware with 10Gb/s and MTU of 1500
  • 2022-08-17 krowe: Changed eth0 on na-arc-5 from qdisc pfifo_fast to qdisc fq_codel to match all the other na-arc and natest-arc nodes.  This seemed to have no affect on performance.
    • tc qdisc replace dev eth0 root fq_codel
  • 2022-08-25 krowe: Tracy cahnged the following sysctl options on na-arc-5 to match the other VM Hosts.  Sadly it seems to have had no effect on wget performance.  na-arc-1, na-arc-2, na-arc-4 are 32KB/s while na-arc-3 and na-arc-5 are 45MB/s.
    • net.ipv4.conf.all.accept_redirects = 0
    • net.ipv4.conf.all.forwarding = 1
  • 2022-09-01: Tracy rebooted naasc-vs-5 which hosts na-arc-5 just in case this was necessary for the net.ipv4.conf.all.forwarding sysctl change to take effect.  Sadly, no change in performance.
  • Why does na-arc-5 still have net.ipv4.conf.all.accept_redirects = 1 even after a reboot while all the other na-arc nodes have this set to 0?
    • 2022-09-06 krowe: probably because na-arc-5 didn't reboot when naasc-vs-5 rebooted.  I expect it was suspended instead of rebooted.  Yet natest-arc-3 and naascweb2-prod were rebooted.  I just checked virt-manager and na-arc-5 is hosted by naasc-vs-5.  Can we reboote na-arc-5?
    • 2022-09-07 krowe: rebooted na-arc-5 and now net.ipv4.conf.all.accept_redirects = 0

...