Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • I see ZeroWindow packets sent from na-arc-5 to nangas13 while downloading a file from nangas13 using wget.  This is na-arc-5 telling nangas13 to wiat because its network buffer is full.
    • Is this because of qdisc pfifo_fast?  No.  krowe changed eth0 to *qdisc fq_codel* and still seeing ZeroWait packets.
  • Why does natest-arc-3 have ens3 instead of eth0 and why is its speed 100Mb/s?
    • virsh domiflist natest-arc-3 shows the Model as rtl8139 instead of virtio
    • When I run ethtool eth0 on nar-arc-{1..5} natest-arc-{1..2} as root, the result is just Link detected: yes instead of the full report with speed while na-arc-3 shows 100Mb/s.
  • Is putting the production swarm nodes (na-arc-*) on the 10Gb/s network a good idea?  Sure it makes a fast connection to cvsan but it adds one more hop to the nangas servers (e.g. na-arc-1 -> cv-nexus9k -> cv-nexus -> nangas11)
  • When I connect to the container acralmaprod001.azurecr.io/offline-production/rh-download:2022.06.01.2022jun I get errors like unknown user 1009  I get the same errors on the natest-arc-1 container.
  • Can we put 10Gb/s NICs in the nangas nodes?
  • Why does almaportal use ens3 while almascience uses eth0?
  • What if we move the rh-downloader container to a different node?  In fact walk it through all five nodes and test.
  • almaportal use ens3 while almascience uses eth0?
  • What if we move the rh-downloader container to a different node?  In fact walk it through all five nodes and test.
  • Why do I see cv-6509 when tracerouting from na-arc-5 to nangas13 but not on natest-arc-1Why do I see cv-6509 when tracerouting from na-arc-5 to nangas13 but not on natest-arc-1
    • [root@na-arc-5 ~]# traceroute nangas13
      traceroute to nangas13 (10.2.140.33), 30 hops max, 60 byte packets
       1  cv-6509-vlan97.cv.nrao.edu (10.2.97.1)  0.426184 ms  0.465336 ms  0.523311 ms
       2  cv-6509.cv.nrao.edu (10.2.254.5)  0.297 ms  0.277 ms  0.266 ms
       3  nangas13.cv.nrao.edu (10.2.140.33)  0.197 ms  0.144 ms  0.109 ms
       [root@natest-arc-1 ~]# traceroute nangas13
      traceroute to nangas13 (10.2.140.33), 30 hops max, 60 byte packets
       1  cv-6509-vlan96.cv.nrao.edu (10.2.96.1)  0.459 ms  0.427 ms  0.402 ms
       2  nangas13.cv.nrao.edu (10.2.140.33)  0.184 ms  0.336 ms  0.311 ms

  • Why does natest-arc-3 have ens3 instead of eth0 and why is its speed 100Mb/s?
    • virsh domiflist natest-arc-3 shows the Model as rtl8139 instead of virtio
    • When I run ethtool eth0 on nar-arc-{1..5} natest-arc-{1..2} as root, the result is just Link detected: yes instead of the full report with speed while na-arc-3 shows 100Mb/s.
  • Is putting the production swarm nodes (na-arc-*) on the 10Gb/s network a good idea?  Sure it makes a fast connection to cvsan but it adds one more hop to the nangas servers (e.g. na-arc-1 -> cv-nexus9k -> cv-nexus -> nangas11)
  • When I connect to the container acralmaprod001.azurecr.io/offline-production/rh-download:2022.06.01.2022jun I get errors like unknown user 1009  I get the same errors on the natest-arc-1 container.


To Do

  1. Done: Recreate na-arc-3 so it gets the same performance as other na-arc-* nodes which is apparently at least 10Gb/s. (pmurphy)
    1. 2022-08-11: cloned na-arc-2 and moved the clone to naasc-vs-3 (zbutcher)
    2. 2022-08-11: moved old na-arc-3 to na-arc-3-OLD (thalstea)
    3. 2022-08-11: Renamed the clone to na-arc-3.  We connected it to the swarm successfully, but it had a low connection speed.
    4. 2022-08-11: Changed the model of  na-arc-3's vnet5 interface on naasc-vs-3 from rtl8139 to virtio to match all the other na-arc-* nodes.  Performance was still poor.
    5. 2022-08-11: Changed the MTU of na-arc-3 eth0 to 1500.  This is different than all the other na-arc-* nodes but it was either that or change the p5p1.120 and br97 on naasc-vs-3 from 9000 to 1500 which my have impacted other VM guests on that host.  Performance was now reasonable.  7Gb/s.  I was expecting about 9Gb/s but perhaps the 1500 MTU is affecting performance.
    6. 2022-08-11: Joined na-arc-3 to the swarm and started services (sbooth)
  2.  Done: Launch services on production swarm (sbooth)
    1. 2022-08-11: Joined na-arc-3 to the swarm and started services (sbooth)
  3. Test the production docker swarm with a test web interface. (lsharp)
    1. 2022-08-12: http://almaportal.cv.nrao.edu/
    2. 2022-08-12 krowe: rant tcpdump on all five na-arc-{1..5} nodes tcpdump dst almaportal and then downloaded a datafile wget --no-check-certificate https://almaportal.cv.nrao.edu/dataPortal/2013.1.00226.S_uid___A001_X122_X1f1_001_of_001.tar and with each execution of the wget, I could see the nex na-arc host report the traffic.  This is because the web proxy on almaportal will select the next na-arc node via round-robin.  All five nodes were providing about 6KB/s speeds to cvpost-master.
    3. 2022-08-12 krowe: I did iperf tests from host to host in the entire chain (nangas14 -> na-arc-{1..5} -> almaportal -> cvpost-master) and each step the performance was at least 900Mb/s yet downloading with wget was about 0.06Mb/s.
  4. Done: Ask other ARC if they use MTU 9000 on 10Gb. (krowe)
    1. JAO uses MTU of 1500
    2. ESO uses two VM hosts running VMware with 10Gb/s and MTU of 1500
  5. Switch the production docker swarm back to MTU 1500 since the test docker swarm uses MTU 1500 and is performing better?
  6. Fix natest-arc-3 so it's NIC Model is virtio instead of rtl8139
  7. Upgrade production swarm to meet ALMA requirements (16-core, 32GB)

...