Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • 2022-09-28 krowe: Why was the network-scripts RPM installed on naasc-vs-2?  No other RHEL8 machine has this RPM.  Was it because nobody knew how to configure vlans and other complicated networking using NetworkManager, which is the new standard in RHEL8?
  • 2022-09-26 krowe: Can someone who is able to login, login to the nodes on the 10.2.120 network and see if those interfaces are showing dropped Rx packets?  I would, but I can't login to most of them because CV.
  • 2022-09-21 krowe: Why are there dozens of stuck inventory processes on naasvnaasc-vs-2?2022
  • Why does naasc-09-20 krowe: ifconfig shows dropped RX packes on all vs-3 have a br120 in state UNKNOWN?  none of the other naasc-vs -* nodes .  Is that increasing still with time?  What is causing this?  CJ mentioned this two months ago.  I am finally looking at it now.  sigh.
  • 2022-09-20 krowe: It looks like device eno1 on naasc-vs-2 is configured via DHCP instead of STATIC.  Is that correct?
  • Why, with rx-gro-hw=off on naasc-vs-4, does na-arc-6 see so many retransmissions and small Congestion Window (Cwnd)?[root@na-arc-6 ~]# iperf3 -B 10.0.0.16 -c 10.0.0.21
    Connecting to host 10.0.0.21, port 5201
    [  4] local 10.0.0.16 port 38534 connected to 10.0.0.21 port 5201
    [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
    [  4]   0.00-1.00   sec   302 MBytes  2.54 Gbits/sec  523    207 KBytes
    [  4]   1.00-2.00   sec   322 MBytes  2.70 Gbits/sec  596    186 KBytes
    [  4]   2.00-3.00   sec   312 MBytes  2.62 Gbits/sec  687    245 KBytes
    [  4]   3.00-4.00   sec   335 MBytes  2.81 Gbits/sec  638    278 KBytes
    [  4]   4.00-5.00   sec   309 MBytes  2.60 Gbits/sec  780    146 KBytes
  • [root@na-arc-3 ~]# iperf3 -B 10.0.0.19 -c 10.0.0.21
    Connecting to host 10.0.0.21, port 5201
    [  4] local 10.0.0.19 port 52986 connected to 10.0.0.21 port 5201
    [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
    [  4]   0.00-1.00   sec   309 MBytes  2.59 Gbits/sec  232    638 KBytes
    [  4]   1.00-2.00   sec   358 MBytes  3.00 Gbits/sec    0    967 KBytes
    [  4]   2.00-3.00   sec   351 MBytes  2.95 Gbits/sec    0   1.18 MBytes 
    [  4]   3.00-4.00   sec   339 MBytes  2.84 Gbits/sec   74   1.36 MBytes
    [  4]   4.00-5.00   sec   359 MBytes  3.01 Gbits/sec    0   1.54 MBytes
  • Actuqally the retransmissions seem to very quite a lot from one run to another. That is the more important question.  Also the throughput seems to vary as well from 1Gb/s to 4Gb/s.  Of course the more retransmissions the less throughput.  Granted this is a second order force and given that the nangas hosts have 1Gb/s links, probably won't be seen.  But if we ever put 10Gb/s cards in the nangas nodes we will see this and be sad.
  • Why does naasc-vs-3 have a br120 in state UNKNOWN?  none of the other naasc-vs nodes have a br120.
  • Why does naasc-vs-4 have all the infiniband modules loaded?  I don't see an IB card.  naasc-vs-1 and naasc-dev-vs also have some IB modules loaded but naasc-vs-3 and naasc-vs-5 don't have any IB modules loaded.
    • Tracy will look into this
  • Why is nfnetlink logging enabled on naasc-vs-4?  You can see this with cat /proc/net/netfilter/nf_log and lsmod|grep -i nfnet
    • nfnetlink is a module for packet mangling.  Could this interfear with the docker swarm networking?
  • why is the eth1 interfaces in all the containers and docker_gwbridge on na-arc-1 in the 172.18.x.x range while all the other na-arcs are in the 172.19.x.x range?  Does it matter?
  • Here are some diffs in sysctl on na-arc nodes.  I tried changed na-arc-4 and na-arc-5 to match the others but performance was the same.  I then changed all the nodes to match na-arc-{1..3} and still no change in performance.  I still don't understand how na-arc-{4..5} got different setttings.  I did find that there is another directory for sysctl settings in /usr/lib/sysctl.d but that isn't why these are different.
    • na-arc-1, na-arc-2, na-arc-3, natest-arc-1, natest-arc-2, natest-arc-3
      • net.bridge.bridge-nf-call-arptables = 0

        net.bridge.bridge-nf-call-ip6tables = 0

        net.bridge.bridge-nf-call-iptables = 1

    • na-arc-4, na-arc-5
      • net.bridge.bridge-nf-call-arptables = 1

        net.bridge.bridge-nf-call-ip6tables = 1

        net.bridge.bridge-nf-call-iptables = 1

  • I see sysctl differences between the natest-arc servers and the na-arc servers.  Here is a diff of /etc/sysctl.d/99-nrao.conf on natest-arc-1 and na-arc-5
    • < #net.ipv4.tcp_tw_recycle = 1
      ---
      > net.ipv4.tcp_tw_recycle = 1
      22,39d21
      < net.ipv4.conf.all.accept_redirects=0
      < net.ipv4.conf.default.accept_redirects=0
      < net.ipv4.conf.all.secure_redirects=0
      < net.ipv4.conf.default.secure_redirects=0
      <
      < #net.ipv6.conf.all.disable_ipv6 = 1
      < #net.ipv6.conf.default.disable_ipv6 = 1
      <
      < # Mellanox recommends the following
      < net.ipv4.tcp_timestamps = 0
      < net.core.netdev_max_backlog = 250000
      <
      < net.core.rmem_default = 16777216
      < net.core.wmem_default = 16777216
      < net.core.optmem_max = 16777216
      < net.ipv4.tcp_mem = 16777216 16777216 16777216
      < net.ipv4.tcp_low_latency = 1
    • If I set net.ipv4.tcp_timestamps = 0 on na-arc-5, the wget download drops to nothing (--.-KB/s).

    • If I set all the above sysctl options, execpt net.ipv4.tcp_timestamps, on all five na-arc nodes, wget download performance doesn't change.  It is still about 32KB/s.  Also I still zeeo ZeroWindow packets.
    • Try rebooting VMs after making changes?
  • I see ZeroWindow packets sent from na-arc-5 to nangas13 while downloading a file from nangas13 using wget.  This is na-arc-5 telling nangas13 to wiat because its network buffer is full.
    • Is this because of qdisc pfifo_fast?  No.  krowe changed eth0 to *qdisc fq_codel* and still seeing ZeroWait packets.
    • Now that I have moved the rh_download to na-arc-1 and put httpd on na-arc-5 I no longer see ZeroWindow packets on na-arc-5.  But I am seeing them on na-arc-1 which is where the rh_downloader is now.  Is this because the rh_downloader is being stalled talking to something else like httpd and therefore telling nangas13 to wait?
  • Why does almaportal use ens3 while almascience uses eth0?
  • What if we move the rh-downloader container to a different node?  In fact walk it through all five nodes and test.
  • Why do I see cv-6509 when tracerouting from na-arc-5 to nangas13 but not on natest-arc-1
    • [root@na-arc-5 ~]# traceroute nangas13
      traceroute to nangas13 (10.2.140.33), 30 hops max, 60 byte packets
       1  cv-6509-vlan97.cv.nrao.edu (10.2.97.1)  0.426 ms  0.465 ms  0.523 ms
       2  cv-6509.cv.nrao.edu (10.2.254.5)  0.297 ms  0.277 ms  0.266 ms
       3  nangas13.cv.nrao.edu (10.2.140.33)  0.197 ms  0.144 ms  0.109 ms
       
    • [root@natest-arc-1 ~]# traceroute nangas13
      traceroute to nangas13 (10.2.140.33), 30 hops max, 60 byte packets
       1  cv-6509-vlan96.cv.nrao.edu (10.2.96.1)  0.459 ms  0.427 ms  0.402 ms
       2  nangas13.cv.nrao.edu (10.2.140.33)  0.184 ms  0.336 ms  0.311 ms
    • Derek wrote that 10.2.99.1 = CV-NEXUS and 10.2.96.1 = CV-6509
  • Why does natest-arc-3 have ens3 instead of eth0 and why is its speed 100Mb/s?
    • virsh domiflist natest-arc-3 shows the Model as rtl8139 instead of virtio
    • When I run ethtool eth0 on nar-arc-{1..5} natest-arc-{1..2} as root, the result is just Link detected: yes instead of the full report with speed while na-arc-3 shows 100Mb/s.
  • Why do iperf tests from natest-arc-1 and natest-arc-2 to natest-arc-3 get about half the performance (0.5Gb/s) expected especially when the reverse tests get expected performance (0.9Gb/s).
  • Is putting the production swarm nodes (na-arc-*) on the 10Gb/s network a good idea?  Sure it makes a fast connection to cvsan but it adds one more hop to the nangas servers (e.g. na-arc-1 -> cv-nexus9k -> cv-nexus -> nangas11)
  • When I connect to the container acralmaprod001.azurecr.io/offline-production/rh-download:2022.06.01.2022jun I get errors like unknown user 1009  I get the same errors on the natest-arc-1 container.
  • Does it matter that the na-arc nodes are on 10.2.97.x, their VM host is on 10.2.99.x while the natest-arc nodes are on 10.2.96.x and their VM hosts (well 2 out of 3) are also on 10.2.96.x?  Is this why I see cv-509.cv.nrao.edu when running traceroute from the na-arc nodes?
  • When running wget --no-check-certificate http://na-arc-3.cv.nrao.edu:8088/dataPortal/member.uid___A001_X1358_Xd2.3C286_sci.spw31.cube.I.pbcor.fits I see traffic going through veth14ce034 on na-arc-3 but I can't find a container associated with that veth.
  • Why does the httpd container have eth0(10.0.0.8).  This is the ingress network.  I don't see any other conrainter with an interface on 10.0.0.0/24.
  • Do we want to use jumbo frames?  If so, some recommend using mtu=8900 and there are a lot of places it needs to be set.

To Do

  • Create na-arc-6 on new naasc-vs-2 (https://support.nrao.edu/show-ticket.php?ticketid=144552)
  • Test iperf between ingress_sbox on new na-arc-6 when it is available
  • Set ethtool -K em1 gro off perminantly on naasc-vs-4 and document it.  How do we do this?
  • Double check switch port settings for naasc-vs-2.  I am seeing many TCP retransmissions (dhart)
  • Check and perhaps replace 10Gb network cable to naas-vs-2.  Does that help with TCP retransmissions?
  • are the retarnsmissions to naasc-vs-2 causing my wget to na-arc-6 to fail?
  • Strawman proposal for reassigning VM guests

Done

  • have a br120.
  • Why does naasc-vs-4 have all the infiniband modules loaded?  I don't see an IB card.  naasc-vs-1 and naasc-dev-vs also have some IB modules loaded but naasc-vs-3 and naasc-vs-5 don't have any IB modules loaded.
    • Tracy will look into this
  • Why is nfnetlink logging enabled on naasc-vs-4?  You can see this with cat /proc/net/netfilter/nf_log and lsmod|grep -i nfnet
    • nfnetlink is a module for packet mangling.  Could this interfear with the docker swarm networking?
  • why is the eth1 interfaces in all the containers and docker_gwbridge on na-arc-1 in the 172.18.x.x range while all the other na-arcs are in the 172.19.x.x range?  Does it matter?
  • Here are some diffs in sysctl on na-arc nodes.  I tried changing na-arc-4 and na-arc-5 to match the others but performance was the same.  I then changed all the nodes to match na-arc-{1..3} and still no change in performance.  I still don't understand how na-arc-{4..5} got different setttings.  I did find that there is another directory for sysctl settings in /usr/lib/sysctl.d but that isn't why these are different.
    • na-arc-1, na-arc-2, na-arc-3, natest-arc-1, natest-arc-2, natest-arc-3
      • net.bridge.bridge-nf-call-arptables = 0

        net.bridge.bridge-nf-call-ip6tables = 0

        net.bridge.bridge-nf-call-iptables = 1

    • na-arc-4, na-arc-5
      • net.bridge.bridge-nf-call-arptables = 1

        net.bridge.bridge-nf-call-ip6tables = 1

        net.bridge.bridge-nf-call-iptables = 1

  • Why does almaportal use ens3 while almascience uses eth0?
  • Why does natest-arc-3 have ens3 instead of eth0 and why is its speed 100Mb/s?
    • virsh domiflist natest-arc-3 shows the Model as rtl8139 instead of virtio
    • When I run ethtool eth0 on nar-arc-{1..5} natest-arc-{1..2} as root, the result is just Link detected: yes instead of the full report with speed while na-arc-3 shows 100Mb/s.

To Do

  • Create na-arc-6 on new naasc-vs-2 (https://support.nrao.edu/show-ticket.php?ticketid=144552)
  • Test iperf between ingress_sbox on new na-arc-6 when it is available
  • Set ethtool -K em1 gro off perminantly on naasc-vs-4 and document it.  How do we do this?
  • Double check switch port settings for naasc-vs-2.  I am seeing many TCP retransmissions (dhart)
  • Check and perhaps replace 10Gb network cable to naas-vs-2.  Does that help with TCP retransmissions?
  • are the retarnsmissions to naasc-vs-2 causing my wget to na-arc-6 to fail?
  • Strawman proposal for reassigning VM guests


Done

  • Recreate na-arc-3 so it gets the same performance as other na-arc-* nodes which is apparently at least 10Gb/s. (pmurphy)
      Recreate na-arc-3 so it gets the same performance as other na-arc-* nodes which is apparently at least 10Gb/s. (pmurphy)
      1. 2022-08-11: cloned na-arc-2 and moved the clone to naasc-vs-3 (zbutcher)
      2. 2022-08-11: moved old na-arc-3 to na-arc-3-OLD (thalstea)
      3. 2022-08-11: Renamed the clone to na-arc-3.  We connected it to the swarm successfully, but it had a low connection speed.
      4. 2022-08-11: Changed the model of  na-arc-3's vnet5 interface on naasc-vs-3 from rtl8139 to virtio to match all the other na-arc-* nodes.  Performance was still poor.
      5. 2022-08-11: Changed the MTU of na-arc-3 eth0 to 1500.  This is different than all the other na-arc-* nodes but it was either that or change the p5p1.120 and br97 on naasc-vs-3 from 9000 to 1500 which my have impacted other VM guests on that host.  Performance was now reasonable.  7Gb/s.  I was expecting about 9Gb/s but perhaps the 1500 MTU is affecting performance.
      6. 2022-08-11: Joined na-arc-3 to the swarm and started services (sbooth)
    1. Launch services on production swarm (sbooth)
      1. 2022-08-11: Joined na-arc-3 to the swarm and started services (sbooth)
    2. Test the production docker swarm with a test web interface. (lsharp)
      1. 2022-08-12: http://almaportal.cv.nrao.edu/
      2. 2022-08-12 krowe: ran tcpdump on all five na-arc-{1..5} nodes tcpdump dst almaportal and then downloaded a datafile wget --no-check-certificate https://almaportal.cv.nrao.edu/dataPortal/2013.1.00226.S_uid___A001_X122_X1f1_001_of_001.tar and with each execution of the wget, I could see the next na-arc host report the traffic.  This is because the web proxy on almaportal will select the next na-arc node via round-robin.  All five nodes were providing about 6KB/s speeds to cvpost-master.
      3. 2022-08-12 krowe: I did iperf tests from host to host in the entire chain (nangas14 -> na-arc-{1..5} -> almaportal -> cvpost-master) and each step the performance was at least 900Mb/s yet downloading with wget was about 0.06Mb/s.
    3. Ask other ARC if they use MTU 9000 on 10Gb. (krowe)
      1. JAO uses MTU of 1500
      2. ESO uses two VM hosts running VMware with 10Gb/s and MTU of 1500
    4. 2022-08-17 krowe: Changed eth0 on na-arc-5 from qdisc pfifo_fast to qdisc fq_codel to match all the other na-arc and natest-arc nodes.  This seemed to have no affect on performance.
      • tc qdisc replace dev eth0 root fq_codel
    5. 2022-08-25 krowe: Tracy cahnged the following sysctl options on na-arc-5 to match the other VM Hosts.  Sadly it seems to have had no effect on wget performance.  na-arc-1, na-arc-2, na-arc-4 are 32KB/s while na-arc-3 and na-arc-5 are 45MB/s.
      • net.ipv4.conf.all.accept_redirects = 0
      • net.ipv4.conf.all.forwarding = 1
    6. 2022-09-01: Tracy rebooted naasc-vs-5 which hosts na-arc-5 just in case this was necessary for the net.ipv4.conf.all.forwarding sysctl change to take effect.  Sadly, no change in performance.
    7. Why does na-arc-5 still have net.ipv4.conf.all.accept_redirects = 1 even after a reboot while all the other na-arc nodes have this set to 0?
      • 2022-09-06 krowe: probably because na-arc-5 didn't reboot when naasc-vs-5 rebooted.  I expect it was suspended instead of rebooted.  Yet natest-arc-3 and naascweb2-prod were rebooted.  I just checked virt-manager and na-arc-5 is hosted by naasc-vs-5.  Can we reboote na-arc-5?
      • 2022-09-07 krowe: rebooted na-arc-5 and now net.ipv4.conf.all.accept_redirects = 0
    8. 2022-09-21 cfultz: Replaced the 10Gb network cable on naasc-vs-2.  "the cable was nearly bent in half at the router".

...

  • Why does iperf show 10Gb/s between na-arc-5 and na-arc-[1,2,4]?  How is this possible if the default interface on the respective VM Hosts is 1Gb/s?
    • ANSWER: The vnets for the VM guests are tied to the 10Gb/s NICs on the VM hosts not the 1Gb/s NICs.
  • Why do natest-arc-{1..3} have 9 veth* interfaces in ip addr show while na-arc-{1..5} don't have any veth* interfaces?
    • Each container creates a veth* interface.
  • Why does na-arc-3 have such poor network performance to the other na-arc nodes?
    • ping na-arc-[1,2,4,5] with anything larger than -s 1490 drops all packets
    • iperf tests show 10Gb/s between the VM host of na-arc-3 (naasc-vs-3 p5p1.120) and the VM host of na-arc-5 (naasc-vs-5 p2p1.120).  So it isn't a bad card in either of the VM hosts.
    • iptables on na-arc-3 looks different than iptables on na-arc-[2,3,5].  na-arc-1 also looks a bit different.
    • docker_gwbridge interface on na-arc-[1,2,4,5] shows NO_CARRIER but not on na-arc-3.
    • na-arc-3 has a veth10fd1da@if37 interface.  None of the other na-arc-* nodes have a veth interface.
    • Production docker swarm iperf tests measured in Gb/s.


      na-arc-1

      (naasc-vs-4)

      na-arc-2

      (naasc-vs-4)

      na-arc-3

      (naasc-vs-3)

      na-arc-4

      (naasc-vs-4)

      na-arc-5

      (naasc-vs-5)

      na-arc-1
      180.0022010

      na-arc-2

      20
      0.0022010
      na-arc-30.0020.002
      0.0020.002
      na-arc-420190.002

      na-arc-510100.0021010

      There is clearly something wrong with na-arc-3

    • ANSWER: Since there were so many problems with na-arc-3, it was decided to recreate it.  It was recreated from a clone of na-arc-2.
  • Is putting all the 1Gb/s production docker swarm nodes on the same ASIC on the same Fabric Extender of the cv-nexus switch a good idea?
    • I am thinking it does not matter because it looks like the production docker swarm nodes use the 10Gb/s network which is on cv-nexus9k
  • Can we set up a test archive query that uses the "other" docker swarm which in this case would be the production swarm (na-arc-*)?
  • Why are there VLANs on the VM hosts.  e.g. em1.97 on naasc-vs-4?
    • 2022-08-12 dhart: If you want all of your guest VMs to be on the same subnet as the VM host, then VLAN awareness isn't needed.  However, in most cases we want the flexibility of being able to have VM guests on different networks (from one another and/or the VM host) so the VM host is configured with a trunk interface to the network to allow for any VLAN to be passed to the underlying VM guests housed on that VM host machine

    • 2022-08-12 dhart: 10.2.97.x (and 10.2.96.x) = internal VLAN for servers (primarily) 10.2.99.x = internal VLAN for server management
    • 10.2.120.x = internal VLAN for 10 GE connections
  • Where is the main docker config (yaml file)?
  • 2022-09-20 krowe: Why does naasc-vs-2 have APIPA configured networks (169.254.0.0)?  Aren't these usually created only if there are misconfigured network(s)?
    • [root@naasc-vs-2 ~]# netstat -nr
      Kernel IP routing table
      Destination-vs-2 ~]# netstat -nr
      Kernel IP routing table
      Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
      0.0.0.0         10.2.99.1       0.0.0.0         GatewayUG         0 Genmask0         Flags 0 eno1
      10.2.99.0   MSS  Window   irtt Iface
      0.0.0.0         10255.2255.99.1255.0   U         0 0          0 eno1
      10.2.120.0      0.0.0.0         255.255.255.0   U UG         00 0          0 eno1ens1f0np0.120
      10169.2254.990.0       0.0.0.0         255.255.2550.0     U         0 0          0 eno1ens1f0np0
      10169.2254.1200.0      00.0.0.0         255.255.2550.0     U         0 0          0 ens1f0np0.120
      169.254.0.0     0.0.0.0         255.255.0.0     U         0 0          0 ens1f0np0br97
      169.254.0.0     0.0.0.0         255.255.0.0     U         0 0          0 ens1f0np0.120
      169.254br101
      192.168.122.0   0.0.0.0         0.0.0.255.255.255.0   U         0 0         255.255.0.0     U         0 0          0 br97
      169.254.0.0     0.0.0.0         255.255.0.0     U         0 0          0 br101
      192.168.122.0   0.0.0.0         255.255.255.0   U         0 0          0 virbr0
    • 2022-09-28 krowe: APIPA routes are created via /etc/sysconfig/network-scripts/ifup-eth which is installed from the network-scripts RPM.  This RPM is legacy for RHEL8 (naasc-vs-2 is RHEL8.6) and must have been installed specificly.  It is not installed on any other RHEL8 machine I have checked.
  • 2022-09-26 krowe: Can an older solarflare card (Solarflare Communications SFC9020) replace the card in naasc-vs-2 to see if that helps with the TCP Retransmissions? 
    •   0 virbr0
    • 2022-09-28 krowe: APIPA routes are created via /etc/sysconfig/network-scripts/ifup-eth which is installed from the network-scripts RPM.  This RPM is legacy for RHEL8 (naasc-vs-2 is RHEL8.6) and must have been installed specificly.  It is not installed on any other RHEL8 machine I have checked.
  • 2022-09-26 krowe: Can an older solarflare card (Solarflare Communications SFC9020) replace the card in naasc-vs-2 to see if that helps with the TCP Retransmissions? 
  • Why can't I download via na-arc-6?  I don't think it is properly setup yet.
  • Why do I see cv-6509 when tracerouting from na-arc-5 to nangas13 but not on natest-arc-1
    • [root@na-arc-5 ~]# traceroute nangas13
      traceroute to nangas13 (10.2.140.33), 30 hops max, 60 byte packets
       1  cv-6509-vlan97.cv.nrao.edu (10.2.97.1)  0.426 ms  0.465 ms  0.523 ms
       2  cv-6509.cv.nrao.edu (10.2.254.5)  0.297 ms  0.277 ms  0.266 ms
       3  nangas13.cv.nrao.edu (10.2.140.33)  0.197 ms  0.144 ms  0.109 ms
       
    • [root@natest-arc-1 ~]# traceroute nangas13
      traceroute to nangas13 (10.2.140.33), 30 hops max, 60 byte packets
       1  cv-6509-vlan96.cv.nrao.edu (10.2.96.1)  0.459 ms  0.427 ms  0.402 ms
       2  nangas13.cv.nrao.edu (10.2.140.33)  0.184 ms  0.336 ms  0.311 ms
    • Derek wrote that 10.2.99.1 = CV-NEXUS and 10.2.96.1 = CV-6509
    Why can't I download via na-arc-6?  I don't think it is properly setup yet.


Conclusions

NAASC Archive Stabilization Solutions

...