Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • 2022-09-28 krowe: Why was the network-scripts RPM installed on naasc-vs-2?  No other RHEL8 machine has this RPM.  Was it because nobody knew how to configure vlans and other complicated networking using NetworkManager, which is the new standard in RHEL8?
    • Yes. RHEL8 makes bridges and vlans really complicated.
  • 2022-09-26 krowe: Can someone who is able to login, login to the nodes on the 10.2.120 network and see if those interfaces are showing dropped Rx packets?  I would, but I can't login to most of them because CV.
  • 2022-09-21 krowe: Why are there stuck inventory processes on naasc-vs-2?
    • Might be an RHEL8 issue
  • Why does naasc-vs-3 have a br120 in state UNKNOWN?  none of the other naasc-vs nodes have a br120.
    • This is because it is easier to create and not use then not create.
  • Why does naasc-vs-4 have all the infiniband modules loaded?  I don't see an IB card.  naasc-vs-1 and naasc-dev-vs also have some IB modules loaded but naasc-vs-3 and naasc-vs-5 don't have any IB modules loaded.
    • Tracy will look into this
  • Why is nfnetlink logging enabled on naasc-vs-4?  You can see this with cat /proc/net/netfilter/nf_log and lsmod|grep -i nfnet
    • nfnetlink is a module for packet mangling.  Could this interfear with the docker swarm networking?
  • why is the eth1 interfaces in all the containers and docker_gwbridge on na-arc-1 in the 172.18.x.x range while all the other na-arcs are in the 172.19.x.x range?  Does it matter?
  • Here are some diffs in sysctl on na-arc nodes.  I tried changing na-arc-4 and na-arc-5 to match the others but performance was the same.  I then changed all the nodes to match na-arc-{1..3} and still no change in performance.  I still don't understand how na-arc-{4..5} got different setttings.  I did find that there is another directory for sysctl settings in /usr/lib/sysctl.d but that isn't why these are different.
    • na-arc-1, na-arc-2, na-arc-3, natest-arc-1, natest-arc-2, natest-arc-3
      • net.bridge.bridge-nf-call-arptables = 0

        net.bridge.bridge-nf-call-ip6tables = 0

        net.bridge.bridge-nf-call-iptables = 1

    • na-arc-4, na-arc-5
      • net.bridge.bridge-nf-call-arptables = 1

        net.bridge.bridge-nf-call-ip6tables = 1

        net.bridge.bridge-nf-call-iptables = 1

  • Why does almaportal use ens3 while almascience uses eth0?
  • Why does natest-arc-3 have ens3 instead of eth0 and why is its speed 100Mb/s?
    • virsh domiflist natest-arc-3 shows the Model as rtl8139 instead of virtio
    • When I run ethtool eth0 on nar-arc-{1..5} natest-arc-{1..2} as root, the result is just Link detected: yes instead of the full report with speed while na-arc-3 shows 100Mb/s.

...

  • Create na-arc-6 on new naasc-vs-2 (https://support.nrao.edu/show-ticket.php?ticketid=144552)
  • Test iperf between ingress_sbox on new na-arc-6 when it is available
  • Set ethtool -K em1 gro off perminantly on naasc-vs-4 and document it.  How do we do this?
  • Double check switch port settings for naasc-vs-2.  I am seeing many TCP retransmissions (dhart)
  • Check and perhaps replace 10Gb network cable to naas-vs-2.  Does that help with TCP retransmissions?
  • are the retarnsmissions to naasc-vs-2 causing my wget to na-arc-6 to fail?
  • Strawman proposal for reassigning VM guests
  • switch naasc-vs-2 from RHEL8 to RHEL7.
  • krowe to make tickets for solutions


Answers

  • Why does iperf show 10Gb/s between na-arc-5 and na-arc-[1,2,4]?  How is this possible if the default interface on the respective VM Hosts is 1Gb/s?
    • ANSWER: The vnets for the VM guests are tied to the 10Gb/s NICs on the VM hosts not the 1Gb/s NICs.
  • Why do natest-arc-{1..3} have 9 veth* interfaces in ip addr show while na-arc-{1..5} don't have any veth* interfaces?
    • Each container creates a veth* interface.
  • Why does na-arc-3 have such poor network performance to the other na-arc nodes?
    • ping na-arc-[1,2,4,5] with anything larger than -s 1490 drops all packets
    • iperf tests show 10Gb/s between the VM host of na-arc-3 (naasc-vs-3 p5p1.120) and the VM host of na-arc-5 (naasc-vs-5 p2p1.120).  So it isn't a bad card in either of the VM hosts.
    • iptables on na-arc-3 looks different than iptables on na-arc-[2,3,5].  na-arc-1 also looks a bit different.
    • docker_gwbridge interface on na-arc-[1,2,4,5] shows NO_CARRIER but not on na-arc-3.
    • na-arc-3 has a veth10fd1da@if37 interface.  None of the other na-arc-* nodes have a veth interface.
    • Production docker swarm iperf tests measured in Gb/s.


      na-arc-1

      (naasc-vs-4)

      na-arc-2

      (naasc-vs-4)

      na-arc-3

      (naasc-vs-3)

      na-arc-4

      (naasc-vs-4)

      na-arc-5

      (naasc-vs-5)

      na-arc-1
      180.0022010

      na-arc-2

      20
      0.0022010
      na-arc-30.0020.002
      0.0020.002
      na-arc-420190.002

      na-arc-510100.0021010

      There is clearly something wrong with na-arc-3

    • ANSWER: Since there were so many problems with na-arc-3, it was decided to recreate it.  It was recreated from a clone of na-arc-2.
  • Is putting all the 1Gb/s production docker swarm nodes on the same ASIC on the same Fabric Extender of the cv-nexus switch a good idea?
    • I am thinking it does not matter because it looks like the production docker swarm nodes use the 10Gb/s network which is on cv-nexus9k
  • Can we set up a test archive query that uses the "other" docker swarm which in this case would be the production swarm (na-arc-*)?
  • Why are there VLANs on the VM hosts.  e.g. em1.97 on naasc-vs-4?
    • 2022-08-12 dhart: If you want all of your guest VMs to be on the same subnet as the VM host, then VLAN awareness isn't needed.  However, in most cases we want the flexibility of being able to have VM guests on different networks (from one another and/or the VM host) so the VM host is configured with a trunk interface to the network to allow for any VLAN to be passed to the underlying VM guests housed on that VM host machine

    • 2022-08-12 dhart: 10.2.97.x (and 10.2.96.x) = internal VLAN for servers (primarily) 10.2.99.x = internal VLAN for server management
    • 10.2.120.x = internal VLAN for 10 GE connections
  • Where is the main docker config (yaml file)?
  • 2022-09-20 krowe: Why does naasc-vs-2 have APIPA configured networks (169.254.0.0)?  Aren't these usually created only if there are misconfigured network(s)?
    • [root@naasc-vs-2 ~]# netstat -nr
      Kernel IP routing table
      Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
      0.0.0.0         10.2.99.1       0.0.0.0         UG        0 0          0 eno1
      10.2.99.0       0.0.0.0         255.255.255.0   U         0 0          0 eno1
      10.2.120.0      0.0.0.0         255.255.255.0   U         0 0          0 ens1f0np0.120
      169.254.0.0     0.0.0.0         255.255.0.0     U         0 0          0 ens1f0np0
      169.254.0.0     0.0.0.0         255.255.0.0     U         0 0          0 ens1f0np0.120
      169.254.0.0     0.0.0.0         255.255.0.0     U         0 0          0 br97
      169.254.0.0     0.0.0.0         255.255.0.0     U         0 0          0 br101
      192.168.122.0   0.0.0.0         255.255.255.0   U         0 0          0 virbr0
    • 2022-09-28 krowe: APIPA routes are created via /etc/sysconfig/network-scripts/ifup-eth which is installed from the network-scripts RPM.  This RPM is legacy for RHEL8 (naasc-vs-2 is RHEL8.6) and must have been installed specificly.  It is not installed on any other RHEL8 machine I have checked.
  • 2022-09-26 krowe: Can an older solarflare card (Solarflare Communications SFC9020) replace the card in naasc-vs-2 to see if that helps with the TCP Retransmissions? 
  • Why can't I download via na-arc-6?  I don't think it is properly setup yet.
  • Why do I see cv-6509 when tracerouting from na-arc-5 to nangas13 but not on natest-arc-1
    • [root@na-arc-5 ~]# traceroute nangas13
      traceroute to nangas13 (10.2.140.33), 30 hops max, 60 byte packets
       1  cv-6509-vlan97.cv.nrao.edu (10.2.97.1)  0.426 ms  0.465 ms  0.523 ms
       2  cv-6509.cv.nrao.edu (10.2.254.5)  0.297 ms  0.277 ms  0.266 ms
       3  nangas13.cv.nrao.edu (10.2.140.33)  0.197 ms  0.144 ms  0.109 ms
       
    • [root@natest-arc-1 ~]# traceroute nangas13
      traceroute to nangas13 (10.2.140.33), 30 hops max, 60 byte packets
       1  cv-6509-vlan96.cv.nrao.edu (10.2.96.1)  0.459 ms  0.427 ms  0.402 ms
       2  nangas13.cv.nrao.edu (10.2.140.33)  0.184 ms  0.336 ms  0.311 ms
    • Derek wrote that 10.2.99.1 = CV-NEXUS and 10.2.96.1 = CV-6509

...