Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • 2022-09-02 krowe: sysctl -a | grep <10Gb NIC> between naasc-vs-3/naasc-vs-5 and  are different
    • naasc-vs-4 has entries for VLANs 101 and 140 while naasc-vs-3 and naasc-vs-5 have entries for VLANs 192 and 96.
  • 2022-09-02 krowe: sysctl -a on naasc-vs-4 and naasc-vs-5 and found many questionable differences
    • naasc-vs-4: net.iw_cm.default_backlog = 256
      • Is this because the IB modules are loaded?
    • naasc-vs-4: net.rdma_ucm.max_backlog = 1024
      • Is this because the IB modules are loaded?
    • naasc-vs-4: sunrpc.rdma*
      • Is this because the IB modules are loaded?
    • naasc-vs-4: net.netfilter.nf_log.2 = nfnetlink_log
      • nfnetlink is a module for packet mangling.  Could this interfear with the docker swarm networking?
    • Though the recorded output rate of naasc-vs-5 is about 500 Mb/s while naasc-vs-{3..4} is about 300Kb/s.
    • And the recorded input rate of naasc-vs-5 is about 500 Mb/s while naasc-vs{3..4} is about 5 Mb/s.
    • This is very strange as it seemed naasc-vs-5 was the limiting factor but the switch ports suggest not.  Perhaps this data rate is caused by other VM guests on naasc-vs-5 (helpdesk-prod, naascweb2-prod, cartaweb-prod, natest-arc-3, cobweb2-dev)
  • 2022-09-06 krowe: ethtool -k <NIC> for naasc-vs-3 and naasc-vs-5 are very different from naasc-vs-4.
    • hw-tc-offload: off vs hw-tc-offload: on
    • rx-gro-hw: off vs rx-gro-hw: on
    • rx-vlan-offload: off vs rx-vlan-offload: on
    • rx-vlan-stag-hw-parse: off vs rx-vlan-stag-hw-parse: on
    • tcp-segmentation-offload: off vs tcp-segmentation-offload: on
    • tx-gre-csum-segmentation: off vs tx-gre-csum-segmentation: on
    • tx-gre-segmentation: off vs tx-gre-segmentation: on
    • tx-gso-partial: off vs x-gso-partial: on
    • tx-ipip-segmentation: off vs tx-ipip-segmentation: on
    • tx-sit-segmentation: off vs tx-sit-segmentation: on
    • tx-tcp-segmentation: off vs tx-tcp-segmentation: on
    • tx-udp_tnl-csum-segmentation: off vs tx-udp_tnl-csum-segmentation: on
    • tx-udp_tnl-segmentation: off vs tx-udp_tnl-segmentation: on
    • tx-vlan-offload: off vs tx-vlan-offload: on
    • tx-vlan-stag-hw-insert: off vs tx-vlan-stag-hw-insert: on
  • 2022-09-12 krowe: I found the rx and tx buffers for em1 on naasc-vs-4 were 511 while on naasc-vs-2, 3, and 5 were 1024.  I changed naasc-vs-4 with the following ethtool -G em1 rx 1024 tx 1024 but it didn't change iperf performance.
  • 2022-09-12 krowe: I found an article suggesting that gro can make traffic slower when it is enabled.  I see that rx-gro-hw is enabled on naasc-vs-4 but disabled on naasc-vs-3 and 5.  You can see this with ethtool -k em1 | grep gro.So I disabled it on naasc-vs-4 with ethtool -K em1 gro off and iperf3 tests now show about 2Gb/s both directions!!!
    • GRO = Generic Receive Offload.  It is hardware on the physical NIC.  GRO is an aggregation technique to coalesce several receive packets from a stream into a single large packet, thus saving CPU cycles as fewer packets need to be processed by the kernel.
    • https://bugzilla.redhat.com/show_bug.cgi?id=1424076
    • After disabling rx-gro-hw, I no longer see TCP Retransmission or TCP Out-Of-Order packets when tracing the iperf3 test from na-arc-3 to na-arc-2.
    • Table7: iperf3 TCP throughput from/to ingress_sbox with rx-gro-hw=off (Mb/s)

      na-arc-1

      (naasc-vs-4)

      na-arc-2

      (naasc-vs-4)

      na-arc-3

      (naasc-vs-3)

      na-arc-4

      (naasc-vs-4)

      na-arc-5

      (naasc-vs-5)

      na-arc-6

      (naasc-vs-2)

      na-arc-1

      4460

      2580463028603150
      na-arc-2

      4060


      2590422036902570
      na-arc-3

      2710

      2580


      308027702920
      na-arc-4

      1090

      37202200
      29703200
      na-arc-5

      4010

      397023404010
      3080
      na-arc-6

      3380

      3060306030103080

na-arc-1,2,3,4,5

Identical

...

  • Where is the main docker config (yaml file)?
  • Why, with rx-gro-hw=off on naasc-vs-4, does na-arc-6 see so many retransmissions and small Congestion Window (Cwnd)?
    • [root@na-arc-6 ~]# iperf3 -B 10.0.0.16 -c 10.0.0.21
      Connecting to host 10.0.0.21, port 5201
      [  4] local 10.0.0.16 port 38534 connected to 10.0.0.21 port 5201
      [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
      [  4]   0.00-1.00   sec   302 MBytes  2.54 Gbits/sec  523    207 KBytes
      [  4]   1.00-2.00   sec   322 MBytes  2.70 Gbits/sec  596    186 KBytes
      [  4]   2.00-3.00   sec   312 MBytes  2.62 Gbits/sec  687    245 KBytes
      [  4]   3.00-4.00   sec   335 MBytes  2.81 Gbits/sec  638    278 KBytes
      [  4]   4.00-5.00   sec   309 MBytes  2.60 Gbits/sec  780    146 KBytes

    • [root@na-arc-3 ~]# iperf3 -B 10.0.0.19 -c 10.0.0.21
      Connecting to host 10.0.0.21, port 5201
      [  4] local 10.0.0.19 port 52986 connected to 10.0.0.21 port 5201
      [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
      [  4]   0.00-1.00   sec   309 MBytes  2.59 Gbits/sec  232    638 KBytes
      [  4]   1.00-2.00   sec   358 MBytes  3.00 Gbits/sec    0    967 KBytes
      [  4]   2.00-3.00   sec   351 MBytes  2.95 Gbits/sec    0   1.18 MBytes 
      [  4]   3.00-4.00   sec   339 MBytes  2.84 Gbits/sec   74   1.36 MBytes
      [  4]   4.00-5.00   sec   359 MBytes  3.01 Gbits/sec    0   1.54 MBytes
    • Actuqally the retransmissions seem to very quite a lot from one run to another. That is the more important question.
  • Why does naasc-vs-3 have a br120 in state UNKNOWN?  none of the other naasc-vs nodes have a br120.
  • Why does naasc-vs-4 have all the infiniband modules loaded?  I don't see an IB card.  naasc-vs-1 and naasc-dev-vs also have some IB modules loaded but naasc-vs-3 and naasc-vs-5 don't have any IB modules loaded.
    • Tracy will look into this
  • Why is nfnetlink logging enabled on naasc-vs-4?  You can see this with cat /proc/net/netfilter/nf_log and lsmod|grep -i nfnet
    • nfnetlink is a module for packet mangling.  Could this interfear with the docker swarm networking?
  • why is the eth1 interfaces in all the containers and docker_gwbridge on na-arc-1 in the 172.18.x.x range while all the other na-arcs are in the 172.19.x.x range?  Does it matter?
  • Here are some diffs in sysctl on na-arc nodes.  I tried changed na-arc-4 and na-arc-5 to match the others but performance was the same.  I then changed all the nodes to match na-arc-{1..3} and still no change in performance.  I still don't understand how na-arc-{4..5} got different setttings.  I did find that there is another directory for sysctl settings in /usr/lib/sysctl.d but that isn't why these are different.
    • na-arc-1, na-arc-2, na-arc-3, natest-arc-1, natest-arc-2, natest-arc-3
      • net.bridge.bridge-nf-call-arptables = 0

        net.bridge.bridge-nf-call-ip6tables = 0

        net.bridge.bridge-nf-call-iptables = 1

    • na-arc-4, na-arc-5
      • net.bridge.bridge-nf-call-arptables = 1

        net.bridge.bridge-nf-call-ip6tables = 1

        net.bridge.bridge-nf-call-iptables = 1

  • I see sysctl differences between the natest-arc servers and the na-arc servers.  Here is a diff of /etc/sysctl.d/99-nrao.conf on natest-arc-1 and na-arc-5
    • < #net.ipv4.tcp_tw_recycle = 1
      ---
      > net.ipv4.tcp_tw_recycle = 1
      22,39d21
      < net.ipv4.conf.all.accept_redirects=0
      < net.ipv4.conf.default.accept_redirects=0
      < net.ipv4.conf.all.secure_redirects=0
      < net.ipv4.conf.default.secure_redirects=0
      <
      < #net.ipv6.conf.all.disable_ipv6 = 1
      < #net.ipv6.conf.default.disable_ipv6 = 1
      <
      < # Mellanox recommends the following
      < net.ipv4.tcp_timestamps = 0
      < net.core.netdev_max_backlog = 250000
      <
      < net.core.rmem_default = 16777216
      < net.core.wmem_default = 16777216
      < net.core.optmem_max = 16777216
      < net.ipv4.tcp_mem = 16777216 16777216 16777216
      < net.ipv4.tcp_low_latency = 1
    • If I set net.ipv4.tcp_timestamps = 0 on na-arc-5, the wget download drops to nothing (--.-KB/s).

    • If I set all the above sysctl options, execpt net.ipv4.tcp_timestamps, on all five na-arc nodes, wget download performance doesn't change.  It is still about 32KB/s.  Also I still zeeo ZeroWindow packets.
    • Try rebooting VMs after making changes?
  • I see ZeroWindow packets sent from na-arc-5 to nangas13 while downloading a file from nangas13 using wget.  This is na-arc-5 telling nangas13 to wiat because its network buffer is full.
    • Is this because of qdisc pfifo_fast?  No.  krowe changed eth0 to *qdisc fq_codel* and still seeing ZeroWait packets.
    • Now that I have moved the rh_download to na-arc-1 and put httpd on na-arc-5 I no longer see ZeroWindow packets on na-arc-5.  But I am seeing them on na-arc-1 which is where the rh_downloader is now.  Is this because the rh_downloader is being stalled talking to something else like httpd and therefore telling nangas13 to wait?
  • Why does almaportal use ens3 while almascience uses eth0?
  • What if we move the rh-downloader container to a different node?  In fact walk it through all five nodes and test.
  • Why do I see cv-6509 when tracerouting from na-arc-5 to nangas13 but not on natest-arc-1
    • [root@na-arc-5 ~]# traceroute nangas13
      traceroute to nangas13 (10.2.140.33), 30 hops max, 60 byte packets
       1  cv-6509-vlan97.cv.nrao.edu (10.2.97.1)  0.426 ms  0.465 ms  0.523 ms
       2  cv-6509.cv.nrao.edu (10.2.254.5)  0.297 ms  0.277 ms  0.266 ms
       3  nangas13.cv.nrao.edu (10.2.140.33)  0.197 ms  0.144 ms  0.109 ms
       
    • [root@natest-arc-1 ~]# traceroute nangas13
      traceroute to nangas13 (10.2.140.33), 30 hops max, 60 byte packets
       1  cv-6509-vlan96.cv.nrao.edu (10.2.96.1)  0.459 ms  0.427 ms  0.402 ms
       2  nangas13.cv.nrao.edu (10.2.140.33)  0.184 ms  0.336 ms  0.311 ms
    • Derek wrote that 10.2.99.1 = CV-NEXUS and 10.2.96.1 = CV-6509
  • Why does natest-arc-3 have ens3 instead of eth0 and why is its speed 100Mb/s?
    • virsh domiflist natest-arc-3 shows the Model as rtl8139 instead of virtio
    • When I run ethtool eth0 on nar-arc-{1..5} natest-arc-{1..2} as root, the result is just Link detected: yes instead of the full report with speed while na-arc-3 shows 100Mb/s.
  • Why do iperf tests from natest-arc-1 and natest-arc-2 to natest-arc-3 get about half the performance (0.5Gb/s) expected especially when the reverse tests get expected performance (0.9Gb/s).
  • Is putting the production swarm nodes (na-arc-*) on the 10Gb/s network a good idea?  Sure it makes a fast connection to cvsan but it adds one more hop to the nangas servers (e.g. na-arc-1 -> cv-nexus9k -> cv-nexus -> nangas11)
  • When I connect to the container acralmaprod001.azurecr.io/offline-production/rh-download:2022.06.01.2022jun I get errors like unknown user 1009  I get the same errors on the natest-arc-1 container.
  • Does it matter that the na-arc nodes are on 10.2.97.x, their VM host is on 10.2.99.x while the natest-arc nodes are on 10.2.96.x and their VM hosts (well 2 out of 3) are also on 10.2.96.x?  Is this why I see cv-509.cv.nrao.edu when running traceroute from the na-arc nodes?
  • When running wget --no-check-certificate http://na-arc-3.cv.nrao.edu:8088/dataPortal/member.uid___A001_X1358_Xd2.3C286_sci.spw31.cube.I.pbcor.fits I see traffic going through veth14ce034 on na-arc-3 but I can't find a container associated with that veth.
  • Why does the httpd container have eth0(10.0.0.8).  This is the ingress network.  I don't see any other conrainter with an interface on 10.0.0.0/24.
  • Do we want to use jumbo frames?  If so, some recommend using mtu=8900 and there are a lot of places it needs to be set.

...