Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Questions

  • Where is the main docker config (yaml file)?
  • I see sysctl differences between the natest-arc servers and the na-arc servers.  Here is a diff of /etc/sysctl.d/99-nrao.conf on natest-arc-1 and na-arc-5
    • < #net.ipv4.tcp_tw_recycle = 1
      ---
      > net.ipv4.tcp_tw_recycle = 1
      22,39d21
      < net.ipv4.conf.all.accept_redirects=0
      < net.ipv4.conf.default.accept_redirects=0
      < net.ipv4.conf.all.secure_redirects=0
      < net.ipv4.conf.default.secure_redirects=0
      <
      < #net.ipv6.conf.all.disable_ipv6 = 1
      < #net.ipv6.conf.default.disable_ipv6 = 1
      <
      < # Mellanox recommends the following
      < net.ipv4.tcp_timestamps = 0
      < net.core.netdev_max_backlog = 250000
      <
      < net.core.rmem_default = 16777216
      < net.core.wmem_default = 16777216
      < net.core.optmem_max = 16777216
      < net.ipv4.tcp_mem = 16777216 16777216 16777216
      < net.ipv4.tcp_low_latency = 1
    • If I set net.ipv4.tcp_timestamps = 0 on na-arc-5, the wget download drops to nothing (--.-KB/s).

    • If I set all the above sysctl options, execpt net.ipv4.tcp_timestamps, on all five na-arc nodes, wget download performance doesn't change.  It is still about 32KB/s.  Also I still zeeo ZeroWindow packets.
    • Try rebooting VMs after making changes?
  • I see ZeroWindow packets sent from na-arc-5 to nangas13 while downloading a file from nangas13 using wget.  This is na-arc-5 telling nangas13 to wiat because its network buffer is full.
    • Is this because of qdisc pfifo_fast?  No.  krowe changed eth0 to *qdisc fq_codel* and still seeing ZeroWait packets.
    • Now that I have moved the rh_download to na-arc-1 and put httpd on na-arc-5 I no longer see ZeroWindow packets on na-arc-5.  But I am seeing them on na-arc-1 which is where the rh_downloader is now.  Is this because the rh_downloader is being stalled talking to something else like httpd and therefore telling nangas13 to wait?
  • Why does almaportal use ens3 while almascience uses eth0?
  • What if we move the rh-downloader container to a different node?  In fact walk it through all five nodes and test.
  • Why do I see cv-6509 when tracerouting from na-arc-5 to nangas13 but not on natest-arc-1
    • [root@na-arc-5 ~]# traceroute nangas13
      traceroute to nangas13 (10.2.140.33), 30 hops max, 60 byte packets
       1  cv-6509-vlan97.cv.nrao.edu (10.2.97.1)  0.426 ms  0.465 ms  0.523 ms
       2  cv-6509.cv.nrao.edu (10.2.254.5)  0.297 ms  0.277 ms  0.266 ms
       3  nangas13.cv.nrao.edu (10.2.140.33)  0.197 ms  0.144 ms  0.109 ms
       
    • [root@natest-arc-1 ~]# traceroute nangas13
      traceroute to nangas13 (10.2.140.33), 30 hops max, 60 byte packets
       1  cv-6509-vlan96.cv.nrao.edu (10.2.96.1)  0.459 ms  0.427 ms  0.402 ms
       2  nangas13.cv.nrao.edu (10.2.140.33)  0.184 ms  0.336 ms  0.311 ms
    • Derek wrote that 10.2.99.1 = CV-NEXUS and 10.2.96.1 = CV-6509
  • Why does natest-arc-3 have ens3 instead of eth0 and why is its speed 100Mb/s?
    • virsh domiflist natest-arc-3 shows the Model as rtl8139 instead of virtio
    • When I run ethtool eth0 on nar-arc-{1..5} natest-arc-{1..2} as root, the result is just Link detected: yes instead of the full report with speed while na-arc-3 shows 100Mb/s.
  • Why do iperf tests from natest-arc-1 and natest-arc-2 to natest-arc-3 get about half the performance (0.5Gb/s) expected especially when the reverse tests get expected performance (0.9Gb/s).
  • Is putting the production swarm nodes (na-arc-*) on the 10Gb/s network a good idea?  Sure it makes a fast connection to cvsan but it adds one more hop to the nangas servers (e.g. na-arc-1 -> cv-nexus9k -> cv-nexus -> nangas11)
  • When I connect to the container acralmaprod001.azurecr.io/offline-production/rh-download:2022.06.01.2022jun I get errors like unknown user 1009  I get the same errors on the natest-arc-1 container.
  • Does it matter that the na-arc nodes are on 10.2.97.x, their VM host is on 10.2.99.x while the natest-arc nodes are on 10.2.96.x and their VM hosts (well 2 out of 3) are also on 10.2.96.x?  Is this why I see cv-509.cv.nrao.edu when running traceroute from the na-arc nodes?
  • When running wget --no-check-certificate http://na-arc-3.cv.nrao.edu:8088/dataPortal/member.uid___A001_X1358_Xd2.3C286_sci.spw31.cube.I.pbcor.fits I see traffic going through veth14ce034 on na-arc-3 but I can't find a container associated with that veth.

...