...
- Using Apache Benchmarks every hour to load http://almascience.nrao.edu/ on rastan.aoc.nrao.edu
- ssh.aoc.nrao.edu:/users/krowe/alma_archive/benchmarks/almascience.nrao.edu/data (times are in milliseconds)
- Mode load time is 98ms
- ssh.aoc.nrao.edu:/users/krowe/alma_archive/benchmarks/almaportal.cv.nrao.edu/data (times are in milliseconds)
- Mode load time is 123ms
- ssh.aoc.nrao.edu:/users/krowe/alma_archive/benchmarks/almascience.nrao.edu/data (times are in milliseconds)
- Using wget to get 2013.1.00226.S-small (about 700MB) every hour on cvpost-master.aoc.nrao.edu
- ssh.cv.nrao.edu:/lustre/cv/users/krowe/tickets/scg-207/benchmarks/almascience.nrao.edu/2013.1.00226.S-small
- 2022-08-16: average time to download is about 42 seconds which is about 16MB/s
- ssh.cv.nrao.edu:/lustre/cv/users/krowe/tickets/scg-207/benchmarks/almascience.nrao.edu/2013.1.00226.S-small
- iperf tests using iperf3 -s -B <local IP> and iperf3 -B <local IP> -c <dest IP>
- 2022-08-15 krowe: I had tcpdump running on each na-arc-{1..5} nodes watching for traffic from almaportal tcpdump dst almaportal. Then I would run the following wget on cvpost-master. The first execution would be shown by tcpdump on na-arc-1, the second execution on na-arc-2 and so forth. This is because of the round-robin nature of the web proxy on almaportal and was a nice confirmation of that process. However, each execution also downloaded at about 32KB/s (0.3Mb/s) after a minute or so of downloading, which is about 300 times slower than expected. Using the test swarm (natest-arc-{1..3}) I can download the same file at about 10MB/s (100Mb/s). Also, I did not see any difference in performance across the five nodes which was also surprising given that one of the nodes runs the downloader container and the other four need to forward traffic to the one download container.
- cvpost-master wget --no-check-certificate https://almaportal.cv.nrao.edu/dataPortal/2013.1.00226.S_uid___A001_X122_X1f1_001_of_001.tar
- 2022-08-15 krowe: I ran iperf tests from end to end and don't see any unexpected performance.
- [nangas11] -- ~900Mb/s --> [rh-download container on na-arc-5] -- ~8,000Mb/s --> [almaportal] -- ~900Mb/s --> [cvpost-master]
- [nangas11] -- ~900Mb/s --> [na-arc-5] -- ~8,000Mb/s --> [almaportal] -- ~900Mb/s --> [cvpost-master]
- 2022-08-17 krowe: doing scp tests of a 784MB file
- [root@rh-download-na-production-2022jun tmp]# scp krowe@nangas13:/NGAS1/volume1/afa/2022-08-17/1/member.uid___A001_X158f_X90c.IRAS_09022-3615_sci.spw29.cube.I.pb.fits.gz /tmp (93MB/s)
- [root@rh-download-na-production-2022jun tmp]# scp member.uid___A001_X158f_X90c.IRAS_09022-3615_sci.spw29.cube.I.pb.fits.gz krowe@almaportal:/tmp (70MB/s)
- almaportal krowe >scp /tmp/member.uid___A001_X158f_X90c.IRAS_09022-3615_sci.spw29.cube.I.pb.fits.gz krowe@cvpost-master:/tmp (110MB/s)
- tcpdump bandwidth tests
- When I download a file from na-arc-5 like so `wget --no-check-certificate http://na-arc-5.cv.nrao.edu:8088/dataPortal/member.uid___A001_X122_X1f1.LKCA_15_13CO_cube.image.fits` which lives on nangas13, to cvpost-master, the download runs at about 32KB/s.
- On nangas13 I see about that much traffic (32KB/s to 50KB/s) almost all of it going to na-arc-5.
- on na-arc-5 (rh-download container) I see between about 200KB/s and 300KB/s of traffic.
- on na-arc-2 (httpd container) I see between about 100KB/s and 150KB/s of traffic. It seems like it is about half the traffic na-arc-5 sees.
- When I download a file from na-arc-5 like so `wget --no-check-certificate http://na-arc-5.cv.nrao.edu:8088/dataPortal/member.uid___A001_X122_X1f1.LKCA_15_13CO_cube.image.fits` which lives on nangas13, to cvpost-master, the download runs at about 32KB/s.
- 2022-08-19 krowe: For some reason, all the swarm services on na-arc-5 shutdown about 24 hours ago (which is around 11am Central Aug. 18, 2022). And now my wget tests are getting about 100MB/s and I tested this five times to walk through all five nodes.
- na-arc-5 was running
- acralmaprod001.azurecr.io/offline-production/asax-elasticsearch:2022.02.01.2022feb (now on na-arc-3)
- acralmaprod001.azurecr.io/offline-production/asax-explorer:2022.04.01.2022apr (now on na-arc-2)
- acralmaprod001.azurecr.io/offline-production/asax-ingestor:2022.06.01.2022jun (now on na-arc-3)
- acralmaprod001.azurecr.io/offline-production/rh-download:2022.06.01.2022jun (now on na-arc-2)
- acralmaprod001.azurecr.io/offline-production/rh-logging:2022.06.01.2022jun (now on na-arc-4)
- na-arc-5 didn't reboot. It has been up for 29 days.
- Looks like na-arc-5 lost its heartbeat with the swarm
Aug 18 13:34:16 na-arc-5 dockerd: time="2022-08-18T13:34:14.131474019-04:00" level=warning msg="memberlist: Refuting a suspect message (from: c30261b68826)"
Aug 18 13:34:16 na-arc-5 dockerd: time="2022-08-18T13:34:15.929428007-04:00" level=info msg="memberlist: Suspect 886f1454e2b4 has failed, no acks received"
Aug 18 13:34:16 na-arc-5 dockerd: time="2022-08-18T13:34:16.061224152-04:00" level=error msg="heartbeat to manager {xojanp58fu1ysx3yk0rpvjsft 10.2.97.71:2377} failed" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" method="(*session).heartbeat" module=node/agent node.id=1l5cnfmt16f6hyg5it0rq39rr session.id=fl2thh44rmjfxgu7xidnakjg8 sessionID=fl2thh44rmjfxgu7xidnakjg8
- I moved the rh-download service back to na-arc-5 with docker service update --force production_requesthandler_download and wget performance is back to about 32KB/s.
- I moved rh-download from na-arc-5 back to na-arc-2 by draining na-arc-5 docker node update --availability drain na-arc-5 and wget performance was back to about 100MB/s. I ran it four times to make sure.
- Then I moved rh-download from na-arc-2 to na-arc-1 by forcing a rebalance again with docker service update --force production_requesthandler_download. This is because na-arc-5 was still drained. wget performance was back to about 100MB/s. I ran it four times to make sure. I wanted to make sure the performance was good because rh-download wasn't on na-arc-5 and not because it was on na-arc-2. I think I have shown that. So, the question is why is performance so poor when rh-download is on na-arc-5?
- na-arc-5 was running
Table1
Production docker swarm iperf tests measured in Gb/s.
...