Numbers are in hours
CPUs at CHTC are noticibly slower than CPUs at NRAO. For example, their set of c20xx machines (e20{03..18}) each have two Intel Xeon Silver 4114 2.20GHz processors and 0.5TB to 1TB of memory, while their large memory machines (mem3, mem2001, mem2002) each have four Intel Xeon E7-4820 v4 2.00GHz processors and 2TB to 4TB of memory.
CASA-6
Small Data Set
Small data set VLASS1.2.sb36484946.eb36542800.58574.4235612037_ptgfix_split_smaller.ms with full parameters, using cfcache from local disk.
Step | NRAO (run06) | NRAO/CHTC () | NRAO/AWS () |
---|---|---|---|
01 | 1.0 | ||
05 | 4.8 | ||
06 | 1.0 | ||
07 | 1.2 | ||
15 | 4.0 | ||
16 | 0.8 | ||
23 | 3.0 | ||
24 | 1.3 | ||
Total | 17.1 |
CASA-5
Large Data Set
Large data set VLASS1.2.sb36491855.eb36574404.58585.53016267361_datacolumn.ms with full parameters and copying , using cfcache to from local disk at CHTC.
Step | NRAO (steps-all-parallel9) | NRAO/CHTC (steps-all-parallel17) | NRAO/AWS (steps-all-parallel16) | |
---|---|---|---|---|
01 | 9.4 | 27.7 (ran at CHTC) | 6.4 | 8.912.3 |
05 | 60.2 | 171.5 | 67.365.9 | |
06 | 24 | 27.9 | 24.48 | |
07 | 11.8 | 218.4 | 11.2 | |
15 | 55.2 | 161.0 | 5861.96 | |
16 | 6.1 | 4.0 | 5.7 | |
23 | 230.8 | 140.1 | ||
24 | 46 | 54.4 | ||
Total | 443.5 | 373.9 |
Small Data Set
Small data set test.ms with full parameters and not copying cfcache to local disk at CHTC using the 16k (wrong) cfcache, using cfcache from local disk.
Step | NRAO (steps-all- | parallel12parallel21) | NRAO/CHTC (steps-all- | parallel15parallel19) | NRAO/AWS (steps-all- | parallel14parallel20) |
---|---|---|---|---|---|---|
01 | 1. | 82 | 1.2 | .01. | 92 | |
05 | 85. | 65 | 5613. | 82 | 5. | 13 |
06 | 31. | 06 | 32. | 92 | 21. | 08 |
07 | 21. | 03 | 21. | 34 | 21. | 24 |
15 | 65. | 94 | 5612. | 30 | 45. | 32 |
16 | 1. | 40 | 1. | 70 | 1. | 40 |
23 | 86. | 36 | 4713. | 89 | 56. | 34 |
24 | 143. | 15 | 667. | 02 | 163. | 84 |
Total | 4626.1 | 22652.81 | 3925.0 |
CPUs at CHTC are noticibly slower than CPUs at NRAO. For example, their set of c20xx machines (e20{03..18}) each have two Intel Xeon Silver 4114 2.20GHz processors and 0.5TB to 1TB of memory, while their large memory machines (mem3, mem2001, mem2002) each have four Intel Xeon E7-4820 v4 2.00GHz processors and 2TB to 4TB of memory. Possible reasons for this slowdown:
- cfcache on cephfs
- Slower CPUs
- Multiple users
- Hyperthreading
I ran a small data set test with full parameters at CHTC that copied cfcache from /staging to local disk and step05 took only 16.7 hours instead of the 56.8 hours it had taken using cfcache on /staging.
Small data set test.ms with full parameters and not copying cfcache to local disk at CHTC using the 8k (right) cfcache and copying the cfcache to local disk at CHTC.
7 |
Wallclock time from start to finish for the small data set (test.ms)
- NRAO: 26.5
- NRAO/CHTC: 111.5 so this job spent about as much time waiting for nodes as running
- NRAO/AWS: 27.3
...