Numbers are in hours
Large data set VLASS1.2.sb36491855.eb36574404.58585.53016267361_datacolumn.ms with full parameters
Step | NRAO (steps-all-parallel9) | NRAO/CHTC (steps-all-parallel10) | NRAO/AWS (steps-all-parallel16) |
---|---|---|---|
01 | 9.4 | 9.2 | 12.3 |
05 | 60.2 | killed at 72 hours | 65.9 |
06 | 24 | 24.4 | |
07 | 11.8 | 14.4 (leap second and timeout errors) | |
15 | 55.2 | 0.0 (exception error) | |
16 | 6.1 | ||
23 | 230.8 | ||
24 | 46 | ||
Total | 443.5 |
Small data set test.ms with full parameters
Step | NRAO (steps-all-parallel12) | NRAO/CHTC (steps-all-parallel15) | NRAO/AWS (steps-all-parallel14) |
---|---|---|---|
01 | 1.8 | 2.0 | 1.9 |
05 | 8.6 | 56.8 | 5.1 |
06 | 3.0 | 3.9 | 2.0 |
07 | 2.0 | 2.3 | 2.2 |
15 | 6.9 | 56.3 | 4.3 |
16 | 1.4 | 1.7 | 1.4 |
23 | 8.3 | 47.8 | 5.3 |
24 | 14.1 | 66.0 | 16.8 |
Total | 46.1 | 226.8 | 39.0 |
CPUs at CHTC are noticibly slower than CPUs at NRAO. For example, their set of c20xx machines (e20{03..18}) each have two Intel Xeon Silver 4114 2.20GHz processors and 0.5TB to 1TB of memory, while their large memory machines (mem3, mem2001, mem2002) each have four Intel Xeon E7-4820 v4 2.00GHz processors and 2TB to 4TB of memory. Possible reasons for this slowdown:
- cfcache on cephfs
- Slower CPUs
- Multiple users
- Hyperthreading
I ran a small data set test with full parameters at CHTC that copied cfcache from /staging to local disk and step05 took only 16.7 hours.