Track Brian Kent's issue.Page for tracking an apparently slow down w.r.t CASA-5 and CASA-6 for VLASS calibration: https://open-jira.nrao.edu/browse/PIPE-568
Comparing CASA-5 and CASA-6 (casa-pipeline-validation-8) across the two different CPUs available for batch processing in NM and CV shows that the newer CPUs (E5-2640v3) run a simple small calibration job (6.7GB) about 1.25 times faster than the old CPUs (E5-2670) with CASA-6 performing slower in every case.
...
. There was no significant run-time difference between NM and CV for similar hardware and software. Results are in minutes.
Here is the full pipeline script I have used for all of these tests casa_pipescript.py For some tests, I commented out all but hifv_importdata.
Full, serial pipeline with small dataset
RHEL7 - 6.7GB dataset with NM Lustre-2.5.5 (results are in minutes)
CASA | nmpost051 (E5-2640v3) | cvpost020 (E5-2640v3) | nmpost038 (E5-2670) | cvpost003 (E5-2670) |
---|---|---|---|---|
5 | 114, 117 | 110, 111 | 144, 143 | 140, 141 |
6 | 156*, 164* | 156*, 158* | 200*, 201* | 197*, 199* |
RHEL7 - 6.7GB dataset after NM upgrade Lustre-2.10.8 and CV results copied from last test (results are in minutes)
CASA | nmpost051 NM (E5-2640v3) | cvpost020 CV (E5-2640v3) | nmpost048 NM (E5-2670) | cvpost003 CV (E5-2670) |
---|---|---|---|---|
5 | 113, 110 | 110, 111 | 142, 141 | 140, 141 |
6 | 155* | 156*, 158* | 198* | 197*, 199 |
...
* |
Mar. 3, 2020 krowe: I tried the nmpost051-casa6-rhel7 with the latest casa-pipeline-validation-17. The run-time was the same as were the tclean() errors.
"*" Means it completed with tclean() errors
Full, new, serial pipeline with large dataset
Mar. 17, 2020 I started using the same pipeline script that Brian is currently using.
RHEL7 - 350GB dataset with NM Lustre-2.10.x, CASA-pipeline-5.5.6.3-9 or CASA 6.0.0.23a100.dev17 (results are in minutes)
CASA | nmpost051 NM (E5-2640v3) | cvpost020 CV (E5-2640v3) | nmpost038 NM (E5-2670) | cvpost003 CV (E5-2670) |
---|---|---|---|---|
5 | 192 | 196 | 239 | 251 |
6 | 328 | 364, 378 | 411, 427 | 453 |
3,350*^ | 3,362*^ | 4,605*^ | 4,480*^ | |
6 | 4,016* | 3,943* | 5,671* | 5,253* |
"*" Means "SEVERE pipeline.hifv.tasks.flagging No flag summary statistics"
"^" Means "SEVERE setjy No rows were selected"
Full, new, serial pipeline with large dataset and profiling metrics
Mar. 17, 2020 I started using the same pipeline script that Brian is currently usingYou can see that running just hifv_importdata() on a larger data set (350GB) shows that nmpost nodes run about 2% to 10% faster than similar cvpost nodes with CASA-6 performing slower in every case.
RHEL7 - 350GB dataset with NM Lustre-2.10.8.x, CASA-pipeline-5.6.3-9 or CASA 6.0.0.23a100.dev17 (results are in minutes)
CASA | nmpost051 NM (E5-2640v3) | cvpost020 CV (E5-2640v3) | nmpost048 NM (E5-2670) | cvpost003 CV (E5-2670) | |||
---|---|---|---|---|---|---|---|
5 | 187 | 196 | 244 | 251 | 6 | 364, 378 | 453 |
Running both hifv_importdata() and hifv_hanning().
RHEL6 - with NM Lustre-2.5.5
...
RHEL7 - with NM Lustre-2.5.5
...
Running entire pipeline with -n 8
RHEL7
...
"*" After 14 days of running the setjy task, and using Felipe's profiling metrics, I canceled the job.
Running entire pipeline with -n 9
RHEL7
3,326*^ | 4,485*^ | |||
6 | 4,172* | 5,572* |
"*" Means "SEVERE pipeline.hifv.tasks.flagging No flag summary statistics"
"^" Means "SEVERE setjy No rows were selected"
Full, new serial pipeline with large dataset and times per pipeline task
Comparing two profiling jobs against one of Brian's jobs (/lustre/aoc/sciops/bkent/pipetest/llama3/workingtest60_2) on the same hardware (E5-2670) in NM. Times were calculated from the CASA logs. Times are in minutes.
Large dataset (350GB) times are in minutes | CASA-5.6.3-9, Pipeline 43128 | CASA-6.0.0.23-pipeline-validation-17, Pipeline master-v0.1-145-ge322387-dirty | CASA-6.0.0.23-pipeline-validation-17, Pipeline master-v0.1-18-g2de4d78-dirty | CASA-6.0.0.23-pipeline-validation-17, Pipeline master-v0.1-18-g2de4d78-dirty |
Task | kent2-pr-c5-l-70 | kent2-pr-c6-l-70 | kent3b-no-c6-l-70 | CASA-6 Bkent |
hifv_importdata | 247 | 425 | 403 | 392 |
hifv_hanning | 175 | 188 | 334 | 460 |
hifv_flagdata | 272 | 323 | 374 | 452 |
hifv_vlasetjy | 75 | 199 | 255 | 357 |
hifv_priorcals | 254 | 281 | 539 | 494 |
hifv_testBPdcals | 74 | 84 | 98 | 123 |
hifv_flagbaddef | 0 | 1 | 0 | 0 |
hifv_checkflag | 68 | 70 | 69 | 69 |
hifv_semiFinalBPdcals | 75 | 153 | 154 | 154 |
hifv_checkflag | 189 | 254 | 250 | 253 |
hifv_solint | 66 | 89 | 105 | 105 |
hifv_fluxboot2 | 104 | 181 | 185 | 175 |
hifv_finalcals | 162 | 182 | 177 | 177 |
hifv_circfeedpolcal | 31 | 33 | 32 | 32 |
hifv_flagcal | 0 | 1 | 0 | 0 |
hifv_applycals | 205 | 212 | 358 | 437 |
hifv_checkflag | 1741 | 1840 | 2388 | 2930 |
hifv_statwt | 645 | 710 | 812 | 500 |
hifv_plotsummary | 101 | 346 | 350 | 350 |
TOTAL (minutes) | 4484 | 5573 | 6884 | 7460 |
K. Scott finished three runs on Apr. 8, 2020 using Brian's large dataset (350GB), CASA-6.0.0.23-pipeline-validation-17 and Pipeline master-v0.1-18-g2de4d78-dirty separated by about an hour each. Each job requested 1 node with 8 cores and 96gb; essentially a NUMA node. system.resources.memory was unset and _cf.validate_parameters = False. (Times are in minutes)
Task | kent3a-no-c6-l-70 | kent3b-no-c6-l-70 | kent3c-no-c6-l-70 |
hifv_importdata | 410 | 403 | 407 |
hifv_hanning | 364 | 334 | 359 |
hifv_flagdata | 381 | 374 | 386 |
hifv_vlasetjy | 263 | 255 | 256 |
hifv_priorcals | 513 | 539 | 511 |
hifv_testBPdcals | 97 | 98 | 98 |
hifv_flagbaddef | 0 | 0 | 0 |
hifv_checkflag | 68 | 69 | 68 |
hifv_semiFinalBPdcals | 153 | 154 | 152 |
hifv_checkflag | 251 | 250 | 250 |
hifv_solint | 105 | 105 | 106 |
hifv_fluxboot2 | 174 | 185 | 174 |
hifv_finalcals | 178 | 177 | 180 |
hifv_circfeedpolcal | 31 | 32 | 31 |
hifv_flagcal | 0 | 0 | 0 |
hifv_applycals | 353 | 358 | 366 |
hifv_checkflag | 2501 | 2388 | 2302 |
hifv_statwt | 832 | 812 | 806 |
hifv_plotsummary | 348 | 350 | 345 |
TOTAL (minutes) | 7023 | 6884 | 6799 |
...