Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Track Brian Kent's issue.Page for tracking an apparently slow down w.r.t CASA-5 and CASA-6 for VLASS calibration: https://open-jira.nrao.edu/browse/PIPE-568

Comparing CASA-5 and CASA-6 (casa-pipeline-validation-8) across the two different CPUs available for batch processing in NM and CV shows that the newer CPUs (E5-2640v3) run a small calibration job (6.7GB) about 1.25 times faster than the old CPUs (E5-2670) with CASA-6 performing slower in every case.  There was no significant run-time difference between NM and CV for similar hardware and software.  Results are in minutes.

Here is the full pipeline script I have used for all of these tests casa_pipescript.py For some tests, I commented out all but hifv_importdata.

Full, serial pipeline with small dataset

RHEL7 - 6.7GB dataset with NM Lustre-2.5.5 (results are in minutes)

CASAnmpost051 (E5-2640v3)cvpost020 (E5-2640v3)nmpost038 (E5-2670)cvpost003 (E5-2670)
5114, 117110, 111144, 143140, 141
6156*, 164*156*, 158*200*, 201*197*, 199*


RHEL7 - 6.7GB dataset after NM upgrade Lustre-2.10.8 and CV results copied from last test (results are in minutes)

Just serial hifv_importdata() with large dataset

*

Mar. 3, 2020 krowe: I tried the nmpost051-casa6-rhel7 with the latest casa-pipeline-validation-17.  The run-time was the same as were the tclean() errors.

"*" Means it completed with tclean() errors


Full, new, serial pipeline with large dataset

Mar. 17, 2020 I started using the same pipeline script that Brian is currently using.

RHEL7 - 350GB dataset with NM Lustre-2.10.x, CASA-pipeline-5.5.6.3-9 or CASA 6.0.0.23a100.dev17 (results are in minutes)

CASAnmpost051 NM (E5-2640v3)cvpost020 CV (E5-2640v3)nmpost038 NM (E5-2670)cvpost003 CV (E5-2670)
5192196239251
6328364, 378411, 427453
3,350*^3,362*^4,605*^4,480*^
64,016*3,943*5,671*5,253*

"*" Means "SEVERE pipeline.hifv.tasks.flagging No flag summary statistics"

"^" Means "SEVERE setjy No rows were selected"


Full, new, serial pipeline with large dataset and profiling metrics

Mar. 17, 2020 I started using the same pipeline script that Brian is currently usingYou can see that running just hifv_importdata() on a larger data set (350GB) shows that nmpost nodes run about 2% to 10% faster than similar cvpost nodes with CASA-6 performing slower in every case.

RHEL7 - 350GB dataset with NM Lustre-2.10.8.x, CASA-pipeline-5.6.3-9 or CASA 6.0.0.23a100.dev17 (results are in minutes)

CASAnmpost051 NM (E5-2640v3)cvpost020 CV (E5-2640v3)nmpost048 NM (E5-2670)cvpost003 CV (E5-2670)
51871962442516364, 378453

Both serial hifv_importdata() and hifv_hanning().

RHEL6 - with NM Lustre-2.5.5

...

RHEL7 - with NM Lustre-2.5.5

...

3,326*^

Image Added


4,485*^

Image Added



6

4,172*

Image Added


5,572*

Image Added


"*" Means "SEVERE pipeline.hifv.tasks.flagging No flag summary statistics"

"^" Means "SEVERE setjy No rows were selected"


Full, new serial pipeline with large dataset and times per pipeline task

Comparing two profiling jobs against one of Brian's jobs (/lustre/aoc/sciops/bkent/pipetest/llama3/workingtest60_2) on the same hardware (E5-2670) in NM.  Times were calculated from the CASA logs.  Times are in minutes.

Large dataset (350GB) times are in minutes

CASA-5.6.3-9,

Pipeline 43128

CASA-6.0.0.23-pipeline-validation-17,

Pipeline master-v0.1-145-ge322387-dirty

CASA-6.0.0.23-pipeline-validation-17,

Pipeline master-v0.1-18-g2de4d78-dirty

CASA-6.0.0.23-pipeline-validation-17,

Pipeline master-v0.1-18-g2de4d78-dirty

Taskkent2-pr-c5-l-70kent2-pr-c6-l-70kent3b-no-c6-l-70CASA-6 Bkent
hifv_importdata247425403392
hifv_hanning175188334460
hifv_flagdata272323374452
hifv_vlasetjy75199255357
hifv_priorcals254281539494
hifv_testBPdcals748498123
hifv_flagbaddef0100
hifv_checkflag68706969
hifv_semiFinalBPdcals75153154154
hifv_checkflag189254250253
hifv_solint6689105105
hifv_fluxboot2104181185175
hifv_finalcals162182177177
hifv_circfeedpolcal31333232
hifv_flagcal0100
hifv_applycals205212358437
hifv_checkflag1741184023882930
hifv_statwt645710812500
hifv_plotsummary101346350350





TOTAL (minutes)4484557368847460




K. Scott finished three runs on Apr. 8, 2020 using Brian's large dataset (350GB), CASA-6.0.0.23-pipeline-validation-17 and Pipeline master-v0.1-18-g2de4d78-dirty separated by about an hour each.  Each job requested 1 node with 8 cores and 96gb; essentially a NUMA node. system.resources.memory was unset and _cf.validate_parameters = False. (Times are in minutes)

Taskkent3a-no-c6-l-70kent3b-no-c6-l-70kent3c-no-c6-l-70
hifv_importdata410403407
hifv_hanning364334359
hifv_flagdata381374386
hifv_vlasetjy263255256
hifv_priorcals513539511
hifv_testBPdcals979898
hifv_flagbaddef000
hifv_checkflag686968
hifv_semiFinalBPdcals153154152
hifv_checkflag251250250
hifv_solint105105106
hifv_fluxboot2174185174
hifv_finalcals178177180
hifv_circfeedpolcal313231
hifv_flagcal000
hifv_applycals353358366
hifv_checkflag250123882302
hifv_statwt832812806
hifv_plotsummary348350345




TOTAL (minutes)702368846799

...

Full parallel pipeline with -n 8

RHEL7

...

"*" After 14 days of running the setjy task, and using Felipe's profiling metrics, I canceled the job.

Full parallel pipeline with -n 9

RHEL7

...