Page for tracking an apparently slow down w.r.t CASA-5 and CASA-6 for VLASS calibration: https://open-jira.nrao.edu/browse/PIPE-568

Comparing CASA-5 and CASA-6 (casa-pipeline-validation-8) across the two different CPUs available for batch processing in NM and CV shows that the newer CPUs (E5-2640v3) run a small calibration job (6.7GB) about 1.25 times faster than the old CPUs (E5-2670) with CASA-6 performing slower in every case.  There was no significant run-time difference between NM and CV for similar hardware and software.  Results are in minutes.

Here is the full pipeline script I have used for all of these tests casa_pipescript.py For some tests, I commented out all but hifv_importdata.

Full, serial pipeline with small dataset

RHEL7 - 6.7GB dataset with NM Lustre-2.5.5 (results are in minutes)

CASAnmpost051 (E5-2640v3)cvpost020 (E5-2640v3)nmpost038 (E5-2670)cvpost003 (E5-2670)
5114, 117110, 111144, 143140, 141
6156*, 164*156*, 158*200*, 201*197*, 199*


RHEL7 - 6.7GB dataset after NM upgrade Lustre-2.10.8 and CV results copied from last test (results are in minutes)

CASANM (E5-2640v3)CV (E5-2640v3)NM (E5-2670)CV (E5-2670)
5113, 110110, 111142, 141140, 141
6155*156*, 158*198*197*, 199*

Mar. 3, 2020 krowe: I tried the nmpost051-casa6-rhel7 with the latest casa-pipeline-validation-17.  The run-time was the same as were the tclean() errors.

"*" Means it completed with tclean() errors


Full, new, serial pipeline with large dataset

Mar. 17, 2020 I started using the same pipeline script that Brian is currently using.

RHEL7 - 350GB dataset with NM Lustre-2.10.x, CASA-pipeline-5.6.3-9 or CASA 6.0.0.23a100.dev17 (results are in minutes)

CASANM (E5-2640v3)CV (E5-2640v3)NM (E5-2670)CV (E5-2670)
53,350*^3,362*^4,605*^4,480*^
64,016*3,943*5,671*5,253*

"*" Means "SEVERE pipeline.hifv.tasks.flagging No flag summary statistics"

"^" Means "SEVERE setjy No rows were selected"


Full, new, serial pipeline with large dataset and profiling metrics

Mar. 17, 2020 I started using the same pipeline script that Brian is currently using.

RHEL7 - 350GB dataset with NM Lustre-2.10.x, CASA-pipeline-5.6.3-9 or CASA 6.0.0.23a100.dev17 (results are in minutes)

"*" Means "SEVERE pipeline.hifv.tasks.flagging No flag summary statistics"

"^" Means "SEVERE setjy No rows were selected"


Full, new serial pipeline with large dataset and times per pipeline task

Comparing two profiling jobs against one of Brian's jobs (/lustre/aoc/sciops/bkent/pipetest/llama3/workingtest60_2) on the same hardware (E5-2670) in NM.  Times were calculated from the CASA logs.  Times are in minutes.

Large dataset (350GB) times are in minutes

CASA-5.6.3-9,

Pipeline 43128

CASA-6.0.0.23-pipeline-validation-17,

Pipeline master-v0.1-145-ge322387-dirty

CASA-6.0.0.23-pipeline-validation-17,

Pipeline master-v0.1-18-g2de4d78-dirty

CASA-6.0.0.23-pipeline-validation-17,

Pipeline master-v0.1-18-g2de4d78-dirty

Taskkent2-pr-c5-l-70kent2-pr-c6-l-70kent3b-no-c6-l-70CASA-6 Bkent
hifv_importdata247425403392
hifv_hanning175188334460
hifv_flagdata272323374452
hifv_vlasetjy75199255357
hifv_priorcals254281539494
hifv_testBPdcals748498123
hifv_flagbaddef0100
hifv_checkflag68706969
hifv_semiFinalBPdcals75153154154
hifv_checkflag189254250253
hifv_solint6689105105
hifv_fluxboot2104181185175
hifv_finalcals162182177177
hifv_circfeedpolcal31333232
hifv_flagcal0100
hifv_applycals205212358437
hifv_checkflag1741184023882930
hifv_statwt645710812500
hifv_plotsummary101346350350





TOTAL (minutes)4484557368847460




K. Scott finished three runs on Apr. 8, 2020 using Brian's large dataset (350GB), CASA-6.0.0.23-pipeline-validation-17 and Pipeline master-v0.1-18-g2de4d78-dirty separated by about an hour each.  Each job requested 1 node with 8 cores and 96gb; essentially a NUMA node. system.resources.memory was unset and _cf.validate_parameters = False. (Times are in minutes)

Taskkent3a-no-c6-l-70kent3b-no-c6-l-70kent3c-no-c6-l-70
hifv_importdata410403407
hifv_hanning364334359
hifv_flagdata381374386
hifv_vlasetjy263255256
hifv_priorcals513539511
hifv_testBPdcals979898
hifv_flagbaddef000
hifv_checkflag686968
hifv_semiFinalBPdcals153154152
hifv_checkflag251250250
hifv_solint105105106
hifv_fluxboot2174185174
hifv_finalcals178177180
hifv_circfeedpolcal313231
hifv_flagcal000
hifv_applycals353358366
hifv_checkflag250123882302
hifv_statwt832812806
hifv_plotsummary348350345




TOTAL (minutes)702368846799


  • No labels

43 Comments

  1. After 14 days of running the setjy task, and using Felipe's profiling metrics, I canceled the job.

  2. We need to revisit this test.  We need to run in serial mode (no parallelization) the VLASS calibration pipeline vs this SDM+BDF: /lustre/aoc/sciops/bkent/pipetest/llama3/rawdata/VLASS1.2.sb36491855.eb36574404.58585.53016267361

    Tests should be done using casa-pipeline-5.6.3-9 and casa 6.0.1-8 with Felipe's profiling framework including both I/O stats and memory footprint.

  3. For the record, I have been using CASA 6.0.0.23-pipeline-validation-17 for my CASA-6 testing because that is what I think Brian Kent has been using.  For CASA-5, some of my older tests were with the standard CASA-5.6.1-8 but all of the tests I started today should be with casa-pipeline-prerelease-5.6.3-9.el7


  4. K Scott Rowe , thank you for filling in the tables - I've been following the progress.  Maybe something has changed.  My CASA 6 execution that finished in early March on nmpost036 took over 7000 minutes:

    file:///lustre/aoc/sciops/bkent/pipetest/llama3/workingtest60_2/pipeline-20200227T150749/html/t1-1.html

  5. Brian Kent Not sure why your CASA-6 run on nmpost036 took so long.  Do you know if you had the entire machine to yourself or were you sharing it with other jobs?  I can say, I think, that CASA-6 runs plotms() a lot more than CASA-5; enough that it might be a significant slowdown.

  6. Brian KentI see that you used a different pipeline script for your run than what I have been using.  The biggest difference between the two is after higv_applycals, you call hifv_checkflag and it runs for almost 3,000 minutes.  There are other differences but this stood out to me.  I will attach the pipeline script I have been using for all these tests.


  7. K Scott Rowe , right - that is part of the standard VLASS recipe, and was included in the CASA 5 run as well:

    file:///lustre/aoc/projects/vlass/analysts/SECALTesting_5.6/SETesting_5.6.3-6/llama/pipeline-20191009T153129/html/t1-4.html

    We need to have that task (hifv_checkflag(checkflagmode='target-vlass')  ), as part of the the test runs too to compare apples to apples.


  8. K Scott Rowe, looking at the bar graphs posted above - are these "large dataset" runs the VLASS SDM that I've been using?  If so, that's not the VLASS recipe.


  9. I would assume the proper script is:   <pipedir>/recipes/hifvcalvlass.py.

    Brian if that's not the correct one could you point us at the right one.  

  10. Yes, that's the correct one to use (or same commands in an appropriate casa_pipescript.py file).

  11. I'd like to be comparing the same things, we can run CASA5 and CASA6 tests and compare them but we can't compare them to your runs if we're not running the same things.

    Brian are you using:   /lustre/aoc/projects/vlass/analysts/SECALTesting_5.6/SETesting_5.6.3-6/llama/casa_pipescript.py

    That's somewhat different than the stock hfvcalvalass.py script.  It's missing a few things pipeline task wise (like the cal imaging) and has different parameter values.  The latter I doubt matters much.

    For verifying CASA5 vs 6 I'd prefer to use stock pipelines but that would mean we need a run from you that uses the stock one.

  12. Thanks James Robnett [X] , yes, that's what I am using.  I am looking at hifvcalvlass.py, and it does not (and shouldn't) have cal imaging.  You are correct - parameter values don't matter much.

    The tasks in the weblog posted above (or file:///lustre/aoc/sciops/bkent/pipetest/llama3/workingtest60_2/pipeline-20200227T150749/html/t1-1.html for CASA 6) are what we need to be comparing.

    hifv_syspower doesn't matter so much.

  13. So, to be clear, I am going to start testing CASA-5 vs CASA-6 with and without profiling both here and in CV using a copy of this pipeline script

    /lustre/aoc/sciops/bkent/pipetest/llama3/workingtest60_2/casa_pipescript.py

    and a copy of this dataset (which I have been using all along)

    /lustre/aoc/sciops/bkent/pipetest/llama3/rawdata/VLASS1.2.sb36491855.eb36574404.58585.53016267361

    Please let me know if any of this looks incorrect.



  14. Yes, K Scott Rowe , that is all correct.

  15. Brian, just to double check, you are using the casa-pipeline-validation-17 version of CASA-6, correct?  That is the version I am using.

    thanks


  16. Yes, absolutely, that's what I am using.

  17. I found this in my casa.out files for CASA-6 runs.  I don't know if it is interesting or not.  There are the same number of these lines in casa.out as there are calls to plotms() in the casa*.log


    NOTICE: Exporting to images in screen resolution is currently not working.  Switching to high resolution (which is slower, but works).

  18. Hi K Scott Rowe , yeah, those have always been there in the terminal window.

  19. So, if you see those messages on your console while running interactivly, that means it is not something related to Xvfb and also not something I can "fix" with resolution options.  I wonder how much "slower" it is.

  20. You are correct - not related to Xvfb.  Unsure of the speed difference.  We would have to check with the plotms folks.

  21. I still have one job running in NM (CASA-6 ES-2670) but I think it is clear so far that plotms() is a large contributor to the longer run times of CASA-6.  Looking through the logs (grep "Begin Task" casa-*.log|wc) I see that CASA-5 begins about 260 plotms() tasks while CASA-6 begins about 895.

  22. K Scott Rowe , I really appreciate the detailed analysis and comparison on the JIRA ticket.  Interesting in that the number plots in the CASA 5 and CASA 6 weblog should be similar, if not exactly the same.

  23. James Robnett [X] , K Scott Rowe , the VLASS execution that I started yesterday is already having issues - hanning smoothing took over 7 hours.  I'm on nmpost035.  Same execution scenario that we have been using - validation-17 for CASA 6.

    file:///lustre/aoc/sciops/bkent/pipetest/llama3/workingtest60march24/pipeline-20200324T200128/html/t2-4m.html?sidebar=sidebar_stage2&ms=all&subpage=casapy.log

    That's over a factor of two difference.  Sorry to run around this again, but is it the hardware?  Are other things running on that node?

    None of K. Scott's graphs documented above show that kind of time for hanning smoothing.  I single out hanning smoothing because it is the second task, and because there is no plotting in that task.

    If you can help me with a batch job, then that would be great.

  24. Hm, one possibility is that CASA 6 is referencing your /users/bkent home dir in a way that CASA 5 doesn't.   If your home dir is in CV that could be slow.  It's a long shot but I think this is looking more like an issue with how you're running than CASA5 vs 6 or NM vs CV.


  25. It shouldn't be hardware.  nmpost011 through nmpost050 are identical.  There is another user on nmpost035 but memory and CPUs should be isolated between you two.  Infiniband traffic on nmpost035 has been around 10K packets/second which is high but still below the maximum that node can handle.  Your casa_pipescript.py file looks exactly like mine except for the path to the data.

    It might be a good test to reserve a whole node (nodesceduler -w 14 1) and try your test on that node.


  26. Good points.  My home dirs betwen cville and the aoc should be separate - they have been for many, many years.

    I will get a whole node and start again right away.

  27. Brian, I see that you have the following in your ~/.casarc

      system.resources.memory: 16392

    James, could this be the problem?



  28. Yes, K Scott Rowe - Ironically, the reason it is set to 16392 is from a conversation with Jeff and James from March 30, 2017:

    """

    A technical note, plotMS will by default use as much memory as it can in producing the images.  I _believe_ that this can be controlled by setting the system.resources.memory parameter in .casarc.  I think it uses units of kB but could be wrong.

    system.resources.memory: 16392
    I think for production we really want to get this into a System Variable as well as in the .casa file, I don't think I can do this uniformly on a short timescale, but we can at least get it into plotms which is where I think we need it.

    """



  29. Oh dear,  well yeah let's try without that setting.  Nothing like a 3 year old off hand comment to bite you in random spots.

  30. You got it - we're off and running with that setting removed, on a whole node (nmpost038).

  31. Excellent.  I expect removing the system.resources.memory setting and not the whole node that will make this work.  But trying both at the same time is fine.

    As for running your jobs in batch instead of interactive, I will mail you a shell script that I think you can use.  But, to be clear, please let the current job continue so we can solve the performance problem.



  32. I will let the current job continue and we should know about Hanning smoothing without taking too much time.  Will report back...

  33. Brian, it looks like your current job is running at a much better pace.  Would you agree?

  34. Yes, but I had the same timing a few weeks ago when hanning smoothing ran in three hours (and the  system.resources.memory: 16392 was still in there).  It was Tuesday's run that again jumped back up to 7 hours.  So maybe it is having a whole node checked out that is the great equalizer?

    I'm looking at your graphs above - how did vlasetjy do?  In CASA 5, I get 1 hr, 22 min, but in CASA 6, I get 3 hr, 16 min.
    Do you have weblog links?

    As far as the plotting goes (260 vs. 895 that you mentioned above), I looked into that too.  All the correct plots are made in CASA 5 and 6, however, in CASA 5, many of the "per antenna" plots (x 27) are not reported in the weblog.  I've ask some other pipeline folks as to why they are plotted, but not reported.  Something in the infrastructure behavior changed, but I don't think that seem to be causing anything to be slower...

    I'm going to try your batch script too...



  35. I think we may have to do several more tests before we can be sure of the solution; be it system.resources.memory, whole-node or something else.

    My times for vlasetjy are similar to yours: CASA-5 1 hr, 15 min; CASA-6 3 hr, 19 min.  Sadly I haven't been keeping the weblogs from  my runs.

    As for plotms, I would like to clear up some things.  The plots I have added to this page, labeled "Runtimes for pipeline and CASA tasks", are not from weblog information nor from casa-2*.log files.  We have a profiling framework that watches and collects data as CASA runs.  So in my plots above, when you see all the time in light-blue for plotms, that is not necessarily related to what plots are reported in the weblog but instead is the amount of time CASA spent in various casa-tasks per pipeline-task.

    On the other hand, my number of calls to plotms() (260 vs. 895) is from the casa-2*.log files, and those numbers were from my previous pipeline script.  Since I have been using the same pipeline script you are using, I am seeing 89 plotms() calls for CASA-5 and 726 plotms() calls for CASA-6.

      grep "Begin Task" casa-*.log | grep plotms | wc

    So, not only has the number of times CASA-6 calls plotms() increased vs CASA-5, but the amount of time it spends in plotms() has also increased.


    1. With "system.resources.memory: 16392" in my ~/.casarc, my CASA-6 job on an E5-2640v3 is running slower but only about 10% slower. It took 149 minutes to run hifv_hanning while my previous run took 138 min. Other pipeline tasks are about 10% slower as well. This is no where near the 7 hours for hifv_hanning Brian saw on his run.

      So I still can't explain Brian's slow runs. He seems to see a much wider variabililty in his runs than I do.

    2. My run with "system.resources.memory: 16392" in my ~/.casarc, my CASA-6 job on an E5-2640v3 finished.  It took 4,146 minutes which is only about 4% slower than my baseline run without "system.resources.memory: 16392" in my ~/.casarc.

  36. Okay, that last point is key then - about the amount of time it spends in plotms - that is an important point.  For some reason ALMA pipeline executions have not seen that to my knowledge.

  37. The plotms suspicious runtime increase has been seen in some ALMA tests. See for example this comment with CASA 5 vs. CASA 6 plots: https://open-jira.nrao.edu/browse/CAS-12891?focusedCommentId=150011&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-150011

    The feeling at the time was: "

    • Maybe the only issue to keep an eye on is the suspicious increase in plotms runtime in hif_applycal (note that it is not as big as it looks at first, because the y axis scales are unfortunately not the same - these plots are not adjusted to this type of comparison). In the other stages plotms also seems to be slower in CASA 6. But in any case this doesn't look like a first order concern for the moment."


  38. Thanks for the updateK Scott Rowe , it looks like my results are (in that a whole node without mem limits runs a bit faster), without "system.resources.memory: 16392" I get 4830 minutes.   With "system.resources.memory: 16392" I get  5010 minutes.  Both with E5-2670.

    Joseph Masters execution with CASA 6.1 took 5520 minutes (without system.resources.memory: 16392).

    Still a range on our all our pipeline executions.  Calibration results look okay though.

  39. We noticed this morning that Brian has been using a slightly different version of the pipeline (master-v0.1-18-g2de4d78-dirty) vs what I have been using (master-v0.1-145-ge322387-dirty).  So I am going to run three jobs with this pipeline (copied from /lustre/aoc/sciops/bkent/pipe/casapipeline.git) staggered by about an hour.

    This should 1: be even more similar to the jobs Brian has been running and 2: see if my runs show any variance in runtimes and if so how.


  40. The purpose of this page was to diagnose why Brian Kent and I were seeing different runtimes between CV and NM and to some extent between CASA-5 and CASA-6.  For the most part, I think this has been explained. We will look into the runtime differences between CASA-5 and CASA-6 in another page.

    Various Pipelines in CASA-5 and CASA-6