Track John Tobin's reduced VLASS imaging case. Compare vs various environmental effects.
First step should be to simply reproduce the odd user vs system vs idle type CPU accounting distribution. Once that's done we should consider stripping it down to just it's significant tclean() call and see that that reproduces.
If it does we'll run that call against current RHEL7, RHEL7+newer kernel, October era RHEL6.9 image, and vs NVME device. This should help isolate whether it's an FS or OS based issue. We'll conclude where to head from there.
A copy of Tobin's job was manuall started on nmpost038 (E5-2670) using Lustre at 10:11 while the entire node was reserved for an interactive user. The job was started with the -n 8 option to mpicasa and using casa-pipeline-prerelease-5.6.3-9.el7. By 13:20, it had finished its first tclean() call and was still looking good as far as CPU usage. It started its second tclean() call at 13:30 and shortly there after, the CPUs began doing much more idle time than user time and the system time increased significantly.
A copy of Tobin's job was manuall started on nmpost040 (E5-2670) using NVMe at 10:12 while the entire node was reserved for an interactive user. The job was started with the -n 8 option to mpicasa and using casa-pipeline-prerelease-5.6.3-9.el7. By 12:54, it had finished its first tclean() call and was still looking good as far as CPU usage. It started its second tclean() call at 12:55 and has continued to use all seven cores very effectively with a slight increase in system time.
Well, nmpost040 (rhel7, NVMe) is still running on all seven CPUs as expected while nmpost038 (rhel7, lustre) is using about 1.5 CPUs, half of which is in system time. nmpost038 seems to be spending more time in raw_qspin_lock rather than native_queued_spin_lock_slowpath, but it is clearly not performing well.
Sadly, my jobs on nmpost051 (rhel6, lustre) and cvpost003 (rhel7, lustre) segfaulted before they started their second tclean() call. I expect this is because I had to use different versions of CASA on these hosts. Nmpost040 and nmpost038 are using casa-pipeline-prerelease-5.6.3-9.el7 but that version isn't available for rhel6 in /home/casa nor is it available in CV for rhel7. So for nmpost051 I used the older casa-pipeline-prerelease-5.6.3-4.el6 and for cvpost003 I used the older casa-pipeline-prerelease-5.6.3-2.el7.
Anyway, I think the performance plots of nmpost040 vs nmpost038 are enough to point to Lustre as the bottleneck.
Running casa-prerelease-5.7.0-49 on nmpost051 (rhel6, lustre) it is now in the poor performance state where it is only using about two cores instead of seven. According to the casa log it is in the second tclean() call, so it looks like rhel6 is not the simple solution. It may be part of a more complicated solution where all the VLASS nodes run rhel6 which might reduce the system load on aocmds which might improve performance.
Runtimes with jtobin's data and -n 8 using casa-prerelease-5.6.3-9 (in minutes)
NM Lustre nmpost051 E5-2640v3 | NM Lustre nmpost038 E5-2670 | CV Lustre cvpost020 E5-2640v3 | CV Lustre cvpost003 E5-2670 | NM NVMe nmpost040 E5-2670 | NM localdisk nmpost040 E5-2670 | NM localdisk nmpost073 ES-2640v4 | NM /dev/shm nmpost073 E5-2640v4 |
---|---|---|---|---|---|---|---|
8339* | 12700* | 5850* | 6521* | 2989*, 3044* | 2867* | 2182* | 2418* |
Runtimes with jtobin's data and -n 9 using casa-prerelease-5.6.3-9 (in minutes)
NM Lustre nmpost051 E5-2640v3 | NM Lustre nmpost038 E5-2670 | CV Lustre cvpost020 E5-2640v3 | CV Lustre cvpost003 E5-2670 | NM NVMe nmpost040 E5-2670 | NM localdisk nmpost040 E5-2670 | NM localdisk nmpost073 ES-2640v4 | NM /dev/shm nmpost073 E5-2640v4 |
---|---|---|---|---|---|---|---|
6448* | 7146* | 5156* | 5756* | 2746* | 2882* | 2191* | 2284* |
Runtimes with jtobin's data and -n 9 using casa-prerelease-5.6.3-9 and Lustre-2.10.x (in minutes). I re-ran the tests in NM after Lustre was upgraded to 2.10.8. The CV and localdisk-type test results were just carried over.
NM Lustre nmpost051 E5-2640v3 | NM Lustre nmpost041 E5-2670 | CV Lustre cvpost020 E5-2640v3 | CV Lustre cvpost003 E5-2670 | NM NVMe nmpost040 E5-2670 | NM localdisk nmpost040 E5-2670 | NM localdisk nmpost073 ES-2640v4 | NM /dev/shm nmpost073 E5-2640v4 |
---|---|---|---|---|---|---|---|
5955* | 6413* | 5156* | 5756* | 2746* | 2882* | 2191* | 2284* |
"*" Means it produced pbcor errors
5 Comments
James Robnett [X]
Some evidence that reproducing John's script demonstrates the problem. Let's reproduce with tests tomorrow using the stripped out tclean() call.
1) Reproduce current test
2) Reproduce with RHEL6.9
3) Reproduce against NVME
tclean(vis=['/lustre/aoc/sciops/krowe/jtobin/TSKY0001.sb36463619.eb36473386.58555.315533263885_ptgfix_split_split.ms'],selectdata=True,field="11559,11560,11561,11562,11563,11564,11565,11566,11567,11568,11569,11570,11571,11572,11573,11574,11575,11576,11577,11578,11579,11580,11581,11582,11583,11584,11585,11586,11587,11588,11589,11590,11703,11704,11705,11706,11707,11708,11709,11710,11711,11712,11713,11714,11715,11716,11717,11718,11719,11720,11721,11722,11723,11724,11725,11726,11727,11728,11729,11730,11731,11732,11733,11917,11918,11919,11920,11921,11922,11923,11924,11925,11926,11927,11928,11929,11930,11931,11932,11933,11934,11935,11936,11937,11938,11939,11940,11941,11942,11943,11944,11945,11946,11947,12095,12096,12097,12098,12099,12100,12101,12102,12103,12104,12105,12106,12107,12108,12109,12110,12111,12112,12113,12114,12115,12116,12117,12118,12119,12120,12121,12122,12123,12124,12125,12236,12237,12238,12239,12240,12241,12242,12243,12244,12245,12246,12247,12248,12249,12250,12251,12252,12253,12254,12255,12256,12257,12258,12259,12260,12261,12262,12263,12264,12265,12266,12267,12448,12449,12450,12451,12452,12453,12454,12455,12456,12457,12458,12459,12460,12461,12462,12463,12464,12465,12466,12467,12468,12469,12470,12471,12472,12473,12474,12475,12476,12477,12478,12588,12589,12590,12591,12592,12593,12594,12595,12596,12597,12598,12599,12600,12601,12602,12603,12604,12605,12606,12607,12608,12609,12610,12611,12612,12613,12614,12615,12616,12617,12618,12619,12798,12799,12800,12801,12802,12803,12804,12805,12806,12807,12808,12809,12810,12811,12812,12813,12814,12815,12816,12817,12818,12819,12820,12821,12822,12823,12824,12825,12826,12827,12828,12937,12938,12939,12940,12941,12942,12943,12944,12945,12946,12947,12948,12949,12950,12951,12952,12953,12954,12955,12956,12957,12958,12959,12960,12961,12962,12963,12964,12965,12966,12967,13145,13146,13147,13148,13149,13150,13151,13152,13153,13154,13155,13156,13157,13158,13159,13160,13161,13162,13163,13164,13165,13166,13167,13168,13169,13170,13171,13172,13173,13174,13175",spw=['2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17'],timerange="",uvrange="<12km",antenna=['0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26'],scan=['189,192,195,199,202,205,208,211,214,217'],observation="",intent="OBSERVE_TARGET#UNSPECIFIED",datacolumn="data",imagename="Test_1.2.se.T26t10.J123156+643000.10.2048.v1.I.iter1",imsize=[12150, 12150],cell="0.6arcsec",phasecenter="J2000 12:31:56.185 +64.30.0.0000",stokes="I",projection="SIN",startmodel="",specmode="mfs",reffreq="3.0GHz",nchan=-1,start="",width="",outframe="LSRK",veltype="radio",restfreq=[],interpolation="linear",perchanweightdensity=False,gridder="awproject",facets=1,psfphasecenter="",chanchunks=-1,wprojplanes=32,vptable="",mosweight=False,aterm=True,psterm=False,wbawp=True,conjbeams=True,cfcache="",usepointing=True,computepastep=360.0,rotatepastep=5.0,pointingoffsetsigdev=0.0,pblimit=0.2,normtype="flatnoise",deconvolver="mtmfs",scales=[0, 5, 12],nterms=2,smallscalebias=0.0,restoration=True,restoringbeam="common",pbcor=False,outlierfile="",weighting="briggs",robust=1.5,noise="1.0Jy",npixels=0,uvtaper=[],niter=20000,gain=0.1,threshold="0.0mJy",nsigma=2.0,cycleniter=-1,cyclefactor=3.0,minpsffraction=0.05,maxpsffraction=0.8,interactive=0,usemask="auto-multithresh",mask="",pbmask=0.0,sidelobethreshold=3.0,noisethreshold=5.0,lownoisethreshold=1.5,negativethreshold=0.0,smoothfactor=1.0,minbeamfrac=0.1,cutthreshold=0.01,growiterations=75,dogrowprune=True,minpercentchange=-1.0,verbose=False,fastnoise=True,restart=True,savemodel="none",calcres=False,calcpsf=False,parallel=True)
May want to add a 4th and 5th
4) Reproduce vs latest kernel
5) Reproduce with older CASA version (probably 5.4.2-8)
Also probably need to consider testing in CV just to confirm that it's an NM only problem. Depending on result of that may consider AWS.
K Scott Rowe
the nmpost051-casa5-rhel6 job finished after 2,601 minutes (43.35 hours) with no SEVERE errors using casa-prerelease-5.7.0-49.
The nmpost040-casa5-nvme-rhel7 job finished after 2,989 minutes (49 hours and 49 minutes) with several SEVERE errors using casa-pipeline-prerelease-5.6.3-9.el7.
It looks like casa-prerelease-5.7.0-51 is available for both RHEL6 and RHEL7 in both NM and CV so that might be a version to use for consistancy.
I would still like to get the runtime down but the performance problem does seem to present itself after only a few hours, so it's not as critical to get the runtime down.
K Scott Rowe
I started a tclean() run on both nmpost040 (rhel7, nvme, casa5) and nmpost051 (rhel6, lustre, casa5) using casa-prerelease-5.7.0-51. So far, they are being well behaved and using upwards of seven cores at close to 100% user time. Since I started them with mpicasa -n 8, I think this is expected. Both jobs failed with tclean() exceptions, so this was not a useful test.
James Robnett [X]
We should try 5 tests just to help characterize the impact of disk I/O in general, Lustre more specifically and NM Lustre in particular. The 5 tests should use the casa-prerelease-5.6.3-9 to match John Tobin's original tests and be run against. Use same CPU class for all tests if possible.
K Scott Rowe
I have started a tclean() run on nmpost057 with casa-prerelease-5.6.3-9 and by calling mstransform and the two tclean() tasks that the pipeline normally calls. I also changed the wprojplanes of the second tclean() call from 32 to 4 in the hopes that will make things run faster and perhaps still show the performance degradation.
Drat. The job still failed with the following after 436 minutes
2020-02-05 00:07:36 SEVERE tclean::task_tclean::@nmpost057:MPIClient Exception from task_tclean : Error initializing the Minor Cycle for Test_1.2.se.T26t10.J123156+643000.10.2048.v1.I.iter1 : Multi-term SumWt does not exist. Please create PSFs or Residuals.