Track John Tobin's reduced VLASS imaging case. Compare vs various environmental effects.
First step should be to simply reproduce the odd user vs system vs idle type CPU accounting distribution. Once that's done we should consider stripping it down to just it's significant tclean() call and see that that reproduces.
If it does we'll run that call against current RHEL7, RHEL7+newer kernel, October era RHEL6.9 image, and vs NVME device. This should help isolate whether it's an FS or OS based issue. We'll conclude where to head from there.
A copy of Tobin's job was manuall started on nmpost038 (E5-2670) using Lustre at 10:11 while the entire node was reserved for an interactive user. The job was started with the -n 8 option to mpicasa and using casa-pipeline-prerelease-5.6.3-9.el7. By 13:20, it had finished its first tclean() call and was still looking good as far as CPU usage. It started its second tclean() call at 13:30 and shortly there after, the CPUs began doing much more idle time than user time and the system time increased significantly.
A copy of Tobin's job was manuall started on nmpost040 (E5-2670) using NVMe at 10:12 while the entire node was reserved for an interactive user. The job was started with the -n 8 option to mpicasa and using casa-pipeline-prerelease-5.6.3-9.el7. By 12:54, it had finished its first tclean() call and was still looking good as far as CPU usage. It started its second tclean() call at 12:55 and has continued to use all seven cores very effectively with a slight increase in system time.
Well, nmpost040 (rhel7, NVMe) is still running on all seven CPUs as expected while nmpost038 (rhel7, lustre) is using about 1.5 CPUs, half of which is in system time. nmpost038 seems to be spending more time in raw_qspin_lock rather than native_queued_spin_lock_slowpath, but it is clearly not performing well.
Sadly, my jobs on nmpost051 (rhel6, lustre) and cvpost003 (rhel7, lustre) segfaulted before they started their second tclean() call. I expect this is because I had to use different versions of CASA on these hosts. Nmpost040 and nmpost038 are using casa-pipeline-prerelease-5.6.3-9.el7 but that version isn't available for rhel6 in /home/casa nor is it available in CV for rhel7. So for nmpost051 I used the older casa-pipeline-prerelease-5.6.3-4.el6 and for cvpost003 I used the older casa-pipeline-prerelease-5.6.3-2.el7.
Anyway, I think the performance plots of nmpost040 vs nmpost038 are enough to point to Lustre as the bottleneck.
Running casa-prerelease-5.7.0-49 on nmpost051 (rhel6, lustre) it is now in the poor performance state where it is only using about two cores instead of seven. According to the casa log it is in the second tclean() call, so it looks like rhel6 is not the simple solution. It may be part of a more complicated solution where all the VLASS nodes run rhel6 which might reduce the system load on aocmds which might improve performance.
Runtimes with jtobin's data and -n 8 using casa-prerelease-5.6.3-9 (in minutes)
NM Lustre nmpost051 E5-2640v3 | NM Lustre nmpost038 E5-2670 | CV Lustre cvpost020 E5-2640v3 | CV Lustre cvpost003 E5-2670 | NM NVMe nmpost040 E5-2670 | NM localdisk nmpost040 E5-2670 | NM localdisk nmpost073 ES-2640v4 | NM /dev/shm nmpost073 E5-2640v4 |
---|---|---|---|---|---|---|---|
8339* | 12700* | 5850* | 6521* | 2989*, 3044* | 2867* | 2182* | 2418* |
Runtimes with jtobin's data and -n 9 using casa-prerelease-5.6.3-9 (in minutes)
NM Lustre nmpost051 E5-2640v3 | NM Lustre nmpost038 E5-2670 | CV Lustre cvpost020 E5-2640v3 | CV Lustre cvpost003 E5-2670 | NM NVMe nmpost040 E5-2670 | NM localdisk nmpost040 E5-2670 | NM localdisk nmpost073 ES-2640v4 | NM /dev/shm nmpost073 E5-2640v4 |
---|---|---|---|---|---|---|---|
6448* | 7146* | 5156* | 5756* | 2746* | 2882* | 2191* | 2284* |
Runtimes with jtobin's data and -n 9 using casa-prerelease-5.6.3-9 and Lustre-2.10.x (in minutes). I re-ran the tests in NM after Lustre was upgraded to 2.10.8. The CV and localdisk-type test results were just carried over.
NM Lustre nmpost051 E5-2640v3 | NM Lustre nmpost041 E5-2670 | CV Lustre cvpost020 E5-2640v3 | CV Lustre cvpost003 E5-2670 | NM NVMe nmpost040 E5-2670 | NM localdisk nmpost040 E5-2670 | NM localdisk nmpost073 ES-2640v4 | NM /dev/shm nmpost073 E5-2640v4 |
---|---|---|---|---|---|---|---|
5156* | 5756* | 2746* | 2882* | 2191* | 2284* |
"*" Means it produced pbcor errors