...
The reports appear to be (in order of confidence and severity): VLASS calibration pipeline performance with CASA5 CASA-5 CASA-6, VLASS imaging slow down with CASA-5 vs CASA-5 months ago, particularly during major cycle, ALMA cube imaging in NM vs CV , apocraphyl misc reports like DA plotms calls. Each will be tracked in a sub page.
...
Lustre: Tests vs NVME
OS version: CASA-5.6 on RHEL6 vs RHEL7
Kernel: Current vs New.
...
Which shows the heavy cost of the spin lock vs the small amount of casa CASA processing during that sample window.
...
We're running a test to be reported here with 5 clones all sharing a cache and a 6th identical sibling with it's own cfcache. We'll compare run times between the coupled and isolated jobs. If this is the actual problem then the solution is two fold:
Modify CASA tclean() to not open cfcache in write mode. Presumably it opens to write, if err it creates else continues. It should test for existence, create and close if it doesn't exist and then open for read.- In the meantime all SE cont jobs should utilize their own copy of the cfcache. It would be good to have the executing scripts imply simply copy it in before starting casa CASA and delete it after CASA exits thus avoiding the contention without creating many copies of an other wise otherwise large directory (33GB or so).
Step 2 successfully reduced the MDS load. Below is a plot of the MDS load for the week of February 21 to 28th. It shows the initial load reduction on Wed. (Feb. 26, 2020) when all VLASS jobs were converted to local cfcache. Thu. (Feb. 27, 2020) shows a large spike while a pointed observation was tested and Fri. (Feb. 28, 2020) shows steady state after all awproject jobs were either converted to local cfcache or stopped.
We're currently examining the access patterns while imaging SPW 7 from TSKY0001.sb36463619.eb36473386.58555.315533263885. The tests are being performed against local cache, cache on lustre and cache on lustre with usepointing = False.
1) I have a theory for the strange cadence pattern, if you buy me a beer and supply a white board plus a few markers I'll explain it.
...
Disproven Ideas
RHEL7/Lustre-2.10 Idea
This idea proposes that the increased system CPU usage on the MDS is caused by RHEL7 but more specificaly Lustre-2.10. In the 1-year CPU graph of aocmds you can clearly see a large increase in system time starting around mid October 2019. There is a very similar increase in user/system CPU time on nmpost061 through nmpost070.
These are the 10 newest nodes installed for VLASS in September 2019 and started off as RHEL7/Lustre-2.10.
To test this, we rebooted nmpost071 through nmpost090 back into RHEL6/Lustre-2.5.5 so that vlasstest jobs will run on them. I have started 10 CASA jobs, each one on a node in this range, started about 20 minutes apart. The system CPU usage on aocmds has not significantly changed because of these jobs which is inconclusive. The proper test of this idea may just be upgrading the Lustre servers to 2.10.
After upgrading the Lustre system in NM to 2.10.8, the effects on the MDS seem the same. So, it appears this wasn't caused by just a client difference.
CFCache open for write lock contention
The local disk CFCache resolved the contention on the MDS which allowed us to examine individual runs more clearly. Previously it was difficult to ascertain the impact of an individual run on the MDS. A large multi-node run using a 5.7 prerelease which opened the CFCache in write mode and a version Sanjay provided from the same trunk which opened it for read produced roughly the same contention on the MDS. This implies the issue is with data access patterns, possibly unique to VLASS, possibly unique to the gridding code.
Rapidly changing pointing in VLASS triggers multiple walks through CFcache
It does appear to be the case that VLASS data triggers more frequent loads/unloads of each SPW's worth of CFs as the pointing changes but that occurs on 1s of sec timescales. The file level reads and opens are happening on 10s of microsecond timescales. In addition a large pointed observation exhibits the same behavior as VLASS data. This leaves access patterns within gridding itself as the remaining candidate even though initial code inspection didn't expose a logical error.
TSKY0001.sb36463619.eb36473386.58555.315533263885_ptgfix_split_split_SPW7.ms