Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

We're running a test to be reported here with 5 clones all sharing a cache and a 6th identical sibling with it's own cfcache.  We'll compare run times between the coupled and isolated jobs.   If this is the actual problem then the solution is two fold:

  1. Modify CASA tclean() to not open cfcache in write mode.  Presumably it opens to write, if err it creates else continues.  It should test for existence, create and close if it doesn't exist and then open for read.
  2. In the meantime all SE cont jobs should utilize their own copy of the cfcache.  It would be good to have the executing scripts imply copy it in before starting casa and delete it after CASA exits thus avoiding the contention without creating many copies of an other wise large directory (33GB or so). 

Step 2 successfully reduced the MDS load,  below is a plot of the MDS load for the week of February 21 to 28th.  It shows the initial load reduction on the 26th when all VLASS jobs were converted to local cfcache,  the 27th shows a large spike while a pointed observation was test, the 28th shows steady state after all awproject jobs were either converted to local cfcache or stopped.

Image Added


We're currently examining the access patterns while imaging SPW 7 from TSKY0001.sb36463619.eb36473386.58555.315533263885.  The tests are being performed against local cache, cache on lustre and cache on lustre with dopointing = False.


1) I have a theory for the strange cadence pattern,  if you buy me a beer and supply a white board plus a few markers I'll explain it.

...

After upgrading the Lustre system in NM to 2.10.8, the effects on the MDS seem the same.  So, it appears this wasn't caused by just a client difference.

CFCache open for write lock contention

The local disk CFCache resolved the contention on the MDS which allowed us to examine individual runs more clearly.  Previously it was difficult to ascertain the impact of an individual run on the MDS.  A large multi-node run using a 5.7 prerelease which opened the CFCache in write mode and a version Sanjay provided from the same trunk which opened it for read produced roughly the same contention on the MDS.  This implies the issue is with data access patterns, possibly unique to VLASS, possibly unique to the gridding code.

Rapidly changing pointing in VLASS triggers multiple walks through CFcache

It does appear to be the case that VLASS data triggers more frequent loads/unloads of each SPW's worth of CFs as the pointing changes but that occurs on 1s of sec timescales.  The file level reads and opens are happening on 10s of microsecond timescales.  In addition a large pointed observation exhibits the same behavior as VLASS data.  This leaves access patterns within gridding itself as the remaining candidate even though initial code inspection didn't expose a logical error.

TSKY0001.sb36463619.eb36473386.58555.315533263885_ptgfix_split_split_SPW7.ms