There are various reports of performance issues that appear to be unique to NM. It could be Lustre, could be cache issues, could be a kernel bug, could be most anything. We should use this page to track the broader issue and sub-pages to track specific test cases in case there are multiple causes and effects that behave differently as a function of environmental changes.
The reports appear to be (in order of confidence and severity): VLASS calibration pipeline performance with CASA-5 CASA-6, VLASS imaging slow down with CASA-5 vs CASA-5 months ago, particularly during major cycle, ALMA cube imaging in NM vs CV , apocraphyl misc reports like DA plotms calls. Each will be tracked in a sub page.
I'd like to track various causes and tests at this upper level and the results for each use case within the above pages.
Lustre: Tests vs NVME
OS version: CASA-5.6 on RHEL6 vs RHEL7
Kernel: Current vs New.
Working Assumption
It appears the problem is related to Lustre and likely work done in October to bring the cluster to RHEL7 with Lustre 2.10 clients while the servers remained at Lustre 2.5.
For 16 way parallel MPI jobs we'd typically expect a CPU core account mix of around 95% User, 5% System, 1%-5% IO Wait and 0% Idle time. When a job exhibits the pathology the primary evidence is a mix of 20-50% User, 10-20% System, 0-1% IO Wait and 30-70% Idle time. The high idle time is the most curious aspect.
The working assumption is that the kernel charges idle time for kernel threads that lock the core in a spin lock state. That state is explicitly not User or IO Wait so it must be charged as either System or Idle. It appears to be a mix of both which may either be a reflection of increased loops through the spin lock chain or nuance of accounting.
Looking at a single process from an AWP 16way jobs that exhibits the behavior we see the following with perf top -p <PID> -d 5 (5 second update).
Samples: 3K of event 'cycles:ppp', Event count (approx.): 1210939771
Overhead Shared Object Symbol
35.93% [kernel] [k] native_queued_spin_lock_slowpath
3.44% [kernel] [k] _raw_qspin_lock
3.17% libc-2.17.so [.] _int_malloc
2.66% libcasa_casa.so.2898.67.56 [.] casacore::real_a2_re_match_2
2.27% [kernel] [k] _raw_spin_lock_irqsave
Which shows the heavy cost of the spin lock vs the small amount of CASA processing during that sample window.
Repeating with perf top -p <PID> -g -d 5 ( the -g triggers call graph accounting) shows:
Samples: 20K of event 'cycles:ppp', Event count (approx.): 6201121238
Children Self Shared Object Symbol
+ 68.93% 0.01% [kernel] [k] system_call_fastpath
+ 56.56% 0.00% [unknown] [k] 0x67616e7461686273
+ 55.58% 0.06% [kernel] [k] mdc_enqueue_base
+ 55.03% 0.14% [kernel] [k] lmv_intent_lock
+ 54.84% 0.03% [kernel] [k] mdc_intent_lock
+ 54.64% 0.00% [kernel] [k] sys_open
+ 54.62% 0.03% [kernel] [k] do_sys_open
+ 54.45% 0.00% [kernel] [k] do_filp_open
+ 54.44% 0.01% [kernel] [k] path_openat
+ 53.92% 0.04% [kernel] [k] do_last
+ 53.81% 0.01% [kernel] [k] vfs_open
+ 53.79% 0.00% [kernel] [k] do_dentry_open
+ 53.75% 0.03% [kernel] [k] ll_file_open
+ 53.56% 0.02% [kernel] [k] ll_intent_file_open
This shows that for system_call_fastpath the majority of the time is in ll_intent_file_open (work from both to top of the call graph, each higher entry's percentage is the sum of all predecessors plus itself). All of those calls are part of the Lustre kernel. Time spent processing the kernel threads is charged as system time, threads that lock the core waiting for response like ll_file_open likely get charged as system time rather than idle time. Threads that lock the core because they're inherently intended to wait, like spin_lock, are likely charged as idle time. If true that explains the increased system time in addition to idle time. It would be better if the kernel had a 5th state to account for enforced system waits rather than mixing them between system time and idle time. That's what you get with a free OS/kernel I guess .
Lastly looking at the ganglia plots for the Metadata server (MDS) for the last year one can see a sharp increase in system time that appears to begin around October 20th, roughly consistent with the cluster upgrade.
Up to date plot can be found here:
Next steps
We're testing the trimmed down VLASS case against an NVME device to demonstrate the pathology disappears when not using Lustre. We're also testing against an RHEL6.9 client though that may not be conclusive since the MDS is still unnaturally loaded.
If all of this is correct the two solutions would be to
1) Upgrade the Lustre servers to 2.10 as soon as possible
2) Downgrade the VLASS clients to RHEL6.9 and Lustre 2.5 with the assumption that they represent most of the load on Lustre and are the primary trigger for the MDS issues.
I'm suspicious this is the root cause of the getcwd() error which appears to be a race condition in the kernel when removing stale dentrys from the cache. The delay triggered by the loaded MDS may be sufficient to trigger. That's a bit inconsistent with the behavior manifesting at the JAO so possibly not.
Current Theory
In addition to the MDS CPU 1-year load below, we can see a finer grained pattern over the coarse of a week for the MDS CPU, IB bytes in and IB packets in:
That fairly clearly shows that the load on the MDS is coupled to client meta requests (open, close, stat etc) in the form of small byte per packet but very many packets per second, around 500K packets per second. There appears to be a regular periodicity which is around 9 hours of load followed by 4 to 5 hours of quiescence. Initial thoughts that this implied a single job firing, and then re-firing N hours after completion were ruled out(1).
Looking at the entire cluster it was discovered that nmpost nodes running VLASS SE imaging shared the same IB packet pattern though at a lower rate. The plot below for nmpost077 shows a match in the earlier part of the week.
Multiple nodes showed the same pattern and appear, in aggregate, to account for the total traffic into/out of the MDS server. In theory there should be no coupling between these VLASS SE imaging jobs but in reality there is one. Many of them share the cfcache. In addition CASA opens the cfcache in write mode. This means any processes that utilize the same copy of the cache will incur lock contention when accessing the same table. This is likely the origin of the ll_intent_file_open and native_queued_spin_lock_slowpath kernel threads on the slow running CASA jobs.
Furthermore once a certain level of contention is reached the MDS saturates and all queries, whether from other VLASS SE jobs or regular pipeline jobs or even interactive sessions will be impacted. This likely explains the intermittent behavior, multiple shared jobs likely impacted their siblings, individual jobs may or may not have been impacted, conversely shared cache jobs with few or no siblings may have been fine.
We're running a test to be reported here with 5 clones all sharing a cache and a 6th identical sibling with it's own cfcache. We'll compare run times between the coupled and isolated jobs. If this is the actual problem then the solution is two fold:
Modify CASA tclean() to not open cfcache in write mode. Presumably it opens to write, if err it creates else continues. It should test for existence, create and close if it doesn't exist and then open for read.- In the meantime all SE cont jobs should utilize their own copy of the cfcache. It would be good to have the executing scripts simply copy it in before starting CASA and delete it after CASA exits thus avoiding the contention without creating many copies of an otherwise large directory (33GB or so).
Step 2 successfully reduced the MDS load. Below is a plot of the MDS load for the week of February 21 to 28th. It shows the initial load reduction on the 26th when all VLASS jobs were converted to local cfcache. The 27th shows a large spike while a pointed observation was tested, the 28th shows steady state after all awproject jobs were either converted to local cfcache or stopped.
We're currently examining the access patterns while imaging SPW 7 from TSKY0001.sb36463619.eb36473386.58555.315533263885. The tests are being performed against local cache, cache on lustre and cache on lustre with usepointing = False.
1) I have a theory for the strange cadence pattern, if you buy me a beer and supply a white board plus a few markers I'll explain it.
Disproven Ideas
RHEL7/Lustre-2.10 Idea
This idea proposes that the increased system CPU usage on the MDS is caused by RHEL7 but more specificaly Lustre-2.10. In the 1-year CPU graph of aocmds you can clearly see a large increase in system time starting around mid October 2019. There is a very similar increase in user/system CPU time on nmpost061 through nmpost070.
These are the 10 newest nodes installed for VLASS in September 2019 and started off as RHEL7/Lustre-2.10.
To test this, we rebooted nmpost071 through nmpost090 back into RHEL6/Lustre-2.5.5 so that vlasstest jobs will run on them. I have started 10 CASA jobs, each one on a node in this range, started about 20 minutes apart. The system CPU usage on aocmds has not significantly changed because of these jobs which is inconclusive. The proper test of this idea may just be upgrading the Lustre servers to 2.10.
After upgrading the Lustre system in NM to 2.10.8, the effects on the MDS seem the same. So, it appears this wasn't caused by just a client difference.
CFCache open for write lock contention
The local disk CFCache resolved the contention on the MDS which allowed us to examine individual runs more clearly. Previously it was difficult to ascertain the impact of an individual run on the MDS. A large multi-node run using a 5.7 prerelease which opened the CFCache in write mode and a version Sanjay provided from the same trunk which opened it for read produced roughly the same contention on the MDS. This implies the issue is with data access patterns, possibly unique to VLASS, possibly unique to the gridding code.
Rapidly changing pointing in VLASS triggers multiple walks through CFcache
It does appear to be the case that VLASS data triggers more frequent loads/unloads of each SPW's worth of CFs as the pointing changes but that occurs on 1s of sec timescales. The file level reads and opens are happening on 10s of microsecond timescales. In addition a large pointed observation exhibits the same behavior as VLASS data. This leaves access patterns within gridding itself as the remaining candidate even though initial code inspection didn't expose a logical error.
TSKY0001.sb36463619.eb36473386.58555.315533263885_ptgfix_split_split_SPW7.ms