You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 11 Next »

There are various reports of performance issues that appear to be unique to NM.  It could be Lustre, could be cache issues, could be a kernel bug, could be most anything.  We should use this page to track the broader issue and sub-pages to track specific test cases in case there are multiple causes and effects that behave differently as a function of environmental changes.

The reports appear to be (in order of confidence and severity):  VLASS calibration pipeline performance with CASA5 CASA-6, VLASS imaging slow down with CASA 5 vs CASA 5 months ago, particularly during major cycle,  ALMA cube imaging in NM vs CV ,  apocraphyl misc reports like DA plotms calls.  Each will be tracked in a sub page.

I'd like to track various causes and tests at this upper level and the results for each use case within the above pages.

Lustre:  Tests vs NVME

OS version: CASA 5.6 on RHEL6 vs RHEL7

Kernel: Current vs New.

Working Assumption

It appears the problem is related to Lustre and likely work done in October to bring the cluster to RHEL7 with Lustre 2.10 clients while the servers remained at Lustre 2.5.

For 16 way parallel MPI jobs we'd typically expect a CPU core account mix of around 95% User, 5% System, 1%-5% IO Wait and 0% Idle time.  When a job exhibits the pathology the primary evidence is a mix of 20-50% User, 10-20% System, 0-1% IO Wait and 30-70% Idle time.  The high idle time is the most curious aspect.

The working assumption is that the kernel charges idle time for kernel threads that lock the core in a spin lock state.  That state is explicitly not User or IO Wait so it must be charged as either System or Idle.  It appears to be a mix of both which may either be a reflection of increased loops through the spin lock chain or nuance of accounting. 

Looking at a single process from an  AWP 16way jobs that exhibits the behavior we see the following with perf top -p <PID> -d 5 (5 second update).

Samples: 3K of event 'cycles:ppp', Event count (approx.): 1210939771   
Overhead  Shared Object                      Symbol                           
  35.93%  [kernel]                           [k] native_queued_spin_lock_slowpath
   3.44%  [kernel]                           [k] _raw_qspin_lock
   3.17%  libc-2.17.so                       [.] _int_malloc
   2.66%  libcasa_casa.so.2898.67.56         [.] casacore::real_a2_re_match_2
   2.27%  [kernel]                           [k] _raw_spin_lock_irqsave

Which shows the heavy cost of the spin lock vs the small amount of casa processing during that sample window.

Repeating with perf top -p <PID> -g -d 5 ( the -g triggers call graph accounting) shows:

Samples: 20K of event 'cycles:ppp', Event count (approx.): 6201121238                                                                                                
  Children      Self  Shared Object                      Symbol                                                                                                      
+   68.93%     0.01%  [kernel]                           [k] system_call_fastpath
+   56.56%     0.00%  [unknown]                          [k] 0x67616e7461686273
+   55.58%     0.06%  [kernel]                           [k] mdc_enqueue_base
+   55.03%     0.14%  [kernel]                           [k] lmv_intent_lock
+   54.84%     0.03%  [kernel]                           [k] mdc_intent_lock
+   54.64%     0.00%  [kernel]                           [k] sys_open
+   54.62%     0.03%  [kernel]                           [k] do_sys_open
+   54.45%     0.00%  [kernel]                           [k] do_filp_open
+   54.44%     0.01%  [kernel]                           [k] path_openat
+   53.92%     0.04%  [kernel]                           [k] do_last
+   53.81%     0.01%  [kernel]                           [k] vfs_open
+   53.79%     0.00%  [kernel]                           [k] do_dentry_open
+   53.75%     0.03%  [kernel]                           [k] ll_file_open
+   53.56%     0.02%  [kernel]                           [k] ll_intent_file_open

This shows that for  system_call_fastpath the majority of the time is in ll_intent_file_open (work from both to top of the call graph, each higher entry's percentage is the sum of all predecessors plus itself).  All of those calls are part of the Lustre kernel.  Time spent processing the kernel threads is charged as system time,  threads that lock the core waiting for response like ll_file_open likely get charged as system time rather than idle time.  Threads that lock the core because they're inherently intended to wait, like spin_lock, are likely charged as idle time.  If true that explains the increased system time in addition to idle time.  It would be better if the kernel had a 5th state to account for enforced system waits rather than mixing them between system time and idle time.  That's what you get with a free OS/kernel I guess (smile).

Lastly looking at the ganglia plots for the Metadata server (MDS) for the last year one can see a sharp increase in system time that appears to begin around October 20th, roughly consistent with the cluster upgrade.

Up to date plot can be found here:

http://ganglia.aoc.nrao.edu/graph.php?g=cpu_report&z=large&c=Lustre%20AOC&h=aocmds.aoc.nrao.edu&m=load_one&r=year&s=descending&hc=4&st=1580403275

 Next steps

We're testing the trimmed down VLASS case against an NVME device to demonstrate the pathology disappears when not using Lustre.  We're also testing against an RHEL6.9 client though that may not be conclusive since the MDS is still unnaturally loaded. 

If all of this is correct the two solutions would be to

1) Upgrade the Lustre servers to 2.10 as soon as possible

2) Downgrade the VLASS clients to RHEL6.9 and Lustre 2.5 with the assumption that they represent most of the load on Lustre and are the primary trigger for the MDS issues.

I'm suspicious this is the root cause of the getcwd() error which appears to be a race condition in the kernel when removing stale dentrys from the cache.  The delay triggered by the loaded MDS may be sufficient to trigger.  That's a bit inconsistent with the behavior manifesting at the JAO so possibly not.

Current Theory


   





In addition to the MDS CPU one year load above we can see a finer grained pattern over the coarse of a week for the MDS CPU, IB bytes in and IB packets in:


  





  • No labels