...
The working assumption is that the kernel charges idle time for kernel threads that lock the core in a spin lock state. That state is explicitly not User or IO Wait so it must be charged as either System or Idle. It appears to be a mix of both which may either be a reflection of increased loops through the spin lock chain or nuance of accounting.
Looking at a single process from an AWP 16way jobs that exhibits the behavior we see the following with perf top -p <process> -d 5 (5 second update).
Samples: 3K of event 'cycles:ppp', Event count (approx.):
...
1210939771
Overhead Shared Object Symbol
35.93% [kernel] [k] native_queued_spin_lock_slowpath
3.44% [kernel] [k] _raw_qspin_lock
3.17% libc-2.17.so [.] _int_malloc
2.66% libcasa_casa.so.2898.67.56 [.] casacore::real_a2_re_match_2
2.27% [kernel] [k] _raw_spin_lock_irqsave
Which shows the heavy cost of the spin lock vs the small amount of casa processing during that sample window.
Repeating with perf top -p <process> -g -d 5 ( the -g triggers call graph accounting) shows:
Samples: 20K of event 'cycles:ppp', Event count (approx.): 6201121238
Children Self Shared Object Symbol
+ 68.93% 0.01% [kernel] [k] system_call_fastpath
+ 56.56% 0.00% [unknown] [k] 0x67616e7461686273
+ 55.58% 0.06% [kernel] [k] mdc_enqueue_base
+ 55.03% 0.14% [kernel] [k] lmv_intent_lock
+ 54.84% 0.03% [kernel] [k] mdc_intent_lock
+ 54.64% 0.00% [kernel] [k] sys_open
+ 54.62% 0.03% [kernel] [k] do_sys_open
+ 54.45% 0.00% [kernel] [k] do_filp_open
+ 54.44% 0.01% [kernel] [k] path_openat
+ 53.92% 0.04% [kernel] [k] do_last
+ 53.81% 0.01% [kernel] [k] vfs_open
+ 53.79% 0.00% [kernel] [k] do_dentry_open
+ 53.75% 0.03% [kernel] [k] ll_file_open
+ 53.56% 0.02% [kernel] [k] ll_intent_file_open
This shows that for the system_call_fastpath the majority of the time ins in ll_intent_file_open (work from both to top of the call graph, each higher entry's percentage is the sum of all predecessors plus itself). All of those calls are part of the Lustre kernel.
Lastly looking at the ganglia plots for the Metadata server (MDS) for the last year one can see a sharp increase in system time that began in October, roughly consistent with the cluster upgrade.
Up to date plot can be found here:
Next steps
We're testing the trimmed down VLASS case against an NVME device to demonstrate the pathology disappears when not using Lustre. We're also testing against an RHEL6.9 client though that may not be conclusive since the MDS is still unnaturally loaded.
...