...
1) I have a theory for the strange cadence pattern, if you buy me a beer and supply a white board plus a few markers I'll explain it.
RHEL7/Lustre-2.10 Idea
This idea proposes that the increased system CPU usage on the MDS is caused by RHEL7 but more spcificly Lustre-2.10. In the 1-year CPU graph of aocmds you can clearly see a large increase in system time starting around mid October 2019. There is a very similar increase in user/system CPU time on nmpost061 through nmpost070.
These are the 10 newest nodes installed for VLASS in September 2019 and started off as RHEL7/Lustre-2.10.
To test this, we rebooted nmpost071 through nmpost090 back into RHEL6/Lustre-2.5.5 so that vlasstest jobs will run on them. I have started 10 casa jobs, each one on a node in this range, started about 20 minutes apart. The system CPU usage on aocmds has not significantly changed because of these jobs which is inconclusive. The proper test of this idea may just be upgrading the Lustre servers to 2.10.