Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Multiple nodes showed the same pattern and appear, in aggregate, to account for the total traffic into/out of the MDS server. In theory there should be no coupling between these VLASS SE imaging jobs but in reality there is one.   Many of them share the cfcache.  In addition CASA opens the cfcache in write mode.  This means any processes that utilize the same copy of the cache will incur lock contention when accessing the same table.    This is likely the origin of the ll_intent_file_open and native_queued_spin_lock_slowpath kernel threads on the slow running CASA jobs.  

Furthermore once a certain level of contention is reached the MDS saturates and all queries, whether from other VLASS SE jobs or regular pipeline jobs or even interactive sessions will be impacted. This likely explains the intermittent behavior, multiple shared jobs likely impacted their siblings,  individual jobs may or may not have been impacted, conversely shared cache jobs with few or no siblings may have been fine.

...