You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Current »

Summary of fundamentals and main findings of the ongoing multi-phase work by the Scientific Computing Group at NRAO. The main goal is to characterize the execution of the pipelines with respect to computing resources. Our expectation is that this work will help ALMA ARCs have a deeper understanding of the computational cost of data processing jobs, while providing developers an additional tool to help track specific areas where CASA can be made more resource efficient.

Data measured by the profiling framework

  • Timing
  • Memory footprint per process
  • Memory load of a node (used, cached, swap and the largest slab block)
  • Number of file descriptors per process
  • IO statistics on lustre file system (number of files per IO size range - 0k-4k, 4k-8k...)
  • Number and duration of system calls (open, close, read, write, fcntl, fsync)

Tests

The following tests were performed on the AOC cluster:

  • Serial benchmarks for all datasets
  • Parallelization breadth (number of MPI processes)
  • Storage type
  • Concurrency

The following tests were performed on AWS:

  • Parallelization breadth (number of MPI processes)
  • Memory limit
  • Timing vs CPU type
  • Number of OpenMP threads

Files

Conclusions

Summary of main conclusions. Go to the reports and presentations on Files for detailed information.

  • Parallelization (MPI) of the calibration pipeline without creating MMS reduces only tclean times, resulting in approximately 1 - 15% of total pipeline runtime
  • Parallelization (MPI) of the imaging pipeline results in runtimes decreasing nearly linearly with the number of MPI processes
  • Reduction in runtime with local NVMe storage devices is less than 15% with respect to lustre - to be tested with larger devices to accommodate working directories larger than ~ 1.5 TB
  • No appreciable difference in imaging run time between 8, 16 and 32 GB RAM per process (8-way MPI) - not yet tested below 8 GB per process

  • Current recommendation is to run isolated jobs or 2-way concurrency (2 jobs on a node) with 8-way parallelization - more testing is planned to understand swap memory behavior of 4-way concurrency, that is more efficient timewise

  • MPI parallelization is advantageous over OpenMP if there’s enough memory to support more processes; OpenMP is advantageous when memory is exhausted and there are unused cores

  • Newer, faster CPUs with higher Passmark (industry standard benchmark - https://www.passmark.com/) are likely to be indicative of shorter pipeline runtimes
  • No labels