The AOC has 30 sixteen 16-core cluster nodes which jointly provide roughly 4Million core hours / year. SE Continuum imaging takes between 30M and 60M core hours / per year (50M to 100M per 18 months). To realize operational scale imaging, the NRAO needs to identify 10 to 15x 15 times its current capacity.
VLASS SE imaging in its current form presents 3 unique challenges to external processing facilities
- Input data sizes are large
- Calibrated MS is ~1TByte for most SBs: could be reduced to 8GB by pre-splitting the MS to relevant visibilities
- CFCache is 30GB: could be generated on the fly at a minor run time cost (for monolithic imaging)
- CASA code stack is ~1GB (may only be an issue for more fine grained decompositions)
- Memory foot prints are large
- Need to characterize major minor cycle
- Need to examine CASA 6 refactored imaging code in this context
- Imaging run times are viewed as long
- Single node 16x parallelization runtimes run times are in the 200 to 500 hours wall clock time. For external facilities without pre-emption wall clock is important
...
We've identified HTCondor and CHTC+OSG as the preferred distribution stack and resource providers. As the memory footprint and runtime of jobs decrease more resources become available. For facilities that do not preempt running jobs in favor of low priority tasks it is critical that they enforce a maximum runtime
Cores per job | Memory per core | Runtime per job | concurrent jobs per imaging run | total jobs per imaging run | Core hours per imaging run | Total pipeline wall time | Available core hours per year | Pipeline characteristics | Notes |
---|---|---|---|---|---|---|---|---|---|
16 | 32 | 400 hours | 1 | 1 | 6400 | 6400 | 400K | Stock pipeline as is | Stock |
16 | 32 | 20 | 1 | ~20 | ~6400 | ~7000 | 2M | Per major+minor cycle executions | Useful first step, unlocks other modes, provides upwards of an AOC worth of hardware |
16 | 32?? | 20 | 1 | ~20 | ~6400 | ~7500 | 2M | Major cycle only | Not interesting by itself, necessary precursor |
1 | 16 | 20 | 16 | 320 | ~6400 | ~8000 | 30M+ | Major cycle only per SPW | Increase in per image runtime, access to 10x AOC |
1 | 4-8 | .8 | 512 | ~10000 | ~8000 | ~10000 | 100M+ | Major cycle only per SPW per W | Access to OSG, substantial increase in per job runtime |