...
- split workflow in to separate data staging and imaging pipelines to avoid manipulating 1TB class MSes
- per tclean call (scripted pipeline has 12 unique calls to tclean)
- per major cycle (submit jobs on a per major cycle basis) (assuming 400 our hour jobs with 20 major cycles this results in ~10 ~20 hour jobs from external host perspective)
- Separate major and minor cycles (run major cycle on external hosts and minor cycle locally to more effectively balance memory demands)
- Separate major cycle per SPW
- reduces input data and CFcace CFcache input size by 16x
- reduces per job runtime by additional 16x, 20 hour per major cycle jobs become 1+ hour jobs
- linearly reduces cfcache creation time as well, may still be more cost effective to build on the fly rather than distribute
- Separate by W terms (ship 1024 distinct gridding jobs, per SPW per W for each major cycle
- Finest practical granularity, run time becoming close to scatter gather time
- Not currently possible
We've identified HTCondor and CHTC+OSG as the preferred distribution stack and resource providers. As the memory footprint and runtime of jobs decrease more resources become available. For facilities that do not preempt running jobs in favor of low priority tasks it is critical that they enforce a maximum runtime
Cores per job | Memory per core | Runtime per job | concurrent jobs per imaging run | total jobs per imaging run | Core hours per imaging run | Total pipeline wall time | Available core hours per year | Pipeline characteristics | Notes |
---|---|---|---|---|---|---|---|---|---|
16 | 32 | 400 hours | 1 | 1 | 6400 | 6400 | 400K | Stock pipeline as is | Stock |
16 | 32 | 20 | 1 | ~20 | ~6400 | ~7000 | 2M | Per major+minor cycle executions | Useful first step, unlocks other modes, provides upwards of an AOC worth of hardware |
16 | 32?? | 20 | 1 | ~20 | ~6400 | ~7500 | 2M | Major cycle only | Not interesting by itself, necessary precursor |
1 | 16 | 20 | 16 | 320 | ~6400 | ~8000 | 30M+ | Major cycle only per SPW | Increase in per image runtime, access to 10x AOC |
1 | 4-8 | .8 | 512 | ~10000 | ~8000 | ~10000 | 100M+ | Major cycle only per SPW per W | Access to OSG, substantial increase in per job runtime |