Dask Domain Model
Here is what I think our domain model should be in the context of the dask ecosystem.
I would like to find a new name for the "Cluster Resource Manger" and rethink the functions allocated to it in the previous post.
I think we should consider how various use cases map to underlying compute resources and separate out issues related to deployment in order to help guide and organize our prototyping efforts.
Resource Managers
Resource Managers manage work on compute resources.
Compute Resource : Resource Manager Example
local : laptop OS
HCP : Slurm, PBS, etc (AKA job queuing systems)
HTC : HTCondor
Cloud: AWS, GCP, etc
Cluster Managers
Cluster Managers are dask abstractions included in dask_jobqueue that deploy a scheduler and workers as determined by communicating with resource managers. e.g. dask_jobqueue.SLURMCluster.
A worker is a Python object and node in a dask cluster that serves two purposes, 1) serve data, and 2) perform computations.
Jobs are resources submitted to and managed by resource managers.
For job queuing system Resource Managers, the Cluster Manager configures each node and generates a job script for the underlying resource manager.
dask_distributed provides a client that interacts with Cluster Managers. The client abstraction provides a consistent interface to the user for any Cluster/Resource Manager combination.
Dask Gateway
Dask Gateway provides a secure, multi-tenant server for managing dask clusters. It allows users to launch and use dask clusters in a shared, centrally managed cluster environment, without requiring users to have direct access to the underlying cluster backend (e.g. Kubernetes, Hadoop/YARN, HPC Job queues, etc…).
HTC
dask_jobqueue provides an HTCondor Cluster Manger.
dask-CHTC customizes dask_jobqueue to fit CHTC’s needs.
Cloud
dask_cloudprovider provides abstractions for constructing and managing ephemeral Dask clusters on various cloud platforms.
by Daniel Lyons
Let’s refine the idea of science product. Each science product refers to a bucket which contains some files. Inside the bucket, there is always a metadata file with a given name that describes the product. This metadata file has a validatable format with some mandatory and some optional data. It’s complete enough to encompass all of the things that stakeholders would want us to make searchable, as well as what we would need to enable processing.
This science product bucket could be realized locally on disk. Ingestion is reduced to importing one of these buckets into the archive. This is the interface we provide to external folks like Josh when they show up with products for us to host.
We write and maintain programs that generate the metadata file from SDMs and FITS files for our instruments. If there is metadata we need but don’t have a way of producing, we expect a human is generating it somehow. EVLA ingestion right now becomes a two-step process: generate the metadata file from the SDM+BDF, then run the generic ingestion.
Archive storage is then bucket-oriented. We pass these buckets around to storage backends. On hierarchical media, we can just make subdirectories for each bucket. On bucket-oriented media like S3 or Ceph, well, it’s already bucket-oriented.
The other software in the system, like delivery, doesn’t need to talk to the archive about the products in the bucket because it can just parse the same metadata file. We document the format and provide our own parser for it publicly. Our internal systems are then decoupled from the archive’s various services. They can just parse the file. We don’t need to provide as many services.
Self-healing the archive comes in two flavors: marching through the metadata files in each bucket, or generating new metadata files.
Versioning can be interposed between science products and their buckets. The science product would have a “current” version which points to a certain bucket and then a list of older versions that point to other buckets.
Ancillaries I don’t have an answer for. My gut feeling is they should just be inside the science product bucket, maybe in a directory with a fixed name like “ancillary.”