Table of Contents |
---|
...
To Do
- Talk with SSA and VLASS about how we actually use these remote clusters.
- Staging, flocking, etc
- Find a spot at DSOC for test system
Timeline
- Buy test system as soon as practical (assuming the project is still a go)
- Does Jeff Kern know if this is a go or not
- Talk to Matthew about where to put this stuff
- May 3, 2022 krowe: talked to Matthew. He will consult with Peter and get back to me. I am thinking 253T.
- Done: Ask Jeff Long to spec a switch
- Buy production system by July
- Receive production system by Aug
- Install production system by Dec
- Running in Jan. 2023
...
This is conceptual at this point. Need to talk with SSA and VLASS about this.
- We pre-stage data on the remote head node
- We then either submit a job locally and it flocks to the remote site or we login to the remote site and submit from there.
- Can we use a nifty filesystem to simplify this (Ceph or that LHC fs)?
- This might be a good phase2 problem to solve.
- Is this kinda what nraorsync does?
- The remote execute hosts transfer data from the remote head node
- The job uploads resulting data to the remote head node
- We retrieve data from the remote head node
...
- Get NRAO jobs on the remote racks. This may depend on how we want to use these remote racks. If we want them to do specific types of jobs then ClassAd options may be the solution. If we want them as overflow for jobs run at NRAO then flocking may be the solution. Perhaps we want both flocking and ClassAd options. Actually flocking may be the best method because I think it doesn't require the execute nodes to have external network access.
- Staging and submitting remotely?
- Flocking? What are the networking requirements?
- Classad options? I think this will require the execute hosts to have routable IPs because our submit host will talk directly to them and vice-versa. Could CCB help here?
- Other?
- Remote HTCondor concerns
- Do we want our jobs to run a an NRAO user like vlapipe or nobody?
- Do we want local remote institution jobs to run as the local remote institution user, some dedicated user, or nobody?Remote HTCondor concerns
- Need to support 50% workload for NRAP NRAO and 50% workload for localremote institution. How?
- Could have 15 nodes for us and 15 nodes for them
- What if we do nothing? HTCondor's fair-share algorithm may do the work for us if all our jobs are run as user vlapipe or something like that.
- Use RANK, and therefore preemption. https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigPrioritiesForUsers
- Group Accounting
- User Priority https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToSetUserPriority
- Share disk space on head node 50% NRAO and 50% localremote institution
- Two partitions: one for NRAO and one for localremote institution?
Documentation
- A projectbook like we did for USNO could be appropriate
- Process diagrams (how systems boot, how jobs get started from NRAO and run, how locals remote institutions start jobs, etc)
Networking
...
- DNS
- What DNS domain will these hosts be in? nrao.edu? local remote-institution.site? other?
- DHCP
- SMTP
- NTP
- NFS
- LDAP? How do we handle accounts? I think we will want accounts on at least the head node. The execution nodes could run everything as nobody or as real users. If we want real users on the execute hosts then we should use a directory service which should probably be LDAP. No sense in teaching folks how to use NIS anymore.
- Local remote institution accounts only?
- ssh
- rsync (nraorsync_plugin.py)
- NAT so the nodes can download/upload data
- TFTP (for OSes and switch)
- condor (port 9618) https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToMixFirewallsAndHtCondor
- ganglia
- nagios
...
- Must support CASA
- Will need a patching/updating mechanism
- How to boot diskless OS images
- I am not finding any new sexy software packages to automate PXE+DHCP+TFTP+NFS, so we will keep doing it the way we have been for years https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/managing_storage_devices/setting-up-a-remote-diskless-system_managing-storage-devices
- One OS image for both our use and local remote institution use, or multiple OS images?
- Use containers (docker, singularity/apptainer, kubernetes, mesos, etc)?
- Ask Greg at CHTC what they use
- They use disked OSes and puppet to maintain it
- What Linux distrobution to use?
- Can we use Red Hat with our current license? I have looked in JDE and I can't find a recent subscription. Need to ask David.
- Should we buy Red Hat licenses like we did for USNO?
- USNO is between $10K and $15K per year for 81 licensed nodes. This may not be an EDU license.
- NRAO used to have a 1,000 host license for Red Hat but I don't know what they have now.
- Do we even want to use Red Hat?
- Alternatives would be Rocky Linux or AlmaLinux since CentOS is essentially dead
- What version do we use RHEL7 or RHEL8?
...
- CASA
- HTCondor
- Will need a way to maintain the software
- stow, rpm, modules, containers?
Third party software for
...
remote institution
- Will need a way to maintain software for the local remote institution site
- Will need a way to maintain the software
- stow, rpm, modules, containers?
...
Maintenance
- replace disk (local remote institution admin)
- replace/reseat DIMM (local remote institution admin)
- replace power supply (local remote institution admin)
- NRAO may handle replacement hardware. Drop ship. Spare ourselves?
- Patching OS images (NRAO)
- Patching third party software like CASA and HTCondor (NRAO)
- Altering OS images (NRAO)
...
- Cabinet Rack: Doors front and rear locking with mesh. Width: 19". Height: 42U is most common. Depth: 42" is most common. Rack must support at least 2,000 lbs static load
- https://greatcabinets.com/product/es-ms/ This is what we usually get
- https://www.apc.com/shop/us/en/products/APC-NetShelter-SX-Server-Rack-Enclosure-42U-Black-1991H-x-600W-x-1070D-mm/P-AR3100 APC NetShelter SX
- https://www.apc.com/shop/us/en/products/APC-NetShelter-SX-Server-Rack-Enclosure-42U-Shock-Packaging-2000-lbs-Black-1991H-x-600W-x-1070D-mm/P-AR3100SP APC NetShelter SX designed for re-shipping after equipped.
- PDU: one PDU or two PDUs? What plug? What voltage? This may very across sites. What if the site has two power sources?
- https://www.apc.com/shop/us/en/products/APC-Rack-PDU-2G-switched-0U-30A-100V-to-120V-24-NEMA-5-20R-sockets/P-AP8932 24 NEMA 5-20R, 120V, 2.8kW, NEMA L5-30P 1Phase
- https://www.apc.com/shop/us/en/products/APC-Rack-PDU-9000-switched-0U-8-6kW-208V-21-C13-and-C15-3-C19-and-C21-sockets/P-APDU9965 21 C13/C15 and 3 C19/C21, 208V, 8.6kW, NEMA L21-30P 3Phase
- https://www.apc.com/shop/us/en/products/APC-Rack-PDU-9000-switched-0U-17-3kW-208V-42-C13-and-C15-6-C19-and-C21-sockets/P-APDU9967 42 C13/C15 and 6 C19/C21, 208V, 17kW, IEC60309 60A 3P+PE 3Phase
- Stagger startups on PDU
- UPS: for just the head node and switch? This may depend on the voltage of the PDUs.
- Switch: Cisco Catalyst C9300X-48TX
48 Data, 48x 10G Multigigabit
- 100M, 1G, 2.5G, 5G, or 10 Gbps
- Switching capacity 2,000 Gbps
- Nine optional Modular Uplinks 100G/40G/25G/10G/1G
- Redundant Power Supply 715 W
- $12K
- Environmental Monitoring: Could the PDU do this?
- KVM: rackmount, not remote, and patch cables
- Ethernet cables:
- Power cables: single or Y calbes depending on number of PDUs and two power sources.
- Head Node: lots of disk. Do the locals remote institutions have access to this disk space? Maybe not.
- iDRAC: Ask CIS what they recommend
- Memory: at least 32GB of RAM to help cache the OS image. 64GB would be even better.
- Storage mdRAID, ZFS, Btrfs, RAID card? Do we want both boot and data arrays to be the same type?
- OS/OSimages, RAID1 with or without spare? (3 disks), about 1TB
- OS: We have been making 40GB partitions for / for over a decade and that looks to still work with RHEL8.
- Swap: 0 or 8GB at most
- /export/home/<hostnmae>: services and diskless_images
- Working data/software (local remote institution and nrao), RAID6 w/spare or RAID7 (9 disks), about 72TB
- An SE imaging input data size is about 10GB per job
- We need maybe 20TB+ of total space or more so maybe 60TB/2
- Carve into two partitions (NRAO data and NRAO software, Local remote institution data and Local remote institution software) each partition has data and software directories.
- OS/OSimages, RAID1 with or without spare? (3 disks), about 1TB
- Networking: May need more than one port. One for internal networking to nodes and one for external Internet access.
- 30 1U nodes or 15 2U nodes or mix? NVMe drives for nodes. Swap drive?
- Dual ~3GHz, ~12Core CPUs
- 512GB or more RAM
- NVMe for scratch and swap 6TB or more
- GPUs: Do we get GPUs? Do we get 1U nodes with room for 1 or 2 Tesla T4 GPUs or 2U nodes with room for 1 or 2 regular GPU?
...