RADIAL Support

Operating System

Must support CASA
Will need a patching/updating mechanism
How to boot diskless OS images
- I am not finding any new sexy software packages to automate PXE+DHCP+TFTP+NFS
- One OS image for both our use and locals use or multiple OS images?
- Use containers (docker, singularity, kubernetes, mesos, etc)?
- Ask Greg at CHTC what they use
What Linux distrobution to use?
- Can we use Red Hat with our current license?
- Should we buy Red Hat licenses like we did for USNO?
- Do we even want to use Red Hat?
- Rocky Linux or AlmaLinux since CentOS is essentially dead?

Third party software for VLASS

CASA
HTCondor
Will need a way to maintain the software

Third party software for Local

Will need a way to maintain software for the local site

Services

DNS
DHCP
SMTP
NTP
NFS?
LDAP? How do we handle accounts?
ssh
rsync (nraorsync_plugin.py)
NAT so the nodes can download/upload data
TFTP
condor (port 9618) https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToMixFirewallsAndHtCondor

Management Access

PDU
UPS
BMC/IPMI
switch

Maintenance

replace disk (local admin)
replace/reseat DIMM (local admin)
replace power supply (local admin)
NRAO may handle replacement hardware. Drop ship. Spare ourselves?
Patching OS images (NRAO)
Patching third party software like CASA and HTCondor (NRAO)
Altering OS images (NRAO)

Hardware

Cabinet Rack: Doors front and rear locking with mesh. Width: 19". Height: 42U is most common. Depth: 42" is most common. Rack must support at least 2,000 lbs static load
- https://greatcabinets.com/product/es-ms/ This is what we usually get
- https://www.apc.com/shop/us/en/products/APC-NetShelter-SX-Server-Rack-Enclosure-42U-Black-1991H-x-600W-x-1070D-mm/P-AR3100 APC NetShelter SX
- https://www.apc.com/shop/us/en/products/APC-NetShelter-SX-Server-Rack-Enclosure-42U-Shock-Packaging-2000-lbs-Black-1991H-x-600W-x-1070D-mm/P-AR3100SP APC NetShelter SX designed for re-shipping after equipped.
PDU: one PDU or two PDUs? What plug? What voltage? This may very across sites. What if the site has two power sources?
- https://www.apc.com/shop/us/en/products/APC-Rack-PDU-2G-switched-0U-30A-100V-to-120V-24-NEMA-5-20R-sockets/P-AP8932 24 NEMA 5-20R, 120V, 2.8kW, NEMA L5-30P 1Phase
- https://www.apc.com/shop/us/en/products/APC-Rack-PDU-9000-switched-0U-8-6kW-208V-21-C13-and-C15-3-C19-and-C21-sockets/P-APDU9965 21 C13/C15 and 3 C19/C21, 208V, 8.6kW, NEMA L21-30P 3Phase
- https://www.apc.com/shop/us/en/products/APC-Rack-PDU-9000-switched-0U-17-3kW-208V-42-C13-and-C15-6-C19-and-C21-sockets/P-APDU9967 42 C13/C15 and 6 C19/C21, 208V, 17kW, IEC60309 60A 3P+PE 3Phase
- Stagger startups on PDU
UPS: for just the head node and switch? This may depend on the voltage of the PDUs.
Switch: 10Gb/s.
Environmental Monitoring: Could the PDU do this?
KVM: rackmount, not remote, and patch cables
Ethernet cables:
Power cables: single or Y calbes depending on number of PDUs and two power sources.
Head Node: lots of disk. Do the locals have access to this disk space? Maybe not.
- How many RAID arrays (OS, diskless images, data)?
- How many RAID volumes (OS, diskless images, nrao-data, local-data)
30 1U nodes or 15 2U nodes or mix? NVMe drives for nodes. Swap drive?
GPUs: Do we get GPUs? Do we get 1U nodes with room for 1 or 2 Tesla T4 GPUs or 2U nodes with room for 1 or 2 regular GPU?

Using

Get NRAO jobs on the remote racks. This may depend on how we want to use these remote racks. If we want them to do specific types of jobs then ClassAd options may be the solution. If we want them as overflow for jobs run at NRAO then flocking may be the solution. Perhaps we want both flocking and ClassAd options. Actually flocking may be the best method because I think it doesn't require the execute nodes to have external network access.
- Flocking?
- Classad options?
- Other?
Remote HTCondor concerns
- Do we want our jobs to run a an NRAO user like vlapipe or nobody?
- Do we want local jobs to run as the local user, some dedicated user, or nobody?
Need to support 50% workload for NRAP and 50% workload for local. How?
- Could have 15 nodes for us and 15 nodes for them
- What if we do nothing? HTCondor's fair-share algorithm may do the work for us if all our jobs are run as user vlapipe or something like that.
- Use RANK, and therefore preemption. https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigPrioritiesForUsers
- Group Accounting
- User Priority https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToSetUserPriority
Share disk space on head node 50% NRAO and 50% local
- Two volumes: one for NRAO and one for local?

Documentation

A projectbook like we did for USNO could be appropriate
Process diagrams (how systems boot, how jobs get started from NRAO and run, how locals start jobs, etc)

Other

Keep each rack as similar to the other racks as possible.
Test system at NRAO should be one of everything.

Since we are making our own little OSG, should we try to leverage OSG for this or not? Or do we want to make each POD a pool and flock?

Should we try to buy as much as we can from one vendor like Dell to simplify things?

APC sells a packaged rack on a pallet ready for shipping. We could fill this with gear and ship it. Not sure if that is a good idea or not.

Resources

USNO correlator (Mark Wainright)
VLBA Control Computers (William Colburn)
Red Hat maintenance (William Colburn)
Virtual kickstart (William Colburn)
HTCondor best practices (Greg Thain)
OSG (Lauren Michael)

Site Questions

Door width and height and path to server room. Can a rack-on-pallet fit? Can it fit upright on casters?
- NRAO-NM wide server door is 48"W x 84"H
Voltage in server room (110V or 208V or 240V)
Receptacles in server room (L5-30R or L21-30R or ...)
Single or dual power feeds
Firewalls

Space shortcuts

Page tree