Page History

To Do

Talk with SSA and VLASS about how we actually use these remote clusters (e.g.Staging, flocking, etc)
Buy test system
Talk with Pueto Rico about their technical situation
Think about getting drives from third party instead of Dell and sparing them ourselves.
Think about getting the same drives for the OS that we get for the data array. It's simpler but perhaps slower.
Who is going to put all the RAM and drives in the servers?
Work on a price spreadsheet
- https://docs.google.com/spreadsheets/d/1oqsg_ZD4OPOibY6EIkvAWD8HKgNdrey8kcYXaGgd1dY/edit#gid=0
Learn Ansible?

Timeline

Buy test system as soon as practical
Buy production system about 6 months after test system
Receive production system about 7 months after test system
Install production system about 10 months after test system
Running about 12 months after test system

Data Path

This is conceptual at this point. Need to talk with SSA and VLASS about this.

We pre-stage data on the remote head node
We then either submit a job locally and it flocks to the remote site or we login to the remote site and submit from there.
- Can we use a nifty filesystem to simplify this (Ceph or that LHC fs)?
- This might be a good phase2 problem to solve.
- Is this kinda what nraorsync does?
The remote execute hosts transfer data from the remote head node
The job uploads resulting data to the remote head node
We retrieve data from the remote head node

Using

Get NRAO jobs on the remote racks. This may depend on how we want to use these remote racks. If we want them to do specific types of jobs then ClassAd options may be the solution. If we want them as overflow for jobs run at NRAO then flocking may be the solution. Perhaps we want both flocking and ClassAd options. Actually flocking may be the best method because I think it doesn't require the execute nodes to have external network access.
- Staging and submitting remotely?
- Flocking?
- Classad options? I think this will require the execute hosts to have routable IPs because our submit host will talk directly to them and vice-versa. Could CCB help here?
- Other?
Remote HTCondor concerns
- Do we want our jobs to run an NRAO user like vlapipe, or nobody?
- Do we want remote institution jobs to run as the remote institution user, some dedicated user, or nobody?
Need to support 50% workload for NRAO and 50% workload for remote institution. How?
- Could have 15 nodes for us and 15 nodes for them
- What if we do nothing? HTCondor's fair-share algorithm may do the work for us if all our jobs are run as user vlapipe or something like that.
- Use RANK, and therefore preemption. https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigPrioritiesForUsers
- Group Accounting
- User Priority https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToSetUserPriority
Share disk space on head node 50% NRAO and 50% remote institution. It might be nice to use something like quotas or LVM so that we can change the disk usage dynamically for times when either NRAO or local needs more space.
- Two partitions: one for NRAO and one for remote institution?
- One partition with quotas? https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/storage_administration_guide/xfsquota
- LVM?

Documentation

A projectbook like we did for USNO could be appropriate
Process diagrams (how systems boot, how jobs get started from NRAO and run, how remote institutions start jobs, etc)

Networking

HTCondor flocking requires

From local schedd to remote collectord on condor port 9618
From remote negotiator and execute hosts to local schedd. Here the execute hosts can be NATed.
From local shadow to remote starterd. Use CCB. It allows execute hosts to live behind firewall and be NATed.

Non-flocking just requires ssh access from probably mcilroy, and to gibson.

NRAO side

NRAO -> remote head node on port 22 (ssh)
Submit Host -> remote head node (condor_collector) on port 9618 (HTCondor) for flocking
Submit Host <- remote head node (condor_negotiator) on port 9618 (HTCondor) for flocking
- mcilroy has external IPs (146.88.1.66 for 1Gb/s and 146.88.10.66 for 10Gb/s). Is the container listening?
Submit Host <- remote execute hosts (condor_starter) on port 9618 (HTCondor) for flocking
Submit Host (condor_shadow) -> remote execute hosts (condor_starter) on port 9618 (HTCondor) for flocking. CCB might alleviate this.

Remote side

Head node <- from nrao.edu on port 22 (ssh)
Head node -> revere.aoc.nrao.edu on port 25 (smtp)
Head node -> NRAO Submit Host on port 9618 (HTCondor) for flocking
Head node <- NRAO Submit Host on port 9618 (HTCondor) for flocking
Execute node -> NRAO Submit Host on port 9618 (HTCondor) for flocking. Execute host may be NATed.
Execute node -> gibson.aoc.nrao.edu on port 22 (ssh) for flocking with nraorsync. Execute host can be NATed.

Services

DNS
- What DNS domain will these hosts be in? nrao.edu? remote-institution.site? other?
- Will this vary depending on site?
- 2022-10-26 krowe: it is looking like the institution will own the equpment. Either they buy it with their own money like UPR-M or AUI gives them a grant and they buy it. Either way, they own it. So, I think we can expect the hosts to be in their DNS domain. Which is probably for the best. We can make CNAMEs for each head node if needed.
- So what IP range should we use? That may depend on the site as each site may use non-routable IP ranges differently.
DHCP
SMTP
NTP or chrony
- What timezone should these be in? I think the choices are
  - Mountain - Perhaps the most convenient for NRAO users and consistant between the sites.
  - Local - Makes the most sence to the local users but means differences between the sites.
  - UTC - equally annoying for all.
NFS
Directory Server
- NIS? Probably not. RHEL9 will not support NIS.
- OpenLDAP
- 389 Directory Server? (previously Fedora Directory Server)
- Identity Management
- FreeIPA
- How do we handle accounts? I think we will want accounts on at least the head node. The execution nodes could run everything as nobody or as real users. If we want real users on the execute hosts then we should use a directory service which should probably be LDAP. No sense in teaching folks how to use NIS anymore.
  - remote institution accounts only?
- 2022-10-26 krowe: RHEL8 and later don't come with OpenLDAP anymore. Red Hat wants you to use either their 389DS or IDM or RHDS or some other thing that gets them money. It's all very confusing
ssh
rsync (nraorsync_plugin.py)?
NAT so the nodes can download/upload data?
TFTP (for OSes and switch)
condor (port 9618) https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToMixFirewallsAndHtCondor
nagios
ganglia
- Ganglia hasn't been updated since 2015 so perhaps it is time to look for something else.
- Prometheus/Graphana
- Zabbix

Operating System

Can we use Red Hat with our current license?
Must support CASA
Will need a patching/updating mechanism
How to boot diskless OS images
- I am not finding any new sexy software packages to automate PXE+DHCP+TFTP+NFS, so we will keep doing it the way we have been for years https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/managing_storage_devices/setting-up-a-remote-diskless-system_managing-storage-devices
- One OS image for
Try to have one OS that supports
- both our use and
local use?
- remote institution use, or multiple OS images?
- Use containers (docker, singularity/apptainer, kubernetes, mesos, etc)
- Or they could dual boot?
- Or kubernetes?
- Ask Greg at CHTC what they use
  - They use disked OSes and puppet to maintain it
What Linux distrobution to use?
- Can we use Red Hat with our current license? I have looked in JDE and I can't find a recent subscription. Need to ask David.
  - We have a 1,000 FTE license with up to 20,000 installations allowed. But since we are either selling the equipment to the institution or asking them to buy it themselves, at least with UPRM, I don't see how we can use our RHEL license legally. Asking each institution to aquire an RHEL license sounds like a recepie for failure so I think open source OS is the answer.
- Should we buy Red Hat licenses like we did for USNO?
  - USNO is between $10K and $15K per year for 81 licensed nodes. This may not be an EDU license.
  - NRAO used to have a 1,000 host license for Red Hat but I don't know what they have now.
  - I don't want to maintain licenses for up to 10 differenct installs. I don't think the institutions will want to purchase and maintain a license.
- Do we even want to use Red Hat?
Rocky Linux or AlmaLinux
- - Alternatives would be Rocky Linux or AlmaLinux or CentoOS Stream
- Some sites will own their equipment like UPR-M. Probably most sites the equipment will be owned by NRAO.
What version do we use RHEL7 or RHEL8 or RHEL9? Remember CASA needs to support it.
What OSes is CASA is verified against? I am pretty sure RHEL but what about CentOS or Rocky or ALMA, etc?.
- https://casadocs.readthedocs.io/en/stable/notebooks/introduction.html#Compatibility
What version of CASA does VLASS need?
The cost of RHEL is pretty small compared to the hardware.
UPRM has their own money but the other institutions will either get a grant or money from us so we can say what OS they use and pay for.
Should pull in Matthew or Schlake on this decision.
Should we use Ansible for deployments?
2022-10-21 krowe: jkern talked to business office and they prefer that AUI gives the money to Morgan State and Morgan State buys the equipment. So Morgan State will own it. NRAO can stipulate you will only get the money if you buy what we recommend.
2022-10-24 krowe: CASA is verified against RHEL8. CentOS Stream 8 is a constantly moving target. There is no CentOS Stream 8.1 or 8.2 it is always the cutting edge of RHEL. I don't like that. I would much rather have version numbers that I can compare to RHEL. So that is a vote against CentOS Stream and for Rocky or Alma. I just checked Scientific Linux (maintinaed by Fermilab and CERN) and they are moving to CentOS Stream8. So there will not be a Scientific Linux 8.

Third party software for VLASS

CASA what version?
HTCondor
Will need a way to maintain the software
- stow, rpm, modules, containers?

Third party software for

...

remote institution

Will need a way to maintain software for the local remote institution site

Services

Will need a way to maintain the software
- stow, rpm, modules, containers?
DNS
DHCP
SMTP
NTP
NFS?
LDAP? How do we handle accounts?
ssh
rsync (nraorsync_plugin.py)
NAT so the nodes can download/upload data
TFTP
condor (port 9618)

Management Access

PDU
UPS
BMC/IPMI
switch

Maintenance

replace disk (local remote institution admin)
replace/reseat DIMM (local remote institution admin)
replace power supply (local remote institution admin)
NRAO may handle replacement hardware. Drop ship. Spare ourselves?
Patching OS images (NRAO)
Patching third party software like CASA and HTCondor (NRAO)
Altering OS images (NRAO)

Hardware

Cabinet Rack: Doors front and rear locking with mesh. Width: 19". Height: 42U is most common. Depth: 42" or 48"?is most common. Rack must support at least 2,000 lbs static load
- Great Lakes GL840ES-2442-B-MS, $3,900 This is what we usually get https://greatcabinets.com/product/es-ms/
- APC NetShelter SX AR3100, $1,875 https://www.apc.com/shop/us/en/products/APC-NetShelter-SX-Server-Rack-Enclosure-42U-Black-1991H-x-600W-x-1070D-mm/P-AR3100
- APC NetShelter SX AR3100SP, designed for re-shipping after equipped, $2,550 https://www.apc.com/shop/us/en/products/APC-NetShelter-SX-Server-Rack-Enclosure-42U-Shock-Packaging-2000-lbs-Black-1991H-x-600W-x-1070D-mm/P-AR3100SP
PDU: one PDU or two PDUsPower Strips: How many Power Strips? What plug? What voltage? This may very across sites. What if the site has two power sources?
- 208V = APC APDU9965, 8.6kW, zeroU, Input is NEMA L21-30P 3phase, Outputs are 21x C13/C15 and 3x C19/C21, $2,100 https://www.apc.com/shop/us/en/products/APC-Rack-PDU-2G9000-switched-0U-30A-100V-to-120V-24-NEMA-5-20R8-6kW-208V-21-C13-and-C15-3-C19-and-C21-sockets/P-AP8932 24 NEMA 5-20R, 120V, 2.8kW, NEMA L5-30P 1PhaseAPDU9965
  - Need 2. For 30 nodes, assuming 400W each we will need 2 of these power strips (30 * 400W = 12,000W)
- 208V = APC APDU9967, 17kW, zeroU, Input is IEC60309 60A 3P+PE 3Phase, Outputs are 42x C13/C15 and 6x C19/C21, $3,775 https://www.apc.com/shop/us/en/products/APC-Rack-PDU-9000-switched-0U-817-6kW3kW-208V-2142-C13-and-C15-36-C19-and-C21-sockets/P-APDU9965 21 C13/C15 and 3 C19/C21, 208V, 8.6kW, NEMA L21-30P 3PhaseAPDU9967
  - Need 1. For 30 nodes, assuming 400W each we will need only 1 of these power strips (30 * 400W = 12,000W)
- 120V = APC AP8932, 2.8kW, zeroU, Input is NEMA L5-30P, Outputs are 24x NEMA 5-20R, $1,450 https://www.apc.com/shop/us/en/products/APC-Rack-PDU-90002G-switched-0U-1730A-3kW100V-208Vto-42120V-C1324-andNEMA-C155-6-C19-and-C21-20R-sockets/P-APDU9967 42 C13/C15 and 6 C19/C21, 208V, 17kW, IEC60309 60A 3P+PE 3PhaseAP8932
  - Need 5. For 30 nodes, assuming 400W each we will need 5 of these power strips (30 * 400W = 12,000W)
- Stagger startups on PDU
UPS: for just the head node and switch? This may depend on the voltage of the PDUs.
- 208V = APC Smart-UPS X SMX2200R2HVNC, 2200VA, 2U, Input is IEC-C20, Outputs are 8x C13 and 1x C19, comes with network card, $3,125 https://www.apc.com/shop/us/en/products/APC-Smart-UPS-X-Line-Interactive-2200VA-Rack-tower-2U-208V-230V-8x-C13-1x-C19-IEC-Network-card-Extended-runtime-Rail-kit-included/P-SMX2200R2HVNC
- 120V = APC Smart-UPS SMT1500RM2UC, 1500VA, 2U, Input is NEMA 5-15P, Outputs are 6x NEMA 5-15R, $1,050 https://www.apc.com/shop/us/en/categories/power/uninterruptible-power-supply-ups-/network-and-server/smart-ups/N-1h89ykeZ11ai1p3
  - Optional UPS Network Card 3 AP9640 $420 https://www.apc.com/shop/us/en/products/UPS-Network-Management-Card-3/P-AP9640
Switch: Cisco Catalyst C9300X-48TX
- 48 Data, 48x 10G Multigigabit
- 100M, 1G, 2.5G, 5G, or 10 Gbps
- Switching capacity 2,000 Gbps
- Nine optional Modular Uplinks 100G/40G/25G/10G/1G
- Redundant Power Supply 715 W
- $12K
Switch: 10Gb/s.
Environmental Monitoring: Could Add-on to the PDU do this?APC PDU
- APC AP9335TH, Temperature and Humidity Sensor, length is 3.9m, $190 https://www.apc.com/shop/us/en/products/APC-Temperature-Humidity-Sensor/P-AP9335TH
KVM: rackmount, not remote, and patch cables
- StarTech RKCONS1901, Rackmount KVM console, 1U, $990 https://www.cdw.com/product/startech.com-rackmount-kvm-console-1u-19-lcd-vga-kvm-drawer-w-cables-usb/5103418?pfm=srh
Ethernet cables:
Power cables: single or Y calbes cables depending on number and types of PDUs and two number of power sources.
Head Node: lots of disk and RAID
- iDRAC: Ask CIS what they recommend
- Memory: at least 32GB of RAM to help cache the OS image. 64GB would be even better.
- Storage mdRAID, ZFS, Btrfs, RAID card? Do
the locals have access to this disk space? Maybe not
- we want both boot and data arrays to be the same type?
  - OS/OSimages, RAID1 with or without spare? (3 disks), about 1TB
    - OS: We have been making 40GB partitions for / for over a decade and that looks to still work with RHEL8.
    - Swap: 0 or 8GB at most
    - /export/home/<hostnmae>: services and diskless_images
  - Working data/software (remote institution and nrao), RAID6 w/spare or RAID7 (9 disks), about 72TB
    - An SE imaging input data size is about 10GB per job
    - We need maybe 20TB+ of total space or more so maybe 60TB/2
    - Carve into two partitions (NRAO data and NRAO software, remote institution data and remote institution software) each partition has data and software directories.
- Networking: May need more than one port. One for internal networking to nodes and one for external Internet access.
30 1U nodes or 15 2U nodes or mix? NVMe drives for nodes. Swap drive?
- Dual ~3GHz, ~12Core CPUs
- 512GB or more RAM
- NVMe for scratch and swap 6TB or more
- Draws about 400W max
GPUs: Do we get GPUs? Do we get 1U nodes with room for 1 or 2 Tesla T4 GPUs or 2U nodes with room for 1 or 2 regular GPU?
- Dell R650, dual Xeon Gold (Ice Lake) 6334 3.6GHz CPUs, 768GB RAM, 6TB - 8TB NVMe
  - https://www.cpubenchmark.net/cpu.php?cpu=Intel+Xeon+Gold+6334+%40+3.60GHz&id=4488
- Dell R650, dual Xeon Silver (Ice Lake) 4309Y 2.8GHz CPUs, 768GB RAM, 6TB - 8TB NVMe
  - https://www.cpubenchmark.net/cpu.php?cpu=Intel+Xeon+Silver+4309Y+%40+2.80GHz&id=4462
- PassMark shows the Silver to be about 85% to 95% of the Gold but the Gold is about 2.5 times the price of the Silver. For the price of 19 Gold nodes we could get 26 Silver nodes. So we could run 73% more jobs witlh Silvers but they would take maybe 15% longer to run. That looks like a win to me. (assuming 15 jobs per node). Perhaps for the test system we could purchase one of each type of CPU?

...

Cooling: Assuming a 30-node cluster, we will need about 12kW or 3.5 tons or 42,000 BTU/h of cooling.

Shipping

Drop ship everything to the site and assemble on site. This will require an NRAO person on site to assemble with a pre-built OS disk for the head node. I think this is too much work to do on site.
- Install DIMMs in nodes
- Install NVMe drives in nodes
- Rack everything
- Cable everything
- Configure switch
- Install/Configure OS
Ship everything here and assemble then ship a rack-on-pallet.
Mix the two. Ship minimal stuff here (head node, switch, couple of compute nodes, etc) and configure and drop ship most of the nodes to the site.
- Re-ship head node, switch, compute nodes to site
- Re-ship memory and drives to site
A person from the remote site could travel to NM or CV to see the test system and get instruction.

Other

Keep each

...

pod as similar to the other

...

pods as possible.
Test system at NRAO should be one of everything.
Since we are making our own little OSG, should we try to leverage OSG for this or not? Or do we want to make each POD a pool and flock?

How do we handle the 50% workload?

Could have 15 nodes for us and 15 nodes for them

Should we try to buy as much as we can from one vendor like Dell to simplify things?
APC sells a packaged rack on a pallet ready for shipping. We could fill this with gear and ship it. Not sure if that is a good idea or not. We will not be able to move the unit into the server room while still on the pallet because no doorway is tall enough. We would have to roll it off the pallet (it comes with a ramp and the rack is on casters) move it into the server room, fill and configure it, roll it out of the server room, roll it back onto the pallet, probably remove the bottom server(s) so we can attach it to the pallet, then re-add the bottom server(s).

...

Test system at NRAO should be one of everything.

We could use the double glass doors for this but there is a lip on the transition. We could use the doors in the PRA closet as it has no lip but would require a lot of moving of shelves and stuff.
APC NetShelter SX packaged:
- On Pallet: Height 85.79in (2179mm) Width 43.5in (1105mm)
- On Casters: Height 78.39in 1991mm) Width 23.62in (600mm)
NRAO Dimensions
- Double Glass doors: Height: 80in (2032mm) (because of the 2in maglock)
- NRAO-NM wide server doors: Height: 83in (2133mm) Width: 48in (1187mm)
I could start prototyping now using AWS.
If jobs are submitted from the remote head node, does that mean SSA will want a container on that remote head node?

Site Questions

...

Voltage in server room (110V 120V or 208V or 240V)
Receptacles in server room (L5-30R or L21-30R or ...)
Single or dual power feeds?
Is power from below or from above?
How stable is their power?
Is there a UPS?
Is there a generator?
Door width and height and path to server room.
Can a rack-on-pallet fit upright? Height: 85.79inches (2179mm) Width: 43.5inches (1105mm)
Can a rack-on-casters fit upright? Height: 78.39inches (1991mm) Width: 23.62inches (600mm)
NRAO-NM wide server door Height: 84inches (2108mm) Width: 46.75inches (1219mm)
Firewalls
How are you going to use this?
Do you care if this is in your DNS zone or ours?
Is NAT available for the execute hosts?

Resources

USNO correlator (Mark Wainright)
VLBA Control Computers (William Colburn)
Red Hat maintenance (William Colburn)
Virtual kickstart (William Colburn)
Switch models and ethernet (Jeff Long)
HTCondor best practices (Greg Thain)
OSG (Lauren Michael)
SDSC at UCSD
TACC at UT Austin
- https://www.tacc.utexas.edu/about/directory
IDIA https://www.idia.ac.za/

Space shortcuts

Page tree

Versions Compared

Old Version 43

New Version Current

Key

Table of Contents

To Do

Timeline

Data Path

Using

Documentation

Networking

NRAO side

Remote side

Services

Operating System

Operating System

Third party software for VLASS

Third party software for

remote institution

Services

Management Access

Maintenance

Hardware

Shipping

Other

Site Questions

Site Questions

Resources