...
- Talk with SSA and VLASS about how we actually use these remote clusters.
- Staging, flocking, etc
- Done: Find a spot at DSOC for test system. 253T if we buy a rack.
- Get account number and routing number.
- Buy test system
- Talk with Pueto Rico about their technical situation
- Ask David if there are enough Red Hat licenses for RADIAL or if we will need to get more licenses
Timeline
- Buy test system as soon as practical (assuming the project is still a go)
- Does Jeff Kern know if this is a go or not
- Talk to Matthew about where to put this stuff
- May 3, 2022 krowe: talked to Matthew and Peter. RADIAL has space 253T reserved.
- Buy production system by July
- Receive production system by Aug
- Install production system by Dec
- Running in Jan. 2023
...
- Get NRAO jobs on the remote racks. This may depend on how we want to use these remote racks. If we want them to do specific types of jobs then ClassAd options may be the solution. If we want them as overflow for jobs run at NRAO then flocking may be the solution. Perhaps we want both flocking and ClassAd options. Actually flocking may be the best method because I think it doesn't require the execute nodes to have external network access.
- Staging and submitting remotely?
- Flocking? What are the networking requirements?
- Classad options? I think this will require the execute hosts to have routable IPs because our submit host will talk directly to them and vice-versa. Could CCB help here?
- Other?
- Remote HTCondor concerns
- Do we want our jobs to run an NRAO user like vlapipe, or nobody?
- Do we want remote institution jobs to run as the remote institution user, some dedicated user, or nobody?
- Need to support 50% workload for NRAO and 50% workload for remote institution. How?
- Could have 15 nodes for us and 15 nodes for them
- What if we do nothing? HTCondor's fair-share algorithm may do the work for us if all our jobs are run as user vlapipe or something like that.
- Use RANK, and therefore preemption. https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigPrioritiesForUsers
- Group Accounting
- User Priority https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToSetUserPriority
- Share disk space on head node 50% NRAO and 50% remote institution
- Two partitions: one for NRAO and one for remote institution?
...
- Cabinet Rack: Doors front and rear locking with mesh. Width: 19". Height: 42U is most common. Depth: 42" is most common. Rack must support at least 2,000 lbs static load
- https://greatcabinets.com/product/es-ms/ This is what we usually get
- https://www.apc.com/shop/us/en/products/APC-NetShelter-SX-Server-Rack-Enclosure-42U-Black-1991H-x-600W-x-1070D-mm/P-AR3100 APC NetShelter SX
- https://www.apc.com/shop/us/en/products/APC-NetShelter-SX-Server-Rack-Enclosure-42U-Shock-Packaging-2000-lbs-Black-1991H-x-600W-x-1070D-mm/P-AR3100SP APC NetShelter SX designed for re-shipping after equipped.
- PDU: one PDU or two PDUs? What plug? What voltage? This may very across sites. What if the site has two power sources?
- 120V = APC AP8932, 2.8kW, zeroU, Input is NEMA L5-30P, Outputs are 24x NEMA 5-20R, $1,450 https://www.apc.com/shop/us/en/products/APC-Rack-PDU-2G-switched-0U-30A-100V-to-120V-24-NEMA-5-20R-sockets/P-AP8932
- 208V = APC APDU9965, 8.6kW, zeroU, Input is NEMA L21-30P 3phase, Outputs are 21x C13/C15 and 3x C19/C21, $2,100 https://www.apc.com/shop/us/en/products/APC-Rack-PDU-9000-switched-0U-8-6kW-208V-21-C13-and-C15-3-C19-and-C21-sockets/P-APDU9965
- 208V = APC APDU9967, 17kW, zeroU, Input is IEC60309 60A 3P+PE 3Phase, Outputs are 42x C13/C15 and 6x C19/C21 https://www.apc.com/shop/us/en/products/APC-Rack-PDU-9000-switched-0U-17-3kW-208V-42-C13-and-C15-6-C19-and-C21-sockets/P-APDU9967
- Stagger startups on PDU
- UPS: for just the head node and switch? This may depend on the voltage of the PDUs.
- 120V = APC Smart-UPS SMT1500RM2UC, 1500VA, 2U, Input is NEMA 5-15P, Outputs are 6x NEMA 5-15R, $1,050 https://www.apc.com/shop/us/en/categories/power/uninterruptible-power-supply-ups-/network-and-server/smart-ups/N-1h89ykeZ11ai1p3
- Optional UPS Network Card 3 AP9640 $420 https://www.apc.com/shop/us/en/products/UPS-Network-Management-Card-3/P-AP9640
- 208V = APC Smart-UPS X, 2200VA, 2U, Input is IEC-C20, Outputs are 8x C13 and 1x C19, comes with network card, $3,125 https://www.apc.com/shop/us/en/products/APC-Smart-UPS-X-Line-Interactive-2200VA-Rack-tower-2U-208V-230V-8x-C13-1x-C19-IEC-Network-card-Extended-runtime-Rail-kit-included/P-SMX2200R2HVNC
- 120V = APC Smart-UPS SMT1500RM2UC, 1500VA, 2U, Input is NEMA 5-15P, Outputs are 6x NEMA 5-15R, $1,050 https://www.apc.com/shop/us/en/categories/power/uninterruptible-power-supply-ups-/network-and-server/smart-ups/N-1h89ykeZ11ai1p3
- Switch: Cisco Catalyst C9300X-48TX
48 Data, 48x 10G Multigigabit
- 100M, 1G, 2.5G, 5G, or 10 Gbps
- Switching capacity 2,000 Gbps
- Nine optional Modular Uplinks 100G/40G/25G/10G/1G
- Redundant Power Supply 715 W
- $12K
- Environmental Monitoring: Add-on to the APC PDU
- APC AP9335TH, Temperature and Humidity Sensor, length is 3.9m, $190 https://www.apc.com/shop/us/en/products/APC-Temperature-Humidity-Sensor/P-AP9335TH
- KVM: rackmount, not remote, and patch cables
- StarTech Rackmount KVM console, 1U, $990 https://www.cdw.com/product/startech.com-rackmount-kvm-console-1u-19-lcd-vga-kvm-drawer-w-cables-usb/5103418?pfm=srh
- Ethernet cables:
- Power cables: single or Y cables depending on number and types of PDUs and number of power sources.
- Head Node: lots of disk . Do the remote institutions have access to this disk space? Maybe not.and RAID
- iDRAC: Ask CIS what they recommend
- Memory: at least 32GB of RAM to help cache the OS image. 64GB would be even better.
- Storage mdRAID, ZFS, Btrfs, RAID card? Do we want both boot and data arrays to be the same type?
- OS/OSimages, RAID1 with or without spare? (3 disks), about 1TB
- OS: We have been making 40GB partitions for / for over a decade and that looks to still work with RHEL8.
- Swap: 0 or 8GB at most
- /export/home/<hostnmae>: services and diskless_images
- Working data/software (remote institution and nrao), RAID6 w/spare or RAID7 (9 disks), about 72TB
- An SE imaging input data size is about 10GB per job
- We need maybe 20TB+ of total space or more so maybe 60TB/2
- Carve into two partitions (NRAO data and NRAO software, remote institution data and remote institution software) each partition has data and software directories.
- OS/OSimages, RAID1 with or without spare? (3 disks), about 1TB
- Networking: May need more than one port. One for internal networking to nodes and one for external Internet access.
- 30 1U nodes or 15 2U nodes or mix? NVMe drives for nodes. Swap drive?
- Dual ~3GHz, ~12Core CPUs
- 512GB or more RAM
- NVMe for scratch and swap 6TB or more
- GPUs: Do we get GPUs? Do we get 1U nodes with room for 1 or 2 Tesla T4 GPUs or 2U nodes with room for 1 or 2 regular GPU?
...
- Drop ship everything to the site and assemble on site. This will require an NRAO person on site to assemble with a pre-built OS disk for the head node. I think this is too much work to do on site.
- Install DIMMs in nodes
- Install NVMe drives in nodes
- Rack everything
- Cable everything
- Configure switch
- Install/Configure OS
- Ship everything here and assemble then ship a rack-on-pallet.
- Mix the two. Ship minimal stuff here (head node, switch, couple of compute nodes, etc) and configure and drop ship most of the nodes to the site.
- Re-ship head node, switch, compute nodes to site
- Re-ship memory and drives to site
- A person from the remote site could travel to NM or CV to see the test system and get instruction.
Other
- Keep each rack pod as similar to the other racks pods as possible.
- Test system at NRAO should be one of everything.
- Since we are making our own little OSG, should we try to leverage OSG for this or not? Or do we want to make each POD a pool and flock?
- Should we try to buy as much as we can from one vendor like Dell to simplify things?
- APC sells a packaged rack on a pallet ready for shipping. We could fill this with gear and ship it. Not sure if that is a good idea or not. We will not be able to move the unit into the server room while still on the pallet because no doorway is tall enough. We would have to roll it off the pallet (it comes with a ramp and the rack is on casters) move it into the server room, fill and configure it, roll it out of the server room, roll it back onto the pallet, probably remove the bottom server(s) so we can attach it to the pallet, then re-add the bottom server(s). We could use the double glass doors for this but there is a lip on the transition. We could use the doors in the PRA closet as it has no lip but would require a lot of moving of shelves and stuff.
- APC NetShelter SX packaged:
- On Pallet: Height 85.79in (2179mm) Width 43.5in (1105mm)
- On Casters: Height 78.39in 1991mm) Width 23.62in (600mm)
- NRAO Dimensions
- Double Glass doors: Height: 80in (2032mm) (because of the 2in maglock)
- NRAO-NM wide server doors: Height: 83in (2133mm) Width: 48in (1187mm)
- I could start prototyping now using AWS.Do we want jobs to flock or do we want to submit jobs on the remote host and have pre-transfered data? Involve SSA and VLASS in this question.
- If jobs are submitted from the remote host head node, does that mean SSA will want a container on that remote hosthead node?
Site Questions
- Voltage in server room (120V or 208V or 240V)
- Receptacles in server room (L5-30R or L21-30R or ...)
- Single or dual power feeds?
- Is power from below or from above?
- Door width and height and path to server room.
- Can a rack-on-pallet fit upright? Height: 85.79inches (2179mm) Width: 43.5inches (1105mm)
- Can a rack-on-casters fit upright? Height: 78.39inches (1991mm) Width: 23.62inches (600mm)
- NRAO-NM wide server door Height: 84inches (2108mm) Width: 46.75inches (1219mm)
- Firewalls
- How are you going to use this?
- Do you care if this is in your DNS zone or ours?
- Is NAT available for the execute hosts?
...