...
- Cabinet Rack: Doors front and rear locking with mesh. Width: 19". Height: 42U is most common. Depth: 42" is most common. Rack must support at least 2,000 lbs static load
- https://greatcabinets.com/product/es-ms/ This is what we usually get
- https://www.apc.com/shop/us/en/products/APC-NetShelter-SX-Server-Rack-Enclosure-42U-Black-1991H-x-600W-x-1070D-mm/P-AR3100 APC NetShelter SX
- https://www.apc.com/shop/us/en/products/APC-NetShelter-SX-Server-Rack-Enclosure-42U-Shock-Packaging-2000-lbs-Black-1991H-x-600W-x-1070D-mm/P-AR3100SP APC NetShelter SX designed for re-shipping after equipped.
- PDU: one PDU or two PDUs? What plug? What voltage? This may very across sites. What if the site has two power sources?
- https://www.apc.com/shop/us/en/products/APC-Rack-PDU-2G-switched-0U-30A-100V-to-120V-24-NEMA-5-20R-sockets/P-AP8932 24 NEMA 5-20R, 120V, 2.8kW, NEMA L5-30P 1Phase
- https://www.apc.com/shop/us/en/products/APC-Rack-PDU-9000-switched-0U-8-6kW-208V-21-C13-and-C15-3-C19-and-C21-sockets/P-APDU9965 21 C13/C15 and 3 C19/C21, 208V, 8.6kW, NEMA L21-30P 3Phase
- https://www.apc.com/shop/us/en/products/APC-Rack-PDU-9000-switched-0U-17-3kW-208V-42-C13-and-C15-6-C19-and-C21-sockets/P-APDU9967 42 C13/C15 and 6 C19/C21, 208V, 17kW, IEC60309 60A 3P+PE 3Phase
- Stagger startups on PDU
- UPS: for just the head node and switch? This may depend on the voltage of the PDUs.
- Switch: 10Gb/s.
- Environmental Monitoring: Could the PDU do this?
- KVM: rackmount, not remote, and patch cables
- Ethernet cables:
- Power cables: single or Y calbes depending on number of PDUs and two power sources.
- Head Node: lots of disk. Do the locals have access to this disk space? Maybe not.
- iDRAC: Ask CIS what they recommend
- Memory: at least 32GB of RAM to help cache the OS image. 64GB would be even better.
- Storage mdRAID, ZFS, RAID card?
- OS/OSimages, RAID1 w/spare (3 disks), about 1TB
- OS: We have been making 40GB partitions for / for over a decade and that looks to still work with RHEL8.
- Swap: 0 or 8GB at most
- /export/home/<hostnmae>: services and diskless_images
- Working data/software (local and nrao), RAID6 w/spare(9 disks), about 72TB
- An SE imaging input data size is about 10GB per job
- We need maybe 20TB+ of total space or more so maybe 60TB/2
- Carve into two partitions (NRAO data and NRAO software, Local data and Local software) each partition has data and software directories.
- OS/OSimages, RAID1 w/spare (3 disks), about 1TB
- Networking: May need more than one port. One for internal networking to nodes and one for external Internet access.
- 30 1U nodes or 15 2U nodes or mix? NVMe drives for nodes. Swap drive?
- NVMe for scratch and swap
- GPUs: Do we get GPUs? Do we get 1U nodes with room for 1 or 2 Tesla T4 GPUs or 2U nodes with room for 1 or 2 regular GPU?
Networking
NRAO side
- Submit host needs to be able to establish a connection to the remote head node on port 9618 (HTCondor)
- Submit host needs to be able to listen for a connection from the remote head node on port 9618 (HTCondor)
Remote side
- Head node requires external access to nrao.edu on port 9618 (HTCondor)
- Execute nodes require external access to nrao.edu on port 9618. Can be NATed. (HTCondor)
Using
- Get NRAO jobs on the remote racks. This may depend on how we want to use these remote racks. If we want them to do specific types of jobs then ClassAd options may be the solution. If we want them as overflow for jobs run at NRAO then flocking may be the solution. Perhaps we want both flocking and ClassAd options. Actually flocking may be the best method because I think it doesn't require the execute nodes to have external network access.
- Flocking? What are the networking requirements?
- Classad options? I think this will require the execute hosts to have routable IPs because our submit host will talk directly to them and vice-versa. Could CCB help here?
- Other?
- Remote HTCondor concerns
- Do we want our jobs to run a an NRAO user like vlapipe or nobody?
- Do we want local jobs to run as the local user, some dedicated user, or nobody?Remote HTCondor concerns
- Need to support 50% workload for NRAP and 50% workload for local. How?
- Could have 15 nodes for us and 15 nodes for them
- What if we do nothing? HTCondor's fair-share algorithm may do the work for us if all our jobs are run as user vlapipe or something like that.
- Use RANK, and therefore preemption. https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigPrioritiesForUsers
- Group Accounting
- User Priority https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToSetUserPriority
- Share disk space on head node 50% NRAO and 50% local
- Two partitions: one for NRAO and one for local?
...