Item | Who | Notes |
---|
HERA hardware | James | herastore01 - herastore01b Needs firmware
- herastore01c Needs firmware
- Done: herastore01d 127013 racked, disked, powered, SASed, firmwared, formatted, mount.
herastore02 129289herastore02 135532 - herastore02 racked, powered, OSed, sotrcli, cards moved. Needs /opt. CIS borrowing for NGAS firmware upgrades.working storcli. probably a new ticket.
- 02a racked, disked, firmwared, powered, SASed. Needs format, mount.
- Done: 02b racked, firmwared. Haven't purchased disks yet.
- Done: 02c racked, firmwared. Haven't purchased disks yet.
- Done: 02d racked, firmwared. Haven't purchased disks yet.
Done: aoc253k-pdu-1 has critical alamrs 132028. During the power outage they replaced the PDU with the spareaocoss13 130466 racked, booted. Needs Lustre. Stolen to repair aocoss04. | More HERA nodes | jrobnett, krowe | Done: new herapost-master and make old herapost-master a compute node . Done: new IB card/cable for new herapost-master 132576Done: Buy an IB switch for HERA racks. $13,300 133166 Connect switch to fabric. Requires some re-arranging of ports. 133166Cards/cables req: 182337, 182338. Install in new nodes.Boot three 2U nodes with 24 cores each with GPU kits but no GPUs for now |
nmngas | jrobnett, krowe | - nmngas{01..04}c racked, firmwared, powerd, SASd. Needs format, mount.
- nmngas{01..04}c-mirror racked, firmwared, powerd, SASd. Needs format, mount.
- Done: Ticket 114896 sadly didn't mention formatting or mounting volumes so it was closed.
- krowe submitted ticket 134766 to format and mount the new volumes.
|
Order test GPUS | jrobnett | Need to order test GPUs against 114412506.6432 - req: 182816 approved by dhalstea Oct. 12, 2021
#krowe Oct 25 2021: PO: 374056, $3,505.00, Tesla T4 #krowe Oct 25 2021: PO: 374060, $2,899.00, RTX A5000 #krowe Oct 25 2021: PO: 374065, $1,382.00, RTX A4000
|
Understand MDS load | jrobnett | Why is the MDS load so high |
Glideins | krowe | 135553 Port RHEL-7.8.1.5 to CV | Track down shadow exceptions | krowe | Why do some htcondor data transfers trigger a shadow exception Done: One reason is MaxStatups in sshd_config. We set it to 10:30:60 which means it refuses unathenticated connections after 10 at a rate of 30% with a hard limit of 60. Perhaps set it to 30:30:100
|