Done: aoc253k-pdu-1 has critical alamrs 132028. During the power outage they replaced the PDU with the spare.
aocoss13 130466 racked, booted. Needs Lustre. Stolen to repair aocoss04.
More HERA nodes
jrobnett, krowe
Connect IB switch to fabric. Requires some re-arranging of ports. 133166
Cards/cables req: 182337, 182338. Install in new nodes.
Boot three 2U nodes with 24 cores each with GPU kits but no GPUs for now
nmngas
jrobnett, krowe
nmngas{01..04}c racked, firmwared, powerd, SASd. Needs format, mount.
nmngas{01..04}c-mirror racked, firmwared, powerd, SASd. Needs format, mount.
Ticket 134766 to format and mount the new volumes.
Order test GPUS
jrobnett
Need to order test GPUs against 114412506.6432
req: 182816 approved by dhalstea Oct. 12, 2021
Understand MDS load
jrobnett
Why is the MDS load so high
Track down shadow exceptions
krowe
Why do some htcondor data transfers trigger a shadow exception
Done: One reason is MaxStatups in sshd_config. We set it to 10:30:60 which means it refuses unathenticated connections after 10 at a rate of 30% with a hard limit of 60. Perhaps set it to 30:30:100