...
- Get NRAO jobs on the remote racks. This may depend on how we want to use these remote racks. If we want them to do specific types of jobs then ClassAd options may be the solution. If we want them as overflow for jobs run at NRAO then flocking may be the solution. Perhaps we want both flocking and ClassAd options. Actually flocking may be the best method because I think it doesn't require the execute nodes to have external network access.
- Staging and submitting remotely?
- Flocking?
- Classad options? I think this will require the execute hosts to have routable IPs because our submit host will talk directly to them and vice-versa. Could CCB help here?
- Other?
- Remote HTCondor concerns
- Do we want our jobs to run an NRAO user like vlapipe, or nobody?
- Do we want remote institution jobs to run as the remote institution user, some dedicated user, or nobody?
- Need to support 50% workload for NRAO and 50% workload for remote institution. How?
- Could have 15 nodes for us and 15 nodes for them
- What if we do nothing? HTCondor's fair-share algorithm may do the work for us if all our jobs are run as user vlapipe or something like that.
- Use RANK, and therefore preemption. https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigPrioritiesForUsers
- Group Accounting
- User Priority https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToSetUserPriority
- Share disk space on head node 50% NRAO and 50% remote institution. It might be nice to use something like quotas or LVM so that we can change the disk usage dynamically for times when either NRAO or local needs more space.
- Two partitions: one for NRAO and one for remote institution?
Documentation
- One partition with quotas? https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/storage_administration_guide/xfsquota
- LVM?
Documentation
- A projectbook like we did for USNO could be appropriate
- Process diagrams (how systems
- A projectbook like we did for USNO could be appropriate
- Process diagrams (how systems boot, how jobs get started from NRAO and run, how remote institutions start jobs, etc)
...
- DNS
- What DNS domain will these hosts be in? nrao.edu? remote-institution.site? other?
- Will this vary depending on site?
- DHCP
- SMTP
- 2022-10-26 krowe: it is looking like the institution will own the equpment. Either they buy it with their own money like UPR-M or AUI gives them a grant and they buy it. Either way, they own it. So, I think we can expect the hosts to be in their DNS domain. Which is probably for the best. We can make CNAMEs for each head node if needed.
- So what IP range should we use? That may depend on the site as each site may use non-routable IP ranges differently.
- DHCP
- SMTP
- NTP or chrony
- What timezone should these be in? I think the choices are
- Mountain - Perhaps the most convenient for NRAO users and consistant between the sites.
- Local - Makes the most sence to the local users but means differences between the sites.
- UTC - equally annoying for all.
- What timezone should these be in? I think the choices are
- NFS
- Directory Server
- NIS? Probably not. RHEL9 will not support NIS.
- OpenLDAP
- 389 Directory Server? (previously Fedora Directory Server)
- Identity Management
- FreeIPA
- How do we handle accounts? I think we will want accounts on at least the head node. The execution nodes could run everything as nobody or as real users. If we want real users on the execute hosts then we should use a directory service which should probably be LDAP. No sense in teaching folks how to use NIS anymore.
- remote institution accounts only?
- 2022-10-26 krowe: RHEL8 and later don't come with OpenLDAP anymore. Red Hat wants you to use either their 389DS or IDM or RHDS or some other thing that gets them money. It's all very confusing
- ssh
- rsync (nraorsync_plugin.py)?
- NAT so the nodes can download/upload data?
- TFTP (for OSes and switch)
- condor (port 9618) https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToMixFirewallsAndHtCondor
- nagios
- ganglia
- Ganglia hasn't been updated since 2015 so perhaps it is time to look for something else.
- Prometheus/Graphana
- Zabbix
...