...
- Get NRAO jobs on the remote racks. This may depend on how we want to use these remote racks. If we want them to do specific types of jobs then ClassAd options may be the solution. If we want them as overflow for jobs run at NRAO then flocking may be the solution. Perhaps we want both flocking and ClassAd options. Actually flocking may be the best method because I think it doesn't require the execute nodes to have external network access.
- Staging and submitting remotely?
- Flocking?
- Classad options? I think this will require the execute hosts to have routable IPs because our submit host will talk directly to them and vice-versa. Could CCB help here?
- Other?
- Remote HTCondor concerns
- Do we want our jobs to run an NRAO user like vlapipe, or nobody?
- Do we want remote institution jobs to run as the remote institution user, some dedicated user, or nobody?
- Need to support 50% workload for NRAO and 50% workload for remote institution. How?
- Could have 15 nodes for us and 15 nodes for them
- What if we do nothing? HTCondor's fair-share algorithm may do the work for us if all our jobs are run as user vlapipe or something like that.
- Use RANK, and therefore preemption. https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigPrioritiesForUsers
- Group Accounting
- User Priority https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToSetUserPriority
- Share disk space on head node 50% NRAO and 50% remote institution. It might be nice to use something like quotas or LVM so that we can change the disk usage dynamically for times when either NRAO or local needs more space.
- Two partitions: one for NRAO and one for remote institution?
- One partition with quotas?
- LVM?
Documentation
- A projectbook like we did for USNO could be appropriate
- Process diagrams (how systems boot, how jobs get started from NRAO and run, how remote institutions start jobs, etc)
...