Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

...

To Do

  • Talk with SSA and VLASS about how we actually use these remote clusters.
    • Staging, flocking, etc
  • Find a spot at DSOC for test system

Timeline

  • Buy test system as soon as practical (assuming the project is still a go)
    • Does Jeff Kern know if this is a go or not
    • Talk to Matthew about where to put this stuff
      • May 3, 2022 krowe: talked to Matthew.  He will consult with Peter and get back to me.  I am thinking 253T.
    • Done: Ask Jeff Long to spec a switch
  • Buy production system by July
  • Receive production system by Aug
  • Install production system by Dec
  • Running in Jan. 2023

...

This is conceptual at this point.  Need to talk with SSA and VLASS about this.

  • We pre-stage data on the remote head node
  • We then either submit a job locally and it flocks to the remote site or we login to the remote site and submit from there.
    • Can we use a nifty filesystem to simplify this (Ceph or that LHC fs)?
    • This might be a good phase2 problem to solve.
    • Is this kinda what nraorsync does?
  • The remote execute hosts transfer data from the remote head node
  • The job uploads resulting data to the remote head node
  • We retrieve data from the remote head node

...

  • Get NRAO jobs on the remote racks.  This may depend on how we want to use these remote racks. If we want them to do specific types of jobs then ClassAd options may be the solution. If we want them as overflow for jobs run at NRAO then flocking may be the solution. Perhaps we want both flocking and ClassAd options.  Actually flocking may be the best method because I think it doesn't require the execute nodes to have external network access.
    • Staging and submitting remotely?
    • Flocking?  What are the networking requirements?
    • Classad options?  I think this will require the execute hosts to have routable IPs because our submit host will talk directly to them and vice-versa.  Could CCB help here?
    • Other?
  • Remote HTCondor concerns
    • Do we want our jobs to run a an NRAO user like vlapipe or nobody?
    • Do we want local remote institution jobs to run as the local remote institution user, some dedicated user, or nobody?Remote HTCondor concerns
  • Need to support 50% workload for NRAP NRAO and 50% workload for localremote institution.  How?
  • Share disk space on head node 50% NRAO and 50% localremote institution
    • Two partitions: one for NRAO and one for localremote institution?


Documentation

  • A projectbook like we did for USNO could be appropriate
  • Process diagrams (how systems boot, how jobs get started from NRAO and run, how locals remote institutions start jobs, etc)


Networking

...

  • DNS
    • What DNS domain will these hosts be in?  nrao.edu? local remote-institution.site? other?
  • DHCP
  • SMTP
  • NTP
  • NFS
  • LDAP?  How do we handle accounts?  I think we will want accounts on at least the head node.  The execution nodes could run everything as nobody or as real users.  If we want real users on the execute hosts then we should use a directory service which should probably be LDAP.  No sense in teaching folks how to use NIS anymore.
    • Local remote institution accounts only?
  • ssh
  • rsync (nraorsync_plugin.py)
  • NAT so the nodes can download/upload data
  • TFTP (for OSes and switch)
  • condor (port 9618) https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToMixFirewallsAndHtCondor
  • ganglia
  • nagios

...

  • Must support CASA
  • Will need a patching/updating mechanism
  • How to boot diskless OS images
  • What Linux distrobution to use?
    • Can we use Red Hat with our current license?  I have looked in JDE and I can't find a recent subscription.  Need to ask David.
    • Should we buy Red Hat licenses like we did for USNO?
      • USNO is between $10K and $15K per year for 81 licensed nodes.  This may not be an EDU license.
      • NRAO used to have a 1,000 host license for Red Hat but I don't know what they have now.
    • Do we even want to use Red Hat?
      • Alternatives would be Rocky Linux or AlmaLinux since CentOS is essentially dead
  • What version do we use RHEL7 or RHEL8?

...

  • CASA
  • HTCondor
  • Will need a way to maintain the software
    • stow, rpm, modules, containers?

Third party software for

...

remote institution

  • Will need a way to maintain software for the local remote institution site
  • Will need a way to maintain the software
    • stow, rpm, modules, containers?

...

Maintenance

  • replace disk (local remote institution admin)
  • replace/reseat DIMM (local remote institution admin)
  • replace power supply (local remote institution admin)
  • NRAO may handle replacement hardware. Drop ship. Spare ourselves?
  • Patching OS images (NRAO)
  • Patching third party software like CASA and HTCondor (NRAO)
  • Altering OS images (NRAO)

...

...