Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Documentation

  • A projectbook like we did for USNO could be appropriate
  • Process diagrams (how systems boot, how jobs get started from NRAO and run, how locals start jobs, etc)


Networking

NRAO side

  • Submit host needs to be able to establish a connection to the remote head node on port 9618 (HTCondor)
  • Submit host needs to be able to listen for a connection from the remote head node on port 9618 (HTCondor)
    • mcilroy has external IPs (146.88.1.66 for 1Gb/s and 146.88.10.66 for 10Gb/s).  Is the container listening?
  • NRAO needs to be able to establish a connection to the remote head node on port 22 (ssh)

Remote side

  • Head node establish on port 9618 to nrao.edu. (HTCondor)
  • Head node listens on port 9618 from nrao.edu. (HTCondor)
  • Execute node establish on port 9618 to nrao.edu.  Execute host be NATed. (HTCondor if flocking)
  • Execute node establish on port 22 to gibson.aoc.nrao.edu.  Execute host can be NATed. (nraorsync if flocking)
  • Head node listens on port 22 from nrao.edu (ssh)
  • Head node establish on port 25 to revere.aoc.nrao.edu (mail)


Services

  • DNS
    • What DNS domain will these hosts be in?  nrao.edu? local.site? other?
  • DHCP
  • SMTP
  • NTP
  • NFS
  • LDAP?  How do we handle accounts?  I think we will want accounts on at least the head node.  The execution nodes could run everything as nobody or as real users.  If we want real users on the execute hosts then we should use a directory service which should probably be LDAP.  No sense in teaching folks how to use NIS anymore.
    • Local accounts only?
  • ssh
  • rsync (nraorsync_plugin.py)
  • NAT so the nodes can download/upload data
  • TFTP (for OSes and switch)
  • condor (port 9618) https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToMixFirewallsAndHtCondor
  • ganglia
  • nagios


Operating System

  • Must support CASA
  • Will need a patching/updating mechanism
  • How to boot diskless OS images
  • What Linux distrobution to use?
    • Can we use Red Hat with our current license?  I have looked in JDE and I can't find a recent subscription.  Need to ask David.
    • Should we buy Red Hat licenses like we did for USNO?
      • USNO is between $10K and $15K per year for 81 licensed nodes.  This may not be an EDU license.
      • NRAO used to have a 1,000 host license for Red Hat but I don't know what they have now.
    • Do we even want to use Red Hat?
      • Alternatives would be Rocky Linux or AlmaLinux since CentOS is essentially dead
  • What version do we use RHEL7 or RHEL8?

...

  • Will need a way to maintain software for the local site
  • Will need a way to maintain the software
    • stow, rpm, modules, containers?

Services

  • DNS
    • What DNS domain will these hosts be in?  nrao.edu? local.site? other?
  • DHCP
  • SMTP
  • NTP
  • NFS
  • LDAP?  How do we handle accounts?  I think we will want accounts on at least the head node.  The execution nodes could run everything as nobody or as real users.  If we want real users on the execute hosts then we should use a directory service which should probably be LDAP.  No sense in teaching folks how to use NIS anymore.
    • Local accounts only?
  • ssh
  • rsync (nraorsync_plugin.py)
  • NAT so the nodes can download/upload data
  • TFTP (for OSes and switch)
  • condor (port 9618) https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToMixFirewallsAndHtCondor
  • ganglia
  • nagios


Management Access

  • PDU
  • UPS
  • BMC/IPMI
  • switch

...

Networking

NRAO side

  • Submit host needs to be able to establish a connection to the remote head node on port 9618 (HTCondor)
  • Submit host needs to be able to listen for a connection from the remote head node on port 9618 (HTCondor)mcilroy has external IPs (146.88.1.66 for 1Gb/s and 146.88.10.66 for 10Gb/s).  Is the container listening?
  • NRAO needs to be able to establish a connection to the remote head node on port 22 (ssh)

Remote side

  • Head node establish on port 9618 to nrao.edu. (HTCondor)
  • Head node listens on port 9618 from nrao.edu. (HTCondor)
  • Execute node establish on port 9618 to nrao.edu.  Execute host be NATed. (HTCondor if flocking)
  • Execute node establish on port 22 to gibson.aoc.nrao.edu.  Execute host can be NATed. (nraorsync if flocking)
  • Head node listens on port 22 from nrao.edu (ssh)
  • Head node establish on port 25 to revere.aoc.nrao.edu (mail)


Using

  • Get NRAO jobs on the remote racks.  This may depend on how we want to use these remote racks. If we want them to do specific types of jobs then ClassAd options may be the solution. If we want them as overflow for jobs run at NRAO then flocking may be the solution. Perhaps we want both flocking and ClassAd options.  Actually flocking may be the best method because I think it doesn't require the execute nodes to have external network access.
    • Flocking?  What are the networking requirements?
    • Classad options?  I think this will require the execute hosts to have routable IPs because our submit host will talk directly to them and vice-versa.  Could CCB help here?
    • Other?
  • Remote HTCondor concerns
    • Do we want our jobs to run a an NRAO user like vlapipe or nobody?
    • Do we want local jobs to run as the local user, some dedicated user, or nobody?Remote HTCondor concerns
  • Need to support 50% workload for NRAP and 50% workload for local.  How?
  • Share disk space on head node 50% NRAO and 50% local
    • Two partitions: one for NRAO and one for local?

Documentation

  • A projectbook like we did for USNO could be appropriate
  • Process diagrams (how systems boot, how jobs get started from NRAO and run, how locals start jobs, etc)


Shipping

  • Drop ship everything to the site and assemble on site.  This will require an NRAO person on site to assemble with a pre-built OS disk for the head node.
  • Ship everything here and assemble then ship a rack-on-pallet
  • Mix the two. Ship minimal stuff here (head node, switch, couple of nodes, etc) and configure and drop ship most of the nodes to the site.
  • A person from the remote site could travel to NM or CV to see the test system and get instruction.

...