Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Talk with SSA and VLASS about how we actually use these remote clusters (e.g.Staging, flocking, etc)
  • Buy test system
  • Talk with Pueto Rico about their technical situation
  • Think about getting drives from third party instead of Dell and sparing them ourselves.
  • prefer 208V for us and the remote site(s)
    • Install a 208V circuit for our reserved space (253T) in the server room.
    • Could use PD1-B breakers 7, 9, 11
  • Think about getting the same drives for the OS that we get for the data array.  It's simpler but perhaps slower.
  • Who is going to put all the RAM and drives in the servers?
  • Work on a price spreadsheet
  • Learn Ansible?

Timeline

  • Buy test system as soon as practical
  • Buy production system about 6 months after test system
  • Receive production system about 7 months after test system
  • Install production system about 10 months after test system
  • Running about 12 months after test system

...

  • Get NRAO jobs on the remote racks.  This may depend on how we want to use these remote racks. If we want them to do specific types of jobs then ClassAd options may be the solution. If we want them as overflow for jobs run at NRAO then flocking may be the solution. Perhaps we want both flocking and ClassAd options.  Actually flocking may be the best method because I think it doesn't require the execute nodes to have external network access.
    • Staging and submitting remotely?
    • Flocking?
    • Classad options?  I think this will require the execute hosts to have routable IPs because our submit host will talk directly to them and vice-versa.  Could CCB help here?
    • Other?
  • Remote HTCondor concerns
    • Do we want our jobs to run an NRAO user like vlapipe, or nobody?
    • Do we want remote institution jobs to run as the remote institution user, some dedicated user, or nobody?
  • Need to support 50% workload for NRAO and 50% workload for remote institution.  How?
  • Share disk space on head node 50% NRAO and 50% remote institution
    • Two partitions: one for NRAO and one for remote institution?

Documentation

  • A projectbook like we did for USNO could be appropriate
  • Process diagrams (how systems boot, how jobs get started from NRAO and run, how remote institutions start jobs, etc)

Networking

HTCondor flocking requires

  • From local schedd to remote collectord on condor port 9618
  • From remote negotiator and execute hosts to local schedd.  Here the execute hosts can be NATed.
  • From local shadow to remote starterd.  Use CCB.  It allows execute hosts to live behind firewall and be NATed.

Non-flocking just requires ssh access from probably mcilroy and to gibson

NRAO side

  • NRAO -> remote head node on port 22 (ssh)
  • Submit Host -> remote head node (condor_collector) on port 9618 (HTCondor) for flocking
  • Submit Host <- remote head node (condor_negotiator) on port 9618 (HTCondor) for flocking
    • mcilroy has external IPs (146.88.1.66 for 1Gb/s and 146.88.10.66 for 10Gb/s).  Is the container listening?
  • Submit Host <- remote execute hosts (condor_starter) on port 9618 (HTCondor) for flocking
  • Submit Host (condor_shadow) -> remote execute hosts (condor_starter) on port 9618 (HTCondor) for flocking.  CCB might alleviate this.

Remote side


Documentation

  • A projectbook like we did for USNO could be appropriate
  • Process diagrams (how systems boot, how jobs get started from NRAO and run, how remote institutions start jobs, etc)


Networking

HTCondor flocking requires

  • From local schedd to remote collectord on condor port 9618
  • From remote negotiator and execute hosts to local schedd.  Here the execute hosts can be NATed.
  • From local shadow to remote starterd.  Use CCB.  It allows execute hosts to live behind firewall and be NATed.

Non-flocking just requires ssh access from probably mcilroy, and to gibson.

NRAO side

  • NRAO -> remote head node Head node <- from nrao.edu on port 22 (ssh)
  • Head node -> revere.aoc.nrao.edu on port 25 (smtp)
  • Head node -> NRAO Submit Host Submit Host -> remote head node (condor_collector) on port 9618 (HTCondor) for flockingHead node <- NRAO Submit Host
  • Submit Host <- remote head node (condor_negotiator) on port 9618 (HTCondor) for flocking
  • Execute node -> NRAO Submit Host on port 9618 (HTCondor) for flocking.  Execute host may be NATed.
  • Execute node -> gibson.aoc.nrao.edu on port 22 (ssh) for flocking with nraorsync.  Execute host can be NATed.

Services

  • DNS
    • What DNS domain will these hosts be in?  nrao.edu? remote-institution.site? other?
  • DHCP
  • SMTP
  • NTP
  • NFS
  • LDAP?  How do we handle accounts?  I think we will want accounts on at least the head node.  The execution nodes could run everything as nobody or as real users.  If we want real users on the execute hosts then we should use a directory service which should probably be LDAP.  No sense in teaching folks how to use NIS anymore.
    • remote institution accounts only?
  • ssh
  • rsync (nraorsync_plugin.py)
  • NAT so the nodes can download/upload data
  • TFTP (for OSes and switch)
  • condor (port 9618) https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToMixFirewallsAndHtCondor
  • ganglia
  • nagios

Operating System

  • Must support CASA
  • Will need a patching/updating mechanism
  • How to boot diskless OS images
  • What Linux distrobution to use?
    • Can we use Red Hat with our current license?  I have looked in JDE and I can't find a recent subscription.  Need to ask David.
      • We have a 1,000 FTE license with up to 20,000 installations allowed.  But since we are either selling the equipment to the institution or asking them to buy it themselves, at least with UPRM, I don't see how we can use our RHEL license legally.  Asking each institution to aquire an RHEL license sounds like a recepie for failure so I think open source OS is the answer.
    • Should we buy Red Hat licenses like we did for USNO?
      • USNO is between $10K and $15K per year for 81 licensed nodes.  This may not be an EDU license.
      • NRAO used to have a 1,000 host license for Red Hat but I don't know what they have now.
      • I don't want to maintain licenses for up to 10 differenct installs.  I don't think the institutions will want to purchase and maintain a license.
    • Do we even want to use Red Hat?
      • Alternatives would be Rocky Linux or AlmaLinux or CentoOS Stream
  • What version do we use RHEL7 or RHEL8 or RHEL9?
  • What OSes is CASA is verified against?  I am pretty sure RHEL but what about CentOS or Rocky or ALMA, etc?.
  • The cost of RHEL is pretty small compared to the hardware.
  • UPRM has their own money but the other institutions will either get a grant or money from us so we can say what OS they use and pay for.
  • Should pull in Matthew or Schlake on this decision.
  • Should we use Ansible for deployments?

Third party software for VLASS

    • mcilroy has external IPs (146.88.1.66 for 1Gb/s and 146.88.10.66 for 10Gb/s).  Is the container listening?
  • Submit Host <- remote execute hosts (condor_starter) on port 9618 (HTCondor) for flocking
  • Submit Host (condor_shadow) -> remote execute hosts (condor_starter) on port 9618 (HTCondor) for flocking.  CCB might alleviate this.

Remote side

  • Head node <- from nrao.edu on port 22 (ssh)
  • Head node -> revere.aoc.nrao.edu on port 25 (smtp)
  • Head node -> NRAO Submit Host on port 9618 (HTCondor) for flocking
  • Head node <- NRAO Submit Host on port 9618 (HTCondor) for flocking
  • Execute node -> NRAO Submit Host on port 9618 (HTCondor) for flocking.  Execute host may be NATed.
  • Execute node -> gibson.aoc.nrao.edu on port 22 (ssh) for flocking with nraorsync.  Execute host can be NATed.


Services

  • DNS
    • What DNS domain will these hosts be in?  nrao.edu? remote-institution.site? other?
    • Will this vary depending on site?
    • 2022-10-26 krowe: it is looking like the institution will own the equpment.  Either they buy it with their own money like UPR-M or AUI gives them a grant and they buy it.  Either way, they own it.  So, I think we can expect the hosts to be in their DNS domain.  Which is probably for the best.  We can make CNAMEs for each head node if needed.
    • So what IP range should we use?  That may depend on the site as each site may use non-routable IP ranges differently.
  • DHCP
  • SMTP
  • NTP or chrony
    • What timezone should these be in?  I think the choices are
      • Mountain - Perhaps the most convenient for NRAO users and consistant between the sites.
      • Local - Makes the most sence to the local users but means differences between the sites.
      • UTC - equally annoying for all.
  • NFS
  • Directory Server
    • NIS?  Probably not.  RHEL9 will not support NIS.
    • OpenLDAP
    • 389 Directory Server? (previously Fedora Directory Server)
    • Identity Management
    • FreeIPA
    • How do we handle accounts?  I think we will want accounts on at least the head node.  The execution nodes could run everything as nobody or as real users.  If we want real users on the execute hosts then we should use a directory service which should probably be LDAP.  No sense in teaching folks how to use NIS anymore.
      • remote institution accounts only?
    • 2022-10-26 krowe: RHEL8 and later don't come with OpenLDAP anymore.  Red Hat wants you to use either their 389DS or IDM or RHDS or some other thing that gets them money.  It's all very confusing
  • ssh
  • rsync (nraorsync_plugin.py)?
  • NAT so the nodes can download/upload data?
  • TFTP (for OSes and switch)
  • condor (port 9618) https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToMixFirewallsAndHtCondor
  • nagios
  •  ganglia
    • Ganglia hasn't been updated since 2015 so perhaps it is time to look for something else.
    • Prometheus/Graphana
    • Zabbix


Operating System

  • Must support CASA
  • Will need a patching/updating mechanism
  • How to boot diskless OS images
  • What Linux distrobution to use?
    • Can we use Red Hat with our current license?  I have looked in JDE and I can't find a recent subscription.  Need to ask David.
      • We have a 1,000 FTE license with up to 20,000 installations allowed.  But since we are either selling the equipment to the institution or asking them to buy it themselves, at least with UPRM, I don't see how we can use our RHEL license legally.  Asking each institution to aquire an RHEL license sounds like a recepie for failure so I think open source OS is the answer.
    • Should we buy Red Hat licenses like we did for USNO?
      • USNO is between $10K and $15K per year for 81 licensed nodes.  This may not be an EDU license.
      • NRAO used to have a 1,000 host license for Red Hat but I don't know what they have now.
      • I don't want to maintain licenses for up to 10 differenct installs.  I don't think the institutions will want to purchase and maintain a license.
    • Do we even want to use Red Hat?
      • Alternatives would be Rocky Linux or AlmaLinux or CentoOS Stream
    • Some sites will own their equipment like UPR-M.  Probably most sites the equipment will be owned by NRAO.
  • What version do we use RHEL7 or RHEL8 or RHEL9?  Remember CASA needs to support it.
  • What OSes is CASA is verified against?  I am pretty sure RHEL but what about CentOS or Rocky or ALMA, etc?.
  • What version of CASA does VLASS need?
  • The cost of RHEL is pretty small compared to the hardware.
  • UPRM has their own money but the other institutions will either get a grant or money from us so we can say what OS they use and pay for.
  • Should pull in Matthew or Schlake on this decision.
  • Should we use Ansible for deployments?
  • 2022-10-21 krowe: jkern talked to business office and they prefer that AUI gives the money to Morgan State and Morgan State buys the equipment.  So Morgan State will own it.  NRAO can stipulate you will only get the money if you buy what we recommend.
  • 2022-10-24 krowe: CASA is verified against RHEL8. CentOS Stream 8 is a constantly moving target.  There is no CentOS Stream 8.1 or 8.2 it is always the cutting edge of RHEL.  I don't like that.  I would much rather have version numbers that I can compare to RHEL.  So that is a vote against CentOS Stream and for Rocky or Alma.  I just checked Scientific Linux (maintinaed by Fermilab and CERN) and they are moving to CentOS Stream8.  So there will not be a Scientific Linux 8.

Third party software for VLASS

  • CASA what version?CASA
  • HTCondor
  • Will need a way to maintain the software
    • stow, rpm, modules, containers?

...