Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Interactive: The ability to assign all or part of a node to a user with shell level access (nodescheduler, qsub -I, etc),  minimal granularity is per NUMA node, finer would be useful.  Because Slurm and HTCondor lack the ability to implement nodescheduler, my current thought is to ditch nodescheduler and just use the interactive commands that come with Slurm and HTCondor.

    • nodescheduler: Was written before I understood what qsub -I did.  Had I known, I may have argued to use qsub -I instead of nodescheduler as it is much simpler, is consistent with other installations of Torque, and may have forced some users to use batch processing which is much more efficient.
    • nodescheduler likes
      • It's not tied to any tty so a user can login multiple times from multiple places to their reserved node without requiring something like screen, tmux, or vnc.  It also means that users aren't all going through nmpost-master.
      • Its creation is asynchronous.  If the cluster is full you don't wait around for your reservation to start, you get an email message when it is ready.
      • It's time limited (e.g. two weeks).  We might be able to do the same with a queue/partition setting but could we then extend that reservation?
      • We get to define the shape of a reservation (whole node, NUMA node, etc).  If we just let people use qsub -I they could reserve all sorts of sizes which may be less efficient.  Then again it may be more efficient.  But either way I think nodescheduler it is simpler for our users.
    • nodescheduler: dislikes
      • With Toruqe/Moab asking for a NUMA node doesn't work as I would like.  Because of bugs and limitations, I still have to ask for a specific amount of memory.  The whole point of asking for a NUMA node was that I didn't need to know the resources of a node ahead of time but could just ask for half of a node. Sadly, that doesn't work with Torque/Moab.
      • Because of the way I maintain the cgroup for the user, with /etc/cgrules.conf, I cannot let a user have more than one nodescheduler job on the same node or it will be impossible to know which cgroup an ssh connection should use.  The interactive commands (qsub -I, etc) don't have this problem.
    • Slurm - I think the biggest blocker here is slurm doesn't have an equivalent to naccesspolicy=uniqueuser.
      • srun -p interactive --pty bash This logs the user into an interactive shell on a node with defaults (1 core, 1 GB memory) in the interactive partition.
      • NUMA I don't see how Slurm can reserve NUMA nodes so we may have to just reserve X tasks with Y memory.
      • naccesspolicy=uniqueuser I don't know how to keep Slurm from giving a user multiple portions of the same host.  With Moab I used naccesspolicy=uniqueuser which prevents the ambiguity of which ssh connection goes to which cgroup.  I could have nodescheduler check the nodes and assign one that the user isn't currently using but this is starting to turn nodescheduler into a scheduler of its own and I think may be more complication than we want to maintain.
      • cgrules Slurm has system-level prolog/epilog functionality that should allow nodesceduler to set /etc/cgrules.conf but pam_slurm_adopt.so pretty much removes the need for /etc/cgrules.conf.
      • PAM The pam_slurm.so module can be used without modifying systemd and will block users that don't have a job running from logging in.  The pam_slurm_adopt.so module required removing some pam_systemd modeules and does what pam_slurm.so does plus will put the user's login shell in the same cgroup as the slurm job expected to run the longest, which would replace my /etc/cgrules.conf hack.  This still doesn't solve the problem of multiple interactive jobs by the same user on the same node.  Removing the pam_systemd.so module prevents the creation of things like /run/user/<UID> and the XDG_RUNTIME_DIR and  XDG_SESSION_ID which breaks VNC.  So we may want to use just pam_slurm.so and not pam_slurm_adopt.so.
        • But slurm at CHTC has neither pam_slurm.so nor pam_slurm_adopt.so configured and their nodes don't create /run/user/<UID> either.  So it might just be slurm itself and not the pam modules causing the problem.
        • Also, in order to install pam_slurm_adopt.so you have to not only disable systemd-logind but you must mask it as well.  This prevents /run/user/<UID> from being created even if you login with ssh (e.g. no Slurm, Torque, or HTCondor involved).
      • nodeextendjob Can Slurm extend the MaxTime of an interactive job? Yes scontrol update TimeLimit=7 JobId=489  This sets the MaxTime to seven minutes.
    • HTCondor
      • condor_submit -i This logs the user into an interactive shell on a node with defaults (1 core equivelent, 0.5 GB memory)
      • NUMA I don't see how HTCondor can reserve NUMA nodes so we may have to just reserve X tasks with Y memory.
      • naccesspolicy=uniqueuser I don't think I need to worry about giving a user multiple portions of the same host if we are using condor_ssh_to_job.
      • cgrules I don't know if HTCondor has the prologue/epilogue functionality to implement my /etc/cgrules.conf hack.
      • PAM How can we allow a user to login to a node they have an interactive job running on via nodescheduler?  With Torque or Slurm there are PAM modules but there isn't one for HTCondor.
      • Could run a sleep job just like we do with Torque and use condor_ssh_to_job which seems to do X11 properly.  We would probably want to make gygax part of the nmpost pool.
    • nodevnc
    • Ah FFS!  I can actually successfully launch a VNC session using my nodevnc-pbs script even though there is no /run/user/<UID> on the node.  I have not changed this nodevnc-pbs script in six months.  WTF?!  Perhaps Torque can work-around not having /run/user/<UID> but Slurm cannot.
      • Given the limitation of Slurm and HTCondor and that we already recommend users use VNC on their interactive nodes, why don't we just provide a nodevnc script that reserves a node (via torque, slurm or HTCondor), start a vnc server and then tells the user it is ready and how to connect to it?  If someone still needs/wants just simple terminal access, then qsub -I or srun --pty bash or condor_submit -i might suffice.
      • Slurm
        • /run/user/<UID> Slurm doesn't actually run /bin/login so things like /run/user/<UID> are not created which causes vncserver to produce errors like Call to lnusertemp failed upon connection with vncviewer. Both HTCondor and Torque/Moab have similar problems.
          • NO: /bin/bash -l
          • NO: --get-user-env=L
          • NO: --get-user-env=S
          • NO: --export=ALL
          • Could I just creat /run/user/<UID> in a prolog script?
          • YES: If I login to the host (rastan) and then launch vnc via slurm it all works.
          • YES: If I create /run/user/5213 on the host, then I can launch vnc via slurm.
          • Could I just creat /run/user/<UID> in a prolog script?
          • I can successfully run Xvfb without /run/user/<UID>.
          • I have successfully ran small CASA tests with xvfb-run.
          • A work-around could be something like the following, but there might be other things broken because of the missing /run/user/<UID> and ${XDG_RUNTIME_DIR}/gvfs is actually a fuse mount and reaper does not know how to umount things.
            • mkdir /tmp/${USER}
            • export XDG_RUNTIME_DIR=/tmp/${USER}
            • Or using loginctl enable-linger krowe and disable-linger krowe in prolog/epilog scripts?
        • Reading up on how pam_slurm_adopt works, it will probably never cooperate with systemd and therefore it is a hack and not future-proof.  https://github.com/systemd/systemd/issues/13535  I am unsure how wise it is to start using pam_slurm_adopt in the first place.
        • So if I don't install pam_slurm_adopt.so, which I only installed because it seemed better than my /etc/cgrules.conf hack, which I only created for nodescheduler after we started using cgroups, then I think I can get nodevnc working as a pseudo-replacement for nodescheduler.  If we do use nodevnc and don't use nodescheduler (which we mostly can't) then we may not want to use the pam_slurm.so module either so that users can't login to nodes they have reserved and possibly use resources they aren't scheduled for.
      • HTCondor
        • HTCondor doesn't seem to create /run/user/<UID> either here (8.9.7) nor at CHTC (8.9.11).  I can get vncserver to run at CHTC by setting HOME=$TMPDIR and transferring ~/.vnc but I am unable to connect to it via vncviewer.  The connection times out.  This makes me think that even if I can get vncserver working, which I may have done at CHTC, it still will give me the lnusertemp error because of the missing /run/user/<UID>.
        • Xserver Since we run an X server on our nmpost nodes, ironically to allow VNC and remote X from thin clients, starting a vncserver from HTCondor fails.  This is because vncserver doesn't see the /tmp/.X11-unix socket of the running X server because HTCondor has bind mounted a fresh /tmp for us so vncserver tries to start an X server which fails because the port is already in use.
    • Other options
      • screen
      • tmux
      • others?

...