Open questions:


Node Priority

Is there a way to set an order to which nodes are picked first or a weight system?  We want certain nodes to be chosen first because they are faster, or have less memory or other such criteria.

NEGOTIATOR_PRE_JOB_RANK on the negotiator

Reservations

What if you know certain nodes will be unavailable for a window of time say the second week of next month.  Is there a way to schedule that in advance in HTCondor?  For example in Slurm

scontrol create reservation starttime=2021-02-8T08:00:00 duration=7-0:0:0 nodes=nmpost[020-030] user=root reservationname=siw2022

Array Jobs

Does HTCondor support array jobs like Slurm? For example in Slurm #SBATCH --array=0-3%2 or is one supposed to use queue options and DAGMan throttling?

https://htcondor.readthedocs.io/en/latest/users-manual/dagman-workflows.html#throttling-nodes-by-category

queue from seq 10 5 30

queue item in 1, 2, 3

Bug: All on one core

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND P
66713 krowe 20 0 4364 348 280 S 0.0 0.0 0:00.01 sleep 22
66714 krowe 20 0 4364 352 280 S 0.0 0.0 0:00.02 sleep 24
66715 krowe 20 0 4364 348 280 S 0.0 0.0 0:00.01 sleep 24
66719 krowe 20 0 4364 348 280 S 0.0 0.0 0:00.02 sleep 2
66722 krowe 20 0 4364 352 280 S 0.0 0.0 0:00.02 sleep 22

From jrobnett@nrao.edu Tue Nov 10 16:38:18 2020

As (bad) luck would have it I had some jobs running where I forgot to set the #cores to do so they triggered the behavior.

Sshing into the node I see three processes sharing the same core and the following for the 3 python processes:

bash-4.2$ cat /proc/113531/status | grep Cpus
Cpus_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
Cpus_allowed_list:      0

If I look at another node with 3 processes where they aren't sharing the same core I see:

bash-4.2$ cat /proc/248668/status | grep Cpu
Cpus_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00555555
Cpus_allowed_list:      0,2,4,6,8,10,12,14,16,18,20,22

Dec. 8, 2020 krowe: I launched five sqrt(rand()) jobs and each one landed on its own CPU. 

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND P
48833 krowe 20 0 12532 1052 884 R 100.0 0.0 9:20.95 a.out 4
49014 krowe 20 0 12532 1052 884 R 100.0 0.0 8:34.91 a.out 5
48960 krowe 20 0 12532 1052 884 R 99.6 0.0 8:54.40 a.out 3
49011 krowe 20 0 12532 1052 884 R 99.6 0.0 8:35.00 a.out 1
49013 krowe 20 0 12532 1048 884 R 99.6 0.0 8:34.84 a.out 0

and the masks aren't restricting them to specific cpus.  So I am yet unable to reproduce James's problem.

st077.aoc.nrao.edu]# grep -i cpus /proc/48960/status
Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
Cpus_allowed_list: 0-447

We can reproduce this without HTCondor.  So this is either being caused by our mpicasa program or the openmpi libraries it uses.  Even better, I can reproduce this with a simple shell script executed from two shells at the same time on the same host.  Another MPI implementation (mvapich2) didn't show this problem.

#!/bin/sh
export PATH=/usr/lib64/openmpi/bin:{$PATH}
export LD_LIBRARY_PATH=/usr/lib64/openmpi/lib:${LD_LIBRARY_PATH}
mpirun -np 2 /users/krowe/work/doc/all/openmpi/busy/busy



Answered Questions:

























10/20/20 08:54:36 From submit: ERROR: on Line 9 of submit file:
10/20/20 08:54:36 From submit: Submit:-1:Error "", Line 0, Include Depth 1: can't open file
10/20/20 08:54:36 From submit:
10/20/20 08:54:36 From submit: ERROR: Failed to parse command file (line 9).
10/20/20 08:54:36 failed while reading from pipe.
10/20/20 08:54:36 Read so far: Submitting job(s)ERROR: on Line 9 of submit file: Submit:-1:Error "", Line 0, Include Depth 1: can't open fileERROR: Failed to parse command file (line 9).
10/20/20 08:54:36 ERROR: submit attempt failed
10/20/20 11:58:58 From submit: Submitting job(s)ERROR on Line 13 of submit file: $CHOICE() macro: myindex is invalid index!
10/20/20 11:58:58 failed while reading from pipe.
10/20/20 11:58:58 Read so far: Submitting job(s)ERROR on Line 13 of submit file: $CHOICE() macro: myindex is invalid index!
10/20/20 11:58:58 ERROR: submit attempt failed








Nodesfree

How can one see nodes that are entirely unclaimed?

SOLUTION: condor_status -const 'PartitionableSlot && Cpus == TotalCpus'


HERA queue

I want a proper subset of machines to be for the HERA project. These machines will only run HERA jobs and HERA jobs will only run on these machines.  This seems to work but is there a better way?

machine configsubmit file

HERA = True

STARTD_ATTRS = $(STARTD_ATTRS) HERA

START = ($(START)) && (TARGET.partition =?= "HERA")

requirements = (HERA == True)

+partition = "HERA"

SOLUTION: yes, this is good.  Submit Transforms could also be set on herapost-master (Submit Host)

https://htcondor.readthedocs.io/en/latest/misc-concepts/transforms.html?highlight=submit%20transform