Table of Contents | ||
---|---|---|
|
...
Current Questions
tokens and collector.locate
It seems that if the submit host is HTC23, you need a user token in order for the API (nraorsync specifically) to locate the schedd.
output_destination and stdout/stderr
It used to be that once you set output_destination = someplugin:// then that plugin was responsible for transferring all files even stdout and stderr. That no longer seems to be the case as of version 23. My nraorsync transfer plugin has code in it looking for _condor_stdout and _condor_stderr as arguments but never sees them with version 23. The stdout and stderr files are copied back to the submit directory instead of letting my plugin transfer them.
This is a change. I am not sure if it affects us adversely or not but can we unchange this?
ANSWER: from Greg "After some archeology, it turns out that the change so that a file transfer plugin requesting to transfer the whole sandbox no longer sees stdout/stderr is intentional, and was asked for by several users. The current workaround is to explicitly list the plugin in the stdout/stderr lines of the submit file, e.g."
output = nraorsync://some_location/stdout
error = nraorsync://some_location/stderr
This seems like it should work but my plugin produces errors. Probably my fault.
tokens and collector.locate
It seems that if the submit host is HTC23, you need a user token in order for the API (nraorsync specifically) to locate the schedd.
import os
import classad
import htcondordef upload_file():
try:
ads = classad.parseAds(open(os.environ['_CONDOR_JOB_AD'], 'r'))
for ad in ads:
try:
globaljobid = str(ad['GlobalJobId'])
print("DEBUG: globaljobidimport os
import classad
import htcondordef upload_file():
try:
ads = classad.parseAds(open(os.environ['_CONDOR_JOB_AD'], 'r'))
for ad in ads:
try:
globaljobid = str(ad['GlobalJobId'])
print("DEBUG: globaljobid is", globaljobid)
except:
return(-1)
except Exception:
return(-1)print("DEBUG: upload_file(): step 1\n")
submithost = globaljobid.split('#')[0]
print("DEBUG: submithost is", submithost)
collector = htcondor.Collector()
print("DEBUG: collector is", collector)
schedd_ad = collector.locate(htcondor.DaemonTypes.Schedd, submithost)
print("DEBUG: schedd_ad is ", schedd_ad)upload_file()
...
This code works if both the AP and EP are version 10. But if the AP is version 23 then it fails weather the EP is verison 10 or version 23. It works with version 23 iff I have a ~/.condor/tokens.d/nmpost token.
getnenv
Did it change since 10.0? Can we still use getenv in DAGs or regular jobs?
#krowe Nov 5 2024: getenv no longer includes your entire environment as of version 10.7 or so. But instead it only includes the environment varialbes you list with the "ENV GET" syntax in the .dag file.
https://git.ligo.org/groups/computing/-/epics/30
Installing version 23
I am looking at upgrading from version 10 to 23 LTS. I noticed that y'all have a repo RPM to install condor but it installs the Feature Release only. It doens't provide repos to install the LTS.
https://htcondor.readthedocs.io/en/main/getting-htcondor/from-our-repositories.html
Output to two places
Some of our pipeline jobs don't set shoud_transfer_files=YES because they need to transfer some output to an area for Analysts to look at and a some other output (may be a subset) to a different area for the User to look at. Is there a condor way to do this? transfer_output_remaps?
ANSWER: Greg doesn't think there is a Condor way to do this. Could make a copy of the subset and use transfer_output_rempas on the copy but that is a bit of a hack.
Pelican?
Felipe is playing with it and we will probably want it at NRAO.
ANSWER: Greg will ask around.
Virtual memory vs RSS
Looks like condor is reporting RSS but that may actually be virtual memory. At least according to Felipe's tests.
Why do I need a user token to run collector.locate against a schedd?
I was going to test this on CHTC but I can't seem get an interactive job on CHTC anymore.
DONE: send greg error ouptut and security config
transfer_output_files change in version 23
My silly nraorsync transfer plugin relies on the user setting transfer_output_files = .job.ad in the submit description file to trigger the transfer of files. Then my nraorsync plugin takes over and looks at +nrao_output_files for the files to copy. But with version 23, this no longer works. I am guessing someone decided that internal files like .job.ad, .machine.ad, _condor_stdout, and _condor_stderr will no longer be tranferrable via trasnfer_output_files. Is that right? If so, I think I can work around it. Just wanted to know.
ANSWER: the starter has an exclude list and .job.ad is probably in it and maybe it is being access sooner or later than before. Greg will see if there is a better, first-class way to trigger transfers.
DONE: We will use condor_transfer since it needs to be there anyway.
Installing version 23
I am looking at upgrading from version 10 to 23 LTS. I noticed that y'all have a repo RPM to install condor but it installs the Feature Release only. It doens't provide repos to install the LTS.
https://htcondor.readthedocs.io/en/main/getting-htcondor/from-our-repositories.html
ANSWER: Greg will find it and get back to me.
DONE: https://research.cs.wisc.edu/htcondor/repo/23.0/el8/x86_64/release/
Virtual memory vs RSS
Looks like condor is reporting RSS but that may actually be virtual memory. At least according to Felipe's tests.
ANSWER: Access to the cgroup information ANSWER: Access to the cgroup information on the nmpost cluster is good because condor is running as root and condor reports the RSS accurately. But on systems using glidein like PATh and OSG they may not have appropriate access to the cgroup so memory reporting on these clusters may be different thatn memory reporting on the nmpost cluster. On glide-in jobs condor reports the virtual memory accross across all the processes in the job.
...
Setup /usr/bin/mail on mcilroy so that it works. Condor will use this to send mail to root when it encounters an error. Need to submit jira ticket to SSA. (krowe)
...
We have had many NMT VLASS nodes crash since we upgraded to RHEL8. I think the nodes were busy when they crashed. So I changed our SLOT_TYPE_1 from 100% to 95%. Is this a good idea?
ANSWER: try using RESERVED_MEMORY=4096 (units are in Megabytes) instead of SLOT_TYPE_1=95% and put SLOT_TYPE_1=100% again.
Resubmitting Jobs
I have an example in
...
LOCAL_CONFIG_FILE = /var/run/condor/condor_config.local
Our /etc/condor/config.d/99-nrao file effectivly sets sets the following
STARTD_ATTRS = PoolName NRAO_TRANSFER_HOST HASLUSTRE BATCH
Our /etc/condor/config.d/99-nrao file effectivly sets sets the following
STARTD_ATTRS = PoolName NRAO_TRANSFER_HOST HASLUSTRE BATCH
Our /var/run/condor/condor_config.local, which is run by glidein nodes, sets the following
STARTD_ATTRS = $(STARTD_ATTRS) NRAOGLIDEIN
The problem is glidein nodes don't get all the STARD_ATTRS set by 99-nrao. They just get NRAOGLIDEIN. It is like condor-master reads 99-nrao to set its STARTD_ATTRS. Then it read condor_config.local to set its STARTD_ATTRS again but without accessing $(STARTD_ATTRS).
ANSWER: The last line in /var/run/condor/condor_config.local is re-writing STARTD_ATTRS. It should have $(STARTD_ATTRS) appended
...
/var/run/condor/condor_config.local, which is run by glidein nodes, sets the following
STARTD_ATTRS = $(STARTD_ATTRS) NRAOGLIDEIN
The problem is glidein nodes don't get all the STARD_ATTRS set by 99-nrao. They just get NRAOGLIDEIN. It is like condor-master reads 99-nrao to set its STARTD_ATTRS. Then it read condor_config.local to set its STARTD_ATTRS again but without accessing $(STARTD_ATTRS).
ANSWER: The last line in /var/run/condor/condor_config.local is re-writing STARTD_ATTRS. It should have $(STARTD_ATTRS) appended
STARTD_ATTRS = NRAOGLIDEIN
Output to two places
Some of our pipeline jobs don't set shoud_transfer_files=YES because they need to transfer some output to an area for Analysts to look at and a some other output (may be a subset) to a different area for the User to look at. Is there a condor way to do this? transfer_output_remaps?
ANSWER: Greg doesn't think there is a Condor way to do this. Could make a copy of the subset and use transfer_output_rempas on the copy but that is a bit of a hack.
Pelican?
Felipe is playing with it and we will probably want it at NRAO.
ANSWER: Greg will ask around.
RHEL8 Crashing
We have had many NMT VLASS nodes crash since we upgraded to RHEL8. I think the nodes were busy when they crashed. So I changed our SLOT_TYPE_1 from 100% to 95%. Is this a good idea?
ANSWER: try using RESERVED_MEMORY=4096 (units are in Megabytes) instead of SLOT_TYPE_1=95% and put SLOT_TYPE_1=100% again.
getnenv
Did it change since 10.0? Can we still use getenv in DAGs or regular jobs?
#krowe Nov 5 2024: getenv no longer includes your entire environment as of version 10.7 or so. But instead it only includes the environment variables you list with the "ENV GET" syntax in the .dag file.
https://git.ligo.org/groups/computing/-/epics/30
ANSWER: Yes this is true. CHTC would like users to stop using getenv=true. There may be a knob to restore the old behavior.
DONE: check out docs and remove getenv=true
...