...
- As we feared referencing the cache of convolution functions (cfcache) directly from staging performed poorly. This is due to a fstat() pathology that fares poorly on distributed filesystems. Jobs ran 3 to 4 times faster when we copied cfcache from /staging to local disk. I ran a small data set test with full parameters at CHTC that copied cfcache from /staging to local disk and step05 took only 16.7 hours instead of the 56.8 hours it had taken using cfcache on /staging.
- I had a job killed because it exceeded 72 hours even though I set +LongJobs = true in the submit filefile
- 2385.0 krowe 9/22 20:43 Error from slot1_1@e2008.chtc.wisc.edu: Job failed to complete in 72 hrs
- ANSWER: the knob is sinuglar +LongJob = true
- What are the options to setting up HTCondor to both flock to CHTC and annex to AWS? Multiple submit hosts? Multiple CMs? etc.
...