Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • As we feared referencing the cache of convolution functions (cfcache) directly from staging performed poorly.  This is due to a fstat() pathology that fares poorly on distributed filesystems.  Jobs ran 3 to 4 times faster when we copied cfcache from /staging to local disk.  I ran a small data set test with full parameters at CHTC that copied cfcache from /staging to local disk and step05 took only 16.7 hours instead of the 56.8 hours it had taken using cfcache on /staging.
  • I had a job killed because it exceeded 72 hours even though I set +LongJobs = true  in the submit filefile 
    • 2385.0 krowe 9/22 20:43 Error from slot1_1@e2008.chtc.wisc.edu: Job failed to complete in 72 hrs
    • ANSWER: the knob is sinuglar +LongJob = true
  • What are the options to setting up HTCondor to both flock to CHTC and annex to AWS? Multiple submit hosts?  Multiple CMs? etc.

...