Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • As we feared referencing the cache of convolution functions (cfcache) directly from staging performed poorly.  This is due to a fstat() pathology that fares poorly on distributed filesystems.  Jobs ran 3 to 4 times faster when we copied cfcache from /staging to local disk.  I ran a small data set test with full parameters at CHTC that copied cfcache from /staging to local disk and step05 took only 16.7 hours instead of the 56.8 hours it had taken using cfcache on /staging.
  • I had a job killed because it exceeded 72 hours even though I set +LongJobs = true  in the submit file 
    • 2385.0 krowe 9/22 20:43 Error from slot1_1@e2008.chtc.wisc.edu: Job failed to complete in 72 hrs
    • ANSWER: the knob is sinuglar +LongJob = true
  • What are the options to setting up HTCondor to both flock to CHTC and annex to AWS? Multiple submit hosts?  Multiple CMs? etc.
    • ANSWER: Philosophy is for everyone to submit in one place and let condor sort out where it goes.
    • CHTC flocks annex jobs to a different CM that actually starts the annex.
      • Submit annex job on SM.  It then flocks to a different CM that can create the annex

...