Definitions
cpuset: Is the set of cores on which the job is allowed to run. On a dual processor machine running Linux, all the even numbered cores are on socket and the odd numbered cores are on the other socket. E.g.
cpuset=0,2,4,6,8,10,12,14 # all the cores on one 8core socket.
cpuset=0,1,2,3,4,5,6,7 # 4 cores on one socket and four on the other.
Conclusions
- casa-5 seems to produce the same image no matter what the cpuset is.
https://docs.google.com/spreadsheets/d/1aKCzeCOj1-50mC7I4fN2eMupPrfR6OH-6LN4jSK9LtQ/edit#gid=670565607
- casa-pipeline-release-5.6.1-8.el7 and casa-6.1.2-7-pipeline-2020.1.0.36 both use the same version of OpenMPI (1.10.4)
/home/casa/packages/RHEL7/release/casa-pipeline-release-5.6.1-8.el7/lib/mpi/bin/mpirun -version
/home/casa/packages/RHEL7/release/casa-6.1.2-7-pipeline-2020.1.0.36/lib/mpi/bin/mpirun -version
- With casa-6, the resulting image is dependant on the cpuset used.
https://docs.google.com/spreadsheets/d/1aKCzeCOj1-50mC7I4fN2eMupPrfR6OH-6LN4jSK9LtQ/edit#gid=2101591390
https://docs.google.com/spreadsheets/d/1aKCzeCOj1-50mC7I4fN2eMupPrfR6OH-6LN4jSK9LtQ/edit#gid=93665106
- When using 8cores and mpicasa -n 9, I casa-6 always produces the same image regardless of the cpuset.
https://docs.google.com/spreadsheets/d/1aKCzeCOj1-50mC7I4fN2eMupPrfR6OH-6LN4jSK9LtQ/edit#gid=1339676938
- jobs jr-batch.9 and jr-nmpost005b.2 show that -n 9 is the same as -n $machinefile when ppn is 9
- runnnig a manual job with access to all the cores (no cpuset) and -n 9 produces the same result as jr-nmpost005.55 (all 8 even cores).
Though I only have a few data points.
- nmpost005, nmpost006, and nmpost072 produce the same images given the same input and using the same cpuset.
- cores chosen by Torque don't seem to change for a given host. Though I only have a few data points. If it did vary once in a
while it could explain the once in a while differences I saw in my end-to-end runs.
https://docs.google.com/spreadsheets/d/1aKCzeCOj1-50mC7I4fN2eMupPrfR6OH-6LN4jSK9LtQ/edit#gid=1234076945
- Torque seems to choose different cpusets for different hosts. E.g. nmpost005 gets 0-1,3-5,7,9,11,13 while nmpost006 gets 0-2,4-6,8,10,12. This cpuset doesn't seem to change after a reboot nor after smaller jobs being run on the host. I have no idea where Torque is saving this cpuset between jobs but it seems to be doing just that. This could produce different images if you are using something like ppn:8 and -n 8 or -n machinefile and you may think it is host dependent when actually it may just be Torque choosing different cpusets for you.
- It seems that the specific cores chosen doesn't dictate the image created but the number of cores on each socket does.
- It is looking like hardware doesn't really matter. It's the cpuset.
Using deconvolver='hogbom' doesn't seem to produce different images based on cpuset but deconvolver='mtmfs' does.
Questions
QUESTION: Does the number of threads per process (ps -T <pid>) change with different cpusets?
QUESTION: Check whether nodescheduler give whole NUMA node, also test whether nodescheduler + mpicasa -n 8 gives same image as the good -n 8 images (ie 8-0 not 6-2 or 5-3)
ANSWER: except for what I am guessing is a Torque but on nmpost060, all the other nodes honored the numanode in nodescheduler and gave me or other users 8 cores on the same socket.
- nmpost011/8-15: cpuset.cpus: 1,5,7,11,13,17,19,23 cpuset.mems: 1 (dual 12core sockets)
- nmpost013/0-7: cpuset.cpus: 0,4,6,10,12,16,18,22 cpuset.mems: 0 (dual 12core sockets)
- nmpost021/0-7: cpuset.cpus: 0,2,4,6,8,10,12,14 cpuset.mems: 0 (dual 16core sockets)
- nmpost033/0-7: cpuset.cpus: 0,2,4,6,8,10,12,14 cpuset.mems: 0 (dual 16core sockets)
- nmpost033/8-15: cpuset.cpus: 1,3,5,7,9,11,13,15 cpuset.mems: 1 (dual 16core sockets)
- nmpost036/0-7: cpuset.cpus: 0,2,6,8,10,12,16,18 cpuset.mems: 0 (dual 20core sockets)
- nmpost036/8-15: cpuset.cpus: 1,3,7,9,11,13,17,19 cpuset.mems: 1 (dual 20core sockets)
- nmpost060/0-7: cpuset.cpus 0,2 cpuset.mems: 0 (dual 16core sockets) Why is this cpuset only 0,2 when torque? L_Request = -L tasks=1:lprocs=8:memory=92gb:place=numanode which looks like nodescheduler but cpuset_string = nmpost060:0,2.
QUESTION: Running jobs with nodescheduler
ANSWER: Using nodescheduler, which provides you with 8 cores, to reserve a node and then manually running casa with either -n 8 or -n 9 produces images that are pixel identical to what you would get with a hand crafted cpuset of 8 cores on the same socket and using -n 8 or -n 9. In other words if you have been using nodescheduler to reserve nodes, I don't think your casa images are suspect.
QUESTION: Test with 4 way parallelization whether 4-0, 0-4, 2-2, 1-3, 3-1 distribution impacts resulting image using -n 4. also try -n 5.
ANSWER
- Using 4 cores and -n 4: 4-0, 0-4 produces a different image than 2-2, 1-3, 3-1.
- Using 4cores and -n 5: all permutations tested (4-0, 0-4, 2-2, 1-3, 3-1) produces the same image.
QUESTION: Test an ALMA data set examine oussid.s12_0.2276_444_53712_sci.spw16_18_20_22.cont.I.iter1.image for comparison
ANSWER: I can't use John Tobin's compare script so I have had to use cmp on individual files.
QUESTION: Can I get different images using batch as well as manual?
ANSWER: Yes. Using ppn=4 and -n 4 and tracking the cores given in the cgroup, I was able to see the same sort of variance in images. For example a job with 4 even cores produced a different image than a job with 3 even and 1 odd core. So, it doesn't look like Torque does anything significantly different than my manual tests and there is no guarentee as to which cores Torque will give you.
QUESTION: We need to track down where divergences are occurring in the imaging. To date we've been looking at the final image, now we need to see is it in the dirty image, the weights, the cleaned image etc to try and isolate the cause.
- Select data set (jr-template)
- Select a parallelization scheme, I'd suggest 8 cores with -n 8 and then use a 8-0 and 6-2 mix to get a good and bad image
- Set up imaging so that it preserves the residual pre-normalization and post normalization, the weights and sumwt and the cleaned image.
- Limit imaging to one imaging cycle
So I have run two jobs (div-1 and div-2) with all even cores (8-0) and they are binary identical expect for casa-*.log, casa.out and imaging.sh. (div is short for divergence. I needed a new name.) This is is just a baseline and it works as expected. Next, I ran div-3 which uses 2 even and 6 odd cores (2-6). There the prenorm residuals in cycle0 are identical between div-1 and div-3. The prenorm residuals in cycle1 are all different between div-1 and div-3 as is the top-level debug-awp.residual.tt0 and debug-awp.image.tt0.
All cycle0 products are binary identical between div-1 and div-3.
In debug-awp.workdir_prenorm_cycle1 the only directories that are different between div-1 and div-3 are debug-awp.n?.residual.tt? and debug-awp.n?.model.tt?
Each debug-awp.n?.residual.tt0 in div-1 is binary different than every debug-awp.n?.residual.tt0 step in div-3
QUESTION: Re-run with CASA5, watch running process and see if it's multithreaded during imaging
ANSWER: I ran two jobs on nmpost067, dual 12core machine with no manually defined cpuset)
CASA-5 seems to max out at 8 threads per python process. Here is top output
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND nTH
283497 krowe 20 0 309188 23924 11228 S 0.0 0.0 0:00.06 Xvfb 17
283499 krowe 20 0 309188 23912 11228 S 0.0 0.0 0:00.05 Xvfb 17
283500 krowe 20 0 309188 23924 11228 S 0.0 0.0 0:00.06 Xvfb 17
283502 krowe 20 0 309188 23916 11228 S 0.0 0.0 0:00.06 Xvfb 17
283505 krowe 20 0 309188 23924 11228 S 0.0 0.0 0:00.04 Xvfb 17
283507 krowe 20 0 309188 23916 11228 S 0.0 0.0 0:00.06 Xvfb 17
283508 krowe 20 0 309188 23916 11228 S 0.0 0.0 0:00.06 Xvfb 17
282789 krowe 20 0 1924676 184516 83300 S 0.0 0.0 0:07.27 python 10
282790 krowe 20 0 3862980 2.2g 82424 S 99.7 0.4 0:36.25 python 8
282791 krowe 20 0 3207512 1.6g 82408 S 99.7 0.3 0:35.87 python 8
282792 krowe 20 0 3862964 2.2g 82432 S 99.7 0.4 0:36.07 python 8
282793 krowe 20 0 3207528 1.6g 82412 S 99.3 0.3 0:35.74 python 8
282794 krowe 20 0 3862984 2.2g 82432 S 99.3 0.4 0:36.10 python 8
282795 krowe 20 0 3207548 1.6g 82420 S 99.7 0.3 0:35.73 python 8
282796 krowe 20 0 3886436 2.2g 82084 S 99.7 0.4 0:18.52 python 8
282784 krowe 20 0 105304 3520 2520 S 0.0 0.0 0:00.16 mpirun 2
283407 krowe 20 0 1617236 91660 888 S 0.0 0.0 0:00.00 python 2
283408 krowe 20 0 1617200 91660 888 S 0.0 0.0 0:00.00 python 2
283409 krowe 20 0 1617188 93676 896 S 0.0 0.0 0:00.00 python 2
283411 krowe 20 0 1617240 91652 888 S 0.0 0.0 0:00.00 python 2
283413 krowe 20 0 1617236 91656 888 S 0.0 0.0 0:00.00 python 2
283414 krowe 20 0 1617236 93688 888 S 0.0 0.0 0:00.00 python 2
283415 krowe 20 0 1617232 91648 888 S 0.0 0.0 0:00.00 python 2
CASA-6 seems to max out at 39 threads per python3 process. Here is top output
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND nTH
285759 krowe 20 0 4866784 2.2g 69296 S 93.7 0.4 0:35.54 python3 39
285760 krowe 20 0 4964980 2.2g 69300 S 155.6 0.4 0:38.87 python3 39
285761 krowe 20 0 4932320 2.2g 69304 S 75.2 0.4 0:35.55 python3 39
285762 krowe 20 0 4997844 2.2g 69280 S 159.9 0.4 0:38.60 python3 39
285763 krowe 20 0 4997856 2.2g 69300 S 70.5 0.4 0:35.34 python3 39
285764 krowe 20 0 4866776 2.2g 69300 S 160.9 0.4 0:39.06 python3 39
285765 krowe 20 0 4374072 2.2g 68888 S 82.1 0.4 0:14.55 python3 24
285909 krowe 20 0 311248 23936 11232 S 0.0 0.0 0:00.05 Xvfb 17
285911 krowe 20 0 311248 23932 11232 S 0.0 0.0 0:00.04 Xvfb 17
285913 krowe 20 0 311248 23932 11232 S 0.0 0.0 0:00.04 Xvfb 17
285914 krowe 20 0 311248 23932 11232 S 0.0 0.0 0:00.04 Xvfb 17
285916 krowe 20 0 311248 23936 11232 S 0.0 0.0 0:00.04 Xvfb 17
285917 krowe 20 0 311248 23932 11232 S 0.0 0.0 0:00.03 Xvfb 17
285922 krowe 20 0 311248 23936 11232 S 0.0 0.0 0:00.04 Xvfb 17
285758 krowe 20 0 1983876 205980 71444 S 0.3 0.0 0:04.69 python3 12
285756 krowe 20 0 105276 3468 2480 S 0.0 0.0 0:00.17 mpirun 2
277642 krowe 20 0 178140 2760 1200 S 0.0 0.0 0:00.09 sshd 1
277647 krowe 20 0 121564 4180 1900 S 0.0 0.0 0:00.28 bash 1
285720 krowe 20 0 115244 1492 1268 S 0.0 0.0 0:00.00 imagin+ 1
285729 krowe 20 0 115240 1528 1280 S 0.0 0.0 0:00.00 xvfb-r+ 1
285741 krowe 20 0 101684 8072 3448 S 0.0 0.0 0:00.06 Xvfb 1
QUESTION: run with OMP_NUM_THREADS=1 explicitly set in the script
ANSWER: setting OMP_NUM_THREADS=1 in the imaging.sh script still produces different images with different cpusets. I.e. no change. And I see OMP_NUM_THREADS=1 in /proc/PID/environ on the parent CASA process as well as all the python child processes even when I don't set OMP_NUM_THREADS in the script or in my .bashrc.
QUESTION: run with deconvolver=mtmfs and nterms=1
ANSWSER: This exhibits the cpuset issue.
QUESTION: run with deconvolver=multiscale
ANSWSER: This exhibits the cpuset issue. It does not show the cpuset issue with the casa-5 fft libraries.
QUESTION: Do AMD processors show the same cpuset issue?
ANSWER: Yes. While AMD numbers its cores differently, a 8-0 job produces different image than a 2-6 job.
QUESTION: CAS-13313 - Getting issue details... STATUS with the compile options as similar to casa-5 as possible.
ANSWER: This new build of CASA with "as close a practical to duplicating the CASA 5 build flags" exhibits the cpuset issue.
QUESTION: Is libatlas to blame for the cpuset issue?
ANSWER: Ville compiled a version of casa without libatlas. I have installed it here /lustre/aoc/projects/vlass/krowe/casa-CAS-13375-2 and it exhibits the cpuset issue.
QUESTION: Does the number of threads per process change based on the cpuset?
ANSWER: Yes. Perhaps this is actually what is generating different images.
- 8 even and 0 odd cores: max of 23 threads per python3 process
- 2 even and 6 odd cores: max of 19 threads per python3 process
- 3 even and 5 odd cores: max of 17 threads per python3 process
QUESTION: Since CASA-5 doesn't suffer this cpuset problem, what if we run CASA-6 with CASA-5's FFT libraries?
ANSWER: CASA-6 doesn't suffer the cpuset issue. I copied /home/casa/packages/RHEL7/release/casa-pipeline-release-5.6.1-8.el7/lib/libfftw3* to my installation of casa-6.1.1-10-pipeline-2020.1.0.36/lib, overwriting the libfft files that were there and then tested my installation of casa-6.1.1-10-pipeline-2020.1.0.36 and was unable to produce the cpuset issue.
QUESTION: Does CASA-6 honor OMP_NUM_THREADS?
ANSSER: No and yes. If you set OMP_NUM_THREADS=0 the python3 processes spawned by mpicasa (e.g. -n 8 spawns eight python3 processes) will have more than 0 threads. But with cpuse=0,2,4,6,8,10,12,14 and -n 8 there seems to be a pattern forming after OMP_NUM_THREADS=4
- OMP_NUM_THREADS=0 max threads seen for a python3 process is 23
- OMP_NUM_THREADS=1 max threads seen for a python3 process is 23
- OMP_NUM_THREADS=2 max threads seen for a python3 process is 23
- OMP_NUM_THREADS=4 max threads seen for a python3 process is 24
- OMP_NUM_THREADS=8 max threads seen for a python3 process is 28
- OMP_NUM_THREADS=16 max threads seen for a python3 process is 36
QUESTION: What determines thread count in CASA-5?
ANSWER:
- Not cpuset
- Not the -n argument to mpicasa.
- OMP_NUM_THREADS but it seems to only increase the thread count if is is greater than 1.
- OMP_NUM_THREADS=1 max threads seen for a python process is 10
- OMP_NUM_THREADS=2 max threads seen for a python process is 11
- OMP_NUM_THREADS=4 max threads seen for a python process is 15
- OMP_NUM_THREADS=8 max threads seen for a python process is 27
- OMP_NUM_THREADS=16 max threads seen for a python process is 51
QUESTION: does the number of threads per process vary based on cpuset size?
ANSWER: Yes
- cpuset=0,2,4,6 -n 5 max threads for python3 is 15
- cpuset=0,2,4,6,8,10,12,14 -n 5 max threads for python3 is 23
- no cpuset defined (24 cores) -n 5 max threads for python3 is 39
QUESTION: are the system fft libraries (/usr/lib64/libfft*) used by casa-6 or casa-5?
ANSWER: No. I used ls --time=atime --sort=time -r /usr/lib64/libfft* to see if the access time changes after running CASA. They did not. But when running my own installation of CASA6, the fft libraries in that installation did update their access time.