CGPACK progress, MAY-2016

CGPACK > MAY-2016

previous | UP | next

Jump directly to: 4-MAY-2016 | 9-MAY-2016 | 12-MAY-2016 | 13-MAY-2016 | 15-MAY-2016 | 17-MAY-2016 | 19-MAY-2016 | 20-MAY-2016 | 23-MAY-2016

4-MAY-2016: Can now get MPI calls with TAU

It turned out I was not linking correctly against TAU libraries. Fixed in r84 of coarrays, example problem 5pi, see Makefile.

Anyway, I also rebuilt TAU like this:

./configure -mpi -c++=mpiicpc -cc=mpiicc -fortran=mpiifort -pdt=$HOME/pdtoolkit-3.22 -bfd=download

Here are the validation results.html. Note that pprof now shows MPI processes.

Some new results from jumpshot. This is for 5pi on 32 images on a single 16-core node. The MPI calls are now visible, but not the arrows. Why?

jumpshot-4 pica program, 32 images, a fragment shotly after start

jumpshot-4 pica program, 32 images, a fragment at the end, low mag.

jumpshot-4 pica program, 32 images, a fragment at the end, high mag.

Here is the pprof output: pprof.text

And here are paraprof visualisations. Note, as before, there is process 0, which just does caflaunch. Clearly the runtime is dominated by the pica calculations. Communications are negligible.

paraprof main display for pi_ca run on 32 images on a single 16-core node

paraprof node 13 display for pi_ca run on 32 images on a single 16-core node

But now I have no hardware counters, because I built without PAPI. Let's try rebuilding once again with PAPI. Configure as:

./configure -mpi -c++=mpiicpc -cc=mpiicc -fortran=mpiifort -pdt=$HOME/pdtoolkit-3.22 -bfd=download -papi=/cm/shared/libraries/intel_build/papi-5.3.0

The validation results: results2.html.

Moving on to the Laplace solver, problem 9laplace from the coarrays course.

Let's try co_back2.f90 program, which implements partitioning of the picture (2D array) into chunks in two dimensions. I run it on 4 images, i.e. splitting into 4 chunks, - 2 along dimension 1 and 2 chunks along dimension 2.

Note that images 1 and 2 spend less on computation, coback2 (red) and more on MPI_barrier (orange) compared to images 3 and 4. I don't know why this is or whether this is significant.

paraprof main display for 9laplace run on 4 images on a single 16-core node

This is for image 2. Note image 2 spends less time computing and more time waiting, MPI_barrier, compared to image 3.

paraprof node 2 display for 9laplace run on 4 images on a single 16-core node

This is for image 3. Note image 3 spends more time computing and less time waiting, MPI_barrier, compared to image 2.

paraprof node 3 display for 9laplace run on 4 images on a single 16-core node

paraprof node 3 stats display for 9laplace run on 4 images on a single 16-core node

Finally, here's the plain text output of pprof: 9lap-pprof.text.

9-MAY-2016: Trying to understand very poor speed-up of parallel Laplacian solver.

This is example problem 9laplace from the Coarrays Course.

Let's try a much bigger image, in case the problem is that there is not enough work for each image to do at each iteration, and the global sync overheads dominate the runtime.

12-MAY-2016: Some progress with Intel

I switched to a 600x400 image:

$ identify ref.pgm 
ref.pgm PNM 600x400 600x400+0+0 8-bit Grayscale DirectClass 793kb 

I use full ifort optimisation and collect opt. reports:

-fast -qopt-report

The serial back.f90 runs under a minute:

45.68user 0.11system 0:45.92elapsed 99%CPU (0avgtext+0avgdata 19248maxresident)k
64inputs+1600outputs (1major+1274minor)pagefaults 0swaps

I was having some Intel MPI errors. In the Intel Clusters and HPC Technology forum Intel staff asked for the output of:

export confile=nodes
cat $PBS_NODEFILE > $confile

mpirun -genvall -genv I_MPI_FABRICS shm:dapl -genv I_MPI_HYDRA_DEBUG 1 -n 16 -machinefile ./nodes IMB-MPI1

which is under 9lap.o4579303. Might be useful in future for debug.

Now trying a coarray program coback1.f90. This version is not instrumented. coback1-tau.f90 is instrumented. The programs differ only in one line - the instrumented program, coback1-tau.f90, calls this TAU routine:

call TAU_PROFILE_SET_NODE(this_image())

The program splits the 2D picture array along dimension 2, so that the coarrays are long along dimension 1 (fastest changing) and short along dimension 2 (slow changing). The picture size is 400 along dimension 2, which is divisible by 2, 4, 8, 10, 16, 20, 25 and 40. Hence I use these numbers of images in this example. The uninstrumented run times are:

===> coback1.x
-genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 2 ./coback1.x
150.53user 12.12system 1:36.55elapsed 168%CPU (0avgtext+0avgdata 37904maxresident)k
1528inputs+1616outputs (2major+8435minor)pagefaults 0swaps
===> coback1.x
-genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 4 ./coback1.x
220.00user 34.39system 1:05.75elapsed 386%CPU (0avgtext+0avgdata 42592maxresident)k
0inputs+1616outputs (0major+13983minor)pagefaults 0swaps
===> coback1.x
-genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 8 ./coback1.x
271.20user 78.84system 0:46.67elapsed 750%CPU (0avgtext+0avgdata 59536maxresident)k
0inputs+1608outputs (0major+30295minor)pagefaults 0swaps
===> coback1.x
-genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 10 ./coback1.x
249.17user 89.83system 0:39.64elapsed 855%CPU (0avgtext+0avgdata 65568maxresident)k
0inputs+5640outputs (0major+33616minor)pagefaults 0swaps
===> coback1.x
-genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 16 ./coback1.x
473.93user 202.38system 0:51.63elapsed 1309%CPU (0avgtext+0avgdata 85680maxresident)k
0inputs+1616outputs (0major+59846minor)pagefaults 0swaps
===> coback1.x
-genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 20 ./coback1.x
484.81user 191.53system 0:53.82elapsed 1256%CPU (0avgtext+0avgdata 89792maxresident)k
0inputs+1608outputs (0major+57515minor)pagefaults 0swaps
===> coback1.x
-genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 25 ./coback1.x
477.85user 194.95system 0:53.65elapsed 1254%CPU (0avgtext+0avgdata 95456maxresident)k
0inputs+1608outputs (0major+63444minor)pagefaults 0swaps
===> coback1.x
-genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 40 ./coback1.x
489.88user 446.08system 1:48.09elapsed 865%CPU (0avgtext+0avgdata 131168maxresident)k
0inputs+5664outputs (0major+96911minor)pagefaults 0swaps

The instrumented run times are:

===> coback1-tau.xtau
-genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 2 ./coback1-tau.xtau
173.92user 16.97system 1:55.10elapsed 165%CPU (0avgtext+0avgdata 64384maxresident)k
1592inputs+540304outputs (2major+13253minor)pagefaults 0swaps
===> coback1-tau.xtau    
-genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 4 ./coback1-tau.xtau
246.27user 33.25system 1:12.43elapsed 385%CPU (0avgtext+0avgdata 69104maxresident)k
0inputs+1419664outputs (0major+22285minor)pagefaults 0swaps
===> coback1-tau.xtau    
-genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 8 ./coback1-tau.xtau
302.14user 71.45system 0:50.69elapsed 736%CPU (0avgtext+0avgdata 85936maxresident)k
0inputs+3586704outputs (0major+45537minor)pagefaults 0swaps
===> coback1-tau.xtau    
-genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 10 ./coback1-tau.xtau
294.58user 72.00system 0:46.36elapsed 790%CPU (0avgtext+0avgdata 91968maxresident)k
0inputs+4711504outputs (0major+57447minor)pagefaults 0swaps
===> coback1-tau.xtau    
-genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 16 ./coback1-tau.xtau
567.49user 163.53system 1:02.69elapsed 1165%CPU (0avgtext+0avgdata 112128maxresident)k
0inputs+7984424outputs (0major+89060minor)pagefaults 0swaps
===> coback1-tau.xtau    
-genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 20 ./coback1-tau.xtau
594.59user 172.27system 1:08.11elapsed 1125%CPU (0avgtext+0avgdata 116208maxresident)k
0inputs+8630880outputs (0major+94834minor)pagefaults 0swaps
===> coback1-tau.xtau    
-genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 25 ./coback1-tau.xtau
601.58user 174.49system 1:07.08elapsed 1156%CPU (0avgtext+0avgdata 121904maxresident)k
0inputs+8450952outputs (0major+92624minor)pagefaults 0swaps
===> coback1-tau.xtau    
-genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 40 ./coback1-tau.xtau
693.98user 436.61system 2:04.40elapsed 908%CPU (0avgtext+0avgdata 157520maxresident)k
0inputs+13701584outputs (0major+139719minor)pagefaults 0swaps

This is roughly the expected pattern. The best times are with 10 images in both cases. The instrumented times are slightly higher. However, even the best uninstrumented time is only marginally better than the serial time. Let's examine the profiling results from the run with 10 images.

The text results, sorted by exclusive total time, i.e.

pprof -m > pprof-10images

are in pprof-10images

On most images MPI_Win_unlock takes most time. This is clearly bad. On the first and the last images MPI_Recv takes most time, closely followed by MPI_Win_unlock.

The main paraprof window. Note the differences between images (nodes) 1 and 10, and the other images.

Paraprof, coback1-tau.f90 program, 10 images, main view

View on node (image) 3:

Paraprof, coback1-tau.f90 program, 10 images, node 3 view

View on node (image) 1:

Paraprof, coback1-tau.f90 program, 10 images, node 1 view

Conclusion - something is not very well optimised in MPI comms. It would be interesting to know exactly how sync images is mapped onto MPI calls. Perhaps jumpshot can help. I'll try this next.

13-MAY-2016: Jumpshot4 analysis of the Laplacian solver with Intel coarrays.

I concentrate on the behaviour towards the end of the program. Here it seems all images, but image 1, have completed their computations at about 49 s. After that, for another 10s! they seem to be doing mostly MPI_Win_unlock. Why?? Also note - lighter green is coback1tau, the main program. Somehow it is not shown on node 10, even though the profiling results, pprof.text, show it's there:

NODE 10;CONTEXT 0;THREAD 0:
---------------------------------------------------------------------------------------
%Time    Exclusive    Inclusive       #Call      #Subrs  Inclusive Name
              msec   total msec                          usec/call
---------------------------------------------------------------------------------------
 35.9       21,261       21,261      180000           0        118 .TAU application => coback1tau => MPI_Recv()
 35.9       21,261       21,261      180000           0        118 MPI_Recv()
 33.6       19,880       19,880 1.65962E+06           0         12 MPI_Win_unlock()
 29.1       17,183       17,183      900093           0         19 .TAU application => coback1tau => MPI_Win_unlock()
 83.5        9,819       49,407           1 1.48008E+06   49407127 .TAU application => coback1tau
 83.5        9,819       49,407           1 1.48008E+06   49407127 coback1tau

jumpshot4, coback1-tau.f90 program, 10 images, low mag

This is a high mag. fragment around 54.9s. The dark green bars on images 1-10 are MPI_Win_unlock. The lighter green on image 1 is coback1tau, the main program. The lavender on node 0 is caflaunch.

jumpshot4, coback1-tau.f90 program, 10 images, high mag

Here is the top of the legend, sorted by inclusive time.

jumpshot4, coback1-tau.f90 program, 10 images, legend

I still can't get paraprof 3D views working from FreeBSD, get these errors:

 paraprof
javax.media.opengl.GLException: Error making context current
        at com.sun.opengl.impl.x11.X11GLContext.makeCurrentImpl(X11GLContext.java:141)
        at com.sun.opengl.impl.x11.X11OnscreenGLContext.makeCurrentImpl(X11OnscreenGLContext.java:69)
        at com.sun.opengl.impl.GLContextImpl.makeCurrent(GLContextImpl.java:127)
        at com.sun.opengl.impl.GLDrawableHelper.invokeGL(GLDrawableHelper.java:182)
        at javax.media.opengl.GLCanvas.maybeDoSingleThreadedWorkaround(GLCanvas.java:258)
        at javax.media.opengl.GLCanvas.display(GLCanvas.java:130)
        at javax.media.opengl.GLCanvas.paint(GLCanvas.java:142)
        at sun.awt.RepaintArea.paintComponent(RepaintArea.java:264)
        at sun.awt.X11.XRepaintArea.paintComponent(XRepaintArea.java:73)
        at sun.awt.RepaintArea.paint(RepaintArea.java:240)
        at sun.awt.X11.XComponentPeer.handleEvent(XComponentPeer.java:694)
        at java.awt.Component.dispatchEventImpl(Component.java:4725)
        at java.awt.Component.dispatchEvent(Component.java:4475)
        at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:675)
        at java.awt.EventQueue.access$300(EventQueue.java:96)
        at java.awt.EventQueue$2.run(EventQueue.java:634)
        at java.awt.EventQueue$2.run(EventQueue.java:632)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:108)
        at java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:119)
        at java.awt.EventQueue$3.run(EventQueue.java:648)
        at java.awt.EventQueue$3.run(EventQueue.java:646)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:108)
        at java.awt.EventQueue.dispatchEvent(EventQueue.java:645)
        at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:275)
        at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:200)
        at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:190)
        at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:185)
        at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:177)
        at java.awt.EventDispatchThread.run(EventDispatchThread.java:138)

So I tried from MS Windows 7(?), and it worked:

Paraprof 3D profiling bar chart, 10 images.

Note that there is an inconsistency in node/image numbering in TAU, resulting in non-symmetric comms matrix.

Paraprof communication matrix, 10 images.

This is likely caused by the fact the coarrays are implemented over MPI. So in some places coarray Fortran 1 to N notation is used, and in other places C 0 to N-1 notation is used.

If I disable the TAU call to explicitly set the node number:

!call TAU_PROFILE_SET_NODE(this_image())

then the comms matrix looks like this:

Paraprof communication matrix, *without* TAU_PROFILE_SET_NODE, 10 images.

However, now the pprof results are weird. Only 10 nodes are shown (previously there was also 11th node running caflaunch only). And of the 10 nodes, only 9 are shown to do any useful work. Node 1 now only runs caflaunch, and only nodes 2-10 do useful work:

pprof-10img-2

On the other hand I now finally get remote calls shown with arrows in jumpshot4.

Preview mode. Olive colour is TAU itself. Lavender on node 0 is caflaunch. Note that node 0 runs nothing else but caflaunch. I'm not sure if this is correct, because then it would mean that only 9 images are doing useful work, not 10.

jumpshot4, *without* TAU_PROFILE_SET_NODE, 10 images, different magnifications.

Individual remote comms are becoming visible at this magnification. The arrows show from which node to which node. The pins give the timestamp and the exact event.

jumpshot4, *without* TAU_PROFILE_SET_NODE, 10 images, different magnifications.

A pattern of communications now is visible. Although sync images allows for random comms pattern, it seems the comms start from node 1, i.e. node 1 swaps halos with node 2. The node 2 swaps halos with node 3, etc., until nodel 9 swaps halos with node 8.

jumpshot4, *without* TAU_PROFILE_SET_NODE, 10 images, different magnifications.

Some randomness in the duration of remote calls is now visible.

jumpshot4, *without* TAU_PROFILE_SET_NODE, 10 images, different magnifications.

The pin names are: pink is "Message size received from all nodes"; blue is "Message size sent to node ..."; light green is "Message size is sent to all nodes". The other colours are the same messages shown from .TAU.

jumpshot4, *without* TAU_PROFILE_SET_NODE, 10 images, different magnifications.

Here I removed the coback1tau process, the main program, and the TAU process. The green blocks are MPI_Win_unlock, and the olive blocks are MPI_Recv. Note that nothing is left on node 0, but caflaunch.

jumpshot4, *without* TAU_PROFILE_SET_NODE, 10 images, different magnifications.

15-MAY-2016: Trying export TAU_PROFILE_FORMAT=merged

Following advice from Sameer Shende via TAU-users mailing list, I added

export TAU_PROFILE_FORMAT=merged

to the PBS job script. The comms matrix is finally symmetric, as it should be:

Paraprof comms matrix, finally symmetric, 10 images, 9laplace program

17-MAY-2016: Figured out how to run paraprof 3D animation over ssh

paraprof is a shell script that sets up some variables and then calls java. One of the variables it sets is:

# If 3D window has problems, please uncomment
#if [ $MACHINE = bgq -o $MACHINE = arm_linux ]; then
export LIBGL_ALWAYS_INDIRECT=1
#fi

I commented LIBGL_ALWAYS_INDIRECT=1 out and now I can run paraprof 3D GL animation over ssh.

This post explains what setting of LIBGL_ALWAYS_INDIRECT=1 does:

http://unix.stackexchange.com/questions/1437/what-does-libgl-always-indirect-1-actually-do.

The exact value is not important. Any value will be interpreted by GL as variable set. Indirect rendering means that the remote GL program sends GL commands over the net to be interpreted by the local GL library. Apparently, this doesn't work in my case. Probably some version clash. Unsetting this variable enables direct rendering, where GL on the remote host communicates directly with graphical hardware on the remote host and only the resulting frames are being sent over the net. This is slower, and at some frame rates might be unsuitable, but it avoids completely the compatibility problems between the local and remote GL versions or similar.

In my particular case the remote host doesn't seem to have dri:

libGL error: failed to open drm device: No such file or directory
libGL error: failed to load driver: i965

so Mesa is used instead. Hitting GL Info button in paraprof 3D animation window shows this:

Paraprof GL info when running over ssh

Here's the 3D paraprof bar chart for 5pi example, compiled for shared memory execution, and run on 16 images, i.e. a single node.

3D paraprof bar chart for 5pi example on 16 images, shared memory

Node 0 (image 1) is most busy, because it's doing the reduction operation. This node also does not call MPI_Win_unlock. Not sure if this is correct, or TAU misunderstanding. Node 6 is slightly different from other nodes, but otherwise the load is even.

Below is paraprof 3D bar chart for 9laplace example, run on 10 images with distributed memory on a single node.

3D paraprof bar chart for 9lap example on 10 images, distr. memory

MPI_Win_unlock dominates the run time here. There is slight imbalance - nodes 0 and 9 (images 1 and 10) spend more time doing comms and less time doing calculations. It is interesting to note that MPI_Recv take a lot more time than MPI_Isend.

19-MAY-2016: Moving onto CGPACK examples

This is program testABW.f90. I instrument the library and the test programs with identical tau-instrument.sh driver script:

export TAU_OPTIONS="-optShared -optVerbose -optCompInst"
export TAU_MAKEFILE=$HOME/tau-2.25.1/x86_64/lib/Makefile.tau-icpc-papi-mpi-pdt

make clean -i -f Makefile-bc3-mpiifort-tau
make all -i -f Makefile-bc3-mpiifort-tau

If TAU option -optCompInst is not used, then I get CGPACK build errors:

cgca_m3clvg.inst.f90(143): warning #6878: Within an interface block, this statement is ignored.
save profiler
------^
cgca_m3clvg.inst.f90(145): error #6622: This statement is invalid in an INTERFACE block.
call TAU_PROFILE_TIMER(profiler, '                                &
------^
cgca_m3clvg.inst.f90(148): error #6622: This statement is invalid in an INTERFACE block.
call TAU_PROFILE_START(profiler)
--------^
cgca_m3clvg.inst.f90(149): error #6622: This statement is invalid in an INTERFACE block.
call TAU_PROFILE_STOP(profiler)
--------^
cgca_m3clvg.inst.f90(142): warning #6168: Data initializations are ignored in interface blocks.
integer profiler(2) / 0, 0 /
^
compilation aborted for cgca_m3clvg.inst.f90 (code 1)
Error: Compilation Failed
Error: Command(Executable) is -- mpiifort
Error: Full Command attempted is -- mpiifort   -c -qopt-report -assume realloc_lhs -O2 -debug full -g -traceback -free -fPIC -warn -coarray=distributed -coarray-config-file=xx14.conf cgca_m3clvg.inst.f90   -I/panfs/panasas01/mech/mexas/tau-2.25.1/include -I/cm/shared/languages/Intel-Compiler-XE-16-U2/compilers_and_libraries_2016.2.181/linux/mpi/intel64/include  -I/panfs/panasas01/mech/mexas/tau-2.25.1/include -I/cm/shared/languages/Intel-Compiler-XE-16-U2/compilers_and_libraries_2016.2.181/linux/mpi/intel64/include     -o cgca_m3clvg.o

Here's the full build log: cgpack-no-optCompInst.log

Apparently TAU calls are inserted by mistake into interface block, where no calls can appear. I reported this to tau-users mailing list. It would be strange if this happened only in coarray Fortran programs. I suspect the use of submodules.

So, have to instrument with -optCompInst, which, as described earlier on this page, leads to wrong TAU profiling data on image 1 (node 0).

The 3 plots below tell the same story. Run time is dominated by MPI_Win_unlock, with MPI_Barrier in the second place. Only 2 CGPACK routines break the 1% time threshold to appear in the 3D bar chart - cgca_hxi.f90 and cgca_clvg_nocosum.

3D paraprof bar chart for CGPACK test ABW, 32 images, distr. memory

paraprof bar chart for CGPACK test ABW, 32 images, distr. memory

paraprof bar chart for CGPACK test ABW, 32 images, distr. memory, single image

The trace is a bit boring, just an endless stream of MPI_Win_unlock (olive), in this case inside cgca_hxi (yellow).

Jumpshot trace fragment for CGPACK test ABW, 32 images, distr. memory

20-MAY-2016: CGPACK+ParaFEM example

Managed to build ParaFEM with TAU by adding

export TAU_OPTIONS="-optShared -optVerbose -optCompInst"
export TAU_MAKEFILE=$HOME/tau-2.25.1/x86_64/lib/Makefile.tau-icpc-papi-mpi-pdt

to make-parafem, and changing only these 2 variables in build/bc3.inc:

FC=                   tau_f90.sh
BC3_FFLAGS=           -O2 -r8 -mt_mpi -warn all -stand f08 -traceback $(TAU_FLAGS)

I also disabled all other tests in make-parafem, to speed the build up and to avoid unnecessary build errors:

BUILD_GROUP_XX="xx14"

Finally, I put tau.conf files in both parafem/parafem dir, and in parafem/parafem/src/programs/dev/xx14 dir. Not sure if both copies are required, but this seems to work.

I tried only xx14std.f90 program, where "std" stands for standard conforming. At this stage coarray collectives are not yet in the standard, and more importantly, they are not yet supported in Intel Fortran v. 16, so standard conforming basically means no collectives. xx14std.f90 does use MPI/IO though to dump CA output.

The instumented program works fine, and here are the TAU results. Profiles show good load balancing, but a total domination of MPI_Win_unlock over run time. MPI_Barrier is in the second place, and only cgca_hxi, the halo exchange routine, exceeds the threshold of 1% of total time.

Paraprof xx14std 3D bar chart, 32 images, distr. memory

The orange bar on node 0 is caflaunch process, which overwrites the profiling data from the program. TAU (Sameer Shende) have sent me a patch to try to fix this.

Paraprof xx14std 2D plot, main display, 32 images, distr. memory

Paraprof xx14std 2D plot, node 17, 32 images, distr. memory

Finally, below are the jumpshot trace fragments. The ParaFEM part shows a nice pattern. These are the MPI calls which are already present in ParaFEM routines, nothing (hopefully) is added by the compiler.

xx14std jumpshot trace fragment, ParaFEM part, nice

And here is the CGPACK part - what a mess! Cannot see any pattern, and all images are dominated by MPI_Win_unlock.

xx14std jumpshot trace fragment, CGPACK part, terrible!

23-MAY-2016: Testing TAU fix to stop Intel's caflaunch overwriting profile/trace data from node 0.

Sameer posted updated versions of ~/tau-2.25.1/src/Profile/Profiler.cpp and ~/tau-2.25.1/src/Profile/Tracer.cpp. I rebuilt/reinstalled TAU with these modified files. There are 10 failures of validation tests, see results3.html. However, the MPI Fortran test is ok, so can proceed to instrument my programs.

Re-testing CGPACK test testABW.f90. I rebuilt CGPACK library and tests with

export TAU_OPTIONS="-optShared -optVerbose -optCompInst"

It seems caflaunch has gone away:

Paraprof profile:

Paraprof CGPACK test ABW, 32 images, 2x16-core nodes,
		caflaunch process has gone away!

Jumpshot traces:

Jumpshot CGPACK test ABW, 32 images, 2x16-core nodes,
		caflaunch process has gone away!

The same for CGPACK+ParaFEM xx14std.f90 - there is no more caflaunch process, and the data from process 0 seems as expected.

Parafem profiles:

Paraprof CGPACK+ParaFEM program xx14std, 32 images, 2x16-core nodes,
		caflaunch process has gone away!

Jumpshot traces: (1) CGPACK part - MPI_Win_unlock:

Jumpshot CGPACK+ParaFEM program xx14std, CGPACK part, 32 images, 2x16-core nodes,
		caflaunch process has gone away!

(2) ParaFEM part - structured comms pattern:

Jumpshot CGPACK+ParaFEM program xx14std, ParaFEM part, 32 images, 2x16-core nodes,
		caflaunch process has gone away!

validate this page

previous | UP | next