CGPACK progress, MAY-2016

CGPACK > MAY-2016

Jump directly to: 4-MAY-2016 | 9-MAY-2016 | 12-MAY-2016 | 13-MAY-2016 | 15-MAY-2016 | 17-MAY-2016 | 19-MAY-2016 | 20-MAY-2016 | 23-MAY-2016

4-MAY-2016: Can now get MPI calls with TAU

It turned out I was not linking correctly against TAU libraries. Fixed in r84 of coarrays, example problem 5pi, see Makefile.

Anyway, I also rebuilt TAU like this:

./configure -mpi -c++=mpiicpc -cc=mpiicc -fortran=mpiifort -pdt=$HOME/pdtoolkit-3.22 -bfd=download

Here are the validation results.html. Note that pprof now shows MPI processes.

Some new results from jumpshot. This is for 5pi on 32 images on a single 16-core node. The MPI calls are now visible, but not the arrows. Why?

jumpshot-4 pica program, 32 images, a fragment shotly after start

jumpshot-4 pica program, 32 images, a fragment at the end, low mag.

jumpshot-4 pica program, 32 images, a fragment at the end, high mag.

Here is the pprof output: pprof.text

And here are paraprof visualisations. Note, as before, there is process 0, which just does caflaunch. Clearly the runtime is dominated by the pica calculations. Communications are negligible.

paraprof main display for pi_ca run on 32 images on a single 16-core node

paraprof node 13 display for pi_ca run on 32 images on a single 16-core node

But now I have no hardware counters, because I built without PAPI. Let's try rebuilding once again with PAPI. Configure as:

./configure -mpi -c++=mpiicpc -cc=mpiicc -fortran=mpiifort -pdt=$HOME/pdtoolkit-3.22 -bfd=download -papi=/cm/shared/libraries/intel_build/papi-5.3.0

The validation results: results2.html.

Moving on to the Laplace solver, problem 9laplace from the coarrays course.

Let's try co_back2.f90 program, which implements partitioning of the picture (2D array) into chunks in two dimensions. I run it on 4 images, i.e. splitting into 4 chunks, - 2 along dimension 1 and 2 chunks along dimension 2.

Note that images 1 and 2 spend less on computation, coback2 (red) and more on MPI_barrier (orange) compared to images 3 and 4. I don't know why this is or whether this is significant.

paraprof main display for 9laplace run on 4 images on a single 16-core node

This is for image 2. Note image 2 spends less time computing and more time waiting, MPI_barrier, compared to image 3.

paraprof node 2 display for 9laplace run on 4 images on a single 16-core node

This is for image 3. Note image 3 spends more time computing and less time waiting, MPI_barrier, compared to image 2.

paraprof node 3 display for 9laplace run on 4 images on a single 16-core node

paraprof node 3 stats display for 9laplace run on 4 images on a single 16-core node

Finally, here's the plain text output of pprof: 9lap-pprof.text.

9-MAY-2016: Trying to understand very poor speed-up of parallel Laplacian solver.

This is example problem 9laplace from the Coarrays Course.

Let's try a much bigger image, in case the problem is that there is not enough work for each image to do at each iteration, and the global sync overheads dominate the runtime.

12-MAY-2016: Some progress with Intel

I switched to a 600x400 image:

$ identify ref.pgm 
ref.pgm PNM 600x400 600x400+0+0 8-bit Grayscale DirectClass 793kb

I use full ifort optimisation and collect opt. reports:

-fast -qopt-report

The serial back.f90 runs under a minute:

45.68user 0.11system 0:45.92elapsed 99%CPU (0avgtext+0avgdata 19248maxresident)k
64inputs+1600outputs (1major+1274minor)pagefaults 0swaps

I was having some Intel MPI errors. In the Intel Clusters and HPC Technology forum Intel staff asked for the output of:

export confile=nodes
cat $PBS_NODEFILE > $confile

mpirun -genvall -genv I_MPI_FABRICS shm:dapl -genv I_MPI_HYDRA_DEBUG 1 -n 16 -machinefile ./nodes IMB-MPI1

which is under 9lap.o4579303. Might be useful in future for debug.

Now trying a coarray program coback1.f90. This version is not instrumented. coback1-tau.f90 is instrumented. The programs differ only in one line - the instrumented program, coback1-tau.f90, calls this TAU routine:

call TAU_PROFILE_SET_NODE(this_image())

The program splits the 2D picture array along dimension 2, so that the coarrays are long along dimension 1 (fastest changing) and short along dimension 2 (slow changing). The picture size is 400 along dimension 2, which is divisible by 2, 4, 8, 10, 16, 20, 25 and 40. Hence I use these numbers of images in this example. The uninstrumented run times are:

===> coback1.x
-genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 2 ./coback1.x
150.53user 12.12system 1:36.55elapsed 168%CPU (0avgtext+0avgdata 37904maxresident)k
1528inputs+1616outputs (2major+8435minor)pagefaults 0swaps
===> coback1.x
-genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 4 ./coback1.x
220.00user 34.39system 1:05.75elapsed 386%CPU (0avgtext+0avgdata 42592maxresident)k
0inputs+1616outputs (0major+13983minor)pagefaults 0swaps
===> coback1.x
-genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 8 ./coback1.x
271.20user 78.84system 0:46.67elapsed 750%CPU (0avgtext+0avgdata 59536maxresident)k
0inputs+1608outputs (0major+30295minor)pagefaults 0swaps
===> coback1.x
-genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 10 ./coback1.x
249.17user 89.83system 0:39.64elapsed 855%CPU (0avgtext+0avgdata 65568maxresident)k
0inputs+5640outputs (0major+33616minor)pagefaults 0swaps
===> coback1.x
-genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 16 ./coback1.x
473.93user 202.38system 0:51.63elapsed 1309%CPU (0avgtext+0avgdata 85680maxresident)k
0inputs+1616outputs (0major+59846minor)pagefaults 0swaps
===> coback1.x
-genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 20 ./coback1.x
484.81user 191.53system 0:53.82elapsed 1256%CPU (0avgtext+0avgdata 89792maxresident)k
0inputs+1608outputs (0major+57515minor)pagefaults 0swaps
===> coback1.x
-genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 25 ./coback1.x
477.85user 194.95system 0:53.65elapsed 1254%CPU (0avgtext+0avgdata 95456maxresident)k
0inputs+1608outputs (0major+63444minor)pagefaults 0swaps
===> coback1.x
-genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 40 ./coback1.x
489.88user 446.08system 1:48.09elapsed 865%CPU (0avgtext+0avgdata 131168maxresident)k
0inputs+5664outputs (0major+96911minor)pagefaults 0swaps

The instrumented run times are:

===> coback1-tau.xtau
-genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 2 ./coback1-tau.xtau
173.92user 16.97system 1:55.10elapsed 165%CPU (0avgtext+0avgdata 64384maxresident)k
1592inputs+540304outputs (2major+13253minor)pagefaults 0swaps
===> coback1-tau.xtau    
-genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 4 ./coback1-tau.xtau
246.27user 33.25system 1:12.43elapsed 385%CPU (0avgtext+0avgdata 69104maxresident)k
0inputs+1419664outputs (0major+22285minor)pagefaults 0swaps
===> coback1-tau.xtau    
-genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 8 ./coback1-tau.xtau
302.14user 71.45system 0:50.69elapsed 736%CPU (0avgtext+0avgdata 85936maxresident)k
0inputs+3586704outputs (0major+45537minor)pagefaults 0swaps
===> coback1-tau.xtau    
-genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 10 ./coback1-tau.xtau
294.58user 72.00system 0:46.36elapsed 790%CPU (0avgtext+0avgdata 91968maxresident)k
0inputs+4711504outputs (0major+57447minor)pagefaults 0swaps
===> coback1-tau.xtau    
-genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 16 ./coback1-tau.xtau
567.49user 163.53system 1:02.69elapsed 1165%CPU (0avgtext+0avgdata 112128maxresident)k
0inputs+7984424outputs (0major+89060minor)pagefaults 0swaps
===> coback1-tau.xtau    
-genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 20 ./coback1-tau.xtau
594.59user 172.27system 1:08.11elapsed 1125%CPU (0avgtext+0avgdata 116208maxresident)k
0inputs+8630880outputs (0major+94834minor)pagefaults 0swaps
===> coback1-tau.xtau    
-genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 25 ./coback1-tau.xtau
601.58user 174.49system 1:07.08elapsed 1156%CPU (0avgtext+0avgdata 121904maxresident)k
0inputs+8450952outputs (0major+92624minor)pagefaults 0swaps
===> coback1-tau.xtau    
-genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 40 ./coback1-tau.xtau
693.98user 436.61system 2:04.40elapsed 908%CPU (0avgtext+0avgdata 157520maxresident)k
0inputs+13701584outputs (0major+139719minor)pagefaults 0swaps

This is roughly the expected pattern. The best times are with 10 images in both cases. The instrumented times are slightly higher. However, even the best uninstrumented time is only marginally better than the serial time. Let's examine the profiling results from the run with 10 images.

The text results, sorted by exclusive total time, i.e.

pprof -m > pprof-10images

are in pprof-10images

On most images MPI_Win_unlock takes most time. This is clearly bad. On the first and the last images MPI_Recv takes most time, closely followed by MPI_Win_unlock.

The main paraprof window. Note the differences between images (nodes) 1 and 10, and the other images.

Paraprof, coback1-tau.f90 program, 10 images, main view

View on node (image) 3:

Paraprof, coback1-tau.f90 program, 10 images, node 3 view

View on node (image) 1:

Paraprof, coback1-tau.f90 program, 10 images, node 1 view

Conclusion - something is not very well optimised in MPI comms. It would be interesting to know exactly how sync images is mapped onto MPI calls. Perhaps jumpshot can help. I'll try this next.

13-MAY-2016: Jumpshot4 analysis of the Laplacian solver with Intel coarrays.

I concentrate on the behaviour towards the end of the program. Here it seems all images, but image 1, have completed their computations at about 49 s. After that, for another 10s! they seem to be doing mostly MPI_Win_unlock. Why?? Also note - lighter green is coback1tau, the main program. Somehow it is not shown on node 10, even though the profiling results, pprof.text, show it's there:

NODE 10;CONTEXT 0;THREAD 0:
---------------------------------------------------------------------------------------
%Time    Exclusive    Inclusive       #Call      #Subrs  Inclusive Name
              msec   total msec                          usec/call
---------------------------------------------------------------------------------------
 35.9       21,261       21,261      180000           0        118 .TAU application => coback1tau => MPI_Recv()
 35.9       21,261       21,261      180000           0        118 MPI_Recv()
 33.6       19,880       19,880 1.65962E+06           0         12 MPI_Win_unlock()
 29.1       17,183       17,183      900093           0         19 .TAU application => coback1tau => MPI_Win_unlock()
 83.5        9,819       49,407           1 1.48008E+06   49407127 .TAU application => coback1tau
 83.5        9,819       49,407           1 1.48008E+06   49407127 coback1tau

jumpshot4, coback1-tau.f90 program, 10 images, low mag

This is a high mag. fragment around 54.9s. The dark green bars on images 1-10 are MPI_Win_unlock. The lighter green on image 1 is coback1tau, the main program. The lavender on node 0 is caflaunch.

jumpshot4, coback1-tau.f90 program, 10 images, high mag

Here is the top of the legend, sorted by inclusive time.

jumpshot4, coback1-tau.f90 program, 10 images, legend

I still can't get paraprof 3D views working from FreeBSD, get these errors:

 paraprof
javax.media.opengl.GLException: Error making context current
        at com.sun.opengl.impl.x11.X11GLContext.makeCurrentImpl(X11GLContext.java:141)
        at com.sun.opengl.impl.x11.X11OnscreenGLContext.makeCurrentImpl(X11OnscreenGLContext.java:69)
        at com.sun.opengl.impl.GLContextImpl.makeCurrent(GLContextImpl.java:127)
        at com.sun.opengl.impl.GLDrawableHelper.invokeGL(GLDrawableHelper.java:182)
        at javax.media.opengl.GLCanvas.maybeDoSingleThreadedWorkaround(GLCanvas.java:258)
        at javax.media.opengl.GLCanvas.display(GLCanvas.java:130)
        at javax.media.opengl.GLCanvas.paint(GLCanvas.java:142)
        at sun.awt.RepaintArea.paintComponent(RepaintArea.java:264)
        at sun.awt.X11.XRepaintArea.paintComponent(XRepaintArea.java:73)
        at sun.awt.RepaintArea.paint(RepaintArea.java:240)
        at sun.awt.X11.XComponentPeer.handleEvent(XComponentPeer.java:694)
        at java.awt.Component.dispatchEventImpl(Component.java:4725)
        at java.awt.Component.dispatchEvent(Component.java:4475)
        at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:675)
        at java.awt.EventQueue.access$300(EventQueue.java:96)
        at java.awt.EventQueue$2.run(EventQueue.java:634)
        at java.awt.EventQueue$2.run(EventQueue.java:632)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:108)
        at java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:119)
        at java.awt.EventQueue$3.run(EventQueue.java:648)
        at java.awt.EventQueue$3.run(EventQueue.java:646)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:108)
        at java.awt.EventQueue.dispatchEvent(EventQueue.java:645)
        at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:275)
        at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:200)
        at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:190)
        at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:185)
        at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:177)
        at java.awt.EventDispatchThread.run(EventDispatchThread.java:138)

So I tried from MS Windows 7(?), and it worked:

Paraprof 3D profiling bar chart, 10 images.

Note that there is an inconsistency in node/image numbering in TAU, resulting in non-symmetric comms matrix.

Paraprof communication matrix, 10 images.

This is likely caused by the fact the coarrays are implemented over MPI. So in some places coarray Fortran 1 to N notation is used, and in other places C 0 to N-1 notation is used.

If I disable the TAU call to explicitly set the node number:

!call TAU_PROFILE_SET_NODE(this_image())

then the comms matrix looks like this:

Paraprof communication matrix, *without* TAU_PROFILE_SET_NODE, 10 images.

However, now the pprof results are weird. Only 10 nodes are shown (previously there was also 11th node running caflaunch only). And of the 10 nodes, only 9 are shown to do any useful work. Node 1 now only runs caflaunch, and only nodes 2-10 do useful work:

pprof-10img-2

On the other hand I now finally get remote calls shown with arrows in jumpshot4.

Preview mode. Olive colour is TAU itself. Lavender on node 0 is caflaunch. Note that node 0 runs nothing else but caflaunch. I'm not sure if this is correct, because then it would mean that only 9 images are doing useful work, not 10.