CGPACK > MAY-2016
previous | UP | next
Jump directly to: 4-MAY-2016 | 9-MAY-2016 | 12-MAY-2016 | 13-MAY-2016 | 15-MAY-2016 | 17-MAY-2016 | 19-MAY-2016 | 20-MAY-2016 | 23-MAY-2016
4-MAY-2016: Can now get MPI calls with TAU
It turned out I was not linking correctly against TAU libraries. Fixed in r84 of coarrays, example problem 5pi, see Makefile.
Anyway, I also rebuilt TAU like this:
./configure -mpi -c++=mpiicpc -cc=mpiicc -fortran=mpiifort -pdt=$HOME/pdtoolkit-3.22 -bfd=download
Here are the validation
pprof now shows
Some new results from
This is for 5pi on 32 images on a single 16-core node.
The MPI calls are now visible, but not the arrows.
Here is the
And here are
Note, as before, there is process 0, which just does
Clearly the runtime is dominated by the
Communications are negligible.
But now I have no hardware counters, because I built without PAPI. Let's try rebuilding once again with PAPI. Configure as:
./configure -mpi -c++=mpiicpc -cc=mpiicc -fortran=mpiifort -pdt=$HOME/pdtoolkit-3.22 -bfd=download -papi=/cm/shared/libraries/intel_build/papi-5.3.0
The validation results:
Moving on to the Laplace solver, problem 9laplace from the coarrays course.
program, which implements
partitioning of the picture
(2D array) into chunks in two
I run it on 4 images, i.e.
splitting into 4 chunks, -
2 along dimension 1 and
2 chunks along dimension 2.
Note that images 1 and 2 spend less on computation,
and more on
compared to images 3 and 4.
I don't know why this is or whether this is
This is for image 2.
Note image 2 spends less time computing and more time waiting,
MPI_barrier, compared to image 3.
This is for image 3.
Note image 3 spends more time computing and less time waiting,
MPI_barrier, compared to image 2.
Finally, here's the plain text output of
9-MAY-2016: Trying to understand very poor speed-up of parallel Laplacian solver.
This is example problem 9laplace from the Coarrays Course.
Let's try a much bigger image, in case the problem is that there is not enough work for each image to do at each iteration, and the global sync overheads dominate the runtime.
12-MAY-2016: Some progress with Intel
I switched to a 600x400 image:
$ identify ref.pgm ref.pgm PNM 600x400 600x400+0+0 8-bit Grayscale DirectClass 793kb
I use full ifort optimisation and collect opt. reports:
back.f90 runs under a minute:
45.68user 0.11system 0:45.92elapsed 99%CPU (0avgtext+0avgdata 19248maxresident)k 64inputs+1600outputs (1major+1274minor)pagefaults 0swaps
I was having some Intel MPI errors. In the Intel Clusters and HPC Technology forum Intel staff asked for the output of:
export confile=nodes cat $PBS_NODEFILE > $confile mpirun -genvall -genv I_MPI_FABRICS shm:dapl -genv I_MPI_HYDRA_DEBUG 1 -n 16 -machinefile ./nodes IMB-MPI1
which is under
Might be useful in future for debug.
Now trying a coarray program
This version is not instrumented.
coback1-tau.f90 is instrumented.
The programs differ only in one line - the instrumented
coback1-tau.f90, calls this TAU routine:
The program splits the 2D picture array along dimension 2, so that the coarrays are long along dimension 1 (fastest changing) and short along dimension 2 (slow changing). The picture size is 400 along dimension 2, which is divisible by 2, 4, 8, 10, 16, 20, 25 and 40. Hence I use these numbers of images in this example. The uninstrumented run times are:
===> coback1.x -genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 2 ./coback1.x 150.53user 12.12system 1:36.55elapsed 168%CPU (0avgtext+0avgdata 37904maxresident)k 1528inputs+1616outputs (2major+8435minor)pagefaults 0swaps ===> coback1.x -genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 4 ./coback1.x 220.00user 34.39system 1:05.75elapsed 386%CPU (0avgtext+0avgdata 42592maxresident)k 0inputs+1616outputs (0major+13983minor)pagefaults 0swaps ===> coback1.x -genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 8 ./coback1.x 271.20user 78.84system 0:46.67elapsed 750%CPU (0avgtext+0avgdata 59536maxresident)k 0inputs+1608outputs (0major+30295minor)pagefaults 0swaps ===> coback1.x -genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 10 ./coback1.x 249.17user 89.83system 0:39.64elapsed 855%CPU (0avgtext+0avgdata 65568maxresident)k 0inputs+5640outputs (0major+33616minor)pagefaults 0swaps ===> coback1.x -genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 16 ./coback1.x 473.93user 202.38system 0:51.63elapsed 1309%CPU (0avgtext+0avgdata 85680maxresident)k 0inputs+1616outputs (0major+59846minor)pagefaults 0swaps ===> coback1.x -genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 20 ./coback1.x 484.81user 191.53system 0:53.82elapsed 1256%CPU (0avgtext+0avgdata 89792maxresident)k 0inputs+1608outputs (0major+57515minor)pagefaults 0swaps ===> coback1.x -genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 25 ./coback1.x 477.85user 194.95system 0:53.65elapsed 1254%CPU (0avgtext+0avgdata 95456maxresident)k 0inputs+1608outputs (0major+63444minor)pagefaults 0swaps ===> coback1.x -genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 40 ./coback1.x 489.88user 446.08system 1:48.09elapsed 865%CPU (0avgtext+0avgdata 131168maxresident)k 0inputs+5664outputs (0major+96911minor)pagefaults 0swaps
The instrumented run times are:
===> coback1-tau.xtau -genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 2 ./coback1-tau.xtau 173.92user 16.97system 1:55.10elapsed 165%CPU (0avgtext+0avgdata 64384maxresident)k 1592inputs+540304outputs (2major+13253minor)pagefaults 0swaps ===> coback1-tau.xtau -genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 4 ./coback1-tau.xtau 246.27user 33.25system 1:12.43elapsed 385%CPU (0avgtext+0avgdata 69104maxresident)k 0inputs+1419664outputs (0major+22285minor)pagefaults 0swaps ===> coback1-tau.xtau -genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 8 ./coback1-tau.xtau 302.14user 71.45system 0:50.69elapsed 736%CPU (0avgtext+0avgdata 85936maxresident)k 0inputs+3586704outputs (0major+45537minor)pagefaults 0swaps ===> coback1-tau.xtau -genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 10 ./coback1-tau.xtau 294.58user 72.00system 0:46.36elapsed 790%CPU (0avgtext+0avgdata 91968maxresident)k 0inputs+4711504outputs (0major+57447minor)pagefaults 0swaps ===> coback1-tau.xtau -genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 16 ./coback1-tau.xtau 567.49user 163.53system 1:02.69elapsed 1165%CPU (0avgtext+0avgdata 112128maxresident)k 0inputs+7984424outputs (0major+89060minor)pagefaults 0swaps ===> coback1-tau.xtau -genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 20 ./coback1-tau.xtau 594.59user 172.27system 1:08.11elapsed 1125%CPU (0avgtext+0avgdata 116208maxresident)k 0inputs+8630880outputs (0major+94834minor)pagefaults 0swaps ===> coback1-tau.xtau -genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 25 ./coback1-tau.xtau 601.58user 174.49system 1:07.08elapsed 1156%CPU (0avgtext+0avgdata 121904maxresident)k 0inputs+8450952outputs (0major+92624minor)pagefaults 0swaps ===> coback1-tau.xtau -genvall -genv I_MPI_FABRICS=shm:dapl -machinefile ./nodes -n 40 ./coback1-tau.xtau 693.98user 436.61system 2:04.40elapsed 908%CPU (0avgtext+0avgdata 157520maxresident)k 0inputs+13701584outputs (0major+139719minor)pagefaults 0swaps
This is roughly the expected pattern. The best times are with 10 images in both cases. The instrumented times are slightly higher. However, even the best uninstrumented time is only marginally better than the serial time. Let's examine the profiling results from the run with 10 images.
The text results, sorted by exclusive total time, i.e.
pprof -m > pprof-10images
are in pprof-10images
On most images
MPI_Win_unlock takes most time.
This is clearly bad.
On the first and the last images
takes most time, closely followed by
The main paraprof window. Note the differences between images (nodes) 1 and 10, and the other images.
View on node (image) 3:
View on node (image) 1:
Conclusion - something is not very well optimised in MPI comms.
It would be interesting to know exactly how
is mapped onto MPI calls.
jumpshot can help.
I'll try this next.
13-MAY-2016: Jumpshot4 analysis of the Laplacian solver with Intel coarrays.
I concentrate on the behaviour towards the end of the program.
Here it seems all images, but image 1, have completed their
computations at about 49 s.
After that, for another 10s! they seem to be doing
Also note - lighter green is
Somehow it is not shown on node 10, even though the
show it's there:
NODE 10;CONTEXT 0;THREAD 0: --------------------------------------------------------------------------------------- %Time Exclusive Inclusive #Call #Subrs Inclusive Name msec total msec usec/call --------------------------------------------------------------------------------------- 35.9 21,261 21,261 180000 0 118 .TAU application => coback1tau => MPI_Recv() 35.9 21,261 21,261 180000 0 118 MPI_Recv() 33.6 19,880 19,880 1.65962E+06 0 12 MPI_Win_unlock() 29.1 17,183 17,183 900093 0 19 .TAU application => coback1tau => MPI_Win_unlock() 83.5 9,819 49,407 1 1.48008E+06 49407127 .TAU application => coback1tau 83.5 9,819 49,407 1 1.48008E+06 49407127 coback1tau
This is a high mag. fragment around 54.9s.
The dark green bars on images 1-10 are
The lighter green on image 1 is
the main program.
The lavender on node 0 is
Here is the top of the legend, sorted by inclusive time.
I still can't get paraprof 3D views working from FreeBSD, get these errors:
paraprof javax.media.opengl.GLException: Error making context current at com.sun.opengl.impl.x11.X11GLContext.makeCurrentImpl(X11GLContext.java:141) at com.sun.opengl.impl.x11.X11OnscreenGLContext.makeCurrentImpl(X11OnscreenGLContext.java:69) at com.sun.opengl.impl.GLContextImpl.makeCurrent(GLContextImpl.java:127) at com.sun.opengl.impl.GLDrawableHelper.invokeGL(GLDrawableHelper.java:182) at javax.media.opengl.GLCanvas.maybeDoSingleThreadedWorkaround(GLCanvas.java:258) at javax.media.opengl.GLCanvas.display(GLCanvas.java:130) at javax.media.opengl.GLCanvas.paint(GLCanvas.java:142) at sun.awt.RepaintArea.paintComponent(RepaintArea.java:264) at sun.awt.X11.XRepaintArea.paintComponent(XRepaintArea.java:73) at sun.awt.RepaintArea.paint(RepaintArea.java:240) at sun.awt.X11.XComponentPeer.handleEvent(XComponentPeer.java:694) at java.awt.Component.dispatchEventImpl(Component.java:4725) at java.awt.Component.dispatchEvent(Component.java:4475) at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:675) at java.awt.EventQueue.access$300(EventQueue.java:96) at java.awt.EventQueue$2.run(EventQueue.java:634) at java.awt.EventQueue$2.run(EventQueue.java:632) at java.security.AccessController.doPrivileged(Native Method) at java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:108) at java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:119) at java.awt.EventQueue$3.run(EventQueue.java:648) at java.awt.EventQueue$3.run(EventQueue.java:646) at java.security.AccessController.doPrivileged(Native Method) at java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:108) at java.awt.EventQueue.dispatchEvent(EventQueue.java:645) at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:275) at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:200) at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:190) at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:185) at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:177) at java.awt.EventDispatchThread.run(EventDispatchThread.java:138)
So I tried from MS Windows 7(?), and it worked:
Note that there is an inconsistency in node/image numbering in TAU, resulting in non-symmetric comms matrix.
This is likely caused by the fact the coarrays are implemented over MPI. So in some places coarray Fortran 1 to N notation is used, and in other places C 0 to N-1 notation is used.
If I disable the TAU call to explicitly set the node number:
then the comms matrix looks like this:
However, now the
pprof results are
Only 10 nodes are shown (previously there was also 11th
And of the 10 nodes, only 9 are shown to do any useful work.
Node 1 now only runs
and only nodes 2-10 do useful work:
On the other hand I now finally get remote
calls shown with arrows in
Olive colour is TAU itself.
Lavender on node 0 is
Note that node 0 runs nothing else but
I'm not sure if this is correct, because
then it would mean that only 9 images are doing
useful work, not 10.
Individual remote comms are becoming visible at this magnification. The arrows show from which node to which node. The pins give the timestamp and the exact event.
A pattern of communications now is visible.
sync images allows for
random comms pattern, it seems the comms start
from node 1, i.e. node 1 swaps halos with node 2.
The node 2 swaps halos with node 3, etc., until
nodel 9 swaps halos with node 8.
Some randomness in the duration of remote calls is now visible.
The pin names are: pink is "Message size received from all nodes"; blue is "Message size sent to node ..."; light green is "Message size is sent to all nodes". The other colours are the same messages shown from .TAU.
Here I removed the
coback1tau process, the
main program, and the TAU process.
The green blocks are
and the olive blocks are
Note that nothing is left on node 0, but
Following advice from Sameer Shende via TAU-users mailing list, I added
to the PBS job script. The comms matrix is finally symmetric, as it should be:
17-MAY-2016: Figured out how to run
animation over ssh
paraprof is a shell script that sets up
some variables and then calls
One of the variables it sets is:
# If 3D window has problems, please uncomment #if [ $MACHINE = bgq -o $MACHINE = arm_linux ]; then export LIBGL_ALWAYS_INDIRECT=1 #fi
and now I can run
paraprof 3D GL animation
This post explains what setting of
The exact value is not important. Any value will be interpreted by GL as variable set. Indirect rendering means that the remote GL program sends GL commands over the net to be interpreted by the local GL library. Apparently, this doesn't work in my case. Probably some version clash. Unsetting this variable enables direct rendering, where GL on the remote host communicates directly with graphical hardware on the remote host and only the resulting frames are being sent over the net. This is slower, and at some frame rates might be unsuitable, but it avoids completely the compatibility problems between the local and remote GL versions or similar.
In my particular case the remote host doesn't seem to have dri:
libGL error: failed to open drm device: No such file or directory libGL error: failed to load driver: i965
so Mesa is used instead.
GL Info button
paraprof 3D animation window
Here's the 3D
paraprof bar chart
example, compiled for shared memory execution,
and run on 16 images, i.e. a single node.
Node 0 (image 1) is most busy,
because it's doing the reduction operation.
This node also does not call
Not sure if this is correct, or TAU misunderstanding.
Node 6 is slightly different from other nodes, but otherwise
the load is even.
paraprof 3D bar chart for
example, run on 10 images with distributed memory
on a single node.
MPI_Win_unlock dominates the run time
There is slight imbalance - nodes 0 and 9 (images 1 and 10)
spend more time doing comms and less time doing calculations.
It is interesting to note that
MPI_Recv take a lot more time than
19-MAY-2016: Moving onto CGPACK examples
This is program
I instrument the library and the test programs
export TAU_OPTIONS="-optShared -optVerbose -optCompInst" export TAU_MAKEFILE=$HOME/tau-2.25.1/x86_64/lib/Makefile.tau-icpc-papi-mpi-pdt make clean -i -f Makefile-bc3-mpiifort-tau make all -i -f Makefile-bc3-mpiifort-tau
If TAU option
-optCompInst is not used,
then I get CGPACK build errors:
cgca_m3clvg.inst.f90(143): warning #6878: Within an interface block, this statement is ignored. save profiler ------^ cgca_m3clvg.inst.f90(145): error #6622: This statement is invalid in an INTERFACE block. call TAU_PROFILE_TIMER(profiler, ' & ------^ cgca_m3clvg.inst.f90(148): error #6622: This statement is invalid in an INTERFACE block. call TAU_PROFILE_START(profiler) --------^ cgca_m3clvg.inst.f90(149): error #6622: This statement is invalid in an INTERFACE block. call TAU_PROFILE_STOP(profiler) --------^ cgca_m3clvg.inst.f90(142): warning #6168: Data initializations are ignored in interface blocks. integer profiler(2) / 0, 0 / ^ compilation aborted for cgca_m3clvg.inst.f90 (code 1) Error: Compilation Failed Error: Command(Executable) is -- mpiifort Error: Full Command attempted is -- mpiifort -c -qopt-report -assume realloc_lhs -O2 -debug full -g -traceback -free -fPIC -warn -coarray=distributed -coarray-config-file=xx14.conf cgca_m3clvg.inst.f90 -I/panfs/panasas01/mech/mexas/tau-2.25.1/include -I/cm/shared/languages/Intel-Compiler-XE-16-U2/compilers_and_libraries_2016.2.181/linux/mpi/intel64/include -I/panfs/panasas01/mech/mexas/tau-2.25.1/include -I/cm/shared/languages/Intel-Compiler-XE-16-U2/compilers_and_libraries_2016.2.181/linux/mpi/intel64/include -o cgca_m3clvg.o
Here's the full build log:
Apparently TAU calls are inserted by mistake into
interface block, where no calls can appear.
I reported this to
It would be strange if this happened only
in coarray Fortran programs.
I suspect the use of submodules.
So, have to instrument with
which, as described earlier on this page, leads to wrong TAU
profiling data on image 1 (node 0).
The 3 plots below tell the same story.
Run time is dominated by MPI_Win_unlock,
with MPI_Barrier in the second place.
Only 2 CGPACK routines break the 1% time threshold
to appear in the 3D bar chart -
The trace is a bit boring, just an endless stream
MPI_Win_unlock (olive), in this case
20-MAY-2016: CGPACK+ParaFEM example
Managed to build ParaFEM with TAU by adding
export TAU_OPTIONS="-optShared -optVerbose -optCompInst" export TAU_MAKEFILE=$HOME/tau-2.25.1/x86_64/lib/Makefile.tau-icpc-papi-mpi-pdt
and changing only these 2 variables
FC= tau_f90.sh BC3_FFLAGS= -O2 -r8 -mt_mpi -warn all -stand f08 -traceback $(TAU_FLAGS)
I also disabled all other tests in
to speed the build up and to avoid unnecessary build errors:
Finally, I put
tau.conf files in both
parafem/parafem dir, and in
Not sure if both copies are required, but this seems to work.
I tried only
program, where "std" stands for standard
At this stage coarray collectives are not yet in
the standard, and more importantly, they are not yet
supported in Intel Fortran v. 16, so standard
conforming basically means no collectives.
xx14std.f90 does use MPI/IO though
to dump CA output.
The instumented program works fine, and here are the
Profiles show good load balancing, but
a total domination of
MPI_Barrier is in the second place, and
cgca_hxi, the halo exchange routine,
exceeds the threshold of 1% of total time.
The orange bar on node 0 is
which overwrites the profiling data from the program.
TAU (Sameer Shende) have sent me a patch to try to fix this.
Finally, below are the jumpshot trace fragments. The ParaFEM part shows a nice pattern. These are the MPI calls which are already present in ParaFEM routines, nothing (hopefully) is added by the compiler.
And here is the CGPACK part - what a mess!
Cannot see any pattern, and all images
are dominated by
23-MAY-2016: Testing TAU fix to stop Intel's
overwriting profile/trace data from node 0.
updated versions of
I rebuilt/reinstalled TAU with these modified files.
There are 10 failures of validation tests, see
However, the MPI Fortran test is ok, so can proceed
to instrument my programs.
Re-testing CGPACK test
I rebuilt CGPACK library and tests with
export TAU_OPTIONS="-optShared -optVerbose -optCompInst"
caflaunch has gone away:
The same for CGPACK+ParaFEM
there is no more
and the data from process 0 seems as expected.
Jumpshot traces: (1) CGPACK part -
(2) ParaFEM part - structured comms pattern:
validate this page
previous | UP | next