CGPACK diary, JAN-2013

CGPACK > JAN-2013

15-JAN-2013: Few simple observations.

At this time it seems the Cray fortran compiler is the most advanced. Intel compiler allows us to run only shared memory programs. The distributed version of Intel compiler is available but we haven't got it. It might be available on phase 3, possibly Spring 2013. Right now we have ifort 12.1.2 20111128 on phase 1 and ifort 12.0.2 20110112 on phase 2. I know of no other compiler ready to use with Fortran coarrays. Let us know if you know of any (mexas@bris.ac.uk).

For more details on Intel coarray support see: Using Coarray Fortran and Distributed Memory Coarray Fortran with the Intel Fortran Compiler for Linux: Essential Guide.

Intel mpdboot and mpdallexit are not needed for shared memory execution.

A simple example from phase 1. The code:

babyblue2% cat coarray1.f90 
real :: z[*]

if (this_image()==1) write (*,*) "from image 1: there are", num_images(), "images"
sync all

z=this_image()
write (*,'(a,i0,a,f0.0)') 'image ', this_image(), ', value: ',z

end
babyblue2%

Compiling:

babyblue2% ifort -coarray=shared -coarray-config-file=coarr.conf coarray1.f90
babyblue2%

the -coarray option is required by the compiler. Without this option the coarray language elements cannot be processed (which is stupid since coarrays are a standard feature and must be supported by default).

-coarray-config-file is optional, but useful. It allows to link a coarray config file into the executable, so that the run-time options can be changed by simply changing the config file and not needed recompilation. In this simple example:

babyblue2% cat coarr.conf 
-envall -n 16 ./a.out
babyblue2%

where -enval "copies your current environment variables to the environment of your CAF processes. HIGHLY RECOMMENDED." (from Intel manual).

-n specifies how many images will be created. This is required.

Finally the name of the executable is given. This obviously must match the name given at the complilation stage.

Anyway, that's it:

babyblue2% ./a.out 
 from image 1: there are          16 images
image 2, value: 2.
image 5, value: 5.
image 1, value: 1.
image 4, value: 4.
image 13, value: 13.
image 6, value: 6.
image 12, value: 12.
image 3, value: 3.
image 7, value: 7.
image 8, value: 8.
image 16, value: 16.
image 9, value: 9.
image 14, value: 14.
image 10, value: 10.
image 15, value: 15.
image 11, value: 11.
babyblue2%

or can submit to queue as usual:

babyblue2% cat z.sh
#!/bin/sh
#PBS -l walltime=00:01:00,nodes=1:ppn=1
#PBS -j oe
#PBS -m abe

cd $HOME
./a.out
babyblue2% 
babyblue2% qsub -qveryshort z.sh
263042.bluecrystal1.cm.cluster
babyblue2% qstat -u $USER

bluecrystal1.cm.cluster: 
                                                                         Req'd  Req'd   Elap
Job ID               Username Queue    Jobname          SessID NDS   TSK Memory Time  S Time
-------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----
263042.bluecryst     mexas    veryshor z.sh                407     1   1    --  00:01 R   -- 
babyblue2%

and when the job is finished:

babyblue2% cat z.sh.o263042 
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
 from image 1: there are          16 images
image 1, value: 1.
image 9, value: 9.
image 14, value: 14.
image 10, value: 10.
image 6, value: 6.
image 2, value: 2.
image 13, value: 13.
image 7, value: 7.
image 4, value: 4.
image 3, value: 3.
image 15, value: 15.
image 5, value: 5.
image 11, value: 11.
image 12, value: 12.
image 8, value: 8.
image 16, value: 16.
babyblue2%

The config file way of doing things (-coarray-config-file) is one option. It is better (in my opinion) to use the env variable FOR_COARRAY_NUM_IMAGES to specify the number of runtime images. For this the -coarray-config-file should not be used at compilation, or at least -n option should not be specified in the config file. For example:

bigblue2> ifort -coarray=shared try1.f90
bigblue2>

Then this can be run with e.g. 8 images:

bigblue2> cat bc.sh 
#!/bin/sh
#PBS -l walltime=00:01:00,nodes=1:ppn=1
#PBS -j oe
#PBS -m abe

export FOR_COARRAY_NUM_IMAGES=8

cd $HOME/nobackup/cgpack/branches/coarray
./a.out
bigblue2>
bigblue2> cat bc.sh.o1746803 
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
 from image 1: there are           8 images
image 4, value: 4.
image 2, value: 2.
image 6, value: 6.
image 7, value: 7.
image 8, value: 8.
image 3, value: 3.
image 1, value: 1.
image 5, value: 5.
bigblue2>

or changed to e.g. 10 images:

bigblue2> cat bc.sh
#!/bin/sh
#PBS -l walltime=00:01:00,nodes=1:ppn=1
#PBS -j oe
#PBS -m abe

export FOR_COARRAY_NUM_IMAGES=10

cd $HOME/nobackup/cgpack/branches/coarray
./a.out
bigblue2> 
bigblue2> cat bc.sh.o1746804
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
 from image 1: there are          10 images
image 1, value: 1.
image 2, value: 2.
image 10, value: 10.
image 6, value: 6.
image 4, value: 4.
image 8, value: 8.
image 9, value: 9.
image 7, value: 7.
image 5, value: 5.
image 3, value: 3.
bigblue2>

23-JAN-2013: Some timings.

This example is from EPCC coarray training course. An edge image is given and the program reconstructs the original image by double integration. The image size is 672 x 1024 pixels. The integration is carried out for a fixed 100k iterations. No convergence check is made. The results are shown below. Timings (speedup) for EPCC example ex3b - image reconstruction
from the given edges

These results are from BlueCrystal phase 2. Our Intel license does not allow distributed memory for coarrays, so these results are from a single node with 8 cores (shared memory). The time is the wallclock time (elapsed or real time) Things to note:

The general trend as expected, i.e. there is some speedup when using more images. Speedup of about 6 is obtaned when using 8 cores.
The [:,*] coarray model works even if one or both of the codimensions is 1, i.e. [1,1], [3,1], [1,7]. This makes writing coarray programs even easier, because a 1D array is processed correctly as a special case of a 2D array.
The single image time is from [1,1] coarray, *not* from a purely serial code, i.e. the time is probably higher than would've been for a serial code with no coarray overheads.
As expected, for a given number of images, a 2D arrangement of images is faster than 1D, i.e. [2,3] is faster than [6,1], and [4,2] is faster than [1,8].
Somewhat unexpectedly, for 1D arrangement of images, the [*,1] layout is faster than [1,*]. Perhaps this is to do with row-major array memory storage in fortran? If you know why, let me know (mexas@bris.ac.uk).

31-JAN-2013: making a large array of coarrays.

The idea is to have a coarray declared as

allocatable :: coarray(:,:,:)[:,:,:]

then allocate according to needs, e.g.:

allocate(coarray(dim1,dim2,dim3)[codim1,codim2,*])

Here dim1, dim2, dim3 are coarray dimensions on each image, and codim1 and codim2 are array codimensions, i.e. dimensions 1 and 2 of the grid of images.

The following example was run on 8 images with

allocate(coarray(10,10,10)[2,2,*])

which means the third (final) codimension is also 2. On each image an array of 10*10*10=1000 elements is then created. My logic is then to think of those arrays arranged in a 2x2x2 grid. The dimensions of this superarray are then 20*20*20=8000 elements.

The image below was obtained with paraview. Each image assigns this_image() to all elements of its local coarray. Then a routine is called from image 1 that writes the coarrays from all images in order. The resulting binary file describes the super array.

Note from the image that the super array is arranged in the fortran natural "array element order. It is obtained by counting most rapidly in the early dimensions" (from Metcalf et al (2011) Modern Fortran Explained, Oxford). In the image "X" corresponds to dimension 1 of the array, "Y" corresponds to dimension 2 of the array, and "Z" corresponds to dimension 3 of the array. One can trace this logic by following the colours and the values, which are simply image numbers. Array from image 1 is completely hidden from view.

A super array obtained by joining 8 coarrays from 8 images

UP | next