CGPACK progress, MAY-2017

CGPACK > MAY-2017

next | UP | previous

Feedback and support

With any questions, bug reports, feedback and any other comments please submit a ticket.

1-MAY-2017: Hazel Hen MPI/IO issues

Trying to run testABM, serial IO seems to work, but MPI/IO fails with incomplete file written:

-rw------- 1 ipransht s31806 2407881600 Apr 26 15:52 serial.raw
-rw------- 1 ipransht s31806 1488453632 Apr 26 16:01 mpiio.raw
running on 240 images in a 3D grid
img: 1 nimgs: 240 (93,155,174)[5,6,8] 5999  0.987       464.     (  1.00    ,  2.00    ,  3.00    )
Each image has 2508210 cells
The model has 601970400 cells
 Serial IO:  18.5411568 s, rate:  0.120947935 GB/s.
 DEBUG: fname:mpiio.raw
aprun: Apid 7335749: Caught signal Terminated, sending to application
=>> PBS: job killed: walltime 617 exceeded limit 600
running on 240 images in a 3D grid
img: 1 nimgs: 240 (93,155,174)[5,6,8] 5999  0.987       464.     (  1.00    ,  2.00    ,  3.00    )
Each image has 2508210 cells
The model has 601970400 cells
 Serial IO:  16.0410023 s, rate:  0.139798909 GB/s.
 DEBUG: fname:mpiio.raw
=>> PBS: job killed: walltime 632 exceeded limit 600

So, let's try a model derived from testABV. Left only solidification and no IO. Seems to work on 10 nodes. XC40, just as XC30, has 24 cores per node. So using 240 cores:

running on 240 images in a 3D grid
img: 1 nimgs: 240 (58,77,93)[8,6,5] 999  0.987       464.     (   1.00    ,  0.995    ,   1.00    )
Each image has 415338 cells
The model has 99681120 cells
...
Application 7360772 resources: utime ~607s, stime ~48s, Rss ~7268, inblocks ~326607, outblocks ~439919

real    0m7.123s
user    0m0.388s
sys     0m0.248s

For future, the HLRS ticket submission page.

3-MAY-2017: Hazel Hen workspace

Finally got to a manual page explaining how to use work filesystems on Hazel Hen. It's quite different from ARCHER (XC30). Hazel call it the Workspace mechanism.

I'm going to use testABV for scaling tests on Hazel Hen, set up with

! physical dimensions of the box, assume mm
bsz0 = (/ 3.0, 5.0, 7.0 /)

! mean grain size, linear dimension, e.g. mean grain diameter, also mm
dm = 1.0e-1

! resolution
res = 1.0e5

This gives a model with about 10 bn cells (1.0e10).

4-MAY-2017: Choosing a model for scaling study

No, the 3-MAY box size is too high, I get OOM. So let's reduce it a bit:

! physical dimensions of the box, assume mm
bsz0 = (/ 4.0, 5.0, 5.0 /)
This gives 9.998 bn cells, so still about 10 bn. This runs successfully on a single node with
export XT_SYMMETRIC_HEAP_SIZE=3g

The default module craype-hugepages16M, i.e. 16MB huge pages is used.

17-MAY-2017: NetCDF with the Intel compiler

The NetCDF Intel modules installed on BlueCrystal phase 3:

libraries/intel_builds/netcdf-4.3-intel-16.par
libraries/intel_builds/netcdf-4.3-par

seem to be deficient. Some variable definitions cannot be found in the Fortran NetCDF module files, e.g.:

newblue1> mpiifort -coarray=distributed -c
-I/cm/shared/libraries/intel_build/netcdf-4-intel-16.par/include
cgca_m2netcdf.f90
cgca_m2netcdf.f90(157): error #6404: This name does not have a type, and must
have an explicit type. [NF90_NETCDF4]
call check( nf90_create(fname, ior(nf90_netcdf4,nf90_mpiio), ncid, &
-----------------------------------^
cgca_m2netcdf.f90(157): error #6404: This name does not have a type, and must
have an explicit type. [NF90_MPIIO]
call check( nf90_create(fname, ior(nf90_netcdf4,nf90_mpiio), ncid, &
------------------------------------------------^
cgca_m2netcdf.f90(157): warning #7319: This argument's data type is
incompatible with this intrinsic procedure; procedure assumed EXTERNAL. [IOR]
call check( nf90_create(fname, ior(nf90_netcdf4,nf90_mpiio), ncid, &
-----------------------------------^
cgca_m2netcdf.f90(157): error #6404: This name does not have a type, and must
have an explicit type. [IOR]
call check( nf90_create(fname, ior(nf90_netcdf4,nf90_mpiio), ncid, &
-------------------------------^
cgca_m2netcdf.f90(158): error #6627: This is an actual argument keyword name,
and not a dummy argument name. [COMM]
comm=comm, info=MPI_INFO_NULL))
-------^
cgca_m2netcdf.f90(158): error #6627: This is an actual argument keyword name,
and not a dummy argument name. [INFO]
comm=comm, info=MPI_INFO_NULL))
------------------^
cgca_m2netcdf.f90(176): error #6404: This name does not have a type, and must
have an explicit type. [NF90_DEF_VAR_FILL]
call check ( nf90_def_var_fill(ncid, varid, 1, 1) )
-------------^
cgca_m2netcdf.f90(184): error #6404: This name does not have a type, and must
have an explicit type. [NF90_COLLECTIVE]
call check( nf90_var_par_access(ncid, varid, nf90_collective) )
---------------------------------------------^
cgca_m2netcdf.f90(184): error #6404: This name does not have a type, and must
have an explicit type. [NF90_VAR_PAR_ACCESS]
call check( nf90_var_par_access(ncid, varid, nf90_collective) )
------------^
compilation aborted for cgca_m2netcdf.f90 (code 1)
newblue1> 

See ticket 9 for more details. So I'm trying to build NetCDF parallel with Intel compiler myself.

NetCDF sits on top of HDF5, so need to build HDF5 first.

The HDF5 parallel build instructions are clear (Sec. 2.1), but not sufficient. I had to read lots of other material online, on HDF5 pages and more. Extra flags and extra envars are needed:

[mexas@newblue1 soft]$ pwd
/panfs/panasas01/mech/mexas/soft
[mexas@newblue1 soft]$ ls | grep hdf5
hdf5-1.10.1
hdf5-1.10.1-ifort16u2-install
hdf5-1.10.1.tar.bz2
[mexas@newblue1 soft]$ cd hdf5-1.10.1
[mexas@newblue1 hdf5-1.10.1]$ CC=mpiicc FC=mpiifort F9X=mpiifort ./configure \
 --enable-fortran --enable-fortran2003 --enable-parallel \
 --prefix=/panfs/panasas01/mech/mexas/soft/hdf5-1.10.1-ifort16u2-install
[mexas@newblue1 hdf5-1.10.1]$ make >& hdf5-1.10.1-make.log

The build log: hdf5-1.10.1-make.log.

Then to check the build I run make check from PBS, which is easier for MPI (parallel HDF5) checks. I use this simple script (-i flag means ignore failures and continue):

#!/bin/bash --login
cd $PBS_O_WORKDIR
make -i check

All tests seem to pass: hdf5-1.10.1-make-check.log!

Finally make install populates the install tree:

[mexas@newblue4 soft]$ ls hdf5-1.10.1-ifort16u2-install
bin  include  lib  share
[mexas@newblue4 soft]$ ls hdf5-1.10.1-ifort16u2-install/include
H5ACpublic.h  H5FDdirect.h         H5Gpublic.h     H5PLextern.h    H5version.h
h5a.mod       H5FDfamily.h         h5im.mod        H5PLpublic.h    h5z.mod
H5api_adpt.h  H5FDlog.h            h5i.mod         h5p.mod         H5Zpublic.h
H5Apublic.h   H5FDmpi.h            H5IMpublic.h    H5Ppublic.h     hdf5.h
H5Cpublic.h   H5FDmpio.h           H5Ipublic.h     H5PTpublic.h    hdf5_hl.h
h5d.mod       H5FDmulti.h          H5LDpublic.h    H5pubconf.h     hdf5.mod
H5DOpublic.h  H5FDpublic.h         h5lib.mod       H5public.h      tstds.mod
H5Dpublic.h   H5FDsec2.h           h5l.mod         h5r.mod         tstds_tests.mod
h5ds.mod      H5FDstdio.h          H5Lpublic.h     H5Rpublic.h     tstimage.mod
H5DSpublic.h  h5f.mod              h5lt_const.mod  h5s.mod         tstimage_tests.mod
h5e.mod       h5fortkit.mod        h5lt.mod        H5Spublic.h     tstlite.mod
H5Epubgen.h   h5fortran_types.mod  H5LTpublic.h    h5tb_const.mod  tstlite_tests.mod
H5Epublic.h   H5Fpublic.h          H5MMpublic.h    h5tb.mod        tsttable.mod
H5f90i_gen.h  h5_gen.mod           h5o.mod         H5TBpublic.h    tsttable_tests.mod
H5f90i.h      h5global.mod         H5Opublic.h     h5t.mod
H5FDcore.h    h5g.mod              H5overflow.h    H5Tpublic.h
[mexas@newblue4 soft]$ ls hdf5-1.10.1-ifort16u2-install/lib/
libdynlib1.la    libdynlibdiff.la   libhdf5_fortran.la            libhdf5_hl.la
libdynlib1.so    libdynlibdiff.so   libhdf5_fortran.so            libhdf5_hl.so
libdynlib2.la    libdynlibdump.la   libhdf5_fortran.so.100        libhdf5_hl.so.100
libdynlib2.so    libdynlibdump.so   libhdf5_fortran.so.100.1.0    libhdf5_hl.so.100.0.1
libdynlib3.la    libdynlibls.la     libhdf5_hl.a                  libhdf5.la
libdynlib3.so    libdynlibls.so     libhdf5hl_fortran.a           libhdf5.settings
libdynlib4.la    libdynlibvers.la   libhdf5hl_fortran.la          libhdf5.so
libdynlib4.so    libdynlibvers.so   libhdf5hl_fortran.so          libhdf5.so.101
libdynlibadd.la  libhdf5.a          libhdf5hl_fortran.so.100      libhdf5.so.101.0.0
libdynlibadd.so  libhdf5_fortran.a  libhdf5hl_fortran.so.100.0.1
[mexas@newblue4 soft]$ ls hdf5-1.10.1-ifort16u2-install/bin/
gif2h5   h5copy   h5dump            h5jam    h5pcc          h5pfc       h5repart  h5watch
h52gif   h5debug  h5format_convert  h5ls     h5perf         h5redeploy  h5stat    ph5diff
h5clear  h5diff   h5import          h5mkgrp  h5perf_serial  h5repack    h5unjam

So now can use this include flag on CGPACK build:

-I$(HOME)/soft/hdf5-1.10.1-ifort16u2-install/include

and this link flag on linking CGPACK programs:

-L$(HOME)/soft/hdf5-1.10.1-ifort16u2-install/lib -lhdf5 -lhdf5_fortran

However, this is not enough. I need also to make sure dynamic libs are found, so need to add this line to .bashrc:

LD_LIBRARY_PATH=$HOME/soft/hdf5-1.10.1-ifort16u2-install/lib:$LD_LIBRARY_PATH

Finally I can build a CGPACK executable, test_hdf5, with all libs found:

[mexas@newblue4 tests]$ ldd test_hdf5.x
        linux-vdso.so.1 =>  (0x00002aaaaaacb000)
        libhdf5.so.101 => /panfs/panasas01/mech/mexas/soft/hdf5-1.10.1-ifort16u2-install/lib/libhdf5.so.101 (0x00002aaaaaccd000)
        libhdf5_fortran.so.100 => /panfs/panasas01/mech/mexas/soft/hdf5-1.10.1-ifort16u2-install/lib/libhdf5_fortran.so.100 (0x00002aaaab38b000)
        libmpifort.so.12 => /cm/shared/languages/Intel-Compiler-XE-16-U2/compilers_and_libraries_2016.2.181/linux/mpi/intel64/lib/libmpifort.so.12 (0x00002aaaab5df000)
        libmpi.so.12 => /cm/shared/languages/Intel-Compiler-XE-16-U2/compilers_and_libraries_2016.2.181/linux/mpi/intel64/lib/libmpi.so.12 (0x00002aaaab97d000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00002aaaac167000)
        librt.so.1 => /lib64/librt.so.1 (0x00002aaaac36c000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00002aaaac574000)
        libicaf.so => /cm/shared/languages/Intel-Compiler-XE-16-U2/compilers_and_libraries_2016.2.181/linux/compiler/lib/intel64/libicaf.so (0x00002aaaac791000)
        libm.so.6 => /lib64/libm.so.6 (0x00002aaaac9e8000)
        libc.so.6 => /lib64/libc.so.6 (0x00002aaaacc6c000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002aaaad000000)
        libz.so.1 => /lib64/libz.so.1 (0x0000003c7d200000)
        libimf.so => /cm/shared/languages/Intel-Compiler-XE-16-U2/compilers_and_libraries_2016.2.181/linux/compiler/lib/intel64/libimf.so (0x00002aaaad217000)
        libsvml.so => /cm/shared/languages/Intel-Compiler-XE-16-U2/compilers_and_libraries_2016.2.181/linux/compiler/lib/intel64/libsvml.so (0x00002aaaad713000)
        libirng.so => /cm/shared/languages/Intel-Compiler-XE-16-U2/compilers_and_libraries_2016.2.181/linux/compiler/lib/intel64/libirng.so (0x00002aaaae5d1000)
        libintlc.so.5 => /cm/shared/languages/Intel-Compiler-XE-16-U2/compilers_and_libraries_2016.2.181/linux/compiler/lib/intel64/libintlc.so.5 (0x00002aaaae931000)
        libifport.so.5 => /cm/shared/languages/Intel-Compiler-XE-16-U2/compilers_and_libraries_2016.2.181/linux/compiler/lib/intel64/libifport.so.5 (0x00002aaaaeb9e000)
        libifcore.so.5 => /cm/shared/languages/Intel-Compiler-XE-16-U2/compilers_and_libraries_2016.2.181/linux/compiler/lib/intel64/libifcore.so.5 (0x00002aaaaedcd000)
        /lib64/ld-linux-x86-64.so.2 (0x00002aaaaaaab000)

18-MAY-2017: Continuing hdf5, netcdf

I made a new test, just for HDF5: test_hdf5.f90 (or in robodoc html documentation: test_hdf5). The test dumps the model using the serial writer, subroutine cgca_swci, in module cgca_m2out, and the HDF5 writer, subroutine cgca_pswci4, in module cgca_m2hdf5.

Then following the diary entry from 13-NOV-2016 I can visualise the model.

Now back to building NetCDF. I tried before, but gave up. See the diary entries starting 9-MAR-2017. Let's try again.

Some, not very clear guidance, is available from NetCDF:

http://www.unidata.ucar.edu/software/netcdf/docs/getting_and_building_netcdf.html#build_parallel

https://www.unidata.ucar.edu/software/netcdf/netcdf-4/newdocs/netcdf-install/Configure.html

After some experimentation I configured like this:

CC=mpiicc CPPFLAGS=-I/panfs/panasas01/mech/mexas/soft/hdf5-1.10.1-ifort16u2-install/include \
 LDFLAGS=-L/panfs/panasas01/mech/mexas/soft/hdf5-1.10.1-ifort16u2-install/lib \
 ./configure --prefix=/panfs/panasas01/mech/mexas/soft/netcdf-c-4.4.1.1-ifort-install \
 --enable-shared --enable-parallel-tests --enable-fortran --enable-netcdf-4

Which resulted in:

# NetCDF C Configuration Summary
==============================

# General
-------
NetCDF Version:         4.4.1.1
Configured On:          Thu May 18 15:32:00 BST 2017
Host System:            x86_64-unknown-linux-gnu
Build Directory:        /panfs/panasas01/mech/mexas/soft/netcdf-c-4.4.1.1
Install Prefix:         /panfs/panasas01/mech/mexas/soft/netcdf-c-4.4.1.1-ifort-install

# Compiling Options
-----------------
C Compiler:             /cm/shared/languages/Intel-Compiler-XE-16-U2/compilers_and_libraries_2016.2.181/linux/mpi/intel64/bin/mpiicc
CFLAGS:
CPPFLAGS:               -I/panfs/panasas01/mech/mexas/soft/hdf5-1.10.1-ifort16u2-install/include
LDFLAGS:                -L/panfs/panasas01/mech/mexas/soft/hdf5-1.10.1-ifort16u2-install/lib
AM_CFLAGS:
AM_CPPFLAGS:
AM_LDFLAGS:
Shared Library:         yes
Static Library:         yes
Extra libraries:        -lhdf5_hl -lhdf5 -ldl -lm -lcurl 

# Features
--------
NetCDF-2 API:           yes
HDF4 Support:           no
NetCDF-4 API:           yes
NC-4 Parallel Support:  yes
PNetCDF Support:        no
DAP Support:            yes
Diskless Support:       yes
MMap Support:           no
JNA Support:            no

Seems fine. Then make run without problems.

However, make check failed with

make[2]: `tst_h_dimscales4' is up to date.
depbase=`echo tst_h_par.o | sed 's|[^/]*$|.deps/&|;s|\.o$||'`;\
        mpiicc -DHAVE_CONFIG_H -I. -I..  -I../include -I../oc2  
-I/panfs/panasas01/mech/mexas/soft/hdf5-1.10.1-ifort16u2-install/include   -MT 
tst_h_par.o -MD -MP -MF $depbase.Tpo -c -o tst_h_par.o tst_h_par.c &&\
        mv -f $depbase.Tpo $depbase.Po
tst_h_par.c(89): error: identifier "ERR" is undefined
        if ((fapl_id = H5Pcreate(H5P_FILE_ACCESS)) < 0) ERR;
                                                        ^

tst_h_par.c(226): error: identifier "SUMMARIZE_ERR" is undefined
        SUMMARIZE_ERR;
        ^

tst_h_par.c(231): error: identifier "FINAL_RESULTS" is undefined
        FINAL_RESULTS;
        ^

compilation aborted for tst_h_par.c (code 2)
make[2]: *** [tst_h_par.o] Error 2
make[2]: Leaving directory 
`/panfs/panasas01/mech/mexas/soft/netcdf-c-4.4.1.1/h5_test'

Apparently this is caused by a bug with --enable-parallel-tests:

http://www.unidata.ucar.edu/mailing_lists/archives/netcdfgroup/2017/msg00019.html

So let's configure without this option:

CC=mpiicc \
CPPFLAGS=-I/panfs/panasas01/mech/mexas/soft/hdf5-1.10.1-ifort16u2-install/include \
LDFLAGS=-L/panfs/panasas01/mech/mexas/soft/hdf5-1.10.1-ifort16u2-install/lib \
./configure \
--prefix=/panfs/panasas01/mech/mexas/soft/netcdf-c-4.4.1.1-ifort-install \
--enable-shared --enable-fortran --enable-netcdf-4

And this seems to pass all tests: netcdf-c-4.4.1.1-make-check.log.

So after make install I get:

[mexas@newblue1 ~]$ nc-config --all

This netCDF 4.4.1.1 has been built with the following features: 

  --cc        -> mpiicc
  --cflags    -> -I/panfs/panasas01/mech/mexas/soft/netcdf-c-4.4.1.1-ifort-install/include -I/panfs/panasas01/mech/mexas/soft/hdf5-1.10.1-ifort16u2-install/include
  --libs      -> -L/panfs/panasas01/mech/mexas/soft/netcdf-c-4.4.1.1-ifort-install/lib -lnetcdf

  --has-c++   -> no
  --cxx       -> 

  --has-c++4  -> no
  --cxx4      -> 

  --has-fortran-> no
  --has-dap   -> yes
  --has-nc2   -> yes
  --has-nc4   -> yes
  --has-hdf5  -> yes
  --has-hdf4  -> no
  --has-logging-> no
  --has-pnetcdf-> no
  --has-szlib -> 

  --prefix    -> /panfs/panasas01/mech/mexas/soft/netcdf-c-4.4.1.1-ifort-install
  --includedir-> /panfs/panasas01/mech/mexas/soft/netcdf-c-4.4.1.1-ifort-install/include
  --libdir    -> /panfs/panasas01/mech/mexas/soft/netcdf-c-4.4.1.1-ifort-install/lib
  --version   -> netCDF 4.4.1.1

Now onto NetCDF Fortran. The instructions are:

http://www.unidata.ucar.edu/software/netcdf/docs/building_netcdf_fortran.html

I configure with:

CC=mpiicc FC=mpiifort \
CPPFLAGS=-I/panfs/panasas01/mech/mexas/soft/netcdf-c-4.4.1.1-ifort-install/include \
LDFLAGS=-L/panfs/panasas01/mech/mexas/soft/netcdf-c-4.4.1.1-ifort-install/lib \
./configure \
--prefix=/panfs/panasas01/mech/mexas/soft/netcdf-fortran-4.4.4-ifort-install

Then make completes with no error, and make check seems to pass all tests: netcdf-fortran-4.4.4-make-check.log

19-MAY-2017: NetCDF with Intel seems to work

I made a new test just for NetCDF: test_netcdf.f90 (or in robodoc html documentation: test_netcdf). It works fine on a single node, with a speedup of nearly 100 times over the serial version:

running on 16 images in a 3D grid
img: 1 nimgs: 16 (77,116,77)[2,2,4] 111  0.867       155.     ( 0.995    ,  1.50    ,  1.99    )
Each image has 687764 cells
The model has 11004224 cells
 Serial time:    17.5471229553223      s. Rate:   2.336219391274498E-003 GB/s.
 NetCDF time:  0.227143049240112      s, rate:   0.180476263951036      GB/s.

However, test_netdf doesn't work across multiple nodes - some MPI errors. I'm investigating with BlueCrystal people.

validate this page

next | UP | previous