CGPACK > MAY-2017
With any questions, bug reports, feedback and any other comments please submit a ticket.
1-MAY-2017: Hazel Hen MPI/IO issues
Trying to run testABM, serial IO seems to work, but MPI/IO fails with incomplete file written:
-rw------- 1 ipransht s31806 2407881600 Apr 26 15:52 serial.raw -rw------- 1 ipransht s31806 1488453632 Apr 26 16:01 mpiio.raw
running on 240 images in a 3D grid img: 1 nimgs: 240 (93,155,174)[5,6,8] 5999 0.987 464. ( 1.00 , 2.00 , 3.00 ) Each image has 2508210 cells The model has 601970400 cells Serial IO: 18.5411568 s, rate: 0.120947935 GB/s. DEBUG: fname:mpiio.raw aprun: Apid 7335749: Caught signal Terminated, sending to application =>> PBS: job killed: walltime 617 exceeded limit 600
running on 240 images in a 3D grid img: 1 nimgs: 240 (93,155,174)[5,6,8] 5999 0.987 464. ( 1.00 , 2.00 , 3.00 ) Each image has 2508210 cells The model has 601970400 cells Serial IO: 16.0410023 s, rate: 0.139798909 GB/s. DEBUG: fname:mpiio.raw =>> PBS: job killed: walltime 632 exceeded limit 600
So, let's try a model derived from testABV. Left only solidification and no IO. Seems to work on 10 nodes. XC40, just as XC30, has 24 cores per node. So using 240 cores:
running on 240 images in a 3D grid img: 1 nimgs: 240 (58,77,93)[8,6,5] 999 0.987 464. ( 1.00 , 0.995 , 1.00 ) Each image has 415338 cells The model has 99681120 cells ... Application 7360772 resources: utime ~607s, stime ~48s, Rss ~7268, inblocks ~326607, outblocks ~439919 real 0m7.123s user 0m0.388s sys 0m0.248s
For future, the HLRS ticket submission page.
3-MAY-2017: Hazel Hen workspace
Finally got to a manual page explaining how to use work filesystems on Hazel Hen. It's quite different from ARCHER (XC30). Hazel call it the Workspace mechanism.
I'm going to use testABV for scaling tests on Hazel Hen, set up with
! physical dimensions of the box, assume mm bsz0 = (/ 3.0, 5.0, 7.0 /) ! mean grain size, linear dimension, e.g. mean grain diameter, also mm dm = 1.0e-1 ! resolution res = 1.0e5
This gives a model with about 10 bn cells (1.0e10
).
4-MAY-2017: Choosing a model for scaling study
No, the 3-MAY box size is too high, I get OOM. So let's reduce it a bit:
! physical dimensions of the box, assume mm bsz0 = (/ 4.0, 5.0, 5.0 /)This gives 9.998 bn cells, so still about 10 bn. This runs successfully on a single node with
export XT_SYMMETRIC_HEAP_SIZE=3g
The default module craype-hugepages16M
,
i.e. 16MB huge pages is used.
17-MAY-2017: NetCDF with the Intel compiler
The NetCDF Intel modules installed on BlueCrystal phase 3:
libraries/intel_builds/netcdf-4.3-intel-16.par libraries/intel_builds/netcdf-4.3-par
seem to be deficient. Some variable definitions cannot be found in the Fortran NetCDF module files, e.g.:
newblue1> mpiifort -coarray=distributed -c -I/cm/shared/libraries/intel_build/netcdf-4-intel-16.par/include cgca_m2netcdf.f90 cgca_m2netcdf.f90(157): error #6404: This name does not have a type, and must have an explicit type. [NF90_NETCDF4] call check( nf90_create(fname, ior(nf90_netcdf4,nf90_mpiio), ncid, & -----------------------------------^ cgca_m2netcdf.f90(157): error #6404: This name does not have a type, and must have an explicit type. [NF90_MPIIO] call check( nf90_create(fname, ior(nf90_netcdf4,nf90_mpiio), ncid, & ------------------------------------------------^ cgca_m2netcdf.f90(157): warning #7319: This argument's data type is incompatible with this intrinsic procedure; procedure assumed EXTERNAL. [IOR] call check( nf90_create(fname, ior(nf90_netcdf4,nf90_mpiio), ncid, & -----------------------------------^ cgca_m2netcdf.f90(157): error #6404: This name does not have a type, and must have an explicit type. [IOR] call check( nf90_create(fname, ior(nf90_netcdf4,nf90_mpiio), ncid, & -------------------------------^ cgca_m2netcdf.f90(158): error #6627: This is an actual argument keyword name, and not a dummy argument name. [COMM] comm=comm, info=MPI_INFO_NULL)) -------^ cgca_m2netcdf.f90(158): error #6627: This is an actual argument keyword name, and not a dummy argument name. [INFO] comm=comm, info=MPI_INFO_NULL)) ------------------^ cgca_m2netcdf.f90(176): error #6404: This name does not have a type, and must have an explicit type. [NF90_DEF_VAR_FILL] call check ( nf90_def_var_fill(ncid, varid, 1, 1) ) -------------^ cgca_m2netcdf.f90(184): error #6404: This name does not have a type, and must have an explicit type. [NF90_COLLECTIVE] call check( nf90_var_par_access(ncid, varid, nf90_collective) ) ---------------------------------------------^ cgca_m2netcdf.f90(184): error #6404: This name does not have a type, and must have an explicit type. [NF90_VAR_PAR_ACCESS] call check( nf90_var_par_access(ncid, varid, nf90_collective) ) ------------^ compilation aborted for cgca_m2netcdf.f90 (code 1) newblue1>
See ticket 9 for more details. So I'm trying to build NetCDF parallel with Intel compiler myself.
NetCDF sits on top of HDF5, so need to build HDF5 first.
The HDF5 parallel build instructions are clear (Sec. 2.1), but not sufficient. I had to read lots of other material online, on HDF5 pages and more. Extra flags and extra envars are needed:
[mexas@newblue1 soft]$ pwd /panfs/panasas01/mech/mexas/soft [mexas@newblue1 soft]$ ls | grep hdf5 hdf5-1.10.1 hdf5-1.10.1-ifort16u2-install hdf5-1.10.1.tar.bz2 [mexas@newblue1 soft]$ cd hdf5-1.10.1 [mexas@newblue1 hdf5-1.10.1]$ CC=mpiicc FC=mpiifort F9X=mpiifort ./configure \ --enable-fortran --enable-fortran2003 --enable-parallel \ --prefix=/panfs/panasas01/mech/mexas/soft/hdf5-1.10.1-ifort16u2-install [mexas@newblue1 hdf5-1.10.1]$ make >& hdf5-1.10.1-make.log
The build log:
hdf5-1.10.1-make.log
.
Then to check the build I run make check
from PBS, which is easier for MPI (parallel HDF5) checks.
I use this simple script (-i
flag means
ignore failures and continue):
#!/bin/bash --login cd $PBS_O_WORKDIR make -i check
All tests seem to pass:
hdf5-1.10.1-make-check.log
!
Finally make install
populates the install tree:
[mexas@newblue4 soft]$ ls hdf5-1.10.1-ifort16u2-install bin include lib share [mexas@newblue4 soft]$ ls hdf5-1.10.1-ifort16u2-install/include H5ACpublic.h H5FDdirect.h H5Gpublic.h H5PLextern.h H5version.h h5a.mod H5FDfamily.h h5im.mod H5PLpublic.h h5z.mod H5api_adpt.h H5FDlog.h h5i.mod h5p.mod H5Zpublic.h H5Apublic.h H5FDmpi.h H5IMpublic.h H5Ppublic.h hdf5.h H5Cpublic.h H5FDmpio.h H5Ipublic.h H5PTpublic.h hdf5_hl.h h5d.mod H5FDmulti.h H5LDpublic.h H5pubconf.h hdf5.mod H5DOpublic.h H5FDpublic.h h5lib.mod H5public.h tstds.mod H5Dpublic.h H5FDsec2.h h5l.mod h5r.mod tstds_tests.mod h5ds.mod H5FDstdio.h H5Lpublic.h H5Rpublic.h tstimage.mod H5DSpublic.h h5f.mod h5lt_const.mod h5s.mod tstimage_tests.mod h5e.mod h5fortkit.mod h5lt.mod H5Spublic.h tstlite.mod H5Epubgen.h h5fortran_types.mod H5LTpublic.h h5tb_const.mod tstlite_tests.mod H5Epublic.h H5Fpublic.h H5MMpublic.h h5tb.mod tsttable.mod H5f90i_gen.h h5_gen.mod h5o.mod H5TBpublic.h tsttable_tests.mod H5f90i.h h5global.mod H5Opublic.h h5t.mod H5FDcore.h h5g.mod H5overflow.h H5Tpublic.h [mexas@newblue4 soft]$ ls hdf5-1.10.1-ifort16u2-install/lib/ libdynlib1.la libdynlibdiff.la libhdf5_fortran.la libhdf5_hl.la libdynlib1.so libdynlibdiff.so libhdf5_fortran.so libhdf5_hl.so libdynlib2.la libdynlibdump.la libhdf5_fortran.so.100 libhdf5_hl.so.100 libdynlib2.so libdynlibdump.so libhdf5_fortran.so.100.1.0 libhdf5_hl.so.100.0.1 libdynlib3.la libdynlibls.la libhdf5_hl.a libhdf5.la libdynlib3.so libdynlibls.so libhdf5hl_fortran.a libhdf5.settings libdynlib4.la libdynlibvers.la libhdf5hl_fortran.la libhdf5.so libdynlib4.so libdynlibvers.so libhdf5hl_fortran.so libhdf5.so.101 libdynlibadd.la libhdf5.a libhdf5hl_fortran.so.100 libhdf5.so.101.0.0 libdynlibadd.so libhdf5_fortran.a libhdf5hl_fortran.so.100.0.1 [mexas@newblue4 soft]$ ls hdf5-1.10.1-ifort16u2-install/bin/ gif2h5 h5copy h5dump h5jam h5pcc h5pfc h5repart h5watch h52gif h5debug h5format_convert h5ls h5perf h5redeploy h5stat ph5diff h5clear h5diff h5import h5mkgrp h5perf_serial h5repack h5unjam
So now can use this include flag on CGPACK build:
-I$(HOME)/soft/hdf5-1.10.1-ifort16u2-install/include
and this link flag on linking CGPACK programs:
-L$(HOME)/soft/hdf5-1.10.1-ifort16u2-install/lib -lhdf5 -lhdf5_fortran
However, this is not enough.
I need also to make sure dynamic libs are found, so
need to add this line to .bashrc
:
LD_LIBRARY_PATH=$HOME/soft/hdf5-1.10.1-ifort16u2-install/lib:$LD_LIBRARY_PATH
Finally I can build a CGPACK executable, test_hdf5
,
with all libs found:
[mexas@newblue4 tests]$ ldd test_hdf5.x linux-vdso.so.1 => (0x00002aaaaaacb000) libhdf5.so.101 => /panfs/panasas01/mech/mexas/soft/hdf5-1.10.1-ifort16u2-install/lib/libhdf5.so.101 (0x00002aaaaaccd000) libhdf5_fortran.so.100 => /panfs/panasas01/mech/mexas/soft/hdf5-1.10.1-ifort16u2-install/lib/libhdf5_fortran.so.100 (0x00002aaaab38b000) libmpifort.so.12 => /cm/shared/languages/Intel-Compiler-XE-16-U2/compilers_and_libraries_2016.2.181/linux/mpi/intel64/lib/libmpifort.so.12 (0x00002aaaab5df000) libmpi.so.12 => /cm/shared/languages/Intel-Compiler-XE-16-U2/compilers_and_libraries_2016.2.181/linux/mpi/intel64/lib/libmpi.so.12 (0x00002aaaab97d000) libdl.so.2 => /lib64/libdl.so.2 (0x00002aaaac167000) librt.so.1 => /lib64/librt.so.1 (0x00002aaaac36c000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00002aaaac574000) libicaf.so => /cm/shared/languages/Intel-Compiler-XE-16-U2/compilers_and_libraries_2016.2.181/linux/compiler/lib/intel64/libicaf.so (0x00002aaaac791000) libm.so.6 => /lib64/libm.so.6 (0x00002aaaac9e8000) libc.so.6 => /lib64/libc.so.6 (0x00002aaaacc6c000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002aaaad000000) libz.so.1 => /lib64/libz.so.1 (0x0000003c7d200000) libimf.so => /cm/shared/languages/Intel-Compiler-XE-16-U2/compilers_and_libraries_2016.2.181/linux/compiler/lib/intel64/libimf.so (0x00002aaaad217000) libsvml.so => /cm/shared/languages/Intel-Compiler-XE-16-U2/compilers_and_libraries_2016.2.181/linux/compiler/lib/intel64/libsvml.so (0x00002aaaad713000) libirng.so => /cm/shared/languages/Intel-Compiler-XE-16-U2/compilers_and_libraries_2016.2.181/linux/compiler/lib/intel64/libirng.so (0x00002aaaae5d1000) libintlc.so.5 => /cm/shared/languages/Intel-Compiler-XE-16-U2/compilers_and_libraries_2016.2.181/linux/compiler/lib/intel64/libintlc.so.5 (0x00002aaaae931000) libifport.so.5 => /cm/shared/languages/Intel-Compiler-XE-16-U2/compilers_and_libraries_2016.2.181/linux/compiler/lib/intel64/libifport.so.5 (0x00002aaaaeb9e000) libifcore.so.5 => /cm/shared/languages/Intel-Compiler-XE-16-U2/compilers_and_libraries_2016.2.181/linux/compiler/lib/intel64/libifcore.so.5 (0x00002aaaaedcd000) /lib64/ld-linux-x86-64.so.2 (0x00002aaaaaaab000)
18-MAY-2017: Continuing hdf5, netcdf
I made a new test, just for HDF5:
test_hdf5.f90
(or in robodoc html documentation:
test_hdf5
).
The test dumps the model using the serial writer, subroutine
cgca_swci
,
in module
cgca_m2out
,
and the HDF5 writer, subroutine
cgca_pswci4
,
in module
cgca_m2hdf5
.
Then following the diary entry from 13-NOV-2016 I can visualise the model.
Now back to building NetCDF. I tried before, but gave up. See the diary entries starting 9-MAR-2017. Let's try again.
Some, not very clear guidance, is available from NetCDF:
http://www.unidata.ucar.edu/software/netcdf/docs/getting_and_building_netcdf.html#build_parallel
https://www.unidata.ucar.edu/software/netcdf/netcdf-4/newdocs/netcdf-install/Configure.html
After some experimentation I configured like this:
CC=mpiicc CPPFLAGS=-I/panfs/panasas01/mech/mexas/soft/hdf5-1.10.1-ifort16u2-install/include \ LDFLAGS=-L/panfs/panasas01/mech/mexas/soft/hdf5-1.10.1-ifort16u2-install/lib \ ./configure --prefix=/panfs/panasas01/mech/mexas/soft/netcdf-c-4.4.1.1-ifort-install \ --enable-shared --enable-parallel-tests --enable-fortran --enable-netcdf-4
Which resulted in:
# NetCDF C Configuration Summary ============================== # General ------- NetCDF Version: 4.4.1.1 Configured On: Thu May 18 15:32:00 BST 2017 Host System: x86_64-unknown-linux-gnu Build Directory: /panfs/panasas01/mech/mexas/soft/netcdf-c-4.4.1.1 Install Prefix: /panfs/panasas01/mech/mexas/soft/netcdf-c-4.4.1.1-ifort-install # Compiling Options ----------------- C Compiler: /cm/shared/languages/Intel-Compiler-XE-16-U2/compilers_and_libraries_2016.2.181/linux/mpi/intel64/bin/mpiicc CFLAGS: CPPFLAGS: -I/panfs/panasas01/mech/mexas/soft/hdf5-1.10.1-ifort16u2-install/include LDFLAGS: -L/panfs/panasas01/mech/mexas/soft/hdf5-1.10.1-ifort16u2-install/lib AM_CFLAGS: AM_CPPFLAGS: AM_LDFLAGS: Shared Library: yes Static Library: yes Extra libraries: -lhdf5_hl -lhdf5 -ldl -lm -lcurl # Features -------- NetCDF-2 API: yes HDF4 Support: no NetCDF-4 API: yes NC-4 Parallel Support: yes PNetCDF Support: no DAP Support: yes Diskless Support: yes MMap Support: no JNA Support: no
Seems fine.
Then make
run without problems.
However, make check
failed with
make[2]: `tst_h_dimscales4' is up to date. depbase=`echo tst_h_par.o | sed 's|[^/]*$|.deps/&|;s|\.o$||'`;\ mpiicc -DHAVE_CONFIG_H -I. -I.. -I../include -I../oc2 -I/panfs/panasas01/mech/mexas/soft/hdf5-1.10.1-ifort16u2-install/include -MT tst_h_par.o -MD -MP -MF $depbase.Tpo -c -o tst_h_par.o tst_h_par.c &&\ mv -f $depbase.Tpo $depbase.Po tst_h_par.c(89): error: identifier "ERR" is undefined if ((fapl_id = H5Pcreate(H5P_FILE_ACCESS)) < 0) ERR; ^ tst_h_par.c(226): error: identifier "SUMMARIZE_ERR" is undefined SUMMARIZE_ERR; ^ tst_h_par.c(231): error: identifier "FINAL_RESULTS" is undefined FINAL_RESULTS; ^ compilation aborted for tst_h_par.c (code 2) make[2]: *** [tst_h_par.o] Error 2 make[2]: Leaving directory `/panfs/panasas01/mech/mexas/soft/netcdf-c-4.4.1.1/h5_test'
Apparently this is caused by a bug with --enable-parallel-tests
:
http://www.unidata.ucar.edu/mailing_lists/archives/netcdfgroup/2017/msg00019.html
So let's configure without this option:
CC=mpiicc \ CPPFLAGS=-I/panfs/panasas01/mech/mexas/soft/hdf5-1.10.1-ifort16u2-install/include \ LDFLAGS=-L/panfs/panasas01/mech/mexas/soft/hdf5-1.10.1-ifort16u2-install/lib \ ./configure \ --prefix=/panfs/panasas01/mech/mexas/soft/netcdf-c-4.4.1.1-ifort-install \ --enable-shared --enable-fortran --enable-netcdf-4
And this seems to pass all tests:
netcdf-c-4.4.1.1-make-check.log
.
So after make install
I get:
[mexas@newblue1 ~]$ nc-config --all This netCDF 4.4.1.1 has been built with the following features: --cc -> mpiicc --cflags -> -I/panfs/panasas01/mech/mexas/soft/netcdf-c-4.4.1.1-ifort-install/include -I/panfs/panasas01/mech/mexas/soft/hdf5-1.10.1-ifort16u2-install/include --libs -> -L/panfs/panasas01/mech/mexas/soft/netcdf-c-4.4.1.1-ifort-install/lib -lnetcdf --has-c++ -> no --cxx -> --has-c++4 -> no --cxx4 -> --has-fortran-> no --has-dap -> yes --has-nc2 -> yes --has-nc4 -> yes --has-hdf5 -> yes --has-hdf4 -> no --has-logging-> no --has-pnetcdf-> no --has-szlib -> --prefix -> /panfs/panasas01/mech/mexas/soft/netcdf-c-4.4.1.1-ifort-install --includedir-> /panfs/panasas01/mech/mexas/soft/netcdf-c-4.4.1.1-ifort-install/include --libdir -> /panfs/panasas01/mech/mexas/soft/netcdf-c-4.4.1.1-ifort-install/lib --version -> netCDF 4.4.1.1
Now onto NetCDF Fortran. The instructions are:
http://www.unidata.ucar.edu/software/netcdf/docs/building_netcdf_fortran.html
I configure with:
CC=mpiicc FC=mpiifort \ CPPFLAGS=-I/panfs/panasas01/mech/mexas/soft/netcdf-c-4.4.1.1-ifort-install/include \ LDFLAGS=-L/panfs/panasas01/mech/mexas/soft/netcdf-c-4.4.1.1-ifort-install/lib \ ./configure \ --prefix=/panfs/panasas01/mech/mexas/soft/netcdf-fortran-4.4.4-ifort-install
Then make
completes with no error,
and make check
seems to pass all tests:
netcdf-fortran-4.4.4-make-check.log
19-MAY-2017: NetCDF with Intel seems to work
I made a new test just for NetCDF:
test_netcdf.f90
(or in robodoc html documentation:
test_netcdf
).
It works fine on a single node,
with a speedup of nearly 100 times
over the serial version:
running on 16 images in a 3D grid img: 1 nimgs: 16 (77,116,77)[2,2,4] 111 0.867 155. ( 0.995 , 1.50 , 1.99 ) Each image has 687764 cells The model has 11004224 cells Serial time: 17.5471229553223 s. Rate: 2.336219391274498E-003 GB/s. NetCDF time: 0.227143049240112 s, rate: 0.180476263951036 GB/s.
However, test_netdf
doesn't work across
multiple nodes - some MPI errors.
I'm investigating with BlueCrystal people.