CGPACK > MAR-2016
24-MAR-2016: Trying to set up TAU
First want to install PAPI.
Using papi-5.4.3. Configured with
./configure MPICC=mpiicc
to make sure that Intel MPI C wrapper is used.
Note double "i" above, not single "i".
mpicc
, single "i"
is a an MPI wrapper for GCC C compiler.
newblue4> which mpiicc /cm/shared/languages/Intel-Compiler-XE-16-U2/compilers_and_libraries_2016.2.181/linux/mpi/intel64/bin/mpiicc newblue4> mpiicc --version icc (ICC) 16.0.2 20160204 Copyright (C) 1985-2016 Intel Corporation. All rights reserved. newblue4> which mpicc /cm/shared/languages/Intel-Compiler-XE-16-U2/compilers_and_libraries_2016.2.181/linux/mpi/intel64/bin/mpicc newblue4> mpicc --version gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-3) Copyright (C) 2010 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Here's the PAPI
config.log
.
Then
make
Here's the PAPI
make.log
.
Then
make test
and got errors:
ctests/zero 0x8000003b PAPI_TOT_CYC is not available. 0x80000034 PAPI_FP_INS is not available. 0x8000003b PAPI_TOT_CYC is not available. 0x80000066 PAPI_FP_OPS is not available. 0x8000003b PAPI_TOT_CYC is not available. 0x80000032 PAPI_TOT_INS is not available. test_utils.c FAILED Line # 717 Error: Not enough room to add an event!
Here's the full make test log
papi-make-test.log
.
No luck for now. So let's try using PAPI 5.3 provided via modules:
libraries/intel_builds/papi-5.3.0
The path to papi-5.3.0 is:
/cm/shared/libraries/intel_build/papi-5.3.0/bin
Let's see which events it supports:
$ papi_avail Available events and hardware information. -------------------------------------------------------------------------------- PAPI Version : 5.3.0.0 Vendor string and code : GenuineIntel (1) Model string and code : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz (45) CPU Revision : 7.000000 CPUID Info : Family: 6 Model: 45 Stepping: 7 CPU Max Megahertz : 2599 CPU Min Megahertz : 2599 Hdw Threads per core : 1 Cores per Socket : 8 Sockets : 2 NUMA Nodes : 2 CPUs per Node : 8 Total CPUs : 16 Running in a VM : no Number Hardware Counters : 11 Max Multiplex Counters : 32 -------------------------------------------------------------------------------- Name Code Avail Deriv Description (Note) PAPI_L1_DCM 0x80000000 Yes No Level 1 data cache misses PAPI_L1_ICM 0x80000001 Yes No Level 1 instruction cache misses PAPI_L2_DCM 0x80000002 Yes Yes Level 2 data cache misses PAPI_L2_ICM 0x80000003 Yes No Level 2 instruction cache misses PAPI_L3_DCM 0x80000004 No No Level 3 data cache misses PAPI_L3_ICM 0x80000005 No No Level 3 instruction cache misses PAPI_L1_TCM 0x80000006 Yes Yes Level 1 cache misses PAPI_L2_TCM 0x80000007 Yes No Level 2 cache misses PAPI_L3_TCM 0x80000008 Yes No Level 3 cache misses PAPI_CA_SNP 0x80000009 No No Requests for a snoop PAPI_CA_SHR 0x8000000a No No Requests for exclusive access to shared cache line PAPI_CA_CLN 0x8000000b No No Requests for exclusive access to clean cache line PAPI_CA_INV 0x8000000c No No Requests for cache line invalidation PAPI_CA_ITV 0x8000000d No No Requests for cache line intervention PAPI_L3_LDM 0x8000000e No No Level 3 load misses PAPI_L3_STM 0x8000000f No No Level 3 store misses PAPI_BRU_IDL 0x80000010 No No Cycles branch units are idle PAPI_FXU_IDL 0x80000011 No No Cycles integer units are idle PAPI_FPU_IDL 0x80000012 No No Cycles floating point units are idle PAPI_LSU_IDL 0x80000013 No No Cycles load/store units are idle PAPI_TLB_DM 0x80000014 Yes Yes Data translation lookaside buffer misses PAPI_TLB_IM 0x80000015 Yes No Instruction translation lookaside buffer misses PAPI_TLB_TL 0x80000016 No No Total translation lookaside buffer misses PAPI_L1_LDM 0x80000017 Yes No Level 1 load misses PAPI_L1_STM 0x80000018 Yes No Level 1 store misses PAPI_L2_LDM 0x80000019 No No Level 2 load misses PAPI_L2_STM 0x8000001a Yes No Level 2 store misses PAPI_BTAC_M 0x8000001b No No Branch target address cache misses PAPI_PRF_DM 0x8000001c No No Data prefetch cache misses PAPI_L3_DCH 0x8000001d No No Level 3 data cache hits PAPI_TLB_SD 0x8000001e No No Translation lookaside buffer shootdowns PAPI_CSR_FAL 0x8000001f No No Failed store conditional instructions PAPI_CSR_SUC 0x80000020 No No Successful store conditional instructions PAPI_CSR_TOT 0x80000021 No No Total store conditional instructions PAPI_MEM_SCY 0x80000022 No No Cycles Stalled Waiting for memory accesses PAPI_MEM_RCY 0x80000023 No No Cycles Stalled Waiting for memory Reads PAPI_MEM_WCY 0x80000024 No No Cycles Stalled Waiting for memory writes PAPI_STL_ICY 0x80000025 Yes No Cycles with no instruction issue PAPI_FUL_ICY 0x80000026 No No Cycles with maximum instruction issue PAPI_STL_CCY 0x80000027 No No Cycles with no instructions completed PAPI_FUL_CCY 0x80000028 No No Cycles with maximum instructions completed PAPI_HW_INT 0x80000029 No No Hardware interrupts PAPI_BR_UCN 0x8000002a Yes Yes Unconditional branch instructions PAPI_BR_CN 0x8000002b Yes No Conditional branch instructions PAPI_BR_TKN 0x8000002c Yes Yes Conditional branch instructions taken PAPI_BR_NTK 0x8000002d Yes No Conditional branch instructions not taken PAPI_BR_MSP 0x8000002e Yes No Conditional branch instructions mispredicted PAPI_BR_PRC 0x8000002f Yes Yes Conditional branch instructions correctly predicted PAPI_FMA_INS 0x80000030 No No FMA instructions completed PAPI_TOT_IIS 0x80000031 No No Instructions issued PAPI_TOT_INS 0x80000032 Yes No Instructions completed PAPI_INT_INS 0x80000033 No No Integer instructions PAPI_FP_INS 0x80000034 Yes Yes Floating point instructions PAPI_LD_INS 0x80000035 Yes No Load instructions PAPI_SR_INS 0x80000036 Yes No Store instructions PAPI_BR_INS 0x80000037 Yes No Branch instructions PAPI_VEC_INS 0x80000038 No No Vector/SIMD instructions (could include integer) PAPI_RES_STL 0x80000039 No No Cycles stalled on any resource PAPI_FP_STAL 0x8000003a No No Cycles the FP unit(s) are stalled PAPI_TOT_CYC 0x8000003b Yes No Total cycles PAPI_LST_INS 0x8000003c No No Load/store instructions completed PAPI_SYC_INS 0x8000003d No No Synchronization instructions completed PAPI_L1_DCH 0x8000003e No No Level 1 data cache hits PAPI_L2_DCH 0x8000003f Yes Yes Level 2 data cache hits PAPI_L1_DCA 0x80000040 No No Level 1 data cache accesses PAPI_L2_DCA 0x80000041 Yes No Level 2 data cache accesses PAPI_L3_DCA 0x80000042 Yes Yes Level 3 data cache accesses PAPI_L1_DCR 0x80000043 No No Level 1 data cache reads PAPI_L2_DCR 0x80000044 Yes No Level 2 data cache reads PAPI_L3_DCR 0x80000045 Yes No Level 3 data cache reads PAPI_L1_DCW 0x80000046 No No Level 1 data cache writes PAPI_L2_DCW 0x80000047 Yes No Level 2 data cache writes PAPI_L3_DCW 0x80000048 Yes No Level 3 data cache writes PAPI_L1_ICH 0x80000049 No No Level 1 instruction cache hits PAPI_L2_ICH 0x8000004a Yes No Level 2 instruction cache hits PAPI_L3_ICH 0x8000004b No No Level 3 instruction cache hits PAPI_L1_ICA 0x8000004c No No Level 1 instruction cache accesses PAPI_L2_ICA 0x8000004d Yes No Level 2 instruction cache accesses PAPI_L3_ICA 0x8000004e Yes No Level 3 instruction cache accesses PAPI_L1_ICR 0x8000004f No No Level 1 instruction cache reads PAPI_L2_ICR 0x80000050 Yes No Level 2 instruction cache reads PAPI_L3_ICR 0x80000051 Yes No Level 3 instruction cache reads PAPI_L1_ICW 0x80000052 No No Level 1 instruction cache writes PAPI_L2_ICW 0x80000053 No No Level 2 instruction cache writes PAPI_L3_ICW 0x80000054 No No Level 3 instruction cache writes PAPI_L1_TCH 0x80000055 No No Level 1 total cache hits PAPI_L2_TCH 0x80000056 No No Level 2 total cache hits PAPI_L3_TCH 0x80000057 No No Level 3 total cache hits PAPI_L1_TCA 0x80000058 No No Level 1 total cache accesses PAPI_L2_TCA 0x80000059 Yes Yes Level 2 total cache accesses PAPI_L3_TCA 0x8000005a Yes No Level 3 total cache accesses PAPI_L1_TCR 0x8000005b No No Level 1 total cache reads PAPI_L2_TCR 0x8000005c Yes Yes Level 2 total cache reads PAPI_L3_TCR 0x8000005d Yes Yes Level 3 total cache reads PAPI_L1_TCW 0x8000005e No No Level 1 total cache writes PAPI_L2_TCW 0x8000005f Yes No Level 2 total cache writes PAPI_L3_TCW 0x80000060 Yes No Level 3 total cache writes PAPI_FML_INS 0x80000061 No No Floating point multiply instructions PAPI_FAD_INS 0x80000062 No No Floating point add instructions PAPI_FDV_INS 0x80000063 Yes No Floating point divide instructions PAPI_FSQ_INS 0x80000064 No No Floating point square root instructions PAPI_FNV_INS 0x80000065 No No Floating point inverse instructions PAPI_FP_OPS 0x80000066 Yes Yes Floating point operations PAPI_SP_OPS 0x80000067 Yes Yes Floating point operations; optimized to count scaled single precision vector operations PAPI_DP_OPS 0x80000068 Yes Yes Floating point operations; optimized to count scaled double precision vector operations PAPI_VEC_SP 0x80000069 Yes Yes Single precision vector/SIMD instructions PAPI_VEC_DP 0x8000006a Yes Yes Double precision vector/SIMD instructions PAPI_REF_CYC 0x8000006b Yes No Reference clock cycles ------------------------------------------------------------------------- Of 108 possible events, 50 are available, of which 17 are derived. avail.c PASSED
So now let's try to build TAU 2.25 with papi 5.3. TAU uses PDT. So let's build PDT 3.21. Seems to build fine.
Let's build TAU:
./configure -c++=icpc -cc=icc -fortran=intel -mpi -pdt=$HOME/pdtoolkit-3.21/ -papi=/cm/shared/libraries/intel_build/papi-5.3.0 -PROFILE -TRACE -slog2 make install
Seems fine. Let's validate TAU:
newblue4> cat parallel.sh #!/bin/bash mpirun -np 4 ./simple newblue4> setenv TAU_VALIDATE_PARALLEL `pwd`/parallel.sh ; ./tau_validate -v --html --table table.heml --timeout 180 x86_64 > & results.html
Here are the The TAU validation results. No errors.
A simple demo that TAU + PAPI is working. For a small coarray program looking at PAPI_BR_NTK counter - branch not taken:
25-MAR-2016: Checking TAU tracing
Using this TAU conf file:
TAU_COMM_MATRIX=1 TAU_METRICS=PAPI_BR_CN,PAPI_BR_TKN,PAPI_BR_NTK,PAPI_BR_MSP,PAPI_BR_PRC TAU_TRACE=1 TAU_PROFILE=1 #TAU_CALLPATH=1 #TAU_CALLPATH_DEPTH=100
Using this shell script to build an instrumented coarray executable. Note TAU_OPTIONS specify the use of compiler for instrumentation, not PDT:
#!/bin/sh export TAU_OPTIONS="-optVerbose -optCompInst" export TAU_MAKEFILE=$HOME/tau-2.25/x86_64/lib/Makefile.tau-icpc-papi-mpi-pdt-profile-trace make clean make all
Using this Makefile for shared coarray execution with Intel MPI:
TAU_MAKEFILE= $(HOME)/tau-2.25/x86_64/lib/Makefile.tau-icpc-papi-mpi-pdt-profile-trace include $(TAU_MAKEFILE) RM= /bin/rm MOD_SRC= m.f90 MOD_OBJ= $(MOD_SRC:.f90=.o) MOD_MOD= $(MOD_SRC:.f90=.mod) MOD_CLEAN= $(MOD_OBJ) $(MOD_MOD) PROG_SRC= simple.f90 PROG_OBJ= $(PROG_SRC:.f90=.o) PROG_EXE= $(PROG_SRC:.f90=.x) PROG_CLEAN= $(PROG_OBJ) $(PROG_EXE) ALL_CLEAN= $(MOD_CLEAN) $(PROG_CLEAN) .SUFFIXES: .SUFFIXES: .f90 .o .mod .x # Comment to disable TAU USE_TAU=1 F90= tau_f90.sh # $(TAU_F90) FFLAGS= -coarray=shared -warn $(TAU_INCLUDE) $(TAU_MPI_INCLUDE) $(TAU_F90_SUFFIX) LINKER= $(TAU_F90) #LINKER= $(TAU_LINKER) LDFLAGS= -coarray=shared $(USER_OPT) $(TAU_LDFLAGS) LIBS= $(TAU_MPI_FLIBS) $(TAU_LIBS) $(TAU_CXXLIBS) PDTF90PARSE= $(PDTDIR)/$(PDTARCHDIR)/bin/f95parse TAUINSTR= $(TAUROOTDIR)/$(CONFIG_ARCH)/bin/tau_instrumentor CFLAGS= $(TAU_INCLUDE) $(TAU_DEFS) $(TAU_MPI_INCLUDE) all: $(MOD_MOD) $(PROG_EXE) .f90.o: $(F90) $(FFLAGS) -c $< .f90.mod: $(F90) $(FFLAGS) -c $< $(PROG_EXE): $(MOD_OBJ) $(PROG_OBJ) $(LINKER) $(LDFLAGS) $(MOD_OBJ) $(PROG_OBJ) -o $@ $(LIBS) clean: $(RM) $(ALL_CLEAN)
Run the instrumented program, as a normal Intel shared memory coarray program:
./simple.x
This generates events and traces:
newblue2> ls events* tautrace* events.0.edf events.1.edf events.8.edf tautrace.13.0.0.trc tautrace.6.0.0.trc events.10.edf events.2.edf events.9.edf tautrace.14.0.0.trc tautrace.7.0.0.trc events.11.edf events.3.edf tautrace.0.0.0.trc tautrace.15.0.0.trc tautrace.8.0.0.trc events.12.edf events.4.edf tautrace.10.0.0.trc tautrace.2.0.0.trc tautrace.9.0.0.trc events.13.edf events.5.edf tautrace.1.0.0.trc tautrace.3.0.0.trc events.14.edf events.6.edf tautrace.11.0.0.trc tautrace.4.0.0.trc events.15.edf events.7.edf tautrace.12.0.0.trc tautrace.5.0.0.trc newblue2>
Need to prepare tracing output for Jumpshot-4. Using these instructions from the TAU User Guide, Sec. 4.3.
newblue2> tau_treemerge.pl /panfs/panasas01/mech/mexas/tau-2.25/x86_64/bin/tau_merge -m tau.edf -e events.0.edf events.1.edf events.10.edf events.11.edf events.12.edf events.13.edf events.14.edf events.15.edf events.2.edf events.3.edf events.4.edf events.5.edf events.6.edf events.7.edf events.8.edf events.9.edf tautrace.0.0.0.trc tautrace.1.0.0.trc tautrace.10.0.0.trc tautrace.11.0.0.trc tautrace.12.0.0.trc tautrace.13.0.0.trc tautrace.14.0.0.trc tautrace.15.0.0.trc tautrace.2.0.0.trc tautrace.3.0.0.trc tautrace.4.0.0.trc tautrace.5.0.0.trc tautrace.6.0.0.trc tautrace.7.0.0.trc tautrace.8.0.0.trc tautrace.9.0.0.trc tau.trc tau.trc exists; override [y]? y tautrace.0.0.0.trc: 24 records read. tautrace.1.0.0.trc: 975574 records read. tautrace.10.0.0.trc: 975124 records read. tautrace.11.0.0.trc: 975124 records read. tautrace.12.0.0.trc: 975124 records read. tautrace.13.0.0.trc: 975124 records read. tautrace.14.0.0.trc: 975124 records read. tautrace.15.0.0.trc: 975392 records read. tautrace.2.0.0.trc: 975214 records read. tautrace.3.0.0.trc: 975124 records read. tautrace.4.0.0.trc: 975124 records read. tautrace.5.0.0.trc: 975124 records read. tautrace.6.0.0.trc: 975214 records read. tautrace.7.0.0.trc: 975124 records read. tautrace.8.0.0.trc: 975124 records read. tautrace.9.0.0.trc: 975124 records read. newblue2> newblue2> tau2slog2 tau.trc tau.edf -o tau.slog2 14627782 records initialized. Processing. 1521 enters: 0 exits: 0 292555 Records read. 1% converted 585110 Records read. 3% converted 877665 Records read. 5% converted 1170220 Records read. 7% converted 1462775 Records read. 9% converted 1755330 Records read. 11% converted 2047885 Records read. 13% converted 2340440 Records read. 15% converted 2632995 Records read. 17% converted 2925550 Records read. 19% converted 3218105 Records read. 21% converted 3510660 Records read. 23% converted 3803215 Records read. 25% converted 4095770 Records read. 27% converted 4388325 Records read. 29% converted 4680880 Records read. 31% converted 4973435 Records read. 33% converted 5265990 Records read. 35% converted 5558545 Records read. 37% converted 5851100 Records read. 39% converted 6143655 Records read. 41% converted 6436210 Records read. 43% converted 6728765 Records read. 45% converted 7021320 Records read. 47% converted 7313875 Records read. 49% converted 7606430 Records read. 51% converted 7898985 Records read. 53% converted 8191540 Records read. 55% converted 8484095 Records read. 57% converted 8776650 Records read. 59% converted 9069205 Records read. 61% converted 9361760 Records read. 63% converted 9654315 Records read. 65% converted 9946870 Records read. 67% converted 10239425 Records read. 69% converted 10531980 Records read. 71% converted 10824535 Records read. 73% converted 11117090 Records read. 75% converted 11409645 Records read. 77% converted 11702200 Records read. 79% converted 11994755 Records read. 81% converted 12287310 Records read. 83% converted 12579865 Records read. 85% converted 12872420 Records read. 87% converted 13164975 Records read. 89% converted 13457530 Records read. 91% converted 13750085 Records read. 93% converted 1521 enters: 0 exits: 0 14042640 Records read. 95% converted 1521 enters: 0 exits: 0 1521 enters: 0 exits: 0 14335195 Records read. 97% converted 1521 enters: 0 exits: 0 1521 enters: 0 exits: 0 1521 enters: 0 exits: 0 1521 enters: 0 exits: 0 1521 enters: 0 exits: 0 1521 enters: 0 exits: 0 1521 enters: 0 exits: 0 1521 enters: 0 exits: 0 1521 enters: 0 exits: 0 1521 enters: 0 exits: 0 1521 enters: 0 exits: 0 14627750 Records read. 99% converted 1521 enters: 0 exits: 0 Reached end of trace file. SLOG-2 Header: version = SLOG 2.0.6 NumOfChildrenPerNode = 2 TreeLeafByteSize = 65536 MaxTreeDepth = 10 MaxBufferByteSize = 609542 Categories is FBinfo(2111 @ 50851451) MethodDefs is FBinfo(0 @ 0) LineIDMaps is FBinfo(164 @ 50853562) TreeRoot is FBinfo(380932 @ 50470519) TreeDir is FBinfo(42368 @ 50853726) Annotations is FBinfo(0 @ 0) Postamble is FBinfo(0 @ 0) 1521 enters: 0 exits: 0 Number of Drawables = 1462957 timeElapsed between 1 & 2 = 36 msec timeElapsed between 2 & 3 = 8337 msec newblue2>
The launch jumpshot as
jumpshot tau.slog2
Jumpshot seems to hang a lot, perhaps my Java or graphics card are too old. Also cannot see any comms shown. Also cannot figure out how to save images. So just an xwd for now: