This folder contains examples of running ph5concat
, including a set of small
input files, run commands, concatenated output file, and metadata of input and
out files.
-
There are four input HDF5 files, each containing a small number of dataset extracted from neutrino simulation data for illustrative purpose.
- sample_r11981_s06.gz
- sample_r11981_s07.gz
- sample_r11981_s08.gz
- sample_r11981_s09.gz
-
These input files are compressed. Run command 'make' will uncompress them into HDF5 files. They can also be uncompressed manually by commands below.
% gzip -dc sample_r11981_s06.gz > sample_r11981_s06.h5 % gzip -dc sample_r11981_s07.gz > sample_r11981_s07.h5 % gzip -dc sample_r11981_s08.gz > sample_r11981_s08.h5 % gzip -dc sample_r11981_s09.gz > sample_r11981_s09.h5
-
Metadata of input files
- Metadata of individual files can be retrieved by running command 'h5ls -r' or 'h5dump -H'.
- The metadata of files 'sample_r11981_s06.h5' and 'sample_r11981_s07.h5' is shown below.
- In these examples, each file contains 4 groups at the root level, namely 'neutrino', 'rec.me.trkkalman', 'rec.training.cvnmaps', and 'spill'. The number of groups and their names must be identical among all input files to be concatenated.
- In an input file, the number of datasets in a group can be different from another group.
- Given a group, the number of datasets and their names it contains must be identical among all input files. However, the size of first dimension of datasets can be different.
-
Run commands
% mpiexec -n 2 ../ph5_concat -i sample_list.txt -o sample_output.h5 % mpiexec -n 4 ../ph5_concat -i sample_list.txt -o sample_output.h5 -k evt
When completed, the output shown on screen is available in sample_stdout.txt.
-
Concatenated output file
- The concatenated output file is provided in
sample_output.h5.gz
. - The metadata retrieved from 'h5dump' command is also available in
sample_output.metadata
.% gzip -dc sample_output.h5.gz > sample_output.h5 % h5dump -Hp sample_output.h5
- The concatenated output file is provided in
Below is an example timing output from a larger run on Cori using 128 MPI processes.
% srun -n 128 ./ph5_concat -i ./nd_list_128.txt -o /scratch1/FS_1M_128/nd_out.h5 -b 512 -k evt
Number of input HDF5 files: 128
Input directory name: /global/cscratch1/sd/wkliao/FS_1M_8
Output file name: /global/cscratch1/sd/wkliao/FS_1M_128/nd_out.h5
Output datasets are compressed with level 6
Read metadata from input files takes 1.2776 seconds
Create output file + datasets takes 25.7466 seconds
Concatenating 1D datasets takes 158.8101 seconds
Write partition key datasets takes 14.0372 seconds
Concatenating 2D datasets takes 114.4464 seconds
Close input files takes 0.0037 seconds
Close output files takes 0.4797 seconds
-------------------------------------------------------------
Input directory name: /scratch/FS_1M_8
Number of input HDF5 files: 128
Output HDF5 file name: /scratch1/FS_1M_128/nd_out.h5
Parallel I/O strategy: 2
Use POSIX I/O to open file: ON
POSIX In-memory I/O: ON
1-process-create-followed-by-all-open: OFF
Chunk caching for raw data: ON
GZIP level: 6
Internal I/O buffer size: 512.0 MiB
Dataset used to produce partition key: evt
Name of partition key datasets: evt.seq
-------------------------------------------------------------
Number of groups: 999
Number of non-zero-sized groups: 108
Number of groups have partition key: 108
Total number of datasets: 17971
Total number of non-zero datasets: 2795
-------------------------------------------------------------
Number of MPI processes: 128
Number calls to MPI_Allreduce: 3
Number calls to MPI_Exscan: 2
-------------------------------------------------------------
H5Dcreate: 25.6772
H5Dread for 1D datasets: 1.8583
H5Dwrite for 1D datasets: 170.2729
H5Dread for 2D datasets: 19.9304
H5Dwrite for 2D datasets: 93.9265
H5Dclose for input datasets: 0.0737
H5Dclose for output datasets: 0.0516
-------------------------------------------------------------
Read metadata from input files: 1.2782
Create output file + datasets: 25.7466
Concatenate small datasets: 158.8102
Write to partition key datasets: 14.0372
Concatenate large datasets: 114.4520
Close input files: 0.0124
Close output files: 0.4799
End-to-end: 314.8095