Parallel_Benchmarking

Parallel Benchmarking

Notes from 7/26/2011 telecon (Entered by Ruth 8/3/2011)

Mark, Ruth, and Albert attended.
Ruth circulated 2-page drawing that outlined a use case in advance.
Page 1
Page 2

Based on that drawing, we discussed how Silo would handle it.

This is a transcription of Ruth’s notes… additions/ corrections welcome.

DBputquadmesh would be used to put the coordinate arrays / xy values
DBputquadvar would be used to put the temp and pressure variables.

Raw data and metadata (name, units, etc.). This is Silo metadata.
A small # of calls would be made to HDF5 – an uberstruct w/ all the metadata would be written.
This would be Dataset create, write, close.

From the application perspective, they care about mesh and variable data at a minimum, but don’t really think about the metadata. The metadata is not counted in terms of I/O request size. The application “thinks” in terms of # of nodes or zones of field (variable). For example, in the use case drawing for D0 there are 5×4 = 20 zones of 64 bit doubles for the temperature field = 160 bytes.

The metric at the HDF5 level is multi-dimensional array.

At the filesystem level it’s # of bytes.

Looking at the code snippet on page 2-172 of Silo UG:

  PMPIO_baton_t *bat = PMPIO_Init(...);
  dbFile = (DBfile *) PMPIO_WaitForBaton(bat, ...);
  /* local work (e.g. DBPutXXX() calls) for this processor */
  .
  .
  .
  PMPIO_HandOffBaton(bat, ...)
  PMPIO_Finish(bat);

This snippet is only about I/O.
The computation is done for all the domains on a given processor before any I/O is done.
Processor memory is sufficient to hold all the data for all the domains on the processor until it’s time to dump.
Typically thousands of domains written per file.
HDF5 groups are used to organize the domains and the data written for them in the file. For example, in the use case, D0 would be a group. quad, temperature, pressure would be under that group.
One file only contains data for a single timestep in the computation. Data for different timesteps goes to files in different directories.

In general, HPC codes don’t do partial I/O — they write or read all the data in a dataset at once. This is changing some with wavelet compression such as what Fastbit does, and in some cases Visit does partial reads if it helps improve viz speed. But, in general that’s the case.

The mock-up drawing does not accurately reflect the way applications and silo work. The domains would be assigned “linearly” to processors. So, where the drawing shows D0, D2 to P0; D1, D3 to P1; D4 to P2, in fact it would be D0, D1 to P0; D2, D3 to P1; D4 to P2. And, in the timeline, it shows compute & I/O interspersed. In fact, all the compute would occur first, then all the I/O.

Currently there is no assignment of domains or division of processors into GROUPS for writing files based on I/O capabilities of the processors. And, currently no overlap of compute and I/O (as shown in the timing diagrams).

Parallel Benchmark Architecture (Initially discussed on 9/14/11 telecon)

Ruth put together some notes on this page1, page2 page4
Input args fall into a few categories
- Over-arching test driver args: filename, paths to particular filesystems, name (moniker) of test, etc.
- I/O interface agnostic: (size of request, number of requests, noise, parallel mode, etc.
- I/O interface specific: (options to HDF5 or Silo)
- Performance results reporting

A finer decomposition of pieces

Interface-Independent I/O Pattern Emmitter (driver)
- Generate an I/O pattern according to some statistical model.
- Repeat an I/O pattern (e.g. a trace) captured from some other source.
Interface-Independent to Interface-Specific I/O request translator
- Note see remarks below regarding notion of equivalence in I/O requests across interfaces
- Support multiple different, real I/O interfaces (sec2, stdio, hdf5, aio, etc) as plugins
Performance gathering and reporting
- Gather information of test context just prior to run such as OS type, filesystem type, etc.
- Back-end database to gather data from run to run, and from different platforms, etc.
- Maybe a build-bot like server where data is gathered to a database
- Timing mechanism
  - Mark M mentions this can be tricky, varies across platform, OS, granularity of timing provided, etc. Might also benefit from some sort of plugin approach.

Equivalence in I/O requests across I/O library interfaces

Be sure to see some discussion of this topic in the forum

A typical issue we face in winning adoption for higher level I/O libraries like HDF5 and Silo is I/O performance. We always hear application developers complain about how much it is costing them to write data to HDF5 vs. using a lower-level interface like Section 2 or stdio or MPI-IO.

Without a doubt the comparison is unfair. Why? Because you can do an awful lot more with an HDF5 file than you can with a file of raw bytes. Nonetheless, its very difficult to argue this point with a newbie application developer considering using a higher level library for the first time due to lack of appreciate of this subtle quality of data that results issue. So, IMHO, its important to be able to speak to the performance issue directly and accurately and, more importantly, be aware of and correct performance issues that arise in higher-level libraries BEFORE we move onto the quality of data that results argument.
I honestly believe that in any reasonable I/O context it should be possible to amortize the cost of using a higher level library by ensuring that library is being used as close to optimally as possible which typically means, as large as I/O request size as possible.
A pet peeve I have with many comparisons that have been done between various I/O libraries and previously developed benchmark tools is that they do not compare apples to apples. Sometimes, this can be as simple as an oversight in how measurements were taken. On the other hand, it can be down right intentional in an attempt to skew results in favor of one I/O library over another. That is usually the case when I look at published results somewhere. So, IMHO, we need to take great care in drawing appropriate comparisons.
In ioperf, the equivalent operation being timed in HDF5 was actually a sequence of H5Dcreate, H5Dwrite, H5Dclose calls. Why? because you cannot write an HDF5 dataset unless you’ve created it. And, you are not done writing it unless you’ve closed it. The only way to take the H5Dcreate and H5Dclose calls out of the picture is if you choose to issue all writes ioperf generated to the same HDF5 dataset. But, that seems like an awful funny way to use HDF5. Each written thing is basically a separate object and putting them all in the same HDF5 dataset means you are loosing that semantic part of the exercise.

This notion of equivalence in I/O requests across interfaces is something that requires further dialog. At a minimum, we need to be thinking of it in terms of what an application code needs to do with its data, irrespective of the idiosynchracies with which that might be achieved through various I/O interfaces even if said interfaces are used optimally.

An application has a bunch of data it would like to store to persistent storage. Generally, that data is scattered all over in pieces in various places in memory. Some of those pieces represent different parts of some larger single semantic data object and others of those pieces may be either a single data object unto themselves or even a whole set of smaller data objects. Ultimately, all that data is intended to wind up in one or more files. The application really has only a few choices…

Gather all the pieces for one or more data object(s) into one place and pass that aggregated whole object onto a library below.
- When the data objects are small (think a single integer), it is common to gather many together to amortize cost of I/O.
Build some kind of a map like thing that indicates where all the pieces of one or more data objects are and pass that thing on to a library below
- Either that library will issue separate writes for each piece (probably not a good idea) or it will gather things together just as in the first case on the caller’s behalf to issue larger writes.

Possiblity to play back a trace captured from another code

Various I/O libraries have the ability to turn on a tracing feature and/or we can re-engineer interfaces to gather tracking information.
- We have used CPP before to map calls to an interface through an intermediary tracing routine before passing control onto the real routine and this has been pretty successful though I don’t know how much the extra function call for the intermediary function can skew results. Given we are talking about I/O which is so slow compared to CPU anyways, I am not sure the extra functional call is even relevant.
We can use tracing feature of a library to collect what a real application code is doing during I/O
It is most useful in the I/O trace was interface agnostic meaning that it captured the I/O intent of the application without regard for a particular I/O interface chosen to do the I/O.
- In the case of HDF5, there may be ways to reverse engineer a trace containing a sequence of H5Dopen-H5Dwrite-H5Dclose calls such as Silo produces. But, in general, this may be difficult.
- Might need to take care that the collection of this information, which often also must be written to files during the process, does not somehow skew the test. Buffering tracing data in the application until I/O is complete is probably best, if its practical. Might need ability to adapt what gets collected during tracing to affect buffer size management.
Richard has had successes using strace for this purpose in the past.
- Knowing the application, such as the IOR benchmark, has enabled the ability to reverse engineer the strace trace and map calls in the strace stream to operations from IOR.

More on notion of I/O request equivalence across interfaces

We have identified a need to develope a notion equivalence of (a set of) I/O operation(s) across a variety of I/O interfaces.

I believe it is best to ask this question from the context of the application needing to write/read data without regard for qualitative differences in how that data winds up being stored/handled by any given I/O interface. For example, when data is written to HDF5, its possible to give the data a name, associate a datatype with it, convert from one datatype to another, to checksum it, compress it, etc. The data is stored such that it can be subsequently random accessed. These are all useful features in HDF5

But, given the basic action of an application writing/reading data to/from persistent storage, I claim all of these useful features represent something that is qualitatively different from raw I/O performance. Therefore, when developing performance and benchmarking metrics, we have two problems. One is quantifying the subsequent overhead higher level interfaces impose on raw I/O performance. The other is developing a way to equivalence I/O operations across interfaces operating at very different levels of abstractions.

There is no doubt that such qualitative differences are very important. And, if all of these other features had no impact on performance, we wouldn’t really even need to be talking about them. But, I claim we need a way to factor these issues out of raw I/O performance measurements and comparisons so that we can normalize measurements across interfaces where the lowest common denominator is something like stdio or sec2 which in and of themselves support none of these features. In so doing, we in fact wind up with a good idea of the cost that applications pay in using a library like HDF5 as well as how to optimize use of such a library to minimize that cost.

But, we also have to be careful. We can envision I/O libraries like Silo which operate on meshes and fields (a level of abstraction above HDF5’s structs and arrays) and which include ever more sophisticated operations on the data such that the operations themselves have a profound impact on the basic action of moving data from memory to persistent storage. For example, if we think of some really advanced scientific database that maybe includes very high level operations to detect vortices in fluid flow or high gradients in fields defined on a mesh and then only takes a snapshot of the data when the conditions are right, such operations will have such a profound impact on I/O performance that it does not make sense to exclude them when measuring I/O performance. In this context, the application’s need isn’t so much to store data to storage as it is to store snapshots of the data around the time(s) of important events in the evolution of the simulation.

So, in general, there is a spectrum; at one end is simply raw I/O and no operations on the data. At the other end is highly sophisticated database-like processing that can change entirely the nature of the data being stored. Then, there are in-between operations that maintain the same data semantics but represent it in perhaps different ways. Stdio and sec2 are examples of the extreme raw-I/O end of the spectrum. HDF5 and Silo are examples of the in-between type of library. ITAPS together with some specialized service software to for feature detection is the other extreme end of the spectrum.

Note, for restart dumps, there is an implied requirement that the complete internal memory state of the simulation can be reconstructed from whatever is stored to persistent storage. The purpose of a restart dump is to store the state of the application so that the simulation can be restarted from that point forward. For plot dumps, there is no such implied requirement and so its conceivable that there can be many operations applied to the data that may change its characteristics dramatically from what is actually stored in the application’s memory.

The notion of application data objects

Conceptually, we can think of all of the data the application wants to store as a collection of one or more data objects.

A data object is a whole, coherent entity of data that is treated, semantically, as an independent, single thing. For example, in a simulation of airflow over a wing, one of the data objects may be the velocity vector field of the air. There are many, many ways an application could choose to store this data object in memory as suiting the needs of the implementation of the numerical models the simulation uses. Below, we characterize examples by way of code showing how the memory of the data is allocated…

Single, composite, 3D array



double vel = (double) malloc(3*Nx*Ny*Nz*sizeof(double));

3 component, 3D arrays



double vx = (double) malloc(Nx*Ny*Nz*sizeof(double));

double vy = (double) malloc(Nx*Ny*Nz*sizeof(double));

double vz = (double) malloc(Nx*Ny*Nz*sizeof(double));

Array of arrays



double *

for (int i = 0; i < Nz; i++)

{

    for (int j = 0; j < Ny; j++)

    {

        (double) malloc(Nz*sizeof(double));

    }

}

Ensuring a benchmark writes and reads are verifieable

Its sometimes convenient, and I myself have written simple I/O test code, to write data to disk in such a way that it cannot later be verified that the data in the file is actually what the writer handed off. For example, its common to allocate a buffer of bytes to write but to not set those bytes to some pre-defined values. That is because we’re aften thinking only about timing how long it takes for a given I/O library to push the bytes to disk. However, the more I think about this, the more I think it is probably prudent to design and indeed require a useful I/O benchmark in such a way that the data that it uses in testing is verifieable. That is that it is possible to indpendently check that indeed the data from the application was properly written to the file in such a way that it can later be read back. I think this represents the absolute minimum requirement of any I/O interface that might be a candidate for including in a benchmarking study. I mean, would we really wan to include a library for which it is not possible to ensure this?

In practical terms, this means something like Silo’s ioperf needs to do tiny bit more work constructing the data buffers it writes. In addition, I think there is value in designing things in such a way that given any bucket of data in the file, based on its contents we can identify which processor and/or which number in sequence of I/O requests originated it. Perhaps there is more we might like to be able to deduce from the contents of a given bucket of data in the file but those are at least two useful features.

I am also thinking it makes sense to include in the benchmark a few different fundamental data types such as character data, integer data and floating point data. So that we define a useful benchmark to be one that handles all these types of data not so much that they have to be handled portably across different machines but that they have to be handled in whatever way that means for the underlying I/O interface being tested.

Uploaded files appear below (but not in edit mode). Insert new text above this header.

Focused Benchmarking and Auto-Tuning Activities (07Dec11 Telecon)

Attendees: Ruth, Quincey, Mark H., Mark M., Prabhat

One stop shopping for I/O relevant tuneables

HDF5 library has various parameters that effect I/O
MPI-IO library has various paramters that effect I/O performance
GPFS, Panasas, Lustre have various I/O parameters that effect I/O performance
C/C++ and Fortran I/O routines and associated system call interfaces have various parameters that effect I/O performance
HPC platform paramters (e.g. things like BG/P’s virtual node vs. smp modes)
Others we are forgetting?

The goal here is to outline development of a common interface to control all these parameters; a one-stop-shopping for I/O tuneables. This could be a useful capability apart from any specific product such as HDF5. So, might want to consider software engineering issues to make it packageable as such. This is a sort of essential piece to many of the other benchmarking and auto-tuning activities we’d like to consider. Without the ability to vary/control parameters in a common way, its difficult to develop software to do other things we want. We need to be aware of situations where there are not well defined interfaces to specific parts of the system with which to control parameters (e.g. only way to affect parameter X is via some env. variable).

We’re considering the notion of an HPC-specific (high-level) API in HDF5 for this purpose¹?. There are still problems with environment variables, since they are out of band.

Include ability to read parameter sets from human readable/editable settings files

Having ability to set paramaeters via a common interface is good. Being able to vary them for different runs of a benchmark or application is also useful. But to do that, some part of the application has to take responsibility for implementing the ability to accpet user-specified settings. The current proposed solution is to affect the ability to drive interface defined above from the contents of human readable/editable text files, probably xml. So, part of defining the interface above will include ability to write out and read back (xml) settings files.

Mark M. propoosed notion of making HDF5 library properties have this ability so that literally any HDF5 properties could be stored persistently as XML strings either within an HDF5 file or as a raw, standalone xml file. The HDF5 Group is nearly finished with a mechanism to serialize/deserialize property lists (into binary, not text though).

Add interface defined above in some I/O benchmarks (Silo’s ioperf, h5perf, h5part kernel)

Mark M. offered manpower to incorporate the one-stop-shopping I/O tuneables interface into Silo’s ioperf. Timeframe would be sometime before end of March, 2012. Prabhat and Mark H. could adjust h5part and use as kernel.

Inventory I/O relevant tuneables

Do we even know what all the knobs are and what the are (intended to) do? Can we collect together in one place (as well as maintain this information as things change) all the tuneables that exist? Ruth suggested world-readable wiki for this. We agreed that having it be HPC-specific would be best.

Triage tuneables for whats important and whats not (rationale too)?

Having a list of all possible tuneables is useful but we really ought to have some idea of what’s important and what’s not, as well as our rationale for these choices… So, we need to triage/classify the tuneables list into things we think are

really important
somewhat important
not very important

as well as the assumptions/conditions under which such judgements are valid (e.g. kinds of I/O application scenarious).

Do we know the answers already?

Mark M. argues that single biggest factor in I/O performance is finding a way to make I/O requests as large as possible. Focusing on maximizing that achieves biggest and best gains in performance. So, why worry about all the other less significant knobs? If we have a good handle on the other knobs, we may be able to vary performance by several times on top of what we can gain with I/O request size. So, the other knobs are still useful?

An initial seat of the pants triage of various paramters where we all sit around the table and say things like “I think X is useless” or “I think Y has a 10% effect” would be a useful thing to do at a future meeting once we’ve compiled a complete list of tuneables.

Also, even if we know the answers today, they are very likely to change over time, or on other systems. And, we may add/retire tuneables from a auto-tuning framework over time, allowing it to stay relevant.

That said, in situations where it is possible to affect I/O request size without undue impact on application or other parts of the software stack, then I believe very strongly that I/O request size will always be relevant and always have a significant impact. On the other hand, should I/O request size really be considered a tuneable as it is something that is controlled almost entirely by the application and/or I/O layers above HDF5.

Identifying driving applications (kernels)

We should spend some time to characterize I/O scenarious used by our relevant applications. Prabhat mentions h5part can serve as a useful I/O kernel. Quincey suggested the following 3 kinds of I/O scenarious: restart/plot dumps (i.e. write once, read never), LSST-like transactions (write many, read frequently), visualization/post-processing analyses (write once, read many).

We need to spend some time defining what the driving applications are and putting word descriptions to them.

Mark H. explained that there is some literature already available on automatically detecting I/O patterns and classifying them. We should look to see whats out there already and see if we can use it.

Granularity of timing and statistics?

Is a single number, total execution time, sufficient for characterizing results from benchmarking runs? Do we need to be able to take finer granularity measures?

Measuring I/O performance can be really, really hard. There are many pitfalls and gotchas. One thing is simply sanity checking to ensure a given test was indeed run with the settings you thought you specified. I can’t tell you how often passing things like MPI-IO hints silently failed due to miss-spellings in the hints names. This kind of thing is fraught with peril and we need to ensure we do plenty of error checking to confirm settings the test is supposed to run with are indeed being set.

Fine granularity data is often helping in diagnosing why a test went wrong or in concluding the test is an outlier or in identifying what part of the software stack went wrong. So, there is value in fine grained timing data. Mark H. mentions that such data can also be gainfully exploited along with some statistical analysis to better understand performance and results.

Key next steps

Start developing design for one-stop-shopping for tuneables interface
Outline missing functionality in HDF5 necessary to obtain I/O tracing data for record and playback
- HDF5 already has some mechanism for doing some of this via H5_DEBUG=trace,all functionality
- Ruth mentioned that this sort of mechanism may not permit real data values but what if we kept the file that was produced by the recorded trace and used that as input to the playback player? In fact, if the original recording was from a real application, then we could assume the entire contents of the file could be read back into memory prior to playback so that write testing would not be corrupted by attempts to read data from the source file. Just a thought
Estimate resources (cost) to complete above
Start adjusting existing benchmark codes (h5part, ioperf) to do some more interesting I/O cases
Do some initial small runs with above simply to exercise processes
Start assembling and triaging list of tuneables

Footnotes

¹ Note that this problem is somewhat similar in spirit to the problem of multiple run-time type interfaces (HDF5, MPI, C/C++ programming language, etc. Presently, a user winds up having to learn multiple interfaces to defininig run-time types and then re-specify the same type to the different interfaces to obtain consistent behavior across interfaces.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly