Skip to content

Compression

miller86 edited this page Mar 24, 2019 · 1 revision

Compression

Some compression algorithms such as gzip can be applied to any arbitrary byte stream. The algorithm is Silo object agnostic. Other compression algorithms, such as Peter Lindstrom’s HZIP, can be applied only to meshes consisting of hexahedra (3D case) or quadrilaterals (2D case).

Currently, there is a single, global method, DBSetCompression() that allows the Silo client to affect compression on all subsequent Silo calls until the next call to DBSetCompression(). On the one hand, this minimizes the amount of work a Silo client will need to to do enable compresssion in their writer code. On the other hand, it is global to the whole library and effects all Silo writes to all files currently opened or to be opened in the future.

We could add DBOPT_COMPRESSION options to each and every Silo object allowing the client to control compression on an object by object basis at the expense of having to make more significant modifications to their writer code. Alternatively, we could add options to a DBCreate() and DBOpen() call to allow the client to specify compression controls on file by file basis. This would involve expanding the notion of file options sets to include compression options. In reality, it would be nice if we supported all three of these approaches to controlling compression.

In this context, a problem with the current way compression controls are implemented (e.g. a single library global control) is that there is a sort of all or nothing effect. If a caller has various types of objects to write and HZIP is the selected compression algorithm, if all those objects don’t conform to the limitations HZIP imposes, then no compression is performed. Currently, all that Silo’s compression controls permit is to specify the algorithm (by name) and then various parameters of the algorithm.

During writes, the HDF5 library provides sort of two decision points regarding the application of the compression algorithm to a given write; one is the can_apply method of the compression filter_. If the @canapply@ method returns false, then the algorithm is not applied. The next decision point is in the actual application of the filter itself (which for custom compression algorithms like HZIP, the Silo library itself controls) where we can decide to have the filter fail, in which case the HDF5 library will perform the write UNcompressed assuming the filter flags argument indicated the filter was optional in the dataset creation property list.

Proposed file-level and object-level compression controls

  • Interpret the DBSetCompression() method as controlling the library-wide default compression settings. Whenever a file is created or non-read-only opened, the file inherits whatever the library’s current default compression settings are.
    • If this method is called multiple times, it will effect only files created or non-read-only opened in the future.
  • Add a new DBSetFileCompression(DBfile *dbfile, const char *compression) method to override a given file’s compression settings.
    • This method can be called multiple times on a currently open file and each time it will effect only future new object writes.
    • Alternatively, the functionality to override the default compression settings could be confined to file creation/open calls. This might useful in preventing variation in the compression method applied in a file over the course of writes to it. On the other hand, the Silo library would then have to store compression information to the file so that subsequent opens and writes could also use the same settings.
    • A risk in permitting the compression method(s) used in a file to vary over the life of writes to the file might be that a given file could wind up with data requiring a multitude of different compression algorithms to read entirely.
  • Add DBOPT_COMPRESSION object-level option which overrides the file level compression settings.
    • what about Silo write methods that do not have a DBoptlist argument?

To some extent, this approach delegates the job of deciding which compression to use and where to the Silo client. In this context, if a caller sets HZIP as the library wide compression method and then discovers nothing in the file winds up getting compressed, the onus is on the user to understand why that is the case and fix it by affecting compression on file(s) or object(s).

Proposed improvement to global-level compression controls

  • Interpret the DBSetCompression method as controlling a compression strategy that tells the Silo library what compression method(s) to use and what to do in case one method cannot be applied (e.g. fails the can_apply test). For example, the setting might be ‘use HZIP where possible and GZIP-BEST otherwise’.
    • Note that it is the failure in the HDF5 filter’s can_apply method that is the distinguishing factor in selecting the compression algorithm to use; not a failure of the compression algorithm to actually compress (e.g. reduce) the amount of data. In HDF5, compression filter specification is done during dataset creation and not during write operations. Therefore, in order to fall back from one kind of compression to another, the Silo library would have to rely upon HDF5’s can_apply method returning false for the compression filter that should be turned off (skipped). Otherwise, the Silo library would have to defer the work in falling back from one kind of compression to another in the H5Dwrite call but aborting the write and then deleting the created dataset and creating another dataset with the proper compression properties.

About the only advantage to this approach is that it is the simplest to implement. On the other hand, it leaves a number of problems with compression still in the library. If multiple files are open simultaneously, the compression settings are global to the whole library and wind up getting applied to each and every open file.