Skip to content

open-meteo/om-file-format

Repository files navigation

OM-File-Format library

Test GitHub license

The Open-Meteo OM-File format is designed for efficient storage and distribution of multi-dimensional environmental data. By chunking, compressing, and indexing the data, OM-Files enable cloud-native random reads while minimizing file sizes. The format supports hierarchical data structures similar to NetCDF or HDF5.

This library implements the format in C, with a high-level Swift abstraction integrated directly into the Open-Meteo weather API. Future bindings for Python, TypeScript, and Rust are planned.

Note: This library is in a highly experimental stage. While Open-Meteo has used the format for years, this standalone library was initiated in October 2024 to provide Python bindings. We aim to provide a robust Python library to access the Open-Meteo weather database provided on S3 through an AWS open-data sponsorship.

Features:

  • Chunked, compressed multi-dimensional arrays
  • High-speed integer compression: Fast compression speed at high compression ratios
  • Lossless and lossy compression: Adjustable accuracy via scale factors to further reduce data size
  • Optimized for cloud-native random IO access: Supports IO merging and splitting
  • Sequential file writing: Enables streaming write to cloud storage; metadata is stored at the file’s end
  • Sans-IO C implementation: Designed for async support and concurrency in higher-level libraries

Core Principles:

  • Chunked Data Storage: OM-Files partition large data arrays into individually compressed chunks, with a lookup table tracking chunk positions. This allows reading and decompressing only the required chunks—ideal for use cases like meteorological datasets, where users often query specific regions rather than global data.
  • Optimized for Meteorological Use Cases: Example: In weather reanalysis (e.g., Copernicus ERA5-Land), global datasets at 0.1° spatial resolution can reach massive scales. A single timestep with 3600 x 1800 pixels (~25 MB using 32-bit floats) grows to 211.5 GB for one year of hourly data (8760 hours). Over decades, and across thousands of variables, datasets easily reach petabyte scales. Traditional GRIB files, while efficient for compression, require decompressing the entire file to access specific subsets. OM-Files, on the other hand, allow direct access to localized data (e.g., a single country or city) by leveraging small chunk sizes (e.g., 3 x 3 x 120).
    • High-Speed Data Access: OM-Files minimize data transfer and decompression overhead, enabling extremely fast reads while maintaining strong compression ratios based on FastPFOR with SIMD instructions for compression rates in the GB/s range. This powers the Open-Meteo weather API to deliver forecasts in sub-millisecond speeds and enables large-scale data analysis without requiring users to download hundreds of gigabytes of GRIB files.
  • Improved Compression Efficiency: Chunking exploits spatial and temporal data correlations to enhance compression. Weather data, for instance, shows gradual changes across locations and time. Optimal chunking dimensions (compressing 1,000–2,000 values per chunk with a last dimension >100) strike a balance between compression efficiency and performance. Too many chunks reduce both.

ToDo:

  • Document Swift functions
  • Document C functions
  • Support for string attributes and string-arrays
  • Build Python library
  • Examples how to use Python FSSPEC with cache to access OM-Files on S3
  • Build web-interface to make the entire Open-Meteo weather database accessible with automatic Python code generation

Swift Library Interface

Swift code can be found in ./Swift with tests in ./Tests

TODO: Document functions + example

C Library Interface

The C code is available in /c

TODO document C functions

Data Hierarchy Model:

  • The file trailer contains the position of the root Variable
  • Each Variable has a datatype and payload. E.g. Int16 has the number as 2-byte payload. An array stores the look-up-table position and array dimension information. The actual compressed array data, is stored at the beginning of the file.
  • Each Variable has a name
  • Each Variable has 0...N variables -> Variables resemble a key-value store where each value can have N children.

A Variable be be of different types:

  • None: Does not contain any value. Useful to define a group
  • Scalar or types Int8, Int16, Int32, Int64, Float, Double, etc
  • Array of type Int8, Int16, etc with dimensions, chunks and compression type information
  • String to be implemented
  • String Array to be implemented

Examples

The following examples show how data with attribute can be encoded into an OM-File format

Example 1: Plain array inside an OM-File:

Root: Name="temperature_2m" Type=Float32-Array Dimensions=[720,1400,24] Chunks=[1,50,24]

Example 2: Array with attributes

Root: Name="temperature_2m" Type=Float32-Array Dimensions=[720,1400,24] Chunks=[1,50,24]
|- Name="dimension_names" Type=String-Array Dimensions=[3]
|- Name="long_name" Type=String Value="Temperature 2 metres above ground"
|- Name="unit" Type=String Value="Celsius"
|- Name="height" Type=Int32 Value=2

Example 3: Multiple Arrays with attributes

Root: Type=None
|- Name="temperature_2m" Type=Float32-Array Dimensions=[720,1400,24] Chunks=[1,50,24]
  |- Name="dimension_names" Type=String-Array Dimensions=[3]
  |- Name="long_name" Type=String Value="Temperature 2 metres above ground"
  |- Name="unit" Type=String Value="Celsius"
  |- Name="height" Type=Int32 Value=2
|- Name="relative_humidity_2m" Type=Float32-Array Dimensions=[720,1400,24] Chunks=[1,50,24]
  |- Name="dimension_names" Type=String-Array Dimensions=[3]
  |- Name="long_name" Type=String Value="Relative Humidity 2 metres above ground"
  |- Name="unit" Type=String Value="Percentage"
  |- Name="height" Type=Int32 Value=2

Model

classDiagram
    Variable <|-- Variable
    Variable --|> Int8
    Variable --|> Int16
    Variable --|>String
    Variable --|> Array
    Trailer --|> Variable
    Variable : +String_name
    Variable : +Variable[]_children
    Variable : +Enum_data_type
    Variable : +Enum_compression_type
    Variable: +number_of_childen()
    Variable: +get_child(int n)
    Variable: +get_name()
    class Trailer {
        +version
        +root_variable
    }
    class Int8{
      +Int8 value
      +read()
    }
    class Int16{
      +Int16 value
      +read()
    }
    class String{
      +String_value
      +read()
    }
    class Array{
        +Int64[]_dimensions
        +Int64[]_chunks
      +Int64_look_up_table_offset
      +Int64_look_up_table_size
      +read(offset:Int64[],count:Int64[])
    }
Loading

Legacy Binary Format:

  • Int16: magic number "OM"
  • Int8: version
  • Int8: compression type with filter
  • Float32: scalefactor
  • Int64: dim0 dim (slow)
  • Int64: dim0 dim1 (fast)
  • Int64: chunk dim0
  • Int64: chunk dim1
  • Array of 64-bit Integer: Offset lookup table
  • Blob: Data for each chunk, offset but the lookup table

New Binary Format:

  • 3 byte: header (magic number "OM" + version)
  • Blob: Compressed data and lookup table LUT
  • Blob: Binary encoded meta data
  • 24 byte: Trailer with address to root variable

Binary representation:

  • File header with magic number and version
  • File trailer with offsets and size of the root variable
  • Variable has attributes: date type (8bit), compression type (8bit), size_of_name (16bit), count_of_attributes (32bit)
  • Depending on data type followed by payload for a given data type
  • Followed by the name as string, and for each attribute the offset and size
  • Typically all compressed data is in the beginning of the file, followed by all meta data and attributes (streaming write without ever seeking back!)

Header message:

Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6 Byte 7 Byte 8
Magic number "OM" Version

Trailer message:

Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6 Byte 7 Byte 8
Magic number "OM" Version Reserved Reserved
Size of Root Variable
Offset of Root Variable

Variable message:

Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6 Byte 7 Byte 8
Data Type Compression Type Size of name Number of Children
Size of Value / LUT (only arrays and strings)
Offset of Value / LUT (only arrays)
Number of Dimensions (only arrays)
Scale Factor (float, only arrays) Add Offset (float, only arrays)
N * Size of Child
N * Offset of Child
N * Dimension Length (only arrays)
N * Chunk Dimension Length (only arrays)
Bytes of value (scalar, string, not arrays)
Byte of name