The Open-Meteo OM-File format is designed for efficient storage and distribution of multi-dimensional environmental data. By chunking, compressing, and indexing the data, OM-Files enable cloud-native random reads while minimizing file sizes. The format supports hierarchical data structures similar to NetCDF or HDF5.
This library implements the format in C, with a high-level Swift abstraction integrated directly into the Open-Meteo weather API. Future bindings for Python, TypeScript, and Rust are planned.
Note: This library is in a highly experimental stage. While Open-Meteo has used the format for years, this standalone library was initiated in October 2024 to provide Python bindings. We aim to provide a robust Python library to access the Open-Meteo weather database provided on S3 through an AWS open-data sponsorship.
- Chunked, compressed multi-dimensional arrays
- High-speed integer compression: Fast compression speed at high compression ratios
- Lossless and lossy compression: Adjustable accuracy via scale factors to further reduce data size
- Optimized for cloud-native random IO access: Supports IO merging and splitting
- Sequential file writing: Enables streaming write to cloud storage; metadata is stored at the file’s end
- Sans-IO C implementation: Designed for async support and concurrency in higher-level libraries
- Chunked Data Storage: OM-Files partition large data arrays into individually compressed chunks, with a lookup table tracking chunk positions. This allows reading and decompressing only the required chunks—ideal for use cases like meteorological datasets, where users often query specific regions rather than global data.
- Optimized for Meteorological Use Cases: Example: In weather reanalysis (e.g., Copernicus ERA5-Land), global datasets at 0.1° spatial resolution can reach massive scales. A single timestep with 3600 x 1800 pixels (~25 MB using 32-bit floats) grows to 211.5 GB for one year of hourly data (8760 hours). Over decades, and across thousands of variables, datasets easily reach petabyte scales. Traditional GRIB files, while efficient for compression, require decompressing the entire file to access specific subsets. OM-Files, on the other hand, allow direct access to localized data (e.g., a single country or city) by leveraging small chunk sizes (e.g., 3 x 3 x 120).
-
- High-Speed Data Access: OM-Files minimize data transfer and decompression overhead, enabling extremely fast reads while maintaining strong compression ratios based on FastPFOR with SIMD instructions for compression rates in the GB/s range. This powers the Open-Meteo weather API to deliver forecasts in sub-millisecond speeds and enables large-scale data analysis without requiring users to download hundreds of gigabytes of GRIB files.
- Improved Compression Efficiency: Chunking exploits spatial and temporal data correlations to enhance compression. Weather data, for instance, shows gradual changes across locations and time. Optimal chunking dimensions (compressing 1,000–2,000 values per chunk with a last dimension >100) strike a balance between compression efficiency and performance. Too many chunks reduce both.
- Document Swift functions
- Document C functions
- Support for string attributes and string-arrays
- Build Python library
- Examples how to use Python FSSPEC with cache to access OM-Files on S3
- Build web-interface to make the entire Open-Meteo weather database accessible with automatic Python code generation
Swift code can be found in ./Swift with tests in ./Tests
TODO: Document functions + example
The C code is available in /c
TODO document C functions
- The file trailer contains the position of the root
Variable
- Each
Variable
has a datatype and payload. E.g. Int16 has the number as 2-byte payload. An array stores the look-up-table position and array dimension information. The actual compressed array data, is stored at the beginning of the file. - Each
Variable
has a name - Each
Variable
has 0...N variables -> Variables resemble a key-value store where each value can have N children.
A Variable
be be of different types:
None
: Does not contain any value. Useful to define a groupScalar
or types Int8, Int16, Int32, Int64, Float, Double, etcArray
of type Int8, Int16, etc with dimensions, chunks and compression type informationString
to be implementedString Array
to be implemented
The following examples show how data with attribute can be encoded into an OM-File format
Example 1: Plain array inside an OM-File:
Root: Name="temperature_2m" Type=Float32-Array Dimensions=[720,1400,24] Chunks=[1,50,24]
Example 2: Array with attributes
Root: Name="temperature_2m" Type=Float32-Array Dimensions=[720,1400,24] Chunks=[1,50,24]
|- Name="dimension_names" Type=String-Array Dimensions=[3]
|- Name="long_name" Type=String Value="Temperature 2 metres above ground"
|- Name="unit" Type=String Value="Celsius"
|- Name="height" Type=Int32 Value=2
Example 3: Multiple Arrays with attributes
Root: Type=None
|- Name="temperature_2m" Type=Float32-Array Dimensions=[720,1400,24] Chunks=[1,50,24]
|- Name="dimension_names" Type=String-Array Dimensions=[3]
|- Name="long_name" Type=String Value="Temperature 2 metres above ground"
|- Name="unit" Type=String Value="Celsius"
|- Name="height" Type=Int32 Value=2
|- Name="relative_humidity_2m" Type=Float32-Array Dimensions=[720,1400,24] Chunks=[1,50,24]
|- Name="dimension_names" Type=String-Array Dimensions=[3]
|- Name="long_name" Type=String Value="Relative Humidity 2 metres above ground"
|- Name="unit" Type=String Value="Percentage"
|- Name="height" Type=Int32 Value=2
classDiagram
Variable <|-- Variable
Variable --|> Int8
Variable --|> Int16
Variable --|>String
Variable --|> Array
Trailer --|> Variable
Variable : +String_name
Variable : +Variable[]_children
Variable : +Enum_data_type
Variable : +Enum_compression_type
Variable: +number_of_childen()
Variable: +get_child(int n)
Variable: +get_name()
class Trailer {
+version
+root_variable
}
class Int8{
+Int8 value
+read()
}
class Int16{
+Int16 value
+read()
}
class String{
+String_value
+read()
}
class Array{
+Int64[]_dimensions
+Int64[]_chunks
+Int64_look_up_table_offset
+Int64_look_up_table_size
+read(offset:Int64[],count:Int64[])
}
Legacy Binary Format:
- Int16: magic number "OM"
- Int8: version
- Int8: compression type with filter
- Float32: scalefactor
- Int64: dim0 dim (slow)
- Int64: dim0 dim1 (fast)
- Int64: chunk dim0
- Int64: chunk dim1
- Array of 64-bit Integer: Offset lookup table
- Blob: Data for each chunk, offset but the lookup table
New Binary Format:
- 3 byte: header (magic number "OM" + version)
- Blob: Compressed data and lookup table LUT
- Blob: Binary encoded meta data
- 24 byte: Trailer with address to root variable
Binary representation:
- File header with magic number and version
- File trailer with offsets and size of the root variable
- Variable has attributes: date type (8bit), compression type (8bit), size_of_name (16bit), count_of_attributes (32bit)
- Depending on data type followed by payload for a given data type
- Followed by the name as string, and for each attribute the offset and size
- Typically all compressed data is in the beginning of the file, followed by all meta data and attributes (streaming write without ever seeking back!)
Header message:
Byte 1 | Byte 2 | Byte 3 | Byte 4 | Byte 5 | Byte 6 | Byte 7 | Byte 8 |
---|---|---|---|---|---|---|---|
Magic number "OM" | Version |
Trailer message:
Byte 1 | Byte 2 | Byte 3 | Byte 4 | Byte 5 | Byte 6 | Byte 7 | Byte 8 |
---|---|---|---|---|---|---|---|
Magic number "OM" | Version | Reserved | Reserved | ||||
Size of Root Variable | |||||||
Offset of Root Variable |
Variable message:
Byte 1 | Byte 2 | Byte 3 | Byte 4 | Byte 5 | Byte 6 | Byte 7 | Byte 8 |
---|---|---|---|---|---|---|---|
Data Type | Compression Type | Size of name | Number of Children | ||||
Size of Value / LUT (only arrays and strings) | |||||||
Offset of Value / LUT (only arrays) | |||||||
Number of Dimensions (only arrays) | |||||||
Scale Factor (float, only arrays) | Add Offset (float, only arrays) | ||||||
N * Size of Child | |||||||
N * Offset of Child | |||||||
N * Dimension Length (only arrays) | |||||||
N * Chunk Dimension Length (only arrays) | |||||||
Bytes of value (scalar, string, not arrays) | |||||||
Byte of name |