Representing the Measurement Set v2.0/v3.0 as Apache Arrow Datasets #1

sjperkins · 2023-03-15T13:28:46Z

sjperkins
Mar 15, 2023
Maintainer

This discussion proposes representing the Measurement Set v2.0 specification as a collection of Apache Arrow Datasets.

Other formats

This discussion does not concern itself with other cloud-native formats, most notably zarr.

Both SARAO and NRAO have investigated the use of zarr as an alternate format for MSv2.0/3.0. zarr can be faster than parquet, as demonstrated in other domains. However, CASA Tables are tabular by definition and this format lends itself to use with SQL, and by implication, downstream query and data analytics engines which operate on Apache parquet files. zarr does not yet support this integration. This does not mean use of zarr should be discounted -- we believe both formats have benefits in different scenarios.

Background: The Casa Table Data System

The CASA Table Data System (CTDS) is a bespoke Radio Astronomy format for storing arrays (columns) of data in a relational, tabular on-disk database. It is a Columnar Database complete with it's own Table Query Language (TAQL) SQL dialect and is primarily used for storing and querying raw visibilities of Radio Interferometry data in the Measurement Set v2.0 specification. CTDS was implemented during the late 1990's/early 2000's before:

Multi-core architectures had become ubiquitous due to the breakdown of Moore's Law.
Radio Interferometer data sizes became too large to process on single nodes.
The development of commodity cloud computing and object Store file systems .

CTDS is highly configurable and can store multiple columns in a single file or across multiple files, but in practice a number of concerns arise:

Table Access is not thread-safe, which is problematic in multi-core architectures.
Strictly, a coarse-grained global table lock is required to allow safe read and write access to column data. Writes serialise data access.
It is possible to disable locking and allow multiple reads/writes in configurations where data appending is disabled and data accesses are configured to not produce race conditions.
But ACID is not guaranteed and data loss can still occur.
Monolithic files are unsuitable for cheap object storage on the cloud, where datasets are composed of a collection of objects accessed by unique keys.
The associated risk and expense of developing and maintaining a bespoke database format.

All of the above factors make the CTDS difficult to use in a modern, distributed and cloud computing paradigm.
Various strategies have been developed to ameliorate the above concerns, including:

Permitting multiple writers by establishing a virtual CASA Table over several logical CASA Tables (CASA 4.5).
Implementing an ADIOS Storage Manager to utilise the ADIOS 2 I/O system.

While the CTDS, with the above modifications, can still process Radio Interferometric data and may be extended to do so in future, we believe that significant future effort can be avoided by representing Radio Astronomy data with widely used Data Engineering formats, particularly those within the Apache Arrow ecosystem.

Apache Arrow

At the heart of the Apache Arrow project is a specification for in-memory columnar data.

The “Arrow Columnar Format” includes a language-agnostic in-memory data structure specification, metadata serialization, and a protocol for serialization and generic data transport.

Similarly to CASA Tables, Arrow Tables are composed of a set of columns. This allows data be produced and consumed by multiple languages including C++, Rust, Python and Julia and to therefore be consumed by data processing and scientific software within those ecosystems.

Aside from cross-language support, formats in the Apache Arrow ecosystem are understood by both Storage and Execution Engines which enable large scale, distributed data analytics. Examples of Storage Engines include:

Data Lakes such as Apache Iceberg and Deltalake.
BlazingSQL, a GPU-accelerated Database.

while examples of Execution Engines include:

DuckDB: an in-process OLAP SQL Engine.
Datafusion: "DataFusion is used to create modern, fast and efficient data pipelines, ETL processes, and database systems, which need the performance of Rust and Apache Arrow and want to provide their users the convenience of an SQL interface or a DataFrame API".
Ibis: "a Python library that provides a lightweight, universal interface for data wrangling. It helps Python users explore and transform data of any size, stored anywhere.""

Additionally, Arrow Tables are convertible to Dataframes for consumption by Dataframe frameworks such as:

Pandas. The original Dataframe library.
Pola-rs: An out-of-core Dataframe library developed in Rust.
cuDF: A GPU Dataframe Library developed by NVIDIA.

Such frameworks allow data scientists to easily manipulate data.

Flexible Data Types

Apache Arrow supports many data types, which can be subdivided into two classes: primitive and nested.
Primitive types include booleans, ints and floats, while nested types include lists, structs and maps.

Extension types can be defined by flexibly combining the above type classes. This is important in the case of Radio Astronomy Data as Arrow does not natively support complex numbers. However, they can be defined as an extension type defined as a list of two floats. Similarly, tensors can be represented as a series of nested lists corresponding to the rank of the tensor.

At this time a PR for a canonical Tensor type is being added to Arrow:

GH-15483: [C++] Add a Fixed Shape Tensor canonical ExtensionType apache/arrow#8510

and adding Complex Numbers as a canonical type is also a requested feature:

[Format] Add metadata for single and double precision complex numbers apache/arrow#16264

While libraries can freely implement their own extension types, canonical extension types are useful as other applications in the Arrow ecosystem are more likely to recognise them.

This flexible type system allows the expression of most CTDS data. Some edge cases are mentioned below.

Disk formats

Additionally, Arrow Tables, composed of many Arrow columns, can be represented on disk by multiple formats. Here we focus on datasets a hierarchical directory structure of Parquet files.

arcae

The purpose of this library is to represent the Measurement Specification v2.0 (and v3.0) in the Apache Arrow Table format through a C++ conversion layer with additional Python bindings.
This re-expresses some Python code originally developed in dask-ms but also achieves some other important goals:

Thread safe CASA Table access is provided by constraining access to the table to an Arrow ThreadPool with 1 thread.
We support variably (ragged) shaped columns. This can occur for visibility data with multiple Spectral Windows, but is also convenient for representing Measurement Set sub-tables.
Arrow column rows can be marked as null, which is convenient for representing missing data in variably shaped CASA columns.
CASA Column and Table keywords are JSON-encoded and attached to Apache Arrow Column and Table metadata.
The Python Global Interpreter Lock (GIL) is dropped during C++ table access, allowing the Python interpreter to continue execution in other Python threads while performing I/O.
Data can be stored in Parquet Datasets and therefore in a variety of Storage Engines. These engines can provide data for ingestion by Execution Engines which can be used by data scientists for data analysis.
Parallel write are possible when writing Datasets.

Implementation Notes

Tensors are currently represented as nested FixedSizeLists although PR arrow#8510 adds a formal Tensor Extension Type to the base Arrow library.
An Extension Type for Complex Numbers is provided which is a FixedSizeList of two floats/doubles.
CASA Table and Column keywords are converted to JSON via casacore's internal JSON handling machinery and attached as metadata to Arrow Tables and Columns, respectively.

Edge cases

The CTDS is very flexible and some edge cases are not yet handled:

TpRecord. This can represent almost any type and combination of types in the CTDS, but is only present in MSv2.0 in the SOURCE::SOURCE_MODEL column. A simple solution might to represent this data type as a JSON-encoded string.
TpQuantity. This is a double with an associated quantity. It should be able to represent this with an Arrow Parametric Type (see for e.g. TimeStamp).
No conversion has yet been defined for CASA Columns of unconstrained rank (ndim = -1). The individual rows of these columns can be scalars or tensors of varying shape. MSv2.0 suggests that only a small number of columns are configured this way and if so, JSON-encoding may suffice for the following:
- SPECTRAL_WINDOW::ASSOC_SPW_ID (optional)
- SPECTRAL_WINDOW::NATURE (optional)
- HISTORY::CLI_COMMAND
- HISTORY::APP_PARAMS
- OBSERVATION::LOG
- OBSERVATION::SCHEDULE
Conversion of Nested FixedSizeList/List types to numpy arrays is currently inefficient, but fixable.

Proposed Arrow Measurement Set Dataset Structure

A typical Arrow dataset is a collection of parquet datasets stored in a hierarchical data structure. The following directory structure follows a Hive partitioning scheme where data is grouped in directories by a combination of unique field and data descriptor:

partition-dataset0/
   FIELD_ID=0/
       DATA_DESC_ID=0/
           part-0.parquet
           part-1.parquet
           ...
       DATA_DESC_ID=1/
           part-0.parquet
           part-1.parquet
   FIELD_ID=1/
       DATA_DESC_ID=0/
           part-0.parquet
           part-1.parquet
           ...
       DATA_DESC_ID=1/
           part-0.parquet
           part-1.parquet
           ...

Based on this, we propose the following general directory structure for storing a Measurement Set as a directory of Arrow Datasets:

HLTau600.ms/
    metadata.parquet
    MAIN/
        FIELD_ID=0/
            DATA_DESC_ID=0/
                part-0.parquet
                part-1.parquet
        FIELD_ID=1/
            DATA_DESC_ID=0/
                part-0.parquet
                part-1.parquet
    SPECTRAL_WINDOW/
        part-0.parquet
    FEED/
        part-0.parquet
    POLARIZATION/
        part-0.parquet
    DATA_DESCRIPTION/
        part-0.parquet

It is therefore simply a directory of Arrow Datasets, one for the MAIN table and each sub-table. A metadata.parquet in the root directory could be useful for storing metadata describing the entire dataset. A list of subtables, for example.

The MAIN table and sub-tables can be free to specify their own individual partitioning schemes, although this is probably only useful for the MAIN table.

sjperkins · 2023-03-20T12:13:39Z

sjperkins
Mar 20, 2023
Maintainer Author

casa-arrow is a package that translates CASA Tables (CTDS) into Arrow Tables at the C++ layer. We believe it will be useful to the greater Radio Astronomy community. The discussion above proposes implementing the MSv2.0/3.0 specification (via Arrow Tables) as a Parquet Dataset: a hierarchical directory structure of parquet files and, given the care needed when defining formats, exists to invite feedback.

The listing by organisation below includes parties that I believe may be interested. This is not exhaustive: please forward this to anyone else who might benefit from this:

ASTRON

@aroffringa
@tammojan
@gervandiepen

JIVE

@kettenis
@dessmalljive

NRAO

@Jan-Willem
@amcnicho
@dmehring

Harvard

@lindyblackburn
@pkgw
@iniyannatarajan
@kazuakiyama

BASP Group

@adrianjhpc

EPFL

@etolley
@arpan-das-astrophysics

SARAO

@o-smirnov
@landmanbester
@JSKenyon
@bennahugo
@richarms
@ludwigschwardt

SKAO

@mserylak

0 replies

sjperkins · 2023-03-20T12:48:58Z

sjperkins
Mar 20, 2023
Maintainer Author

/cc @mserylak

0 replies

sjperkins · 2023-03-20T13:24:51Z

sjperkins
Mar 20, 2023
Maintainer Author

/cc @etolley @arpan-das-astrophysics

0 replies

sjperkins · 2023-03-20T13:26:40Z

sjperkins
Mar 20, 2023
Maintainer Author

/cc @pkgw

0 replies

sjperkins · 2023-03-21T11:09:43Z

sjperkins
Mar 21, 2023
Maintainer Author

/cc @IanHeywood and @david-macmahon

0 replies

sjperkins · 2023-03-24T10:39:10Z

sjperkins
Mar 24, 2023
Maintainer Author

/cc @mreineck

0 replies

kettenis · 2023-04-03T14:00:06Z

kettenis
Apr 3, 2023

Hi Simon,
Strictly speaking I'm at JIVE not ASTRON (we're in the same building though, so I understand the confusion). My colleague @dessmalljive should be interested as well.
For VLBI a partitioning of the MAIN table by SCAN_NUMBER might make more sense than partitioning by FIELD_ID, but I guess not every MS has meaningful scan numbers.
Not entirely sure what the place of casa-arrow would be. For hardcore number crunching a zarr/xarray/dask approach makes more sense I think. But casa-arrow might work better for contexts where flexible data selection is needed, for example for plotting visibilities?

4 replies

sjperkins Apr 3, 2023
Maintainer Author

Hi Simon, Strictly speaking I'm at JIVE not ASTRON (we're in the same building though, so I understand the confusion). My colleague @dessmalljive should be interested as well.

Thanks for the correction :-)

For VLBI a partitioning of the MAIN table by SCAN_NUMBER might make more sense than partitioning by FIELD_ID, but I guess not every MS has meaningful scan numbers.

Yes, I'll probably make it possible to arbitrarily partition the MS by indexing columns in casa-arrow (similar to the way it's done in dask-ms). This will be configurable.

Not entirely sure what the place of casa-arrow would be. For hardcore number crunching a zarr/xarray/dask approach makes more sense I think. But casa-arrow might work better for contexts where flexible data selection is needed, for example for plotting visibilities?

Yes, I agree that zarr is better for straight up data processing. Indeed, one of Arrow's use cases is for data selection.

However, another use case I'm particularly interested in is that the tabular Arrow format is used in for distributed data shuffling. See for e.g.

While interferometer data is mostly TIME ordered, there are cases where it's optimal to process data in baseline (ANTENNA1/ANTENNA2) order: I'm thinking of spatial locality for imaging and flagging separate baselines in parallel.

kazuakiyama Apr 6, 2023

Hi @kettenis and @sjperkins,

After playing with this columnar format a little bit from Julia's side, I see another major strength of having this in the arrow/parquet or similar types of table formats --- the arrow/parquet formats offer native support to the nested arrays that have a variable size of the element, which xarray like data structure cannot handle natively. I think this will be a major strength to handle spectral line (with different spectral resolution) / simultaneous multi-frequency measurements (without a common channelization) in a single data structure without partitioning or splitting data sets --- this seems to be a major strength for the ngEHT community.

I'm also curious if with xarray-like structures it is easy to partition data irregularly by scans, spectral windows, etc, which seems very easy with pyarrow.

kazuakiyama Apr 6, 2023

by the way, this sort of table-based column-oriented format has attracted our EHT community as well. Our data processing with millimeter VLBI often involves closure analysis and other baseline-based analysis, and tabular formats may allow more optimal processing for the major tasks for ngEHT data processing.

There is a conceptual design of data for (sub-)mm VLBI processing developed by Chi-kwan Chan (@rndsrc not yet involved in this thread) at U. Arizona Stewart Observatory here. I think this sort of concept can be implemented in MS friendly way in this tabular format.

sjperkins Apr 11, 2023
Maintainer Author

I'm also curious if with xarray-like structures it is easy to partition data irregularly by scans, spectral windows, etc, which seems very easy with pyarrow.

By default, dask-ms partitions Measurement Sets by FIELD_ID and DATA_DESC_ID into separate xarray Datasets. As each partition represents a different SPW, the arrays in each partition are regular. It's possible to add SCAN_NUMBER to the partitioning (although this is not a primary index of MSv2.0).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Representing the Measurement Set v2.0/v3.0 as Apache Arrow Datasets #1

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Representing the Measurement Set v2.0/v3.0 as Apache Arrow Datasets #1

sjperkins Mar 15, 2023 Maintainer

Other formats

Background: The Casa Table Data System

Apache Arrow

Flexible Data Types

Disk formats

arcae

Implementation Notes

Edge cases

Proposed Arrow Measurement Set Dataset Structure

Replies: 7 comments · 4 replies

sjperkins Mar 20, 2023 Maintainer Author

ASTRON

JIVE

NRAO

Harvard

BASP Group

EPFL

SARAO

SKAO

sjperkins Mar 20, 2023 Maintainer Author

sjperkins Mar 20, 2023 Maintainer Author

sjperkins Mar 20, 2023 Maintainer Author

sjperkins Mar 21, 2023 Maintainer Author

sjperkins Mar 24, 2023 Maintainer Author

kettenis Apr 3, 2023

sjperkins Apr 3, 2023 Maintainer Author

kazuakiyama Apr 6, 2023

kazuakiyama Apr 6, 2023

sjperkins Apr 11, 2023 Maintainer Author

sjperkins
Mar 15, 2023
Maintainer

Replies: 7 comments 4 replies

sjperkins
Mar 20, 2023
Maintainer Author

sjperkins
Mar 20, 2023
Maintainer Author

sjperkins
Mar 20, 2023
Maintainer Author

sjperkins
Mar 20, 2023
Maintainer Author

sjperkins
Mar 21, 2023
Maintainer Author

sjperkins
Mar 24, 2023
Maintainer Author

kettenis
Apr 3, 2023

sjperkins Apr 3, 2023
Maintainer Author

sjperkins Apr 11, 2023
Maintainer Author