Replies: 7 comments 4 replies
-
casa-arrow is a package that translates CASA Tables (CTDS) into Arrow Tables at the C++ layer. We believe it will be useful to the greater Radio Astronomy community. The discussion above proposes implementing the MSv2.0/3.0 specification (via Arrow Tables) as a Parquet Dataset: a hierarchical directory structure of parquet files and, given the care needed when defining formats, exists to invite feedback. The listing by organisation below includes parties that I believe may be interested. This is not exhaustive: please forward this to anyone else who might benefit from this: ASTRON@aroffringa JIVENRAO@Jan-Willem Harvard@lindyblackburn BASP GroupEPFL@etolley SARAO@o-smirnov SKAO |
Beta Was this translation helpful? Give feedback.
-
/cc @mserylak |
Beta Was this translation helpful? Give feedback.
-
/cc @pkgw |
Beta Was this translation helpful? Give feedback.
-
/cc @IanHeywood and @david-macmahon |
Beta Was this translation helpful? Give feedback.
-
/cc @mreineck |
Beta Was this translation helpful? Give feedback.
-
Hi Simon, |
Beta Was this translation helpful? Give feedback.
-
This discussion proposes representing the Measurement Set v2.0 specification as a collection of Apache Arrow Datasets.
Other formats
This discussion does not concern itself with other cloud-native formats, most notably zarr.
Both SARAO and NRAO have investigated the use of zarr as an alternate format for MSv2.0/3.0. zarr can be faster than parquet, as demonstrated in other domains. However, CASA Tables are tabular by definition and this format lends itself to use with SQL, and by implication, downstream query and data analytics engines which operate on Apache
parquet
files. zarr does not yet support this integration. This does not mean use of zarr should be discounted -- we believe both formats have benefits in different scenarios.Background: The Casa Table Data System
The CASA Table Data System (CTDS) is a bespoke Radio Astronomy format for storing arrays (columns) of data in a relational, tabular on-disk database. It is a Columnar Database complete with it's own Table Query Language (TAQL) SQL dialect and is primarily used for storing and querying raw visibilities of Radio Interferometry data in the Measurement Set v2.0 specification. CTDS was implemented during the late 1990's/early 2000's before:
CTDS is highly configurable and can store multiple columns in a single file or across multiple files, but in practice a number of concerns arise:
All of the above factors make the CTDS difficult to use in a modern, distributed and cloud computing paradigm.
Various strategies have been developed to ameliorate the above concerns, including:
While the CTDS, with the above modifications, can still process Radio Interferometric data and may be extended to do so in future, we believe that significant future effort can be avoided by representing Radio Astronomy data with widely used Data Engineering formats, particularly those within the Apache Arrow ecosystem.
Apache Arrow
At the heart of the Apache Arrow project is a specification for in-memory columnar data.
Similarly to CASA Tables, Arrow Tables are composed of a set of columns. This allows data be produced and consumed by multiple languages including C++, Rust, Python and Julia and to therefore be consumed by data processing and scientific software within those ecosystems.
Aside from cross-language support, formats in the Apache Arrow ecosystem are understood by both
Storage
andExecution
Engines which enable large scale, distributed data analytics. Examples of Storage Engines include:while examples of Execution Engines include:
Additionally, Arrow Tables are convertible to Dataframes for consumption by Dataframe frameworks such as:
Such frameworks allow data scientists to easily manipulate data.
Flexible Data Types
Apache Arrow supports many data types, which can be subdivided into two classes:
primitive
andnested
.Primitive
types include booleans, ints and floats, while nested types include lists, structs and maps.Extension
types can be defined by flexibly combining the above type classes. This is important in the case of Radio Astronomy Data as Arrow does not natively support complex numbers. However, they can be defined as an extension type defined as a list of two floats. Similarly, tensors can be represented as a series of nested lists corresponding to the rank of the tensor.At this time a PR for a canonical Tensor type is being added to Arrow:
and adding Complex Numbers as a canonical type is also a requested feature:
While libraries can freely implement their own extension types, canonical extension types are useful as other applications in the Arrow ecosystem are more likely to recognise them.
This flexible type system allows the expression of most CTDS data. Some edge cases are mentioned below.
Disk formats
Additionally, Arrow Tables, composed of many Arrow columns, can be represented on disk by multiple formats. Here we focus on datasets a hierarchical directory structure of Parquet files.
arcae
The purpose of this library is to represent the Measurement Specification v2.0 (and v3.0) in the Apache Arrow Table format through a C++ conversion layer with additional Python bindings.
This re-expresses some Python code originally developed in dask-ms but also achieves some other important goals:
Parquet Datasets
and therefore in a variety ofStorage Engines
. These engines can provide data for ingestion byExecution Engines
which can be used by data scientists for data analysis.Implementation Notes
Edge cases
The CTDS is very flexible and some edge cases are not yet handled:
SOURCE::SOURCE_MODEL
column. A simple solution might to represent this data type as a JSON-encoded string.SPECTRAL_WINDOW::ASSOC_SPW_ID
(optional)SPECTRAL_WINDOW::NATURE
(optional)HISTORY::CLI_COMMAND
HISTORY::APP_PARAMS
OBSERVATION::LOG
OBSERVATION::SCHEDULE
Proposed Arrow Measurement Set Dataset Structure
A typical Arrow dataset is a collection of parquet datasets stored in a hierarchical data structure. The following directory structure follows a Hive partitioning scheme where data is grouped in directories by a combination of unique field and data descriptor:
Based on this, we propose the following general directory structure for storing a Measurement Set as a directory of Arrow Datasets:
It is therefore simply a directory of Arrow Datasets, one for the MAIN table and each sub-table. A
metadata.parquet
in the root directory could be useful for storing metadata describing the entire dataset. A list of subtables, for example.The MAIN table and sub-tables can be free to specify their own individual partitioning schemes, although this is probably only useful for the MAIN table.
Beta Was this translation helpful? Give feedback.
All reactions