[PROPOSAL]: Support indexes on nested arrays of struct #372

andrei-ionescu · 2021-03-03T15:14:53Z

Problem Statement

Hyperspace does NOT support indexing over columns/fields of type array of struct.

Background and Motivation

There are use cases where queries like the ones bellow can greatly benefit performance wise from Hyperspace's indexing.

df1.filter(array_contains(df1("nested.cnt"), 11)).show

df1.join(df2, array_contains(df1("nested.cnt"), df2("rank")))

Proposed Solution

This design is based on the work of proposal #347 and #365.

Given the following dataset

root
 |-- id: string (nullable = true)
 |-- v: integer (nullable = true)
 |-- nested: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- cnt: integer (nullable = true)

+---+---+--------------------+
|id |v  |nested              |
+---+---+--------------------+
|id2|100|[[b1, 1]]           |
|id1|100|[[a1, 1], [a2, 2]]  |
|id3|110|[[c1, 11], [c2, 12]]|
+---+---+--------------------+

Anyone should be able to create an index with:

hs.createIndex(
  nestedDataset, 
  IndexConfig(
    "idx_arr_nested", 
    indexedColumns = Seq("nested.cnt"), 
    includedColumns = Seq("id")))

Alternatives

N/A

Known/Potential Compatibility Issues

N/A

Design

Adding the support for nested fields impacts the following areas:

Validate nested column names
Modify the create index action
Modify the filter and rule index functions

Creating the index

As seen above the index may be created with already existing Hyperspace APIs.

⚠️ Important ⚠️

It will NOT be possible to support multiple fields from a nested array due to the way Spark is working right now. This limitation is because arrays need to be exploded to create the proper index and Spark allows only one generator allowed per select clause.

Under the hood the index may be created with something like this:

df1.select(explode(df1("nested.cnt"), df1("id")).as("nested__cnt"))

The resulted data frame will be:

root
 |-- nested__cnt: integer (nullable = true)
 |-- id: string (nullable = true)

+-----------+---+
|nested__cnt| id|
+-----------+---+
|          1|id2|
|          1|id1|
|          2|id1|
|         11|id3|
|         12|id3|
+-----------+---+

It is important to understand that the name of the field of the index column is a non-nested column and due to parquet quirkiness on using . (dot) in the field name, it has to be properly renamed and at query time projected as it was. This will be based on the renaming pattern already implemented in #365.

Search query

Given the following search/filter query

df1.filter(array_contains(df1("nested.cnt"), 11)).select("id")

More to come...

Join Queries

TDB

Implementation

First it will have to add the support of creating indexes over arrays.
Then support search queries + hybrid scans
Then support join queries + hybrid scans

Performance Implications (if applicable)

N/A

The text was updated successfully, but these errors were encountered:

andrei-ionescu added proposal This is the default tag for a newly created design proposal untriaged This is the default tag for a newly created issue labels Mar 3, 2021

andrei-ionescu mentioned this issue Apr 8, 2021

Support filter over indexes on nested fields #380

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROPOSAL]: Support indexes on nested arrays of struct #372

[PROPOSAL]: Support indexes on nested arrays of struct #372

andrei-ionescu commented Mar 3, 2021

[PROPOSAL]: Support indexes on nested arrays of struct #372

[PROPOSAL]: Support indexes on nested arrays of struct #372

Comments

andrei-ionescu commented Mar 3, 2021

Problem Statement

Background and Motivation

Proposed Solution

Alternatives

Known/Potential Compatibility Issues

Design

Creating the index

Search query

Join Queries

Implementation

Performance Implications (if applicable)