Skip to content
This repository has been archived by the owner on Jun 14, 2024. It is now read-only.

[PROPOSAL]: Support indexes on nested arrays of struct #372

Open
andrei-ionescu opened this issue Mar 3, 2021 · 0 comments
Open

[PROPOSAL]: Support indexes on nested arrays of struct #372

andrei-ionescu opened this issue Mar 3, 2021 · 0 comments
Labels
proposal This is the default tag for a newly created design proposal untriaged This is the default tag for a newly created issue

Comments

@andrei-ionescu
Copy link
Contributor

Problem Statement

Hyperspace does NOT support indexing over columns/fields of type array of struct.

Background and Motivation

There are use cases where queries like the ones bellow can greatly benefit performance wise from Hyperspace's indexing.

df1.filter(array_contains(df1("nested.cnt"), 11)).show
df1.join(df2, array_contains(df1("nested.cnt"), df2("rank")))

Proposed Solution

This design is based on the work of proposal #347 and #365.

Given the following dataset

root
 |-- id: string (nullable = true)
 |-- v: integer (nullable = true)
 |-- nested: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- cnt: integer (nullable = true)
+---+---+--------------------+
|id |v  |nested              |
+---+---+--------------------+
|id2|100|[[b1, 1]]           |
|id1|100|[[a1, 1], [a2, 2]]  |
|id3|110|[[c1, 11], [c2, 12]]|
+---+---+--------------------+

Anyone should be able to create an index with:

hs.createIndex(
  nestedDataset, 
  IndexConfig(
    "idx_arr_nested", 
    indexedColumns = Seq("nested.cnt"), 
    includedColumns = Seq("id")))

Alternatives

N/A

Known/Potential Compatibility Issues

N/A

Design

Adding the support for nested fields impacts the following areas:

  • Validate nested column names
  • Modify the create index action
  • Modify the filter and rule index functions

Creating the index

As seen above the index may be created with already existing Hyperspace APIs.

⚠️ Important ⚠️

It will NOT be possible to support multiple fields from a nested array due to the way Spark is working right now. This limitation is because arrays need to be exploded to create the proper index and Spark allows only one generator allowed per select clause.

Under the hood the index may be created with something like this:

df1.select(explode(df1("nested.cnt"), df1("id")).as("nested__cnt"))

The resulted data frame will be:

root
 |-- nested__cnt: integer (nullable = true)
 |-- id: string (nullable = true)
+-----------+---+
|nested__cnt| id|
+-----------+---+
|          1|id2|
|          1|id1|
|          2|id1|
|         11|id3|
|         12|id3|
+-----------+---+

It is important to understand that the name of the field of the index column is a non-nested column and due to parquet quirkiness on using . (dot) in the field name, it has to be properly renamed and at query time projected as it was. This will be based on the renaming pattern already implemented in #365.

Search query

Given the following search/filter query

df1.filter(array_contains(df1("nested.cnt"), 11)).select("id")

More to come...

Join Queries

TDB

Implementation

  1. First it will have to add the support of creating indexes over arrays.
  2. Then support search queries + hybrid scans
  3. Then support join queries + hybrid scans

Performance Implications (if applicable)

N/A

@andrei-ionescu andrei-ionescu added proposal This is the default tag for a newly created design proposal untriaged This is the default tag for a newly created issue labels Mar 3, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
proposal This is the default tag for a newly created design proposal untriaged This is the default tag for a newly created issue
Projects
None yet
Development

No branches or pull requests

1 participant