Enhancements to user interface when using QL with row MultiIndex #152

ilumsden · 2024-11-08T03:09:13Z

In #76, I added a new multi_index_mode parameter to GraphFrame.filter and the query language to allow us to apply queries to GraphFrames where we have a row MultiIndex. However, since then, there's been a lot of confusion about the parameter, what is does, and how to use it, especially in Thicket.

This PR improves naming, simplifies default use, and enhances functionality of this feature. More specifically, this PR does 3 things:

Renames multi_index_mode to predicate_row_aggregator, which more clearly indicates that the argument is used to aggregate per-row outputs from predicates
Expands the acceptable values to predicate_row_aggregator
Adds a new mechanism that allows the query classes (i.e., Query, ObjectQuery, StringQuery) to define a default aggregator
Moves logic for applying aggregators to QueryEngine, which allows us to bypass all of this if we don't have a row MultiIndex

With this PR, the predicate_row_aggregator argument now accepts the following:

None: tells Hatchet to use the default aggregator for the type of query
"off": tells Hatchet to not use any aggregators (note: this will result in errors if there is a row MultiIndex)
"all": applies an aggregator that returns true if and only if the predicate returned true for all rows associated with a node
"any": applies an aggregator that returns true if the predicate returned true for any row associated with a node
Callable that takes a pandas.Series of booleans as input and returns a boolean as output: applies the user-provided function as an aggregator

When using predicate_row_aggregator=None, the aggregators used will be:

"off" if using a base syntax query (corresponds to the Query class)
"all" if using a object or string dialect query (corresponds to the ObjectQuery and StringQuery classes)
the default aggregators for each subquery if using a compound query

…o-understand predicate_row_aggregator argument

ilumsden · 2024-11-13T18:08:02Z

To clarify, the reason we need multi_index_mode/predicate_row_aggregator is because the graph algorithm-part of the query language needs predicates to provide a single boolean for each node. When we do not have a row MultiIndex (i.e., the standard case for Hatchet), this requirement is always satisfied. However, when we do have a row MultiIndex (i.e., the standard case for Thicket), this requirement is never satisfied because we have multiple rows in the DataFrame per node. As a result, predicates will return a pandas.Series of booleans when we have a row MultiIndex. The multi_index_mode/predicate_row_aggregator argument provides a mechanism to aggregate that Series of booleans into a single boolean.

michaelmckinsey1 · 2024-11-13T23:03:28Z

An example of where this aggregation argument is relevant. Example base-syntax query to match nodes with name "my_node" where aggregation does not need to be specified due to .all()

query = th.query.Query().match(
    "*",
    lambda row: row["name"].apply(
        lambda tn: tn == "my_node"
    ).all()
)
tkq = tk.query(query)

Equivalent string syntax query where specifying aggregation is necessary

query = """
MATCH ("*")->(n) WHERE n."name"="my_node"
"""
filt = tk.query(query, predicate_row_aggregator="all")

michaelmckinsey1 · 2024-11-15T17:36:55Z

Matching a single node with name my_node

query = th.query.Query().match(
    1,
    lambda row: row["name"].apply(
        lambda tn: tn == "my_node"
    ).all()
)
tkq = tk.query(query)

or

query = th.query.Query().match(
    ".",
    lambda row: row["name"].apply(
        lambda tn: tn == "my_node"
    ).all()
)
tkq = tk.query(query)

Replaces multi_index_mode in QL with a more customizable and easier-t…

12ed4d1

…o-understand predicate_row_aggregator argument

ilumsden self-assigned this Nov 8, 2024

ilumsden added 9 commits November 7, 2024 22:17

Fixes unit tests

f65fb5b

Formatting

8e07660

Removes MultiIndexModeMismatch

86da99c

Fixes logic for handling string values of predicate_row_aggregator

7ad292e

Formatting

9159da0

Fixes a condition to properly parse the default aggregators

3be6276

Fixes a few testing bugs

0c5c52b

Formatting

cae50b5

Restores special logic for multi-index in the string dialect

bad0410

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancements to user interface when using QL with row MultiIndex #152

Enhancements to user interface when using QL with row MultiIndex #152

ilumsden commented Nov 8, 2024

ilumsden commented Nov 13, 2024

michaelmckinsey1 commented Nov 13, 2024

michaelmckinsey1 commented Nov 15, 2024

Enhancements to user interface when using QL with row MultiIndex #152

Are you sure you want to change the base?

Enhancements to user interface when using QL with row MultiIndex #152

Conversation

ilumsden commented Nov 8, 2024

ilumsden commented Nov 13, 2024

michaelmckinsey1 commented Nov 13, 2024

michaelmckinsey1 commented Nov 15, 2024