Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancements to user interface when using QL with row MultiIndex #152

Draft
wants to merge 10 commits into
base: develop
Choose a base branch
from

Conversation

ilumsden
Copy link
Collaborator

@ilumsden ilumsden commented Nov 8, 2024

In #76, I added a new multi_index_mode parameter to GraphFrame.filter and the query language to allow us to apply queries to GraphFrames where we have a row MultiIndex. However, since then, there's been a lot of confusion about the parameter, what is does, and how to use it, especially in Thicket.

This PR improves naming, simplifies default use, and enhances functionality of this feature. More specifically, this PR does 3 things:

  1. Renames multi_index_mode to predicate_row_aggregator, which more clearly indicates that the argument is used to aggregate per-row outputs from predicates
  2. Expands the acceptable values to predicate_row_aggregator
  3. Adds a new mechanism that allows the query classes (i.e., Query, ObjectQuery, StringQuery) to define a default aggregator
  4. Moves logic for applying aggregators to QueryEngine, which allows us to bypass all of this if we don't have a row MultiIndex

With this PR, the predicate_row_aggregator argument now accepts the following:

  • None: tells Hatchet to use the default aggregator for the type of query
  • "off": tells Hatchet to not use any aggregators (note: this will result in errors if there is a row MultiIndex)
  • "all": applies an aggregator that returns true if and only if the predicate returned true for all rows associated with a node
  • "any": applies an aggregator that returns true if the predicate returned true for any row associated with a node
  • Callable that takes a pandas.Series of booleans as input and returns a boolean as output: applies the user-provided function as an aggregator

When using predicate_row_aggregator=None, the aggregators used will be:

  • "off" if using a base syntax query (corresponds to the Query class)
  • "all" if using a object or string dialect query (corresponds to the ObjectQuery and StringQuery classes)
  • the default aggregators for each subquery if using a compound query

…o-understand predicate_row_aggregator argument
@ilumsden ilumsden added area-query-lang Issues and PRs related to Hatchet's query language priority-normal Normal priority issues and PRs status-work-in-progress PR is currently being worked on type-feature Requests for new features or PRs which implement new features type-internal-cleanup PR or issues related to the structure of the codebase, directories and refactors labels Nov 8, 2024
@ilumsden ilumsden self-assigned this Nov 8, 2024
@ilumsden
Copy link
Collaborator Author

To clarify, the reason we need multi_index_mode/predicate_row_aggregator is because the graph algorithm-part of the query language needs predicates to provide a single boolean for each node. When we do not have a row MultiIndex (i.e., the standard case for Hatchet), this requirement is always satisfied. However, when we do have a row MultiIndex (i.e., the standard case for Thicket), this requirement is never satisfied because we have multiple rows in the DataFrame per node. As a result, predicates will return a pandas.Series of booleans when we have a row MultiIndex. The multi_index_mode/predicate_row_aggregator argument provides a mechanism to aggregate that Series of booleans into a single boolean.

@michaelmckinsey1
Copy link
Collaborator

An example of where this aggregation argument is relevant. Example base-syntax query to match nodes with name "my_node" where aggregation does not need to be specified due to .all()

query = th.query.Query().match(
    "*",
    lambda row: row["name"].apply(
        lambda tn: tn == "my_node"
    ).all()
)
tkq = tk.query(query)

Equivalent string syntax query where specifying aggregation is necessary

query = """
MATCH ("*")->(n) WHERE n."name"="my_node"
"""
filt = tk.query(query, predicate_row_aggregator="all")

@michaelmckinsey1
Copy link
Collaborator

Matching a single node with name my_node

query = th.query.Query().match(
    1,
    lambda row: row["name"].apply(
        lambda tn: tn == "my_node"
    ).all()
)
tkq = tk.query(query)

or

query = th.query.Query().match(
    ".",
    lambda row: row["name"].apply(
        lambda tn: tn == "my_node"
    ).all()
)
tkq = tk.query(query)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-query-lang Issues and PRs related to Hatchet's query language priority-normal Normal priority issues and PRs status-work-in-progress PR is currently being worked on type-feature Requests for new features or PRs which implement new features type-internal-cleanup PR or issues related to the structure of the codebase, directories and refactors
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants