Query API #16

park-brian · 2024-06-25T18:47:15Z

This is where we'll discuss adding a Query API to allow users to efficiently retrieve the data they're looking for. Users can perform column chunk/page-level predicate pushdown manually using column and offset indexes, but this is quite labor-intensive and requires some knowledge of how parquet files work.

Ideally, users should be able to write simple queries that are analyzed and used to construct efficient query plans/predicate functions which evaluate column chunk and page statistics. We could also try to optimize data fetching, especially over the network, by making multirange queries (when supported or specified via a flag) and concatenating requests for (nearly) adjacent byte ranges.

User-defined predicate functions are another option (potentially just a lower-level API), which can be implemented with fairly little effort. This would allow users to define arbitrary predicate logic and would also be a valid way of implementing different query frontends (eg: the simple structured queries discussed earlier).

platypii · 2024-08-19T01:27:24Z

Sorry for the delay here @park-brian. I'm ready to start re-visiting pushdown predicates in hyparquet, and so I have merged the parse-indicies branch into master. Will include with the next published version.

I know you had some thoughts and maybe some additional experiments that you did. I would be very interested in ideas and pull requests related to more sophisticated query support.

park-brian · 2024-08-20T02:33:55Z

Thanks @platypii, this is great news! I've also been quite busy these days, so I apologize for not getting to follow up on this effort.

I can start by creating some utility functions which use predicate functions to select row groups based on their statistics. Since row group statistics and page statistics are similar, we should be able to use the same predicate to filter both.

Of course, I'll also want to fetch adjacent row groups/pages as well, depending on which columns the user requests. Sometimes, row groups/pages are not aligned so I'll want to take this into account as well.

park-brian · 2024-12-14T06:36:32Z

Hi @platypii, I've added a few examples of page-level predicate pushdown (the code is not the best, but outlines a working approach):
https://github.com/park-brian/hyparquet/blob/predicate-pushdown/test/indexes.pushdown.test.js

These are fairly primitive examples and require users to provide predicate functions, as well as assemble/read pages from over the network (which is why we added sliceAll to make efficient multi-range requests).

I've also made a small change to readColumn to allow reading all rows without needing to know how many rows there are in advance (this is especially useful when we just have raw headers + page data, and don't know how many records are in each page).

platypii · 2024-12-14T07:42:47Z

Very cool @park-brian!!

Since we last talked I've added some basic support for more efficient sorting via a parquetQuery function. I would like to add predicate support as well so that users can add thresholds for displaying data. This would be a great start for implementing that.

I'm not sure what the best js api to represent predicates is. I'm open to suggestions. Eventually I would like to be able to make SQL-like queries against parquet efficiently.

I am happy to merge in your query work, especially if you need some functions changed to support it (minor nit: I would prefer rowLimit = undefined or Infinity to mean no limit, not -1.)

park-brian · 2024-12-14T21:10:52Z

Thanks @platypii, I've created a small PR (#53) to support rowLimit == Infinity/null/undefined. I am also interested in adding predicate support to parquetQuery. A query interface could look like this:

const query = {id: 10};
// or potentially: const query = parseSql(`select * from data where id = 10`);  // returns: {id: 10}
const results = await parquetQuery({file, query, orderBy, offset, limit});

Here's a quick rundown of the relevant query syntax from the mongo docs:

// Basic Query
{ key: value }                      // Exact match

// Comparison
{ key: { $gt: value } }             // Greater than
{ key: { $gte: value } }            // Greater than or equal
{ key: { $lt: value } }             // Less than
{ key: { $lte: value } }            // Less than or equal
{ key: { $ne: value } }             // Not equal
{ key: { $in: [value1, value2] } }  // Matches any value in array
{ key: { $nin: [value1, value2] } } // Does not match values in array
{ key: { $gte: minValue, $lte: maxValue } } // Between (inclusive)

// Logical Operators
{ $and: [{ key1: "value1" }, { key2: "value2" }] }
{ $or: [{ key1: "value1" }, { key2: "value2" }] }
{ key: { $not: { $gt: value } } }   // Negates a query condition

platypii · 2024-12-15T18:51:04Z

Did some searching for existing json-based DSLs for querying, and I agree that the mongo syntax is the most well-known and is probably the way to go 👍

park-brian · 2024-12-18T21:20:29Z

The cool thing about this DSL is that it makes it easy to implement a post-processing filter (eg: we can just use: array.filter(doc => matchQuery(doc, query));. This lets us mock out an interface for users to get familiar with the query api.

function matchQuery(doc, query) {
  // Handle $and operator
  if (query.$and) {
    return query.$and.every(subQuery => matchQuery(doc, subQuery));
  }
  
  // Handle $or operator
  if (query.$or) {
    return query.$or.some(subQuery => matchQuery(doc, subQuery));
  }

  // Handle regular field queries
  return Object.entries(query).every(([field, condition]) => {
    const value = doc[field];
    
    // Simple equality match
    if (condition !== null && typeof condition !== 'object') {
      return value === condition;
    }

    // Handle comparison operators
    return Object.entries(condition).every(([operator, target]) => {
      switch (operator) {
        case '$gt':
          return value > target;
        case '$gte':
          return value >= target;
        case '$lt':
          return value < target;
        case '$lte':
          return value <= target;
        case '$ne':
          return value !== target;
        case '$in':
          return Array.isArray(target) && target.includes(value);
        case '$nin':
          return Array.isArray(target) && !target.includes(value);
        case '$not':
          return !matchQuery({ [field]: value }, { [field]: target });
        default:
          return true;
      }
    });
  });
}

park-brian · 2024-12-20T21:20:38Z

Hey @platypii, I'll create a preliminary pull request adding the query options api to parquetQuery as a post-processing filter on the results set (this will eventually evolve into predicate pushdown).

platypii · 2024-12-22T00:03:47Z

Merged the filtering api and published v1.8.0. I didn't document it in the README in case we need to make changes. Consider it un-official launch of the query engine. Publishing should make it easier to test and iterate.

I would imagine next is a function that matches a query against each rowgroup and/or page, using statistics, so that they can be filtered out before fetching?

A couple other random thoughts:

Haven't tested this, but I'm guessing that Date types don't work with matchQuery right now, at least for $gte etc?
At some point I want a dryrun or plan option so that we can determine in advance if a query will require a huge amount of data. Not sure if this should be a flag or what the api should look like. It would be helpful for showing progress and advanced query planning.
I'm playing with a sqlToMongo function I wrote to make it easier to leverage this query engine.

park-brian · 2024-12-24T05:32:44Z

In theory, dates should work as long as they're read from the statistics as Dates (which are compatible with regular operators). I haven't had a chance to test this either, sorry!

I agree that we should emit the query plan (eg: the list of statistics objects we're applying the filters to at each level, and the decisions we make based on those statistics). There's a lot of room for optimization if we know certain columns are already semi/sorted in the data.

sqlToMongo sounds really interesting! Maybe down the line, this would be supported:
parquetQuery({sql: "select * from 'https://example.com/file.parquet' where id = 1" })

park-brian · 2024-12-24T05:37:19Z

We can even use tagged template literals for formatting parameters as objects

parquetQuery(sql`select * from '${url}' where id = ${id}`) // equivalent to the above ({sql: '...'})

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query API #16

Query API #16

park-brian commented Jun 25, 2024

platypii commented Aug 19, 2024

park-brian commented Aug 20, 2024

park-brian commented Dec 14, 2024

platypii commented Dec 14, 2024

park-brian commented Dec 14, 2024

platypii commented Dec 15, 2024

park-brian commented Dec 18, 2024

park-brian commented Dec 20, 2024

platypii commented Dec 22, 2024

park-brian commented Dec 24, 2024

park-brian commented Dec 24, 2024 •

edited

Loading

Query API #16

Query API #16

Comments

park-brian commented Jun 25, 2024

platypii commented Aug 19, 2024

park-brian commented Aug 20, 2024

park-brian commented Dec 14, 2024

platypii commented Dec 14, 2024

park-brian commented Dec 14, 2024

platypii commented Dec 15, 2024

park-brian commented Dec 18, 2024

park-brian commented Dec 20, 2024

platypii commented Dec 22, 2024

park-brian commented Dec 24, 2024

park-brian commented Dec 24, 2024 • edited Loading

park-brian commented Dec 24, 2024 •

edited

Loading