Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise parquet read parameters #3928

Open
gaffer01 opened this issue Dec 13, 2024 · 1 comment · May be fixed by #4124
Open

Optimise parquet read parameters #3928

gaffer01 opened this issue Dec 13, 2024 · 1 comment · May be fixed by #4124
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@gaffer01
Copy link
Member

gaffer01 commented Dec 13, 2024

Background

Experimentation has suggested better defaults for parameters used when running queries, specifically the use of column indexes and the S3A readahead range. These optimisations reduce both the number of GETs and the time to return results. The reduction in the number of GETs is dramatic if there are a lot of columns in a table.

Description

When running queries we want to turn off the column indexes. We also want to set the readhead range to the size of the row group.

We can also turn off the use of column indexes when reading parquet files in Java compactions (this has already been done in the DataFusion-based compaction code).

We want to continue to write column indexes when we write parquet files, as external programs may want to read Sleeper's parquet files and use them.

Analysis

To turn off column indexes when running queries we can use the useColumnIndexFilter(false) option on new ParquetRecordReader.Builder(path, schema). We can have a table option to determine whether column indexes are used when performing queries. This should default to false.

We want to default the readahead range used for queries to the row group size. We already have a table property for the readahead range. Suggest we simply set this to the same default as the row group size.

We can explicitly set the use of column indexes to false when performing Java compactions. There seems to be no need to have this as an option.

@gaffer01 gaffer01 added the enhancement New feature or request label Dec 13, 2024
@gaffer01 gaffer01 added this to the 0.28.0 milestone Dec 13, 2024
@rtjd6554 rtjd6554 self-assigned this Jan 9, 2025
@rtjd6554 rtjd6554 linked a pull request Jan 23, 2025 that will close this issue
4 tasks
@rtjd6554 rtjd6554 linked a pull request Jan 23, 2025 that will close this issue
4 tasks
@rtjd6554
Copy link
Collaborator

Assumption made in solution that the default value for the "S3A_READAHEAD_RANGE" set to DEFAULT_ROW_GROUP_SIZE
is acceptable as a value as previously explictly stated Kb as a measure and this is purely in Bytes.

Documentation for hadoop looks to support on reading:
https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html
[Search: fs.s3a.readahead.range]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants