You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Experimentation has suggested better defaults for parameters used when running queries, specifically the use of column indexes and the S3A readahead range. These optimisations reduce both the number of GETs and the time to return results. The reduction in the number of GETs is dramatic if there are a lot of columns in a table.
Description
When running queries we want to turn off the column indexes. We also want to set the readhead range to the size of the row group.
We can also turn off the use of column indexes when reading parquet files in Java compactions (this has already been done in the DataFusion-based compaction code).
We want to continue to write column indexes when we write parquet files, as external programs may want to read Sleeper's parquet files and use them.
Analysis
To turn off column indexes when running queries we can use the useColumnIndexFilter(false) option on new ParquetRecordReader.Builder(path, schema). We can have a table option to determine whether column indexes are used when performing queries. This should default to false.
We want to default the readahead range used for queries to the row group size. We already have a table property for the readahead range. Suggest we simply set this to the same default as the row group size.
We can explicitly set the use of column indexes to false when performing Java compactions. There seems to be no need to have this as an option.
The text was updated successfully, but these errors were encountered:
Assumption made in solution that the default value for the "S3A_READAHEAD_RANGE" set to DEFAULT_ROW_GROUP_SIZE
is acceptable as a value as previously explictly stated Kb as a measure and this is purely in Bytes.
Background
Experimentation has suggested better defaults for parameters used when running queries, specifically the use of column indexes and the S3A readahead range. These optimisations reduce both the number of GETs and the time to return results. The reduction in the number of GETs is dramatic if there are a lot of columns in a table.
Description
When running queries we want to turn off the column indexes. We also want to set the readhead range to the size of the row group.
We can also turn off the use of column indexes when reading parquet files in Java compactions (this has already been done in the DataFusion-based compaction code).
We want to continue to write column indexes when we write parquet files, as external programs may want to read Sleeper's parquet files and use them.
Analysis
To turn off column indexes when running queries we can use the
useColumnIndexFilter(false)
option onnew ParquetRecordReader.Builder(path, schema)
. We can have a table option to determine whether column indexes are used when performing queries. This should default to false.We want to default the readahead range used for queries to the row group size. We already have a table property for the readahead range. Suggest we simply set this to the same default as the row group size.
We can explicitly set the use of column indexes to false when performing Java compactions. There seems to be no need to have this as an option.
The text was updated successfully, but these errors were encountered: