Name	Name	Last commit message	Last commit date
parent directory ..
db-benchmark	db-benchmark
expected-plans	expected-plans
queries	queries
src/bin	src/bin
.dockerignore	.dockerignore
.gitignore	.gitignore
Cargo.toml	Cargo.toml
README.md	README.md
entrypoint.sh	entrypoint.sh
tpch-gen.sh	tpch-gen.sh
tpchgen.dockerfile	tpchgen.dockerfile

DataFusion Benchmarks

This crate contains benchmarks based on popular public data sets and open source benchmark suites, making it easy to run real-world benchmarks to help with performance and scalability testing and for comparing performance with other Arrow implementations as well as other query engines.

Benchmark derived from TPC-H

These benchmarks are derived from the TPC-H benchmark. And we use this repo as the source of tpch-gen and answers: https://github.com/databricks/tpch-dbgen.git, based on 2.17.1 version of TPC-H.

Generating Test Data

TPC-H data can be generated using the tpch-gen.sh script, which creates a Docker image containing the TPC-DS data generator.

# scale_factor: scale of the database population. scale 1.0 represents ~1 GB of data
./tpch-gen.sh <scale_factor>

Data will be generated into the data subdirectory and will not be checked in because this directory has been added to the .gitignore file.

Running the DataFusion Benchmarks

The benchmark can then be run (assuming the data created from dbgen is in ./data) with a command such as:

cargo run --release --bin tpch -- benchmark datafusion --iterations 3 --path ./data --format tbl --query 1 --batch-size 4096

You can enable the features simd (to use SIMD instructions, cargo nightly is required.) and/or mimalloc or snmalloc (to use either the mimalloc or snmalloc allocator) as features by passing them in as --features:

cargo run --release --features "simd mimalloc" --bin tpch -- benchmark datafusion --iterations 3 --path ./data --format tbl --query 1 --batch-size 4096

If you want to disable collection of statistics (and thus cost based optimizers), you can pass --disable-statistics flag. ``bash cargo run --release --bin tpch -- benchmark datafusion --iterations 3 --path /mnt/tpch-parquet --format parquet --query 17 --disable-statistics


The benchmark program also supports CSV and Parquet input file formats and a utility is provided to convert from `tbl`
(generated by the `dbgen` utility) to CSV and Parquet.

```bash
cargo run --release --bin tpch -- convert --input ./data --output /mnt/tpch-parquet --format parquet

Or if you want to verify and run all the queries in the benchmark, you can just run cargo test.

Expected output

The result of query 1 should produce the following output when executed against the SF=1 dataset.

+--------------+--------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+-------------+
| l_returnflag | l_linestatus | sum_qty  | sum_base_price     | sum_disc_price     | sum_charge         | avg_qty            | avg_price          | avg_disc             | count_order |
+--------------+--------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+-------------+
| A            | F            | 37734107 | 56586554400.73001  | 53758257134.870026 | 55909065222.82768  | 25.522005853257337 | 38273.12973462168  | 0.049985295838396455 | 1478493     |
| N            | F            | 991417   | 1487504710.3799996 | 1413082168.0541    | 1469649223.1943746 | 25.516471920522985 | 38284.467760848296 | 0.05009342667421622  | 38854       |
| N            | O            | 74476023 | 111701708529.50996 | 106118209986.10472 | 110367023144.56622 | 25.502229680934594 | 38249.1238377803   | 0.049996589476752576 | 2920373     |
| R            | F            | 37719753 | 56568041380.90001  | 53741292684.60399  | 55889619119.83194  | 25.50579361269077  | 38250.854626099666 | 0.05000940583012587  | 1478870     |
+--------------+--------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------+-------------+
Query 1 iteration 0 took 1956.1 ms
Query 1 avg time: 1956.11 ms

NYC Taxi Benchmark

These benchmarks are based on the New York Taxi and Limousine Commission data set.

cargo run --release --bin nyctaxi -- --iterations 3 --path /mnt/nyctaxi/csv --format csv --batch-size 4096

Example output:

Running benchmarks with the following options: Opt { debug: false, iterations: 3, batch_size: 4096, path: "/mnt/nyctaxi/csv", file_format: "csv" }
Executing 'fare_amt_by_passenger'
Query 'fare_amt_by_passenger' iteration 0 took 7138 ms
Query 'fare_amt_by_passenger' iteration 1 took 7599 ms
Query 'fare_amt_by_passenger' iteration 2 took 7969 ms

h2o benchmarks

cargo run --release --bin h2o group-by --query 1 --path /mnt/bigdata/h2oai/N_1e7_K_1e2_single.csv --mem-table --debug

Example run:

Running benchmarks with the following options: GroupBy(GroupBy { query: 1, path: "/mnt/bigdata/h2oai/N_1e7_K_1e2_single.csv", debug: false })
Executing select id1, sum(v1) as v1 from x group by id1
+-------+--------+
| id1   | v1     |
+-------+--------+
| id063 | 199420 |
| id094 | 200127 |
| id044 | 198886 |
...
| id093 | 200132 |
| id003 | 199047 |
+-------+--------+

h2o groupby query 1 took 1669 ms

Parquet filter pushdown benchmarks

This is a set of benchmarks for testing and verifying performance of parquet filter pushdown. The queries are executed on a synthetic dataset generated during the benchmark execution and designed to simulate web server access logs.

cargo run --release --bin parquet_filter_pushdown --  --path ./data --scale-factor 1.0

This will generate the synthetic dataset at ./data/logs.parquet. The size of the dataset can be controlled through the size_factor (with the default value of 1.0 generating a ~1GB parquet file).

For each filter we will run the query using different ParquetScanOption settings.

Example run:

Running benchmarks with the following options: Opt { debug: false, iterations: 3, partitions: 2, path: "./data", batch_size: 8192, scale_factor: 1.0 }
Generated test dataset with 10699521 rows
Executing with filter 'request_method = Utf8("GET")'
Using scan options ParquetScanOptions { pushdown_filters: false, reorder_predicates: false, enable_page_index: false }
Iteration 0 returned 10699521 rows in 1303 ms
Iteration 1 returned 10699521 rows in 1288 ms
Iteration 2 returned 10699521 rows in 1266 ms
Using scan options ParquetScanOptions { pushdown_filters: true, reorder_predicates: true, enable_page_index: true }
Iteration 0 returned 1781686 rows in 1970 ms
Iteration 1 returned 1781686 rows in 2002 ms
Iteration 2 returned 1781686 rows in 1988 ms
Using scan options ParquetScanOptions { pushdown_filters: true, reorder_predicates: false, enable_page_index: true }
Iteration 0 returned 1781686 rows in 1940 ms
Iteration 1 returned 1781686 rows in 1986 ms
Iteration 2 returned 1781686 rows in 1947 ms
...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmarks

benchmarks

README.md

DataFusion Benchmarks

Benchmark derived from TPC-H

Generating Test Data

Running the DataFusion Benchmarks

Expected output

NYC Taxi Benchmark

h2o benchmarks

Parquet filter pushdown benchmarks

Files

benchmarks

Directory actions

More options

Directory actions

More options

Latest commit

History

benchmarks

Folders and files

parent directory

README.md

DataFusion Benchmarks

Benchmark derived from TPC-H

Generating Test Data

Running the DataFusion Benchmarks

Expected output

NYC Taxi Benchmark

h2o benchmarks

Parquet filter pushdown benchmarks