Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Parquet] Fix slow dictionary encoding of NaN float values #6953

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

adamreeve
Copy link
Contributor

Which issue does this PR close?

Closes #6952

Rationale for this change

This treats NaNs as equal to other NaNs of the same type for the purpose of dictionary encoding them when writing f32 or f64 Parquet physical values.

What changes are included in this PR?

  • Introduces a new Intern trait to define equality behaviour for interning, replacing the use of PartialEq.
  • Adds a benchmark for writing floating point values with NaNs to Parquet.

Are there any user-facing changes?

  • Users should see improved performance when writing floating point data with many NaNs to Parquet

@github-actions github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate labels Jan 8, 2025
@adamreeve
Copy link
Contributor Author

Benchmark results from the new benchmarks before changing the interning behaviour:

write_batch primitive/4096 values float with NaNs
                        time:   [5.6968 ms 5.7060 ms 5.7141 ms]
                        thrpt:  [9.6186 MiB/s 9.6324 MiB/s 9.6479 MiB/s]
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) low severe
  4 (4.00%) low mild
  1 (1.00%) high mild
write_batch primitive/4096 values float with no NaNs
                        time:   [383.44 µs 383.65 µs 383.85 µs]
                        thrpt:  [143.18 MiB/s 143.26 MiB/s 143.34 MiB/s]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

This shows that writing with 50% NaN values is much slower than with no NaNs.

After the change, performance with NaNs is very similar to without NaNs:

write_batch primitive/4096 values float with NaNs
                        time:   [406.40 µs 406.63 µs 406.88 µs]
                        thrpt:  [135.08 MiB/s 135.16 MiB/s 135.24 MiB/s]
                 change:
                        time:   [-92.875% -92.861% -92.845%] (p = 0.00 < 0.05)
                        thrpt:  [+1297.6% +1300.7% +1303.5%]
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe
write_batch primitive/4096 values float with no NaNs
                        time:   [382.52 µs 384.16 µs 385.50 µs]
                        thrpt:  [142.58 MiB/s 143.07 MiB/s 143.68 MiB/s]
                 change:
                        time:   [+0.1803% +0.3520% +0.5192%] (p = 0.00 < 0.05)
                        thrpt:  [-0.5165% -0.3507% -0.1799%]
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) low severe

(I removed the 4096 values float with no NaNs benchmark from this PR after running these benchmarks as I don't think there's a lot of value in keeping it)

@@ -66,7 +70,7 @@ impl<S: Storage> Interner<S> {
.dedup
.entry(
hash,
|index| value == self.storage.get(*index),
|index| value.eq(self.storage.get(*index)),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, after opening this PR I realised a simpler approach would be to just compare the values by their byte representation here:

|index| value.as_bytes() == self.storage.get(*index).as_bytes()

I will check what effect that has on performance.

@alamb
Copy link
Contributor

alamb commented Jan 8, 2025

Thank you @adamreeve -- is there any chance you could break out the benchmark into its own PR so it is easier to compare the before/after performance of this change?

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! This seems like an elegant solution. Thanks!

@adamreeve
Copy link
Contributor Author

Thank you @adamreeve -- is there any chance you could break out the benchmark into its own PR so it is easier to compare the before/after performance of this change?

Sure, I've made #6955 to just add the benchmark and test, I'll make this PR a draft and rebase it once that is merged.

@adamreeve adamreeve marked this pull request as draft January 8, 2025 21:03
@alamb
Copy link
Contributor

alamb commented Jan 8, 2025

Thank you @adamreeve -- is there any chance you could break out the benchmark into its own PR so it is easier to compare the before/after performance of this change?

Sure, I've made #6955 to just add the benchmark and test, I'll make this PR a draft and rebase it once that is merged.

THanks! I have merged #6955 -- I'll run the benchmarks when this one is rebased

THanks again @adamreeve

@github-actions github-actions bot removed the arrow Changes to the arrow crate label Jan 8, 2025
@adamreeve
Copy link
Contributor Author

OK I've rebased this now and switched to comparing byte representations for all types rather than needing a new trait, which is a simpler solution and is consistent with how values are hashed. The performance was very similar between the two approaches on my machine.

@adamreeve adamreeve marked this pull request as ready for review January 8, 2025 23:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Writing floating point values containing NaN to Parquet is slow when using dictionary encoding
3 participants