[Parquet] Fix slow dictionary encoding of NaN float values #6953

adamreeve · 2025-01-08T03:46:05Z

Which issue does this PR close?

Rationale for this change

This treats NaNs as equal to other NaNs of the same type for the purpose of dictionary encoding them when writing f32 or f64 Parquet physical values.

What changes are included in this PR?

Introduces a new Intern trait to define equality behaviour for interning, replacing the use of PartialEq.
Adds a benchmark for writing floating point values with NaNs to Parquet.

Are there any user-facing changes?

Users should see improved performance when writing floating point data with many NaNs to Parquet

adamreeve · 2025-01-08T03:49:34Z

Benchmark results from the new benchmarks before changing the interning behaviour:

write_batch primitive/4096 values float with NaNs
                        time:   [5.6968 ms 5.7060 ms 5.7141 ms]
                        thrpt:  [9.6186 MiB/s 9.6324 MiB/s 9.6479 MiB/s]
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) low severe
  4 (4.00%) low mild
  1 (1.00%) high mild
write_batch primitive/4096 values float with no NaNs
                        time:   [383.44 µs 383.65 µs 383.85 µs]
                        thrpt:  [143.18 MiB/s 143.26 MiB/s 143.34 MiB/s]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

This shows that writing with 50% NaN values is much slower than with no NaNs.

After the change, performance with NaNs is very similar to without NaNs:

write_batch primitive/4096 values float with NaNs
                        time:   [406.40 µs 406.63 µs 406.88 µs]
                        thrpt:  [135.08 MiB/s 135.16 MiB/s 135.24 MiB/s]
                 change:
                        time:   [-92.875% -92.861% -92.845%] (p = 0.00 < 0.05)
                        thrpt:  [+1297.6% +1300.7% +1303.5%]
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe
write_batch primitive/4096 values float with no NaNs
                        time:   [382.52 µs 384.16 µs 385.50 µs]
                        thrpt:  [142.58 MiB/s 143.07 MiB/s 143.68 MiB/s]
                 change:
                        time:   [+0.1803% +0.3520% +0.5192%] (p = 0.00 < 0.05)
                        thrpt:  [-0.5165% -0.3507% -0.1799%]
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) low severe

(I removed the 4096 values float with no NaNs benchmark from this PR after running these benchmarks as I don't think there's a lot of value in keeping it)

adamreeve · 2025-01-08T04:19:44Z

parquet/src/util/interner.rs

@@ -66,7 +70,7 @@ impl<S: Storage> Interner<S> {
            .dedup
            .entry(
                hash,
-                |index| value == self.storage.get(*index),
+                |index| value.eq(self.storage.get(*index)),


Hmm, after opening this PR I realised a simpler approach would be to just compare the values by their byte representation here:

|index| value.as_bytes() == self.storage.get(*index).as_bytes()

I will check what effect that has on performance.

alamb · 2025-01-08T14:02:56Z

Thank you @adamreeve -- is there any chance you could break out the benchmark into its own PR so it is easier to compare the before/after performance of this change?

etseidl

Nice catch! This seems like an elegant solution. Thanks!

adamreeve · 2025-01-08T21:03:24Z

Thank you @adamreeve -- is there any chance you could break out the benchmark into its own PR so it is easier to compare the before/after performance of this change?

Sure, I've made #6955 to just add the benchmark and test, I'll make this PR a draft and rebase it once that is merged.

alamb · 2025-01-08T22:32:24Z

Thank you @adamreeve -- is there any chance you could break out the benchmark into its own PR so it is easier to compare the before/after performance of this change?

Sure, I've made #6955 to just add the benchmark and test, I'll make this PR a draft and rebase it once that is merged.

THanks! I have merged #6955 -- I'll run the benchmarks when this one is rebased

THanks again @adamreeve

adamreeve · 2025-01-08T23:42:07Z

OK I've rebased this now and switched to comparing byte representations for all types rather than needing a new trait, which is a simpler solution and is consistent with how values are hashed. The performance was very similar between the two approaches on my machine.

github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate labels Jan 8, 2025

adamreeve commented Jan 8, 2025

View reviewed changes

etseidl approved these changes Jan 8, 2025

View reviewed changes

adamreeve mentioned this pull request Jan 8, 2025

[Parquet] Add benchmark and test for writing NaNs to Parquet #6955

Merged

adamreeve marked this pull request as draft January 8, 2025 21:03

adamreeve added 2 commits January 9, 2025 12:14

Treat NaNs equal to NaN when interning for dictionary encoding

e342a38

Compare all values by bytes rather than adding Intern trait

ef980a7

adamreeve force-pushed the nan-dict-encode branch from 38d8581 to ef980a7 Compare January 8, 2025 23:39

github-actions bot removed the arrow Changes to the arrow crate label Jan 8, 2025

adamreeve marked this pull request as ready for review January 8, 2025 23:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Parquet] Fix slow dictionary encoding of NaN float values #6953

[Parquet] Fix slow dictionary encoding of NaN float values #6953

adamreeve commented Jan 8, 2025

adamreeve commented Jan 8, 2025

adamreeve Jan 8, 2025

alamb commented Jan 8, 2025

etseidl left a comment

adamreeve commented Jan 8, 2025

alamb commented Jan 8, 2025

adamreeve commented Jan 8, 2025

[Parquet] Fix slow dictionary encoding of NaN float values #6953

Are you sure you want to change the base?

[Parquet] Fix slow dictionary encoding of NaN float values #6953

Conversation

adamreeve commented Jan 8, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

adamreeve commented Jan 8, 2025

adamreeve Jan 8, 2025

Choose a reason for hiding this comment

alamb commented Jan 8, 2025

etseidl left a comment

Choose a reason for hiding this comment

adamreeve commented Jan 8, 2025

alamb commented Jan 8, 2025

adamreeve commented Jan 8, 2025