-
Notifications
You must be signed in to change notification settings - Fork 838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Parquet] Fix slow dictionary encoding of NaN float values #6953
base: main
Are you sure you want to change the base?
Conversation
Benchmark results from the new benchmarks before changing the interning behaviour:
This shows that writing with 50% NaN values is much slower than with no NaNs. After the change, performance with NaNs is very similar to without NaNs:
(I removed the |
parquet/src/util/interner.rs
Outdated
@@ -66,7 +70,7 @@ impl<S: Storage> Interner<S> { | |||
.dedup | |||
.entry( | |||
hash, | |||
|index| value == self.storage.get(*index), | |||
|index| value.eq(self.storage.get(*index)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, after opening this PR I realised a simpler approach would be to just compare the values by their byte representation here:
|index| value.as_bytes() == self.storage.get(*index).as_bytes()
I will check what effect that has on performance.
Thank you @adamreeve -- is there any chance you could break out the benchmark into its own PR so it is easier to compare the before/after performance of this change? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! This seems like an elegant solution. Thanks!
Sure, I've made #6955 to just add the benchmark and test, I'll make this PR a draft and rebase it once that is merged. |
THanks! I have merged #6955 -- I'll run the benchmarks when this one is rebased THanks again @adamreeve |
38d8581
to
ef980a7
Compare
OK I've rebased this now and switched to comparing byte representations for all types rather than needing a new trait, which is a simpler solution and is consistent with how values are hashed. The performance was very similar between the two approaches on my machine. |
Which issue does this PR close?
Closes #6952
Rationale for this change
This treats NaNs as equal to other NaNs of the same type for the purpose of dictionary encoding them when writing f32 or f64 Parquet physical values.
What changes are included in this PR?
Intern
trait to define equality behaviour for interning, replacing the use ofPartialEq
.Are there any user-facing changes?