Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve arrow-json deserialization performance by 30% #7157

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

mwylde
Copy link
Contributor

@mwylde mwylde commented Feb 20, 2025

Which issue does this PR close?

Closes #7156

Rationale for this change

Described in the issue

What changes are included in this PR?

This PR is made up of three changes, as separate commits. The performance impact of each change is summarized here:

Benchmark BufIter Change (%) memchr2 Change (%) simdutf8 Change (%) BufIter Time (µs) memchr2 Time (µs) simdutf8 Time (µs)
logs_json -13.49% -12.61% -0.83% 582.91 516.77 510.75
logs_pretty_json -20.63% -10.77% -1.22% 604.53 535.28 533.00
nexmark_json -30.01% -16.01% -0.42% 730.18 613.23 600.53
nexmark_pretty_json -21.35% -15.76% -1.09% 754.84 636.66 623.56
nexmark_bids_json -26.64% -22.20% -3.01% 508.01 396.75 382.77
nexmark_bids_pretty_json -26.26% -22.16% -2.64% 531.44 415.03 402.55
tweets_json -20.36% -18.97% -14.85% 688.13 594.75 515.75
tweets_pretty_json -22.61% -15.10% -11.04% 749.34 642.19 563.66
Average -22.04% -16.57% -4.64% 643.17 543.21 566.57
  • 80cd0b9 (BufIter): replaces the wrapped iterator in BufIter with a slice and an offset, allowing more efficient operations; this is a straightforward change that significantly improves performance
  • 3245fff (memchr2): uses the memchr library to speed up searches for string ends; this is also a significant improvement but adds an additional dependency (although one that is already used in arrow-string)
  • 9789442 (simdutf8) a more modest improvement that reduces the cost of utf8 validation via the simdutf8 library; this is also an additional dependency, although it's used in the parquet crate and has been suggested for other uses in Support using simdutf8 for validate_string_view and other utf8 validation #7014

If the changes that add dependencies are not desired they can be backed out of the PR.

Are there any user-facing changes?

No

@github-actions github-actions bot added the arrow Changes to the arrow crate label Feb 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inefficiencies in the arrow-json tape implementation
2 participants