Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/optimize collections for json #38

Merged
merged 36 commits into from
Jan 10, 2025

Conversation

garethj2
Copy link
Contributor

@garethj2 garethj2 commented Jan 9, 2025

Refactor Haystack core so it's lazy for decoding values.

Before this change, decoding a grid would eagerly create rows and dicts (HVal) at the point the data was decoded. This adds a lot of unnecessary overhead if only part of the data structure is ever used. In our server side usage, we create lots of haystack values that may or may not be used at all.

This change makes with with haystack collections extremely lazy for JSON. Only when data is accessed for the first time will it be decoded. The refactoring could potentially also be applied to other encodings but since JSON is by far the fastest to parse it makes sense to start there.

Each collection (HList, HGrid and HDict) now has a backing store. Different stores abstract how the data is loaded. Hence we can make it lazy.

I've also added support for creating haystack data structures from JSON strings and JSON strings encoded in byte buffers. This way a haystack value can be created and only if something is done with it, will the byte buffer be read, decoded to a string and then decoded to its JSON.

I appreciate there's a lot of code here. To really zoom into the core pieces of code that do the work, please take a look at DictJsonStore, GridJsonStore and ListJsonStore (the new JSON specific store) versus DictObjStore, GridObjStore and ListObjStore (the old way that decodes everything up front (still used when creating dicts on the fly from code - which is fine).

Copy link
Contributor

@jaxgzz jaxgzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement Gareth! Just left a couple of very minor comments, but this all thing looks very nice to me. Thanks! I'm approving this ahead of time

Copy link
Contributor

@rracariu rracariu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting approach, what are the % perf gains?

@garethj2
Copy link
Contributor Author

garethj2 commented Jan 10, 2025

@rracariu in answer to your question regarding performance gains. Roughly speaking...

If you read a large grid and then exhaustively read all information there's no performance gain.

If you read a large grid and just use some the grid's meta it's a 1000% faster. If you just read a grid and only read a few of the tags on each dict's it's about 30% faster.

If you read a large grid and then immediately transfer it back to JSON then again it's 1000s of times faster.

There's quite a few situation is our server side usage where this happens so we should see some very large performance gains. Note the added support for JSON string and byte buffers with encoded JSON strings.

@garethj2 garethj2 merged commit 564c01e into master Jan 10, 2025
1 check passed
@garethj2 garethj2 deleted the feature/optimize-grid-and-dict-for-json branch January 10, 2025 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants