Improve performance, especially in data with many CR-LF #137
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The code changes here started as an attempt to improve performance with data loaded with many CRLF pairs (#67). In the current code these cause performance issues because CRLF matches the start of a boundary pattern. This makes advancing through the data very slow due to the number of call-backs.
By far the greatest optmization was to remove the partial Boyer-Moore-Horspool (BMH) implementation and replace it with
bytes.find
and a bit of logic to ensure partial matches at the end weren't missed.bytes.find
appears to be significantly faster (around 3 times) on representative data. I can make a pull request with just this change if you are interested.In the update I also attempted to reduce the number of times the on_part_data callback was called to a minimum. Whereas the old code would call it every time a partial boundary match was found (i.e. CRLF), now it only calls it when necessary. The conditions for calling the
on_part_data
are now:The significant complication is what happens when a partial boundary match overlaps the end of the loaded data. This was addressed with a look-behind buffer before, but the buffer is mostly unnecessary: since we are always matching boundary bytes, the look-behind buffer is always just a copy of the boundary. Only the last few bytes may vary (CRLF vs -- depending on whether it is a part or end boundary). However hitting this condition should be very very rare, and is addressed in the code.