Improve performance, especially in data with many CR-LF #137

jhnstrk · 2024-03-31T13:48:34Z

The code changes here started as an attempt to improve performance with data loaded with many CRLF pairs (#67). In the current code these cause performance issues because CRLF matches the start of a boundary pattern. This makes advancing through the data very slow due to the number of call-backs.

By far the greatest optmization was to remove the partial Boyer-Moore-Horspool (BMH) implementation and replace it with bytes.find and a bit of logic to ensure partial matches at the end weren't missed. bytes.find appears to be significantly faster (around 3 times) on representative data. I can make a pull request with just this change if you are interested.

In the update I also attempted to reduce the number of times the on_part_data callback was called to a minimum. Whereas the old code would call it every time a partial boundary match was found (i.e. CRLF), now it only calls it when necessary. The conditions for calling the on_part_data are now:

A complete boundary match is found, either for an end of part or final end.
The currently loaded data buffer has been exhausted.

The significant complication is what happens when a partial boundary match overlaps the end of the loaded data. This was addressed with a look-behind buffer before, but the buffer is mostly unnecessary: since we are always matching boundary bytes, the look-behind buffer is always just a copy of the boundary. Only the last few bytes may vary (CRLF vs -- depending on whether it is a part or end boundary). However hitting this condition should be very very rare, and is addressed in the code.

Drops the look-behind buffer since the content is always the boundary.

The Boyer-Moore-Horspool algorithm was removed and replaced with Python's built-in `find` method. This appears to be faster, sometimes by an order of magnitude.

Kludex

I can make a pull request with just this change if you are interested.

All good.

Thanks for the great description, helped a lot on the review.

jhnstrk force-pushed the crlf_data_perf branch from 77742e0 to 0d19c08 Compare April 1, 2024 07:45

jhnstrk added 2 commits April 21, 2024 15:42

Improve parsing content with many cr-lf

473f23c

Drops the look-behind buffer since the content is always the boundary.

Improve performance by using built-in bytes.find.

f196d40

The Boyer-Moore-Horspool algorithm was removed and replaced with Python's built-in `find` method. This appears to be faster, sometimes by an order of magnitude.

jhnstrk force-pushed the crlf_data_perf branch from 0d19c08 to f196d40 Compare April 21, 2024 13:42

Merge branch 'master' into crlf_data_perf

bd04b2f

Kludex approved these changes Sep 28, 2024

View reviewed changes

Delete unused join_bytes

f08e0c3

Kludex merged commit a790e40 into Kludex:master Sep 28, 2024
6 checks passed

Kludex mentioned this pull request Sep 28, 2024

Parsing is extremely slow if there are a lot of CR LF characters in the stream #67

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance, especially in data with many CR-LF #137

Improve performance, especially in data with many CR-LF #137

jhnstrk commented Mar 31, 2024 •

edited

Loading

Kludex left a comment

Improve performance, especially in data with many CR-LF #137

Improve performance, especially in data with many CR-LF #137

Conversation

jhnstrk commented Mar 31, 2024 • edited Loading

Kludex left a comment

Choose a reason for hiding this comment

jhnstrk commented Mar 31, 2024 •

edited

Loading