Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow a regular expression package to be replaced by another package. #87

Open
aka-ao opened this issue May 14, 2024 · 3 comments
Open

Comments

@aka-ao
Copy link

aka-ao commented May 14, 2024

#59 Related to this PR, the regexp package is very slow.
I suggest creating an interface and changing it so that processing related to regular expressions can be injected externally as a dependency.

@dgoldstein0
Copy link
Contributor

Since that pr, we've added lru caching to the library, which makes frequently seen user agents parse almost instantly. Is the performance with that really a problem?

I'd be open to considering faster regex implementations if it gives a large benefit, though I'd think it'd make more sense if the regex engine we use was not exposed for dependency injection - if any regex engine is missing features we need or implements a slightly different flavor of regex which ends up having observable behavior differences, that would be problematic.

@masklinn
Copy link

masklinn commented Oct 26, 2024

Since that pr, we've added lru caching to the library, which makes frequently seen user agents parse almost instantly. Is the performance with that really a problem?

While that is true, it is obviously only for user agents which have been seen before and are still in the cache by their second hit. FWIW a while back a dailymotion employee kindly provided a sample of their access logs1, which should be realistic2, it has ~75k entries of which ~20k unique, with a very long tail of "one hit wonders", user agents seen only once and never again (this is relevant because as I learned LRU is pretty bad at one hit wonders at low sizes). On that file, when I investigated different cache algorithms I got the following hit rates at cachesize 1000 (which I understand is what you're using):

belady3 (1000): 60% hit rate
lru (1000): 37% hit rate
s3fifo (1000): 50% hit rate
sieve (1000): 48% hit rate

So even with a cache, the regex performance can be quite impactful.

With that said, rather than switching the regex library what I would recommend is looking if somebody has implemented FilteredRE2 in Go: regexes.yaml has a lot of regexes, there are more than 600 device parsers as of 0.18, a faster regex engine will not make that much of a difference in the end I think.

What FilteredRE2 does is prefilter the regexes, and then try only those which have a chance of matching4. Needing to only try 10-15 regexes is a much more significant gain than trying 600 regexes a bit faster.

An other thing you may want to look at — but this one you'll really have to bench for go specifically as regexp may or may not have this issue — is using Regexp.Match first, and only performing capture (retrieving the submatches) on matches: in both re2 and rust-regex the overhead of capture is very large (2-3x the cost of a match check), so it only takes 1-3 failures for "n match checks followed by one capture" to outperform "n captures". For this reason FilteredRE2 does not support capturing at all, after prefiltering it will only PartialMatch the selected regex: https://github.com/google/re2/blob/main/re2/filtered_re2.cc#L98-L122

Footnotes

  1. https://github.com/ua-parser/uap-python/pull/163#issuecomment-1536412054

  2. although obviously it's realistic traffic for dailymotion, a different site can have just as realistic a traffic with a different pattern

  3. Bélády's algorithm is a non-practical cache algorithm which can see in the future and has perfect knowledge of which entries to reject or evict, it is essentially the best that can be done at that cache size, making it a useful point of reference

  4. basically collecting the atoms (literal strings) each regex requires when building the filter, matching atoms to needles using something like aho-corasick, then reverse-mapping the set of matched atoms to regexes

@amfonelic
Copy link

amfonelic commented Dec 30, 2024

The standard library's regexp is already quite fast for capturing groups and is often the fastest choice when working in Go. While faster libraries like Hyperscan exist, Go lacks a direct Hyperscan-like (streaming DFA) library that also supports capturing-group offsets. Libraries like PCRE or Oniguruma can do that, but it's a toss-up whether they'll actually be faster.

I agree with masklinn BTW

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants