forked from katef/libfsm
-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upstream sync #25
Merged
Merged
Upstream sync #25
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
There's extra error codes in #440 for regexes that aren't UNSATISFIABLE per se, but depend on particular corner cases in PCRE that probably aren't worth supporting in an automata-based implementation. Add a test case for one, tests/pcre/in48.re: ^a|$[^x]b* This is a tricky one to handle properly; according to PCRE it should match either "a<anything...>" OR "\n", but nothing else. The newline match is because $ is a non-input-consuming check that evaluation is either at the end of input, or at a newline immediately before the end. In this case `$[^x]b*` matches exactly one newline; it's equivalent to "$\n". This probably isn't worth supporting, but we can detect cases where a potential newline match appears after a $ and reject them as an unsupported PCRE behavior.
These are anchoring edge cases found by fuzzing. They have to do with incomplete pruning of repeated nodes that are unsatisfiable due to anchoring, nesting of anchors inside groups and/or ALT subtrees, etc. The fixes will go in their own commit.
There's extra error codes in #440 for regexes that aren't UNSATISFIABLE per se, but depend on particular corner cases in PCRE that probably aren't worth supporting in an automata-based implementation. Add a test case for one, tests/pcre/in48.re: ^a|$[^x]b* This is a tricky one to handle properly; according to PCRE it should match either "a<anything...>" OR "\n", but nothing else. The newline match is because $ is a non-input-consuming check that evaluation is either at the end of input, or at a newline immediately before the end. In this case `$[^x]b*` matches exactly one newline; it's equivalent to "$\n". This probably isn't worth supporting, but we can detect cases where a potential newline match appears after a $ and reject them as an unsupported PCRE behavior.
`^(?:$^|a)(?:$^|^)*`: This should be equivalent to `^(?:^$|a)` because the first group is always either fully anchored (making the second redundant) or must match 'a' (making it being anchored at the start impossible), but that information wasn't being passed properly to the second group. `$[^a]*` was returning AST_ANALYSIS_ERROR_UNSUPPORTED_PCRE while inside the repeat node (`[]*`), so it didn't flag the `^` immediately after. If a repeat node that can match 0 times is unsupported, unsatisfiable, etc., just set its max count to 0 and continue. With this change analysis finds the `^` and treats the whole regex overall like `$^|$^`, which reduces to `^$`. The fixes for both of these will be integrated in later commits -- I'm integrating changes from a working branch that had several intermediate steps committed to push to another computer for fuzzing.
If a repeat node's subtree returns an error but has a min count of 0, then prune it by setting its max count to 0. If its error is returned directly then the analysis won't see nodes that appear after it.
Found while fuzzing.
This should normally be set when allocation fails, but when built for fuzzing analysis can pre-emptively reject inputs that would need excessive memory due to large repetition counts (for example), so in that case make sure an error code is set to avoid triggering an assertion later on.
Lots of other bugs in linking have been indirect consequences of an ancestor ALT node that changed its top-down x,y linkage to global or the global start loop instead. Avoiding modification here eliminates a lot of noise. While trying to simultaneously satisfy the pcre-anchor tests 92, 99, and 100 I took a step back, thought through the system as a whole, and realized this was where their linkage was getting messed up.
Rather than modifying this directly, add a function, which has its own asserts and logging.
Add probe count logging to `to_set_htab_check`. Switch from adding each PHI_64*(id+1) to xor-ing, it has better collision behavior. This change makes time ./re -rpcre 'a[^bz][^bz][^bz][^b][^bz][^bz][^bz][^bz][^bz][^bz][^b][^bz][^bz][^bz][^bz][^bz]' go from taking ~14 seconds to ~6 when I run it locally, because adding leads to much higher probe counts on average.
katef's testing with words.sh found some suspicious timing, profiling with callgrind showed there's still some kind of bad collision behavior doing PHI64(a) ^ PHI64(b) with exactly two IDs. It's probably still a bad idea to combine multiple Fibonacci hashes, even with xor-ing rather than adding. Changing to xorshift* (another fast, high quality hash function for 64-bit ints) immediately makes the issue go away, so do that.
My intended idea was to have all syntax errors go through the same actions driven by parser.sid. But in this case SID's generated parser bails out immediately on TOK_UNKNOWN, and we never reach an ## alt. So we never reach <make-ast>, and thus never <err-syntax> about it. That's what the assertion was about; the generated parser should never return with a NULL AST. My workaround for this is just to recreate the equivalent syntax error at the top-level, which isn't very satisfying at all. But it does what we need here. Spotted by @cinco-de-mayonnaise, thank you.
- hash_id should take a uint64_t argument, rather than unsigned. - Instead of adding them or hashing them separately and combining, pack both into the uint64_t argument for hash_id, since each is a 32-bit ID. Further experimentation supports that this has better collision behavior.
- Add a function to hash a series of IDs, rather than doing it with a loop combining the intermediate hashes in inconsistent ways. - Add a #define to log a probe count and a limit to assert, and rework the hashing code control flow for tracking/asserting on the probe count. - Remove `hash_fnv1a_64` from a logging statement (normally not compiled), since that's the only remaining use remove the FNV1a functions from hash.h. - Remove FSM_PHI_32 and FSM_PHI_64, now unused. - parser.act still uses 32-bit FNV1a, but that's for character-by- character hashing, so that still makes sense, and it has its own copy of fnv1a.
Ranges using named endpoints are currently rejected with "Unsupported operator".
Spotted by ASAN.
I'm still experimenting with the limit, but it's common for sort_and_dedup_dst_buf to be called with a fairly small input array. The bitset approach's cost is proportional to the max - min value delta, so when there's only a few entries it may be faster to do insertion sort instead -- 16 entries will probably fit in a single cache line. Changing SMALL_INPUT_LIMIT changes cutoff, commenting it out disables that approach outright. Setting SMALL_INPUT_CHECK to 1 enables a pass at the end that verifies the array is actually sorted and unique. Also, check upfront whether the input array is already sorted and unique. Experiments so far suggest that this doesn't pay off very often, but it's very cheap since we're already sweeping over the input to the min and max values.
I benchmarked with a range of values 8 to 4096, somewhere around 32 - 256 worked particularly well, and it started to become slower beyond that, so set it to <32 for now.
…ons-during-ast-analysis Add error code for unsupported PCRE cases (RE_EUNSUPPPCRE), reject one.
This was adapted from the group capture integration branch. It isn't specific to group capture. Currently, metadata on end states will not prohibit merging, which can lead to false positives when using endids to detect which of the original regexes are matching when several regexes are compiled and then unioned into a single DFA. This adds a pass before the main minimisation loop that splits the ECs of end states with distinct sets of end IDs, which is sufficient to prevent them from merging later. I reworked the test tests/endids/endids2_union_many_endids.c somewhat, since it was previously checking that the end states WERE merged. Unfortunately the other checks are somewhat weakened -- I removed the part checking that `endids[j] == j+1` because that ordering isn't guaranteed anymore: the patterns aren't anchored, so some examples will match more than one of them, and the endids collected can be nonconsecutive. I added a separate test, endids10_minimise_partial_overlap.c, which checks the minimal case that when the two regexes /^abc$/ and /^ab*c$/ are combined and a distinct endid is set on each, matching "ac" "abc" "abbc" gets one or both endids as appropriate. Ideally, we'd have a property test that checks that any arbitrary set of regex patterns has the same set of inputs matching -> endids when they are run individually vs. when combined into a single FSM, determinised, and minimised.
These got missed. Also, rename EXPENSIVE_INTEGRITY_CHECKS in minimise.c to EXPENSIVE_CHECKS and move the <fsm/print.h> header inclusion.
If two end states have distinct sets of endids associated with them, mark them as distinguishable before entering the main fixpoint loop. This brings the results from the test oracle back in line with fsm_minimise.
Here I'd assumed #0 would always be present at the root, but that's actually only the case for a dialect which provides capturing groups. Instead of switching on the dialect, I'm just coding this to treat #0 as optional. This does have the semantic effect of removing the "whatever we parsed, it wasn't a literal" part. But I actually can't find a way to reach that path anyway.
Minimisation should not combine end states with distinct sets of endids.
My thinking here is that capturing the entire expression is a property of the runtime, regardless of whether a particular dialect offers syntax for group capture for subexpressions. Practically, it seems like a user might want to know what a pattern matched (i.e. ignoring the unanchored prefix and suffix).
Break a dependency on group #0 for re_is_literal()
This adds an extra parameter to `re_strings_add_str` and `re_strings_add_raw` that (if non-NULL) will associate a single endid with the string being added. When `re_strings_build` constructs the DFA it will produce a separate end state for each end. This needs further testing with multiple overlapping patterns. When multiple literal strings appear in the input only the latest match will be reported.
This uses a `struct state_set` since sizeof(fsm_state) == sizeof(fsm_end_id_t), and it's probably not worth making a separate ADT just for these. The second test checks that duplicated strings get all their endids set. The previous implementation (a single endid, or ENDID_NONE) dropped all but the last endid defined.
The generated functions for FSM_IO_STR and FSM_IO_PAIR didn't include a `void *opaque` to pass other state, so add it. For example, this could be used to pass info about which endids matched.
This will also need katef#465 . |
…odegen-in-IO-STR src/libfsm/print/c.c: Add opaque argument to generated DFA functions.
Add endid support to `re_strings`.
katef
approved these changes
May 8, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This syncs some changes from upstream, mostly bugfixes: