Sync from upstream katef/libfsm main #27

silentbicycle · 2024-10-09T21:19:48Z

Sync from upsrteam, subsequent changes will depend on interfaces added here.

queue: Fix a read past the end of the queue.

Add fsm_intersect_charset(), fsm -U

These are used for reporting errors, and the only error here is reported from within the <count-range> action. In fact I think all errors involving lexical positions are reported within actions (i.e. during parse), because they are all syntax errors. And so I think we never need to annotate the AST with positions.

…e node.

This also removes struct ast_pos, with no remaining uses.

Remove lexical position annotations for AST nodes

My thinking here is that we don't need to categorise these independently from anything else we consider unsupported, because to the caller the reason something is unsupported doesn't matter. This way there's only one situation for a caller to keep track of, and in particular to not need to remember to update whenever we introduce more unsupported things.

This also means that many existing tests will exercise fsm_vacuum.

Consolidate RE_EUNSUPPORTED syntax errors

Add fsm_vacuum, which reduces the state array when over-allocated.

Add fsm_new_statealloc()

This is a follow-up to katef#465: - Also add the extra `void *opaque` to vmc's codegen. - Add a `(void)` cast to suppress warnings if the extra opaque void pointer isn't used by the generated code.

…nd-warning Add `void* opaque` to vmc codegen too, disable unused warning.

This matches the standard behaviour. I'd disallowed it to avoid confusion, but in rx I do actually want to realloc down to nothing and return NULL when size is 0. And that's what free() does, but it seems cumbersome to have a conditional around that in the caller.

There's no need to have xstrdup() behave differently here, it's just confusing.

Add xstrndup

@classabbyamp

This doesn't help for katef#317, but whatever the solution is there, asserting about it is the wrong thing to do. Spotted by @classabbyamp, thank you

I have extremely broken the iprange example. I am very unsure what's going on with this program.

A few small bugfixes

Spotted by Dan Kegel, thank you.

rx, a program for compiling sets of regular expressions

Happier cache lines

Most of these are used in fsm_detect_required_characters. Also add #includes for standard headers being used.

This inspects the DFA to determine which characters must appear in any matching input.

fsm_endid_get's id_buf_count argument is expected to "have enough cells (according to id_buf_count)", but if it has more than enough, stale data can get sorted into the result. Add a test, tests/endids/endids_reused_buffer.c

bugfix: fsm_endid_get should sort with result count, not buffer size.

Update to Unicode 16.0

Add a regression test showing possible endid false negatives when FSMs were trimmed (called from fsm_minimise) without updating endids.

`struct bm` isn't part of the public API, so use a uint64_t[4] instead, and add an optional parameter for the count. Update the tests.

Move the end_id array and its count into a struct for state metadata, and rename access throughout to end_ids and end_id_count. Upcoming changes for eager output IDs will soon be passing more info to all of these callbacks, but only callers making use of those fields need to care. Instead of making callers add more `(void) param;` declarations all over the place to avoid warnings, just pass in a metadata struct pointer. Also, "count" is a pretty generic name and what it refers to will soon be ambiguous. This should not be a functional change on its own.

It's only used for the assertion.

It's already in fsm/fsm.h.

The IR struct is about to get another id & count pair. This loses storing the count as a 31-bit bitfield, but if the goal for that is saving memory then the ids array allocation could be replaced with a struct that contains the count, and then each IR without endids will save more space than the current approach.

…-must-appear-in-input-to-match Add fsm_detect_required_characters

…ld-remap-endids fsm_compact_states must remap endids, to avoid dangling references.

…data-args-into-struct print API: Box end_ids and end_id_count in a struct for callbacks.

silentbicycle and others added 30 commits May 15, 2024 15:09

queue: Fix a read past the end of the queue.

1a3e038

Merge pull request katef#467 from katef/sv/fix-queue-memmove-overrun

18816b1

queue: Fix a read past the end of the queue.

First cut at introducing fsm_intersect_charset(), exposed as fsm -U.

2c78478

Merge pull request katef#470 from katef/kate/intersect-charset

073305b

Add fsm_intersect_charset(), fsm -U

Refactoring: No need to carry lexical positions in the ast_expr .rang…

07e8fc1

…e node.

Whitespace.

bf46a2c

Refactoring: No need to mark group start/end for expr in general.

51c42fa

Refactoring: No need for <mark-expr>.

f0da88d

This also removes struct ast_pos, with no remaining uses.

Add fsm_vacuum, which reduces the state array when over-allocated.

94ffdcc

Whitespace.

68c5710

Refactoring: No need to keep dialect-dependent parser state here.

e9d190a

Move FSM_DEFAULT_STATEALLOC to fsm.c.

3435f50

Merge pull request katef#471 from katef/kate/ast-pos

80e9311

Remove lexical position annotations for AST nodes

Fabricate error reporting for RE_EUNSUPPORTED through AST analysis.

8330d57

Whitespace.

61772d0

fsm_vacuum: Change return type to bool.

8feff14

Call fsm_vacuum at the end of fsm_minimise.

6147860

This also means that many existing tests will exercise fsm_vacuum.

Merge pull request katef#473 from katef/kate/consolidate-unsupported

201637d

Consolidate RE_EUNSUPPORTED syntax errors

Merge pull request katef#472 from katef/sv/add_fsm_vacuum

1732de7

Add fsm_vacuum, which reduces the state array when over-allocated.

Add fsm_new_statealloc(), to pre-allocate to a known number of states.

748b7af

Merge pull request katef#474 from katef/kate/fsm_new_statealloc

65d8624

Add fsm_new_statealloc()

Add void* opaque to vmc codegen too, disable unused warning.

a3fb7d7

This is a follow-up to katef#465: - Also add the extra `void *opaque` to vmc's codegen. - Add a `(void)` cast to suppress warnings if the extra opaque void pointer isn't used by the generated code.

Merge pull request katef#475 from katef/sv/vmc-codegen-opaque-param-a…

e6ba81b

…nd-warning Add `void* opaque` to vmc codegen too, disable unused warning.

xstrndup()

88877cd

Normalization; all x*() xalloc.h interfaces exit on error.

4ec219a

There's no need to have xstrdup() behave differently here, it's just confusing.

No need to depend on POSIX here.

825a621

Merge pull request katef#476 from katef/kate/xstrndup

2483417

Add xstrndup

katef and others added 26 commits August 25, 2024 20:43

No need for calloc here.

f1ca8e8

Stray assertion.

5010a40

This doesn't help for katef#317, but whatever the solution is there, asserting about it is the wrong thing to do. Spotted by @classabbyamp, thank you

Update for hooks & options API changes.

ebbcb35

I have extremely broken the iprange example. I am very unsure what's going on with this program.

Merge pull request katef#490 from katef/kate/bugfixin

632ddd3

A few small bugfixes

Spelling.

8f9ea4d

Spotted by Dan Kegel, thank you.

Merge pull request katef#488 from katef/kate/rx

dc9721f

rx, a program for compiling sets of regular expressions

Merge pull request katef#491 from katef/kate/happy-cache-lines

81b14f8

Happier cache lines

bitmap.h: Add operations: copy, intersect, union, any, unset..

981223f

Most of these are used in fsm_detect_required_characters. Also add #includes for standard headers being used.

Add fsm_detect_required_characters.

fa24439

This inspects the DFA to determine which characters must appear in any matching input.

endids_reused_buffer.c: Fix memory leak in test. (free fsm.)

dec6ec0

Merge pull request katef#493 from katef/sv/bugfix-endid-qsort-count

521789f

bugfix: fsm_endid_get should sort with result count, not buffer size.

Update to Unicode 16.0

5705d43

Merge pull request katef#495 from data-man/ucd16

0f0dbb6

Update to Unicode 16.0

fsm_compact_states must remap endids, to avoid dangling references.

9aa1b9a

Add a regression test showing possible endid false negatives when FSMs were trimmed (called from fsm_minimise) without updating endids.

Change fsm_detect_required_characters interface.

3295240

`struct bm` isn't part of the public API, so use a uint64_t[4] instead, and add an optional parameter for the count. Update the tests.

fsm_detect_required_characters: Set count to 0 when matching "".

5dcdeb6

Move detect_required1.c tests to .re/.txt files. Add -q chars to fsm.

208de5f

fsm.1.xml: Add basic info about -S <step_limit> and -q requiredchars.

aab4422

Move call to fsm_countstates inside #ifndef NDEBUG.

8487fcb

It's only used for the assertion.

libfsm.syms: Expose fsm_new_statealloc.

2d61cf9

It's already in fsm/fsm.h.

Merge pull request katef#492 from katef/sv/determine-which-characters…

eb576b8

…-must-appear-in-input-to-match Add fsm_detect_required_characters

Merge pull request katef#496 from katef/sv/bugfix-compact-states-shou…

2474c51

…ld-remap-endids fsm_compact_states must remap endids, to avoid dangling references.

Merge pull request katef#497 from katef/sv/move-codegen-callback-meta…

0961848

…data-args-into-struct print API: Box end_ids and end_id_count in a struct for callbacks.

silentbicycle requested a review from katef October 9, 2024 21:19

katef approved these changes Oct 9, 2024

View reviewed changes

katef merged commit 8b53634 into main Oct 9, 2024
349 checks passed

katef deleted the sv/upstream-main branch October 9, 2024 21:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync from upstream katef/libfsm main #27

Sync from upstream katef/libfsm main #27

silentbicycle commented Oct 9, 2024

Sync from upstream katef/libfsm main #27

Sync from upstream katef/libfsm main #27

Conversation

silentbicycle commented Oct 9, 2024