diff --git a/dev/.documenter-siteinfo.json b/dev/.documenter-siteinfo.json index d080a68..cda8c86 100644 --- a/dev/.documenter-siteinfo.json +++ b/dev/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.10.5","generation_timestamp":"2024-10-18T14:46:32","documenter_version":"1.7.0"}} \ No newline at end of file +{"documenter":{"julia_version":"1.11.0","generation_timestamp":"2024-10-18T15:37:52","documenter_version":"1.7.0"}} \ No newline at end of file diff --git a/dev/custom/index.html b/dev/custom/index.html index 9830a17..4ed9e8c 100644 --- a/dev/custom/index.html +++ b/dev/custom/index.html @@ -104,7 +104,7 @@ )

Create a CodeGenContext (ctx), a struct that stores options for Automa code generation. Ctxs are used for Automa's various code generator functions. They currently take the following options (more may be added in future versions)

Example

julia> ctx = CodeGenContext(generator=:goto, vars=Variables(buffer=:tbuffer));
 
 julia> generate_code(ctx, compile(re"a+")) isa Expr
-true
source
Automa.VariablesType

Struct used to store variable names used in generated code. Contained in a CodeGenContext. Create a custom Variables for your CodeGenContext if you want to customize the variables used in Automa codegen, typically if you have conflicting variables with the same name.

Automa generates code with the following variables, shown below with their default names:

  • p::Int: current position of data
  • p_end::Int: end position of data
  • is_eof::Bool: Whether p_end marks end file stream
  • cs::Int: current state
  • data::Any: input data
  • mem::SizedMemory: Memory wrapping data
  • byte::UInt8: current byte being read from data
  • buffer::TranscodingStreams.Buffer: (generate_reader only)

Example

julia> ctx = CodeGenContext(vars=Variables(byte=:u8));
+true
source
Automa.VariablesType

Struct used to store variable names used in generated code. Contained in a CodeGenContext. Create a custom Variables for your CodeGenContext if you want to customize the variables used in Automa codegen, typically if you have conflicting variables with the same name.

Automa generates code with the following variables, shown below with their default names:

  • p::Int: current position of data
  • p_end::Int: end position of data
  • is_eof::Bool: Whether p_end marks end file stream
  • cs::Int: current state
  • data::Any: input data
  • mem::SizedMemory: Memory wrapping data
  • byte::UInt8: current byte being read from data
  • buffer::TranscodingStreams.Buffer: (generate_reader only)

Example

julia> ctx = CodeGenContext(vars=Variables(byte=:u8));
 
 julia> ctx.vars.byte
-:u8
source
+:u8source diff --git a/dev/debugging/index.html b/dev/debugging/index.html index f27bdbc..aad7e98 100644 --- a/dev/debugging/index.html +++ b/dev/debugging/index.html @@ -64,4 +64,4 @@ println(io, machine2dot(machine)) end # Requires graphviz to be installed -run(pipeline(`dot -Tsvg /tmp/machine.dot`), stdout="/tmp/machine.svg")source +run(pipeline(`dot -Tsvg /tmp/machine.dot`), stdout="/tmp/machine.svg")source diff --git a/dev/index.html b/dev/index.html index 8fb0c2a..bf04b1b 100644 --- a/dev/index.html +++ b/dev/index.html @@ -45,4 +45,4 @@ (headers, reshape(fields, length(headers), :)) end -header, data = parse_tsv("a\tabc\n12\t13\r\nxyc\tz\n\n") +header, data = parse_tsv("a\tabc\n12\t13\r\nxyc\tz\n\n") diff --git a/dev/io/index.html b/dev/io/index.html index 9e223dc..cd777a0 100644 --- a/dev/io/index.html +++ b/dev/io/index.html @@ -107,14 +107,14 @@ mark: ^ p = 9 ^

Finally, when we reach the newline p = 13, the whole header is in the buffer, and so data[@markpos():p-1] will correctly refer to the header (now, 1:12).

content: abcdefghijkl\nA
 mark:    ^
-p = 13               ^

Remember to update the mark, or to clear it with @unmark() in order to be able to flush data from the buffer afterwards.

Reference

Automa.generate_readerFunction
generate_reader(funcname::Symbol, machine::Automa.Machine; kwargs...)

Generate a streaming reader function of the name funcname from machine.

The generated function consumes data from a stream passed as the first argument and executes the machine with filling the data buffer.

This function returns an expression object of the generated function. The user need to evaluate it in a module in which the generated function is needed.

Keyword Arguments

  • arguments: Additional arguments funcname will take (default: ()). The default signature of the generated function is (stream::TranscodingStream,), but it is possible to supply more arguments to the signature with this keyword argument.
  • context: Automa's codegenerator (default: Automa.CodeGenContext()).
  • actions: A dictionary of action code (default: Dict{Symbol,Expr}()).
  • initcode: Initialization code (default: :()).
  • loopcode: Loop code (default: :()).
  • returncode: Return code (default: :(return cs)).
  • errorcode: Executed if cs < 0 after loopcode (default error message)

See the source code of this function to see how the generated code looks like

source
Automa.@escapeMacro
@escape()

Pseudomacro. When encountered during Machine execution, the machine will stop executing. This is useful to interrupt the parsing process, for example to emit a record during parsing of a larger file. p will be advanced as normally, so if @escape is hit on B during parsing of "ABC", the next byte will be C.

source
Automa.@markMacro
@mark()

Pseudomacro, to be used with IO-parsing Automa functions. This macro will "mark" the position of p in the current buffer. The marked position will not be flushed from the buffer after being consumed. For example, Automa code can call @mark() at the beginning of a large string, then when the string is exited at position p, it is guaranteed that the whole string resides in the buffer at positions markpos():p-1.

source
Automa.@unmarkMacro
unmark()

Pseudomacro. Removes the mark from the buffer. This allows all previous data to be cleared from the buffer.

See also: @mark, @markpos

source
Automa.@markposMacro
markpos()

Pseudomacro. Get the position of the mark in the buffer.

See also: @mark, @markpos

source
Automa.@bufferposMacro
bufferpos()

Pseudomacro. Returns the integer position of the current TranscodingStreams buffer (only used with the generate_reader function).

Example

# Inside some Automa action code
+p = 13               ^

Remember to update the mark, or to clear it with @unmark() in order to be able to flush data from the buffer afterwards.

Reference

Automa.generate_readerFunction
generate_reader(funcname::Symbol, machine::Automa.Machine; kwargs...)

Generate a streaming reader function of the name funcname from machine.

The generated function consumes data from a stream passed as the first argument and executes the machine with filling the data buffer.

This function returns an expression object of the generated function. The user need to evaluate it in a module in which the generated function is needed.

Keyword Arguments

  • arguments: Additional arguments funcname will take (default: ()). The default signature of the generated function is (stream::TranscodingStream,), but it is possible to supply more arguments to the signature with this keyword argument.
  • context: Automa's codegenerator (default: Automa.CodeGenContext()).
  • actions: A dictionary of action code (default: Dict{Symbol,Expr}()).
  • initcode: Initialization code (default: :()).
  • loopcode: Loop code (default: :()).
  • returncode: Return code (default: :(return cs)).
  • errorcode: Executed if cs < 0 after loopcode (default error message)

See the source code of this function to see how the generated code looks like

source
Automa.@escapeMacro
@escape()

Pseudomacro. When encountered during Machine execution, the machine will stop executing. This is useful to interrupt the parsing process, for example to emit a record during parsing of a larger file. p will be advanced as normally, so if @escape is hit on B during parsing of "ABC", the next byte will be C.

source
Automa.@markMacro
@mark()

Pseudomacro, to be used with IO-parsing Automa functions. This macro will "mark" the position of p in the current buffer. The marked position will not be flushed from the buffer after being consumed. For example, Automa code can call @mark() at the beginning of a large string, then when the string is exited at position p, it is guaranteed that the whole string resides in the buffer at positions markpos():p-1.

source
Automa.@unmarkMacro
unmark()

Pseudomacro. Removes the mark from the buffer. This allows all previous data to be cleared from the buffer.

See also: @mark, @markpos

source
Automa.@bufferposMacro
bufferpos()

Pseudomacro. Returns the integer position of the current TranscodingStreams buffer (only used with the generate_reader function).

Example

# Inside some Automa action code
 @setbuffer()
 description = sub_parser(stream)
-p = @bufferpos()

See also: @setbuffer

source
Automa.@relposMacro
relpos(p)

Automa pseudomacro. Return the position of p relative to @markpos(). Equivalent to p - @markpos() + 1. This can be used to mark additional points in the stream when the mark is set, after which their action position can be retrieved using abspos(x).

Behaviour is undefined if mark has not yet been set.

Example usage:

# In one action
+p = @bufferpos()

See also: @setbuffer

source
Automa.@relposMacro
relpos(p)

Automa pseudomacro. Return the position of p relative to @markpos(). Equivalent to p - @markpos() + 1. This can be used to mark additional points in the stream when the mark is set, after which their action position can be retrieved using abspos(x).

Behaviour is undefined if mark has not yet been set.

Example usage:

# In one action
 identifier_pos = @relpos(p)
 
 # Later, in a different action
-identifier = data[@abspos(identifier_pos):p]

See also: @abspos

source
Automa.@absposMacro
abspos(p)

Automa pseudomacro. Used to obtain the actual position of a relative position obtained from @relpos. See @relpos for more details.

Behaviour is undefined if mark has not yet been set.

source
Automa.@setbufferMacro
setbuffer()

Updates the buffer position to match p. The buffer position is syncronized with p before and after calls to functions generated by generate_reader. @setbuffer() can be used to the buffer before calling another parser.

Example

# Inside some Automa action code
+identifier = data[@abspos(identifier_pos):p]

See also: @abspos

source
Automa.@absposMacro
abspos(p)

Automa pseudomacro. Used to obtain the actual position of a relative position obtained from @relpos. See @relpos for more details.

Behaviour is undefined if mark has not yet been set.

source
Automa.@setbufferMacro
setbuffer()

Updates the buffer position to match p. The buffer position is syncronized with p before and after calls to functions generated by generate_reader. @setbuffer() can be used to the buffer before calling another parser.

Example

# Inside some Automa action code
 @setbuffer()
 description = sub_parser(stream)
-p = @bufferpos()

See also: @bufferpos

source
+p = @bufferpos()

See also: @bufferpos

source diff --git a/dev/objects.inv b/dev/objects.inv index 95c80a6..c684b92 100644 Binary files a/dev/objects.inv and b/dev/objects.inv differ diff --git a/dev/parser/index.html b/dev/parser/index.html index cc3e0fd..cade8a1 100644 --- a/dev/parser/index.html +++ b/dev/parser/index.html @@ -81,17 +81,17 @@ julia> regex2 = onenter!(regex, :entering_regex); julia> regex === regex2 -truesource
Automa.RegExp.onexit!Function
onexit!(re::RE, a::Union{Symbol, Vector{Symbol}}) -> re

Set action(s) a to occur when reading the first byte no longer part of regex re, or if experiencing an expected end-of-file. If multiple actions are set by passing a vector, execute the actions in order.

See also: onenter!, onall!, onfinal!

Example

julia> regex = re"ab?c*";
+true
source
Automa.RegExp.onexit!Function
onexit!(re::RE, a::Union{Symbol, Vector{Symbol}}) -> re

Set action(s) a to occur when reading the first byte no longer part of regex re, or if experiencing an expected end-of-file. If multiple actions are set by passing a vector, execute the actions in order.

See also: onenter!, onall!, onfinal!

Example

julia> regex = re"ab?c*";
 
 julia> regex2 = onexit!(regex, :exiting_regex);
 
 julia> regex === regex2
-true
source
Automa.RegExp.onall!Function
onall!(re::RE, a::Union{Symbol, Vector{Symbol}}) -> re

Set action(s) a to occur when reading any byte part of the regex re. If multiple actions are set by passing a vector, execute the actions in order.

See also: onenter!, onexit!, onfinal!

Example

julia> regex = re"ab?c*";
+true
source
Automa.RegExp.onall!Function
onall!(re::RE, a::Union{Symbol, Vector{Symbol}}) -> re

Set action(s) a to occur when reading any byte part of the regex re. If multiple actions are set by passing a vector, execute the actions in order.

See also: onenter!, onexit!, onfinal!

Example

julia> regex = re"ab?c*";
 
 julia> regex2 = onall!(regex, :reading_re_byte);
 
 julia> regex === regex2
-true
source
Automa.RegExp.onfinal!Function
onfinal!(re::RE, a::Union{Symbol, Vector{Symbol}}) -> re

Set action(s) a to occur when the last byte of regex re. If re does not have a definite final byte, e.g. re"a(bc)*", where more "bc" can always be added, compiling the regex will error after setting a final action. If multiple actions are set by passing a vector, execute the actions in order.

See also: onenter!, onall!, onexit!

Example

julia> regex = re"ab?c";
+true
source
Automa.RegExp.onfinal!Function
onfinal!(re::RE, a::Union{Symbol, Vector{Symbol}}) -> re

Set action(s) a to occur when the last byte of regex re. If re does not have a definite final byte, e.g. re"a(bc)*", where more "bc" can always be added, compiling the regex will error after setting a final action. If multiple actions are set by passing a vector, execute the actions in order.

See also: onenter!, onall!, onexit!

Example

julia> regex = re"ab?c";
 
 julia> regex2 = onfinal!(regex, :entering_last_byte);
 
@@ -99,24 +99,24 @@
 true
 
 julia> compile(onfinal!(re"ab?c*", :does_not_work))
-ERROR: [...]
source
Automa.RegExp.precond!Function
precond!(re::RE, s::Symbol; [when=:enter], [bool=true]) -> re

Set re's precondition to s. Before any state transitions to re, or inside re, the precondition code s is checked to be bool before the transition is taken.

when controls if the condition is checked when the regex is entered (if :enter), or at every state transition inside the regex (if :all)

Example

julia> regex = re"ab?c*";
+ERROR: [...]
source
Automa.RegExp.precond!Function
precond!(re::RE, s::Symbol; [when=:enter], [bool=true]) -> re

Set re's precondition to s. Before any state transitions to re, or inside re, the precondition code s is checked to be bool before the transition is taken.

when controls if the condition is checked when the regex is entered (if :enter), or at every state transition inside the regex (if :all)

Example

julia> regex = re"ab?c*";
 
 julia> regex2 = precond!(regex, :some_condition);
 
 julia> regex === regex2
-true
source
Automa.generate_codeFunction
generate_code([::CodeGenContext], machine::Machine, actions=nothing)::Expr

Generate init and exec code for machine. The default code generator function for creating functions, preferentially use this over generating init and exec code directly, due to its convenience. Shorthand for producing the concatenated code of:

  • generate_init_code(ctx, machine)
  • generate_action_code(ctx, machine, actions)
  • generate_input_error_code(ctx, machine) [elided if actions == :debug]

Examples

@eval function foo(data)
+true
source
Automa.generate_codeFunction
generate_code([::CodeGenContext], machine::Machine, actions=nothing)::Expr

Generate init and exec code for machine. The default code generator function for creating functions, preferentially use this over generating init and exec code directly, due to its convenience. Shorthand for producing the concatenated code of:

  • generate_init_code(ctx, machine)
  • generate_action_code(ctx, machine, actions)
  • generate_input_error_code(ctx, machine) [elided if actions == :debug]

Examples

@eval function foo(data)
     # Initialize variables used in actions
     data_buffer = UInt8[]
     $(generate_code(machine, actions))
     return data_buffer
-end

See also: generate_init_code, generate_exec_code

source
Automa.generate_init_codeFunction
generate_init_code([::CodeGenContext], machine::Machine)::Expr

Generate variable initialization code, initializing variables such as p, and p_end. The names of these variables are set by the CodeGenContext. If not passed, the context defaults to DefaultCodeGenContext

Prefer using the more generic generate_code over this function where possible. This function should be used if the initialized data should be modified before the execution code.

Example

@eval function foo(data)
+end

See also: generate_init_code, generate_exec_code

source
Automa.generate_init_codeFunction
generate_init_code([::CodeGenContext], machine::Machine)::Expr

Generate variable initialization code, initializing variables such as p, and p_end. The names of these variables are set by the CodeGenContext. If not passed, the context defaults to DefaultCodeGenContext

Prefer using the more generic generate_code over this function where possible. This function should be used if the initialized data should be modified before the execution code.

Example

@eval function foo(data)
     $(generate_init_code(machine))
     p = 2 # maybe I want to start from position 2, not 1
     $(generate_exec_code(machine, actions))
     return cs
-end

See also: generate_code, generate_exec_code

source
Automa.generate_exec_codeFunction
generate_exec_code([::CodeGenContext], machine::Machine, actions=nothing)::Expr

Generate machine execution code with actions. This code should be run after the machine has been initialized with generate_init_code. If not passed, the context defaults to DefaultCodeGenContext

Prefer using the more generic generate_code over this function where possible. This function should be used if the initialized data should be modified before the execution code.

Examples

@eval function foo(data)
+end

See also: generate_code, generate_exec_code

source
Automa.generate_exec_codeFunction
generate_exec_code([::CodeGenContext], machine::Machine, actions=nothing)::Expr

Generate machine execution code with actions. This code should be run after the machine has been initialized with generate_init_code. If not passed, the context defaults to DefaultCodeGenContext

Prefer using the more generic generate_code over this function where possible. This function should be used if the initialized data should be modified before the execution code.

Examples

@eval function foo(data)
     $(generate_init_code(machine))
     p = 2 # maybe I want to start from position 2, not 1
     $(generate_exec_code(machine, actions))
     return cs
-end

See also: generate_init_code, generate_exec_code

source
+end

See also: generate_init_code, generate_exec_code

source diff --git a/dev/reader/index.html b/dev/reader/index.html index 5ee891e..99a84ab 100644 --- a/dev/reader/index.html +++ b/dev/reader/index.html @@ -57,4 +57,4 @@ Seq("tag", "GAGATATA") julia> read_record(reader) -ERROR: EOFError: read end of file +ERROR: EOFError: read end of file diff --git a/dev/regex/index.html b/dev/regex/index.html index 1ec4838..27e660d 100644 --- a/dev/regex/index.html +++ b/dev/regex/index.html @@ -13,5 +13,5 @@ true julia> compile(regex) isa Automa.Machine -true

See also: @re_str, compile

source
Automa.RegExp.@re_strMacro
@re_str -> RE

Construct an Automa regex of type RE from a string. Note that due to Julia's raw string escaping rules, re"\\" means a single backslash, and so does re"\\\\", while re"\\\\\"" means a backslash, then a quote character.

Examples:

julia> re"ab?c*[def][^ghi]+" isa RE
-true 

See also: RE

source
+true

See also: @re_str, compile

source
Automa.RegExp.@re_strMacro
@re_str -> RE

Construct an Automa regex of type RE from a string. Note that due to Julia's raw string escaping rules, re"\\" means a single backslash, and so does re"\\\\", while re"\\\\\"" means a backslash, then a quote character.

Examples:

julia> re"ab?c*[def][^ghi]+" isa RE
+true 

See also: RE

source
diff --git a/dev/theory/index.html b/dev/theory/index.html index 3ee68a6..8552fee 100644 --- a/dev/theory/index.html +++ b/dev/theory/index.html @@ -1,2 +1,2 @@ -Theory · Automa.jl

Theory of regular expressions

Most programmers are familiar with regular expressions, or regex, for short. What many programmers don't know is that regex have a deep theoretical underpinning, which is leaned on by regex engines to produce highly efficient code.

Informally, a regular expression can be thought of as any pattern that can be constructed from the following atoms:

  • The empty string is a valid regular expression, i.e. re""
  • Literal matching of a single symbol from a finite alphabet, such as a character, i.e. re"p"

Atoms can be combined with the following operations, if R and P are two regular expressions:

  • Alternation, i.e R | P, meaning either match R or P.
  • Concatenation, i.e. R * P, meaning match first R, then P
  • Repetition, i.e. R*, meaning match R zero or more times consecutively.
Note

In Automa, the alphabet is bytes, i.e. 0x00:0xff, and so each symbol is a single byte. Multi-byte characters such as Æ is interpreted as the two concatenated of two symbols, re"\xc3" * re"\x86". The fact that Automa considers one input to be one byte, not one character, can become relevant if you instruct Automa to complete an action "on every input".

Popular regex libraries include more operations like ? and +. These can trivially be constructed from the above mentioned primitives, i.e. R? is "" | R, and R+ is RR*.

Some implementations of regular expression engines, such as PCRE which is the default in Julia as of Julia 1.8, also support operations like backreferences and lookbehind. These operations can NOT be constructed from the above atoms and axioms, meaning that PCRE expressions are not regular expressions in the theoretical sense.

The practical importance of theoretically sound regular expressions is that there exists algorithms that can match regular expressions on O(N) time and O(1) space, whereas this is not true for PCRE expressions, which are therefore significantly slower.

Note

Automa.jl only supports real regex, and as such does not support e.g. backreferences, in order to gurantee fast runtime performance.

To match regex to strings, the regex are transformed to finite automata, which are then implemented in code.

Nondeterministic finite automata

The programmer Ken Thompson, of Unix fame, deviced Thompson's construction, an algorithm to constuct a nondeterministic finite automaton (NFA) from a regex. An NFA can be thought of as a flowchart (or a directed graph), where one can move from node to node on directed edges. Edges are either labeled ϵ, in which the machine can freely move through the edge to its destination node, or labeled with one or more input symbols, in which the machine may traverse the edge upon consuming said input.

To illustrate, let's look at one of the simplest regex: re"a", matching the letter a:

State diagram showing state 1, edge transition consuming input 'a', leading to "accept state" 2

You begin at the small dot on the right, then immediately go to state 1, the circle marked by a 1. By moving to the next state, state 2, you consume the next symbol from the input string, which must be the symbol marked on the edge from state 1 to state 2 (in this case, an a). Some states are "accept states", illustrated by a double circle. If you are at an accept state when you've consumed all symbols of the input string, the string matches the regex.

Each of the operations that combine regex can also combine NFAs. For example, given the two regex a and b, which correspond to the NFAs A and B, the regex a * b can be expressed with the following NFA:

State diagram showing ϵ transition from state A to accept state B

Note the ϵ symbol on the edge - this signifies an "epsilon transition", meaning you move directly from A to B without consuming any symbols.

Similarly, a | b correspond to this NFA structure...

State diagram of the NFA for `a | b`

...and a* to this:

State diagram of the NFA for `a*`

For a larger example, re"(\+|-)?(0|1)*" combines alternation, concatenation and repetition and so looks like this:

State diagram of the NFA for `re"(\\+|-)?(0|1)*"`

ϵ-transitions means that there are states from which there are multiple possible next states, e.g. in the larger example above, state 1 can lead to state 2 or state 8. That's what makes NFAs nondeterministic.

In order to match a regex to a string then, the movement through the NFA must be emulated. You begin at state 1. When a non-ϵ edge is encountered, you consume a byte of the input data if it matches. If there are no edges that match your input, the string does not match. If an ϵ-edge is encountered from state A that leads to states B and C, the machine goes from state A to state {B, C}, i.e. in both states at once.

For example, if the regex re"(\+|-)?(0|1)* visualized above is matched to the string -11, this is what happens:

  • NFA starts in state 1
  • NFA immediately moves to all states reachable via ϵ transition. It is now in state {2, 3, 5, 7, 8, 9, 10}.
  • NFA sees input -. States {2, 3, 4, 5, 7, 8, 10} do not have an edge with - leading out, so these states die. Therefore, the machine is in state 9, consumes the input, and moves to state 2.
  • NFA immediately moves to all states reachable from state 2 via ϵ transitions, so goes to {3, 4, 5, 7}
  • NFA sees input 1, must be in state 5, moves to state 6, then through ϵ transitions to state {3, 4, 5, 7}
  • The above point repeats, NFA is still in state {3, 4, 5, 7}
  • Input ends. Since state 3 is an accept state, the string matches.

Using only a regex-to-NFA converter, you could create a simple regex engine simply by emulating the NFA as above. The existence of ϵ transitions means the NFA can be in multiple states at once which adds unwelcome complexity to the emulation and makes it slower. Luckily, every NFA has an equivalent determinisitic finite automaton, which can be constructed from the NFA using the so-called powerset construction.

Deterministic finite automata

Or DFAs, as they are called, are similar to NFAs, but do not contain ϵ-edges. This means that a given input string has either zero paths (if it does not match the regex), one, unambiguous path, through the DFA. In other words, every input symbol must trigger one unambiguous state transition from one state to one other state.

Let's visualize the DFA equivalent to the larger NFA above:

State diagram of the DFA for `re"(\\+|-)?(0|1)*"`

It might not be obvious, but the DFA above accepts exactly the same inputs as the previous NFA. DFAs are way simpler to simulate in code than NFAs, precisely because at every state, for every input, there is exactly one action. DFAs can be simulated either using a lookup table of possible state transitions, or by hardcoding GOTO-statements from node to node when the correct input is matched. Code simulating DFAs can be ridicuously fast, with each state transition taking less than 1 nanosecond, if implemented well.

Furthermore, DFAs can be optimised. Two edges between the same nodes with labels A and B can be collapsed to a single edge with labels [AB], and redundant nodes can be collapsed. The optimised DFA equivalent to the one above is simply:

State diagram of the simpler DFA for `re"(\\+|-)?(0|1)*"`

Unfortunately, as the name "powerset construction" hints, convering an NFA with N nodes may result in a DFA with up to 2^N nodes. This inconvenient fact drives important design decisions in regex implementations. There are basically two approaches:

Automa.jl will just construct the DFA directly, and accept a worst-case complexity of O(2^N). This is acceptable (I think) for Automa, because this construction happens in Julia's package precompilation stage (not on package loading or usage), and because the DFAs are assumed to be constants within a package. So, if a developer accidentally writes an NFA which is unacceptably slow to convert to a DFA, it will be caught in development. Luckily, it's pretty rare to have NFAs that result in truly abysmally slow conversions to DFA's: While bad corner cases exist, they are rarely as catastrophic as the O(2^N) would suggest. Currently, Automa's regex/NFA/DFA compilation pipeline is very slow and unoptimized, but, since it happens during precompile time, it is insignificant compared to LLVM compile times.

Other implementations, like the popular ripgrep command line tool, uses an adaptive approach. It constructs the DFA on the fly, as each symbol is being matched, and then caches the DFA. If the DFA size grows too large, the cache is flushed. If the cache is flushed too often, it falls back to simulating the NFA directly. Such an approach is necessary for ripgrep, because the regex -> NFA -> DFA compilation happens at runtime and must be near-instantaneous, unlike Automa, where it happens during package precompilation and can afford to be slow.

Automa in a nutshell

Automa simulates the DFA by having the DFA create a Julia Expr, which is then used to generate a Julia function using metaprogramming. Like all other Julia code, this function is then optimized by Julia and then LLVM, making the DFA simulations very fast.

Because Automa just constructs Julia functions, we can do extra tricks that ordinary regex engines cannot: We can splice arbitrary Julia code into the DFA simulation. Currently, Automa supports two such kinds of code: actions, and preconditions.

Actions are Julia code that is executed during certain state transitions. Preconditions are Julia code, that evaluates to a Bool value, and which are checked before a state transition. If a precondition evaluates to false, the transition is not taken.

+Theory · Automa.jl

Theory of regular expressions

Most programmers are familiar with regular expressions, or regex, for short. What many programmers don't know is that regex have a deep theoretical underpinning, which is leaned on by regex engines to produce highly efficient code.

Informally, a regular expression can be thought of as any pattern that can be constructed from the following atoms:

  • The empty string is a valid regular expression, i.e. re""
  • Literal matching of a single symbol from a finite alphabet, such as a character, i.e. re"p"

Atoms can be combined with the following operations, if R and P are two regular expressions:

  • Alternation, i.e R | P, meaning either match R or P.
  • Concatenation, i.e. R * P, meaning match first R, then P
  • Repetition, i.e. R*, meaning match R zero or more times consecutively.
Note

In Automa, the alphabet is bytes, i.e. 0x00:0xff, and so each symbol is a single byte. Multi-byte characters such as Æ is interpreted as the two concatenated of two symbols, re"\xc3" * re"\x86". The fact that Automa considers one input to be one byte, not one character, can become relevant if you instruct Automa to complete an action "on every input".

Popular regex libraries include more operations like ? and +. These can trivially be constructed from the above mentioned primitives, i.e. R? is "" | R, and R+ is RR*.

Some implementations of regular expression engines, such as PCRE which is the default in Julia as of Julia 1.8, also support operations like backreferences and lookbehind. These operations can NOT be constructed from the above atoms and axioms, meaning that PCRE expressions are not regular expressions in the theoretical sense.

The practical importance of theoretically sound regular expressions is that there exists algorithms that can match regular expressions on O(N) time and O(1) space, whereas this is not true for PCRE expressions, which are therefore significantly slower.

Note

Automa.jl only supports real regex, and as such does not support e.g. backreferences, in order to gurantee fast runtime performance.

To match regex to strings, the regex are transformed to finite automata, which are then implemented in code.

Nondeterministic finite automata

The programmer Ken Thompson, of Unix fame, deviced Thompson's construction, an algorithm to constuct a nondeterministic finite automaton (NFA) from a regex. An NFA can be thought of as a flowchart (or a directed graph), where one can move from node to node on directed edges. Edges are either labeled ϵ, in which the machine can freely move through the edge to its destination node, or labeled with one or more input symbols, in which the machine may traverse the edge upon consuming said input.

To illustrate, let's look at one of the simplest regex: re"a", matching the letter a:

State diagram showing state 1, edge transition consuming input 'a', leading to "accept state" 2

You begin at the small dot on the right, then immediately go to state 1, the circle marked by a 1. By moving to the next state, state 2, you consume the next symbol from the input string, which must be the symbol marked on the edge from state 1 to state 2 (in this case, an a). Some states are "accept states", illustrated by a double circle. If you are at an accept state when you've consumed all symbols of the input string, the string matches the regex.

Each of the operations that combine regex can also combine NFAs. For example, given the two regex a and b, which correspond to the NFAs A and B, the regex a * b can be expressed with the following NFA:

State diagram showing ϵ transition from state A to accept state B

Note the ϵ symbol on the edge - this signifies an "epsilon transition", meaning you move directly from A to B without consuming any symbols.

Similarly, a | b correspond to this NFA structure...

State diagram of the NFA for `a | b`

...and a* to this:

State diagram of the NFA for `a*`

For a larger example, re"(\+|-)?(0|1)*" combines alternation, concatenation and repetition and so looks like this:

State diagram of the NFA for `re"(\\+|-)?(0|1)*"`

ϵ-transitions means that there are states from which there are multiple possible next states, e.g. in the larger example above, state 1 can lead to state 2 or state 8. That's what makes NFAs nondeterministic.

In order to match a regex to a string then, the movement through the NFA must be emulated. You begin at state 1. When a non-ϵ edge is encountered, you consume a byte of the input data if it matches. If there are no edges that match your input, the string does not match. If an ϵ-edge is encountered from state A that leads to states B and C, the machine goes from state A to state {B, C}, i.e. in both states at once.

For example, if the regex re"(\+|-)?(0|1)* visualized above is matched to the string -11, this is what happens:

  • NFA starts in state 1
  • NFA immediately moves to all states reachable via ϵ transition. It is now in state {2, 3, 5, 7, 8, 9, 10}.
  • NFA sees input -. States {2, 3, 4, 5, 7, 8, 10} do not have an edge with - leading out, so these states die. Therefore, the machine is in state 9, consumes the input, and moves to state 2.
  • NFA immediately moves to all states reachable from state 2 via ϵ transitions, so goes to {3, 4, 5, 7}
  • NFA sees input 1, must be in state 5, moves to state 6, then through ϵ transitions to state {3, 4, 5, 7}
  • The above point repeats, NFA is still in state {3, 4, 5, 7}
  • Input ends. Since state 3 is an accept state, the string matches.

Using only a regex-to-NFA converter, you could create a simple regex engine simply by emulating the NFA as above. The existence of ϵ transitions means the NFA can be in multiple states at once which adds unwelcome complexity to the emulation and makes it slower. Luckily, every NFA has an equivalent determinisitic finite automaton, which can be constructed from the NFA using the so-called powerset construction.

Deterministic finite automata

Or DFAs, as they are called, are similar to NFAs, but do not contain ϵ-edges. This means that a given input string has either zero paths (if it does not match the regex), one, unambiguous path, through the DFA. In other words, every input symbol must trigger one unambiguous state transition from one state to one other state.

Let's visualize the DFA equivalent to the larger NFA above:

State diagram of the DFA for `re"(\\+|-)?(0|1)*"`

It might not be obvious, but the DFA above accepts exactly the same inputs as the previous NFA. DFAs are way simpler to simulate in code than NFAs, precisely because at every state, for every input, there is exactly one action. DFAs can be simulated either using a lookup table of possible state transitions, or by hardcoding GOTO-statements from node to node when the correct input is matched. Code simulating DFAs can be ridicuously fast, with each state transition taking less than 1 nanosecond, if implemented well.

Furthermore, DFAs can be optimised. Two edges between the same nodes with labels A and B can be collapsed to a single edge with labels [AB], and redundant nodes can be collapsed. The optimised DFA equivalent to the one above is simply:

State diagram of the simpler DFA for `re"(\\+|-)?(0|1)*"`

Unfortunately, as the name "powerset construction" hints, convering an NFA with N nodes may result in a DFA with up to 2^N nodes. This inconvenient fact drives important design decisions in regex implementations. There are basically two approaches:

Automa.jl will just construct the DFA directly, and accept a worst-case complexity of O(2^N). This is acceptable (I think) for Automa, because this construction happens in Julia's package precompilation stage (not on package loading or usage), and because the DFAs are assumed to be constants within a package. So, if a developer accidentally writes an NFA which is unacceptably slow to convert to a DFA, it will be caught in development. Luckily, it's pretty rare to have NFAs that result in truly abysmally slow conversions to DFA's: While bad corner cases exist, they are rarely as catastrophic as the O(2^N) would suggest. Currently, Automa's regex/NFA/DFA compilation pipeline is very slow and unoptimized, but, since it happens during precompile time, it is insignificant compared to LLVM compile times.

Other implementations, like the popular ripgrep command line tool, uses an adaptive approach. It constructs the DFA on the fly, as each symbol is being matched, and then caches the DFA. If the DFA size grows too large, the cache is flushed. If the cache is flushed too often, it falls back to simulating the NFA directly. Such an approach is necessary for ripgrep, because the regex -> NFA -> DFA compilation happens at runtime and must be near-instantaneous, unlike Automa, where it happens during package precompilation and can afford to be slow.

Automa in a nutshell

Automa simulates the DFA by having the DFA create a Julia Expr, which is then used to generate a Julia function using metaprogramming. Like all other Julia code, this function is then optimized by Julia and then LLVM, making the DFA simulations very fast.

Because Automa just constructs Julia functions, we can do extra tricks that ordinary regex engines cannot: We can splice arbitrary Julia code into the DFA simulation. Currently, Automa supports two such kinds of code: actions, and preconditions.

Actions are Julia code that is executed during certain state transitions. Preconditions are Julia code, that evaluates to a Bool value, and which are checked before a state transition. If a precondition evaluates to false, the transition is not taken.

diff --git a/dev/tokenizer/index.html b/dev/tokenizer/index.html index ffa628b..e69af27 100644 --- a/dev/tokenizer/index.html +++ b/dev/tokenizer/index.html @@ -43,11 +43,11 @@ @eval @enum Token error $(first.(tokens)...) make_tokenizer((error, [Token(i) => j for (i,j) in enumerate(last.(tokens))] -)) |> eval

Token disambiguation

It's possible to create a tokenizer where the different token regexes overlap:

julia> make_tokenizer([re"[ab]+", re"ab*", re"ab"]) |> eval

In this case, an input like ab will match all three regex. Which tokens are emitted is determined by two rules:

First, the emitted tokens will be as long as possible. So, the input aa could be emitted as one token of the regex re"[ab]+", two tokens of the same regex, or of two tokens of the regex re"ab*". In this case, it will be emitted as a single token of re"[ab]+", since that will make the first token as long as possible (2 bytes), whereas the other options would only make it 1 byte long.

Second, tokens with a higher index in the input array beats previous tokens. So, a will be emitted as re"ab*", as its index of 2 beats the previous regex re"[ab]+" with the index 1, and ab will match the third regex.

If you don't want emitted tokens to depend on these priority rules, you can set the optional keyword unambiguous=true in the make_tokenizer function, in which case make_tokenizer will error if any input text could be broken down into different tokens. However, note that this may cause most tokenizers to error when being built, as most tokenization processes are ambiguous.

Reference

Automa.TokenizerType
Tokenizer{E, D, C}

Lazy iterator of tokens of type E over data of type D. Tokenizers are usually created with the tokenize function, and their iterator behaviour are defined by make_tokenizer.

Tokenizer works on any buffer-like object that defines pointer and sizeof. When iterated, it will return a Tuple{Integer, Integer, E}:

  • The first value in the tuple is the 1-based starting index of the token in the buffer
  • The second is the length of the token in bytes
  • The third is the token.

Un-tokenizable data will be emitted as the "error token" which must also be of type E.

The Int parameter C allows multiple tokenizers to be created with the otherwise same type parameters.

See also: make_tokenizer

source
Automa.tokenizeFunction
tokenize(::Type{E}, data, version=1) -> Tokenizer

Create a Tokenizer{E, typeof(data), version}, iterating tokens of type E over data.

See also: Tokenizer, make_tokenizer, compile

Examples

julia> tokenize(UInt32, "hello")
+)) |> eval

Token disambiguation

It's possible to create a tokenizer where the different token regexes overlap:

julia> make_tokenizer([re"[ab]+", re"ab*", re"ab"]) |> eval

In this case, an input like ab will match all three regex. Which tokens are emitted is determined by two rules:

First, the emitted tokens will be as long as possible. So, the input aa could be emitted as one token of the regex re"[ab]+", two tokens of the same regex, or of two tokens of the regex re"ab*". In this case, it will be emitted as a single token of re"[ab]+", since that will make the first token as long as possible (2 bytes), whereas the other options would only make it 1 byte long.

Second, tokens with a higher index in the input array beats previous tokens. So, a will be emitted as re"ab*", as its index of 2 beats the previous regex re"[ab]+" with the index 1, and ab will match the third regex.

If you don't want emitted tokens to depend on these priority rules, you can set the optional keyword unambiguous=true in the make_tokenizer function, in which case make_tokenizer will error if any input text could be broken down into different tokens. However, note that this may cause most tokenizers to error when being built, as most tokenization processes are ambiguous.

Reference

Automa.TokenizerType
Tokenizer{E, D, C}

Lazy iterator of tokens of type E over data of type D. Tokenizers are usually created with the tokenize function, and their iterator behaviour are defined by make_tokenizer.

Tokenizer works on any buffer-like object that defines pointer and sizeof. When iterated, it will return a Tuple{Integer, Integer, E}:

  • The first value in the tuple is the 1-based starting index of the token in the buffer
  • The second is the length of the token in bytes
  • The third is the token.

Un-tokenizable data will be emitted as the "error token" which must also be of type E.

The Int parameter C allows multiple tokenizers to be created with the otherwise same type parameters.

See also: make_tokenizer

source
Automa.tokenizeFunction
tokenize(::Type{E}, data, version=1) -> Tokenizer

Create a Tokenizer{E, typeof(data), version}, iterating tokens of type E over data.

See also: Tokenizer, make_tokenizer, compile

Examples

julia> tokenize(UInt32, "hello")
 Tokenizer{UInt32, String, 1}("hello")
 
 julia> tokenize(Int8, [1, 2, 3], 3)
-Tokenizer{Int8, Vector{Int64}, 3}([1, 2, 3])
source
Automa.make_tokenizerFunction
make_tokenizer(
     machine::TokenizerMachine;
     tokens::Tuple{E, AbstractVector{E}}= [ integers ],
     goto=true, version=1
@@ -64,7 +64,7 @@
  (2, 1, 0x02)
  (3, 3, 0x00)
  (6, 1, 0x02)
- (7, 1, 0x01)

Any actions inside the input regexes will be ignored.

If goto (default), use the faster, but more complex goto code generator.
The version number will set the last parameter of the Tokenizer, which allows you to create different tokenizers for the same element type.

See also: Tokenizer, tokenize, compile

source
make_tokenizer(
+ (7, 1, 0x01)

Any actions inside the input regexes will be ignored.

If goto (default), use the faster, but more complex goto code generator.
The version number will set the last parameter of the Tokenizer, which allows you to create different tokenizers for the same element type.

See also: Tokenizer, tokenize, compile

source
make_tokenizer(
     tokens::Union{
         AbstractVector{RE},
         Tuple{E, AbstractVector{Pair{E, RE}}}
@@ -91,4 +91,4 @@
  (1, 1, 1)
  (2, 1, 2)
  (3, 2, 0)
- (5, 1, 2)
source
+ (5, 1, 2)source diff --git a/dev/validators/index.html b/dev/validators/index.html index 80fc5b9..bf325a7 100644 --- a/dev/validators/index.html +++ b/dev/validators/index.html @@ -24,9 +24,9 @@ (0x0a, (3, 0)) julia> validate_io(IOBuffer(">hello\nAC")) -(nothing, (2, 2))

Reference

Automa.generate_buffer_validatorFunction
generate_buffer_validator(name::Symbol, regexp::RE; goto=true; docstring=true)

Generate code that, when evaluated, defines a function named name, which takes a single argument data, interpreted as a sequence of bytes. The function returns nothing if data matches Machine, else the index of the first invalid byte. If the machine reached unexpected EOF, returns 0.

If goto, the function uses the faster but more complicated :goto code.
If docstring, automatically create a docstring for the generated function.

See also: generate_io_validator

source
Automa.generate_io_validatorFunction
generate_io_validator(funcname::Symbol, regex::RE; goto::Bool=false)

NOTE: This method requires TranscodingStreams to be loaded

Create code that, when evaluated, defines a function named funcname. This function takes an IO, and checks if the data in the input conforms to the regex, without executing any actions. If the input conforms, return nothing. Else, return (byte, (line, col)), where byte is the first invalid byte, and (line, col) the 1-indexed position of that byte. If the invalid byte is a \n byte, col is 0 and the line number is incremented. If the input errors due to unexpected EOF, byte is nothing, and the line and column given is the last byte in the file.

If goto, the function uses the faster but more complicated :goto code.

See also: generate_buffer_validator

source
Automa.compileFunction
compile(re::RE; optimize::Bool=true, unambiguous::Bool=true)::Machine

Compile a finite state machine (FSM) from re. If optimize, attempt to minimize the number of states in the FSM. If unambiguous, disallow creation of FSM where the actions are not deterministic.

Examples

machine = let
+(nothing, (2, 2))

Reference

Automa.generate_buffer_validatorFunction
generate_buffer_validator(name::Symbol, regexp::RE; goto=true; docstring=true)

Generate code that, when evaluated, defines a function named name, which takes a single argument data, interpreted as a sequence of bytes. The function returns nothing if data matches Machine, else the index of the first invalid byte. If the machine reached unexpected EOF, returns 0.

If goto, the function uses the faster but more complicated :goto code.
If docstring, automatically create a docstring for the generated function.

See also: generate_io_validator

source
Automa.generate_io_validatorFunction
generate_io_validator(funcname::Symbol, regex::RE; goto::Bool=false)

NOTE: This method requires TranscodingStreams to be loaded

Create code that, when evaluated, defines a function named funcname. This function takes an IO, and checks if the data in the input conforms to the regex, without executing any actions. If the input conforms, return nothing. Else, return (byte, (line, col)), where byte is the first invalid byte, and (line, col) the 1-indexed position of that byte. If the invalid byte is a \n byte, col is 0 and the line number is incremented. If the input errors due to unexpected EOF, byte is nothing, and the line and column given is the last byte in the file.

If goto, the function uses the faster but more complicated :goto code.

See also: generate_buffer_validator

source
Automa.compileFunction
compile(re::RE; optimize::Bool=true, unambiguous::Bool=true)::Machine

Compile a finite state machine (FSM) from re. If optimize, attempt to minimize the number of states in the FSM. If unambiguous, disallow creation of FSM where the actions are not deterministic.

Examples

machine = let
     name = re"[A-Z][a-z]+"
     first_last = name * re" " * name
     last_first = name * re", " * name
     compile(first_last | last_first)
-end
source
compile(tokens::Vector{RE}; unambiguous=false)::TokenizerMachine

Compile the regex tokens to a tokenizer machine. The machine can be passed to make_tokenizer.

The keyword unambiguous decides which of multiple matching tokens is emitted: If false (default), the longest token is emitted. If multiple tokens have the same length, the one with the highest index is returned. If true, make_tokenizer will error if any possible input text can be broken ambiguously down into tokens.

See also: Tokenizer, make_tokenizer, tokenize

source
+endsource
compile(tokens::Vector{RE}; unambiguous=false)::TokenizerMachine

Compile the regex tokens to a tokenizer machine. The machine can be passed to make_tokenizer.

The keyword unambiguous decides which of multiple matching tokens is emitted: If false (default), the longest token is emitted. If multiple tokens have the same length, the one with the highest index is returned. If true, make_tokenizer will error if any possible input text can be broken ambiguously down into tokens.

See also: Tokenizer, make_tokenizer, tokenize

source