Skip to content

Latest commit

 

History

History
79 lines (52 loc) · 2.46 KB

README.md

File metadata and controls

79 lines (52 loc) · 2.46 KB

retree-java

Travis CI codecov

Introduce

retree could parse and combine lots of regular expressions as an regular-expression-tree, which is very similar to trie.

First, retree would parse regex as a node-chain, for example:

  • \d+a\W will be parsed as CurlyNode(\d+) -> CharNode(a) -> CharNode(\W),
  • \d+\s\W will be parsed as CurlyNode(\d+) -> CharNode(\s) -> CharNode(\W).

After that, retree will merge those two node-chain as one node-tree (there should have an image to explain how it looks)

When performing multiple regex matching, we could use retree to reduce useless scan and loop, and avoid lots of backtracking.

Maven Dependency

Add maven dependency:

<dependency>
    <groupId>com.github.sisyphsu</groupId>
    <artifactId>retree</artifactId>
    <version>1.0.4</version>
</dependency>

Usage

This example shows how retree works, it's very similar to java.util.regex:

String[] res = {"(\\d{4}-\\d{1,2}-\\d{1,2})", "<b>(?<name>.*)</b>", "(\\w+@\\w+\\.[a-z]+(\\.[a-z]+)?)"};
String input = "Today is 2019-09-05, from <b>sulin</b> ([email protected]).";
ReMatcher matcher = new ReMatcher(new ReTree(res), input);

assert matcher.find();
assert "2019-09-05".contentEquals(matcher.group());

assert matcher.find();
assert "<b>sulin</b>".contentEquals(matcher.group());
assert "sulin".contentEquals(matcher.group("name"));

assert matcher.find();
assert "[email protected]".contentEquals(matcher.group());

In this example, we only need to scan input once to complete three different regular expressions' matching:

  • (\d{4}-\d{1,2}-\d{1,2})
  • <b>(?<name>.*)</b>
  • (\\w+@\\w+\\.[a-z]+(\\.[a-z]+)?)

Showcase

dateparser

dateparser is a smart and high-performance date parser library, it supports hundreds of different format, nearly all format that we used.

dateparser use retree to perform the matching operation of lots of different date patterns.

Even if dateparser have thundreds of predefined regular expressions, it still can parse date very fast(1000~1500ns).

Performance & Benchmark

TODO

Multi-Language Support

I will transplant this library to golang and javascript in nearly future.

License

Apache-2.0