This package provides an extractor pipeline plugin that is able to execute a regular expression on a field and extract the results to separate fields. It can be used to extract values from a single log line.
For full details on the supported configuration format, please see the configuration section of this document.
The extractor plugin works by parsing incoming messages and executing a regular expression on a specified field. The capture groups from the regular expression can then optionally be converted and written to a new JSON field.
For example, given the following (partial) configuration:
fields:
- field: "message"
match: "\\[.+?\\] INFO [^:]+:(\\d+) - (.+)"
extract:
- to: "location.line"
type: "number"
- to: "log_message"
type: "string"
The regular expression given matches text that looks like [Apr 10] INFO com.foo.Bar:10 - Message
, and will capture the line and message respectively.
If this plugin then receives the following as input:
{
"message": "[Apr 10] INFO com.foo.Bar:10 - Message"
}
It will extract and convert accordingly, resulting in the following:
{
"message": "[Apr 10] INFO com.foo.Bar:10 - Message",
"log_message": "Message",
"location": {
"line": 10
}
}
From this result, it should be obvious that this plugin simply uses the captures generated by the regex configured in match
, potentially converts them to a different format, and finally writes them to the specified output field. The original input field is left untouched.
To build the library, run gradle pipeline:extractor:shadowJar
. The result will be located in build/extractor-all.jar
.
To execute the tests and linting, run gradle pipeline:extractor:check
.
To run a compiled version of the extractor plugin, simply launch it using Java:
java -jar build/extractor-all.jar [path to config]
The extractor plugin can be easily built and run using Docker.
A pre-built image is available at the Docker hub repository.
The plugin image is tagged as sergdelft/hyperion:pipeline-plugins-extractor-<version>
. Please consult the root README for the latest published version.
To run this image with extractor_config.yml
as its configuration execute:
docker run -it -rm -v ${PWD}/extractor_config.yml:/root/config.yml sergdelft/hyperion:pipeline-plugins-extractor-0.1.0
The included Dockerfile compiles and bundles the plugin. To build it, navigate to the repository root and run the following command:
docker build . -f pipeline/plugins/extractor/Dockerfile -t hyperion-extractor:latest
Once building completes, the plugin can be ran using the following command,
assuming that the configuration file is located at extractor_config.yml
:
docker run -it -rm -v ${PWD}/extractor_config.yml:/root/config.yml hyperion-extractor:latest
This plugin accepts configuration in a YAML file supplied as a command line argument. The following options are accepted:
# A list of extraction patterns that need to be applied to incoming messages. If
# a field is not present in a message, it is skipped in the processing. You can have
# any number of these field extractions configured.
fields:
-
# The name of the field to execute the regular expression on.
field: "message"
# The regular expression to execute on the field. Please note that
# you will need to escape backslashes as you would if this was a
# normal string literal.
match: "\\[.+?\\] INFO [^:]+:(\\d+) - (.+)"
# The list of targets for the matched groups in the regex. This is
# processed in the same order of the groups as they appear in the
# regular expression. In case there is a mismatch in size between
# the groups matched and the extract list, the smaller of the two
# is chosen.
extract:
-
# The target field for this extraction. This can be a nested
# expression delimited by dots, in which case it will automatically
# create child objects as needed.
to: "location.line"
# The type of the value to write to the field. The extractor plugin
# supports converting the extracted string value from the regular
# expression to a different format. The following types are supported:
# "string", "number", "double". If not given, defaults to "string".
type: "number"
# Add extract entries as necessary...
-
to: "message_text"
type: "string"
# Add more fields as necessary
-
field: "other_message"
# etc.
# Various settings needed for the plugin to interact with the pipeline,
# such as it's unique ID and the hostname and port of the Hyperion plugin manager.
#
# Please note that the plugin must also be able to talk to any of its previous
# and next steps in the pipeline. As such, it is recommended that all of the
# plugins are contained on a single networking setup.
pipeline:
# The host and port pair that can be used to contact the Hyperion plugin manager.
# Please note that this machine must be able to talk over TCP to the manager and
# that the manager must be aware of this plugin/aggregator.
manager-host: "manager:8000"
# The unique ID of this pipeline step that matches the configuration of the plugin
# manager. Used to identify which plugins are inputs/outputs of this step. Please
# note that the plugin will crash at launch if the plugin manager does not recognize
# this plugin ID.
plugin-id: "Extractor"
# The size of the internal buffer used for storing data that has not yet been processed
# locally. Increasing this will allow for more messages to be buffered, at the cost of
# more memory usage. Messages incoming while the buffer is full will be thrown away. If
# this happens often, consider using the load balancer plugin to shard this plugin across
# multiple instances. Defaults to 20,000.
buffer-size: 20000
This plugin accepts any type of JSON value as input. If the input is not valid JSON, or if it is not a JSON object, it will be passed to the next stage of the pipeline unaffected. Other than that, any type of JSON object is accepted.
{
"message": "[Apr 10] INFO com.foo.Bar:10 - Message"
}
This plugin will transform the incoming JSON message according to the configuration and output a new JSON object that strictly contains more values than the input. For the example given in the usage section, the output is as follows:
{
"message": "[Apr 10] INFO com.foo.Bar:10 - Message",
"log_message": "Message",
"location": {
"line": 10
}
}