diff --git a/README.md b/README.md new file mode 100644 index 00000000..a47339b5 --- /dev/null +++ b/README.md @@ -0,0 +1,79 @@ +# Geneve + +Geneve is a data generation tool, its name stands for GENerate EVEnts. + +To better understand its basics, consider the Elastic Security's +[detection engine](https://www.elastic.co/guide/en/security/current/detection-engine-overview.html). +It regularly searches one or more indices for suspicious events, when a +match is found it creates an alert. To do so it needs detection rules +which define what a _suspicious event_ looks like. + +The original goal of Geneve is then summarized by: + +> Given a detection rule, generate source events that would trigger an alert creation. + +It does so by analyzing the rule, building an abstract syntax tree of the +enclosed query and translating it to an intermediate language that is used +for generating documents (= events) over and over. + +What became obvious over time is that the query at the heart of each rule +is actually a powerful way to drive the documents generation that goes +well beyond the alerts triggering. + +Additionally, one thing is generating garbage data that satisfies a rule +and another is generating realistic data that can be analyzed with Kibana, +which is an implicit goal of the tool. + +This last is a quite harder nut to crack than the original goal and is +currently under development. + +If you want to try it, read [Getting started](docs/getting_started.md). + +# Status + +## Data modeling + +The rules/queries parsing, AST creation and IR generation are quite +developed and rigorously tested by the CI/CD pipelines. The generated +events are good enough to trigger many of the expected alerts on various +versions of the stack, from 8.2.0 to 8.6.0, but the work is necessarily +incomplete albeit as correct as possible. + +The detection rules set used for the tests is separately loaded into +Geneve and is currently locked to version 8.2.0 (718 rules in total). Next +step is to use the rules preloaded in the Kibana under test +(https://github.com/elastic/geneve/issues/125). + +Kind of issues observed in this area: + +1. skipped rules due to unimplemented rule type (ie. threshold) or query + language (ie. lucene). + 73 rules. +2. generation errors due to unimplemented query language features or + improvements needed in what is already implemented. + 80 rules. +3. incorrect generation, the expected alerts are actually not created. + 5 rules. + +The first two points are detailed in the +[Documents generation from detection rules](/tests/reports/documents_from_rules.md) +test report, the last is in the +[Alerts generation from detection rules](tests/reports/alerts_from_rules.md) one. + +Number of rules for which correct data is generated and alerts are created: 560. + +## Data realism + +Allowing the user to "click through" requires that generated data exploits +the relations that Kibana is made to observe. Having relations implies +having also the entities that such relations connect together, entities +that need to be consistent in the whole generation batch. + +The problem is being understood more and more, parts of its solution are +already implemented others are still sketched. + +## User interface + +Geneve is composed of a Python module and a REST API server that exposes +it. The Python API is quite simple and stable, the REST API instead has +raw edges and needs proper simplification. diff --git a/docs/data_model.md b/docs/data_model.md new file mode 100644 index 00000000..baf17602 --- /dev/null +++ b/docs/data_model.md @@ -0,0 +1,105 @@ +# Data model + +The Geneve data model describes what data Geneve is expected to generate, +it guides and constrains the data generation process so that the output +satisfies your criteria. + +Think in this way: data generation is a random process, at its root it +just produces a long random string made of 0s and 1s. What you actually +want is to shape the result and channel the randomness so that the +generated data looks sensible in your context and at the same time never +quite the same. + +In essence, you tell Geneve what you are searching for and it will return +a json document that is a plausible answer to your search, every time the +answer is different. If this sounds like "queries" to you, you're right: +Geneve input is queries. + +## Queries + +You have to provide at least one query to Geneve, if you give it multiple +Geneve will randomly choose the one it will generate the document for at +that round. + +Suppose you have this query: + +``` +process.name: "*.exe" +``` + +What it tells to Geneve is actually: you want the documents to have a field +named `process.name` and its content needs to match the wildcard `*.exe`. + +Generated documents could be: + +```json +{"process.name": "excel.exe"} +``` + +```json +{"process.name": "winword.exe"} +``` + +but also, more likely, random letters in the name such as + +```json +{"process.name": "LDow.exe"} +``` + +or + +```json +{"process.name": "OjiRlQMX.exe"} +``` + +If you really want to control the options, then you can enumerate them + +``` +process.name: ("excel.exe" or "winword.exe" or "regedit.exe") +``` + +the generated documents can only be one of the three possible, you +restricted the choice Geneve can do. + +Let's do another one + +``` +process.name: "10.0.0.0/8" +``` + +you get + +```json +{"process.name": "10.0.0.0/8"} +``` + +as surprising as it can be, it's the only answer Geneve can give back if you +don't train it to actually consider `process.name` to be of type `ip address`. + +Here comes into play the schema and how it defines what fields and their type. We'll assume +[ECS](https://www.elastic.co/guide/en/ecs/current/ecs-field-reference.html) +is in use but Geneve does not, if you want ECS you need to load it (see +[Loading the schema](https://github.com/cavokz/geneve/blob/add-some-docs3/docs/getting_started.md#loading-the-schema)). +If you use fields not in the schema, Geneve will consider them of type `plain text` (`keyword`, actually). + +Now try again with a more appropriate field + +``` +source.ip: "10.0.0.0/8" +``` + +you get, for example + +```json +{"source.ip": "10.23.84.86"} +``` + +## Query languages + +All the queries in the examples above are expressed in the +[Kibana Query Language](https://www.elastic.co/guide/en/kibana/current/kuery-query.html) (Kuery) +but you can also use the +[Event Query Language](https://www.elastic.co/guide/en/elasticsearch/reference/current/eql.html) (EQL). +These are the only two languages supported at the moment but it's well possible to add others. + +Independently from the query language used, fields remain those defined by the schema. diff --git a/docs/getting_started.md b/docs/getting_started.md new file mode 100644 index 00000000..e64f4280 --- /dev/null +++ b/docs/getting_started.md @@ -0,0 +1,325 @@ +# Getting started + +## Data generation process + +The data generation process uses this analogy: generated data flows from source to sink. + +To generate data it is then necessary to define: + +* `source`: what data is generated, eg. data model +* `sink`: where data is sent to, eg. ES index +* `flow`: how data is transmitted, eg. how fast or how much? +* `schema`: fields definition, eg. ECS 8.2.0 + +Each of the above is handled by its own REST API endpoint. An arbitrary +number of sources, sinks, flows and schemas can be defined on the same +server. + +## Install + +Currently Geneve is packaged only for [Homebrew](https://brew.sh), you +need first to install the Geneve tap + +```shell +$ brew tap elastic/geneve +``` + +then the tool itself + +```shell +$ brew install geneve +``` + +## REST API server + +Data is generated by the Geneve server, you start it with + +```shell +$ geneve serve +2023/01/31 16:40:23 Control: http://localhost:9256 +``` + +The server keeps the terminal busy with its logs, to stop just press `^C`. +The first line in the log shows where to reach it, this is the base url of +the server, all the API endpoints are reachable (but not browseable) under +`api/`. + +For the rest of this document we'll assume that the following shell +variables are set: + +* `$GENEVE` points to the Geneve server, url `http://localhost:9256` +* `$TARGET_ES` is the url of the target Elasticsearch instance +* `$TARGET_KIBANA` is the corresponding Kibana's url + +Now open a separate terminal to operate on the server with curl. + +## Loading the schema + +The schema describes the fields that can be present in a generated +document. At the moment it needs to be explicitly loaded into the server. + +Download the latest version (or any other, if you have preferences) from +https://github.com/elastic/ecs/releases and search for file `ecs_flat.yml` +in the folder `ecs-X.Y.Z/generated/ecs/`. + +Supposing that the path of said file is in shell variable `$SCHEMA_YAML`, you +load it with + +```shell +$ curl -s -XPUT -H "Content-Type: application/yaml" "$GENEVE/api/schema/ecs" --data-binary "@$SCHEMA_YAML" +``` + +The `ecs` in the endpoint `api/schema/ecs` is an arbitrary name, it's how +the loaded schema is addressed by the server. + +## Define the data model + +In the data model you describe the data that shall be generated. It can +be as simple as a list of fields that need to be present or more complex +for defining also the relations among them. + +How to write a data model is separate subject (see [Data model](data_model.md)), +here we focus on how to configure one on the server. You use the `api/source` endpoint. + +```shell +$ curl -s -XPUT -H "Content-Type: application/yaml" "$GENEVE/api/source/mydata" --data-binary @- </_mappings` +endpoint returns the mappings of all the possible fields that can be +encountered in the documents generated by that source. + +Use the Elasticsearch index API to create the index + +```shell +$ curl -s -XPUT -H "Content-Type: application/json" $TARGET_ES/myindex --data @- <