elastic · cavokz · Feb 7, 2023 · Jan 30, 2023 · Jan 31, 2023 · Feb 2, 2023
diff --git a/README.md b/README.md
@@ -0,0 +1,79 @@
+# Geneve
+
+Geneve is a data generation tool, its name stands for GENerate EVEnts.
+
+To better understand its basics, consider the Elastic Security's
+[detection engine](https://www.elastic.co/guide/en/security/current/detection-engine-overview.html).
+It regularly searches one or more indices for suspicious events, when a
+match is found it creates an alert. To do so it needs detection rules
+which define what a _suspicious event_ looks like.
+
+The original goal of Geneve is then summarized by:
+
+> Given a detection rule, generate source events that would trigger an alert creation.
+
+It does so by analyzing the rule, building an abstract syntax tree of the
+enclosed query and translating it to an intermediate language that is used
+for generating documents (= events) over and over.
+
+What became obvious over time is that the query at the heart of each rule
+is actually a powerful way to drive the documents generation that goes
+well beyond the alerts triggering.
+
+Additionally, one thing is generating garbage data that satisfies a rule
+and another is generating realistic data that can be analyzed with Kibana,
+which is an implicit goal of the tool.
+
+This last is a quite harder nut to crack than the original goal and is
+currently under development.
+
+If you want to try it, read [Getting started](docs/getting_started.md).
+
+# Status
+
+## Data modeling
+
+The rules/queries parsing, AST creation and IR generation are quite
+developed and rigorously tested by the CI/CD pipelines. The generated
+events are good enough to trigger many of the expected alerts on various
+versions of the stack, from 8.2.0 to 8.6.0, but the work is necessarily
+incomplete albeit as correct as possible.
+
+The detection rules set used for the tests is separately loaded into
+Geneve and is currently locked to version 8.2.0 (718 rules in total). Next
+step is to use the rules preloaded in the Kibana under test
+(https://github.com/elastic/geneve/issues/125).
+
+Kind of issues observed in this area:
+
+1. skipped rules due to unimplemented rule type (ie. threshold) or query
+   language (ie. lucene).
+	 <ins>73 rules</ins>.
+2. generation errors due to unimplemented query language features or
+   improvements needed in what is already implemented.
+	 <ins>80 rules</ins>.
+3. incorrect generation, the expected alerts are actually not created.
+   <ins>5 rules</ins>.
+
+The first two points are detailed in the
+[Documents generation from detection rules](/tests/reports/documents_from_rules.md)
+test report, the last is in the
+[Alerts generation from detection rules](tests/reports/alerts_from_rules.md) one.
+
+Number of rules for which correct data is generated and alerts are created: <ins>560</ins>.
+
+## Data realism
+
+Allowing the user to "click through" requires that generated data exploits
+the relations that Kibana is made to observe. Having relations implies
+having also the entities that such relations connect together, entities
+that need to be consistent in the whole generation batch.
+
+The problem is being understood more and more, parts of its solution are
+already implemented others are still sketched.
+
+## User interface
+
+Geneve is composed of a Python module and a REST API server that exposes
+it. The Python API is quite simple and stable, the REST API instead has
+raw edges and needs proper simplification.
diff --git a/docs/data_model.md b/docs/data_model.md
@@ -0,0 +1,105 @@
+# Data model
+
+The Geneve data model describes what data Geneve is expected to generate,
+it guides and constrains the data generation process so that the output
+satisfies your criteria.
+
+Think in this way: data generation is a random process, at its root it
+just produces a long random string made of 0s and 1s. What you actually
+want is to shape the result and channel the randomness so that the
+generated data looks sensible in your context and at the same time never
+quite the same.
+
+In essence, you tell Geneve what you are searching for and it will return
+a json document that is a plausible answer to your search, every time the
+answer is different. If this sounds like "queries" to you, you're right:
+Geneve input is queries.
+
+## Queries
+
+You have to provide at least one query to Geneve, if you give it multiple
+Geneve will randomly choose the one it will generate the document for at
+that round.
+
+Suppose you have this query:
+
+```
+process.name: "*.exe"
+```
+
+What it tells to Geneve is actually: you want the documents to have a field
+named `process.name` and its content needs to match the wildcard `*.exe`.
+
+Generated documents could be:
+
+```json
+{"process.name": "excel.exe"}
+```
+
+```json
+{"process.name": "winword.exe"}
+```
+
+but also, more likely, random letters in the name such as
+
+```json
+{"process.name": "LDow.exe"}
+```
+
+or
+
+```json
+{"process.name": "OjiRlQMX.exe"}
+```
+
+If you really want to control the options, then you can enumerate them
+
+```
+process.name: ("excel.exe" or "winword.exe" or "regedit.exe")
+```
+
+the generated documents can only be one of the three possible, you
+restricted the choice Geneve can do.
+
+Let's do another one
+
+```
+process.name: "10.0.0.0/8"
+```
+
+you get
+
+```json
+{"process.name": "10.0.0.0/8"}
+```
+
+as surprising as it can be, it's the only answer Geneve can give back if you
+don't train it to actually consider `process.name` to be of type `ip address`.
+
+Here comes into play the schema and how it defines what fields and their type. We'll assume
+[ECS](https://www.elastic.co/guide/en/ecs/current/ecs-field-reference.html)
+is in use but Geneve does not, if you want ECS you need to load it (see
+[Loading the schema](https://github.com/cavokz/geneve/blob/add-some-docs3/docs/getting_started.md#loading-the-schema)).
+If you use fields not in the schema, Geneve will consider them of type `plain text` (`keyword`, actually).
+
+Now try again with a more appropriate field
+
+```
+source.ip: "10.0.0.0/8"
+```
+
+you get, for example
+
+```json
+{"source.ip": "10.23.84.86"}
+```
+
+## Query languages
+
+All the queries in the examples above are expressed in the
+[Kibana Query Language](https://www.elastic.co/guide/en/kibana/current/kuery-query.html) (Kuery)
+but you can also use the
+[Event Query Language](https://www.elastic.co/guide/en/elasticsearch/reference/current/eql.html) (EQL).
+These are the only two languages supported at the moment but it's well possible to add others.
+
+Independently from the query language used, fields remain those defined by the schema.