Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some documentation #128

Merged
merged 4 commits into from
Feb 7, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
79 changes: 79 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Geneve

Geneve is a data generation tool, its name stands for GENerate EVEnts.

To better understand its basics, consider the Elastic Security's
[detection engine](https://www.elastic.co/guide/en/security/current/detection-engine-overview.html).
It regularly searches one or more indices for suspicious events, when a
match is found it creates an alert. To do so it needs detection rules
which define what a _suspicious event_ looks like.

The original goal of Geneve is then summarized by:

> Given a detection rule, generate source events that would trigger an alert creation.

It does so by analyzing the rule, building an abstract syntax tree of the
enclosed query and translating it to an intermediate language that is used
for generating documents (= events) over and over.

What became obvious over time is that the query at the heart of each rule
is actually a powerful way to drive the documents generation that goes
well beyond the alerts triggering.

Additionally, one thing is generating garbage data that satisfies a rule
and another is generating realistic data that can be analyzed with Kibana,
which is an implicit goal of the tool.

This last is a quite harder nut to crack than the original goal and is
currently under development.

If you want to try it, read [Getting started](docs/getting_started.md).

# Status

## Data modeling

The rules/queries parsing, AST creation and IR generation are quite
developed and rigorously tested by the CI/CD pipelines. The generated
events are good enough to trigger many of the expected alerts on various
versions of the stack, from 8.2.0 to 8.6.0, but the work is necessarily
incomplete albeit as correct as possible.

The detection rules set used for the tests is separately loaded into
Geneve and is currently locked to version 8.2.0 (718 rules in total). Next
step is to use the rules preloaded in the Kibana under test
(https://github.com/elastic/geneve/issues/125).
cavokz marked this conversation as resolved.
Show resolved Hide resolved

Kind of issues observed in this area:

1. skipped rules due to unimplemented rule type (ie. threshold) or query
language (ie. lucene).
<ins>73 rules</ins>.
2. generation errors due to unimplemented query language features or
improvements needed in what is already implemented.
<ins>80 rules</ins>.
3. incorrect generation, the expected alerts are actually not created.
<ins>5 rules</ins>.

The first two points are detailed in the
[Documents generation from detection rules](/tests/reports/documents_from_rules.md)
test report, the last is in the
[Alerts generation from detection rules](tests/reports/alerts_from_rules.md) one.

Number of rules for which correct data is generated and alerts are created: <ins>560</ins>.

## Data realism

Allowing the user to "click through" requires that generated data exploits
the relations that Kibana is made to observe. Having relations implies
having also the entities that such relations connect together, entities
that need to be consistent in the whole generation batch.

The problem is being understood more and more, parts of its solution are
already implemented others are still sketched.
cavokz marked this conversation as resolved.
Show resolved Hide resolved

## User interface

Geneve is composed of a Python module and a REST API server that exposes
it. The Python API is quite simple and stable, the REST API instead has
raw edges and needs proper simplification.
105 changes: 105 additions & 0 deletions docs/data_model.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Data model

The Geneve data model describes what data Geneve is expected to generate,
it guides and constrains the data generation process so that the output
satisfies your criteria.

Think in this way: data generation is a random process, at its root it
just produces a long random string made of 0s and 1s. What you actually
want is to shape the result and channel the randomness so that the
generated data looks sensible in your context and at the same time never
quite the same.
Comment on lines +7 to +11
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this -- appreciate the first principals level-set here 🙂. The output (data generated) just needs to meet the minimum interface contract defined by the input query, and can and should vary indiscriminately outside of that.

As a further abstraction (which could be an input to Geneve), have you had any thoughts around some sort of input definition spec or anything? This is probably most applicable to the "click through" example mentioned above, but could be useful when needing to generate multiple documents and managing the references and chained dependencies between each (e.g. capturing parent/child relationships, generating sub-documents that are in the correct subnet, etc). Supporting sequences in EQL is a sorta similar situation, albeit still defined and verifiable by the query spec.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh, reading further looks like the Data Model defined below pretty much handles this. 👍 Would be nice to package these up ala Rally Tracks so it's easy to run pre-defined configurations.


In essence, you tell Geneve what you are searching for and it will return
a json document that is a plausible answer to your search, every time the
answer is different. If this sounds like "queries" to you, you're right:
Geneve input is queries.

## Queries

You have to provide at least one query to Geneve, if you give it multiple
Geneve will randomly choose the one it will generate the document for at
that round.

Suppose you have this query:

```
process.name: "*.exe"
```

What it tells to Geneve is actually: you want the documents to have a field
named `process.name` and its content needs to match the wildcard `*.exe`.

Generated documents could be:

```json
{"process.name": "excel.exe"}
```

```json
{"process.name": "winword.exe"}
```

but also, more likely, random letters in the name such as

```json
{"process.name": "LDow.exe"}
```

or

```json
{"process.name": "OjiRlQMX.exe"}
```

If you really want to control the options, then you can enumerate them

```
process.name: ("excel.exe" or "winword.exe" or "regedit.exe")
```

the generated documents can only be one of the three possible, you
restricted the choice Geneve can do.

Let's do another one

```
process.name: "10.0.0.0/8"
```

you get

```json
{"process.name": "10.0.0.0/8"}
```

as surprising as it can be, it's the only answer Geneve can give back if you
don't train it to actually consider `process.name` to be of type `ip address`.

Here comes into play the schema and how it defines what fields and their type. We'll assume
[ECS](https://www.elastic.co/guide/en/ecs/current/ecs-field-reference.html)
is in use but Geneve does not, if you want ECS you need to load it (see
[Loading the schema](https://github.com/cavokz/geneve/blob/add-some-docs3/docs/getting_started.md#loading-the-schema)).
If you use fields not in the schema, Geneve will consider them of type `plain text` (`keyword`, actually).

Now try again with a more appropriate field

```
source.ip: "10.0.0.0/8"
```

you get, for example

```json
{"source.ip": "10.23.84.86"}
```

## Query languages

All the queries in the examples above are expressed in the
[Kibana Query Language](https://www.elastic.co/guide/en/kibana/current/kuery-query.html) (Kuery)
but you can also use the
[Event Query Language](https://www.elastic.co/guide/en/elasticsearch/reference/current/eql.html) (EQL).
These are the only two languages supported at the moment but it's well possible to add others.

Independently from the query language used, fields remain those defined by the schema.
Loading