Skip to content

Commit

Permalink
Add new s3 sink documentation for Data Prepper 2.8
Browse files Browse the repository at this point in the history
Signed-off-by: Taylor Gray <[email protected]>
  • Loading branch information
graytaylor0 committed May 15, 2024
1 parent aae9fc6 commit 1c94268
Showing 1 changed file with 46 additions and 11 deletions.
57 changes: 46 additions & 11 deletions _data-prepper/pipelines/configuration/sinks/s3.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,20 +70,48 @@ In order to use the `s3` sink, configure AWS Identity and Access Management (IAM
}
```

## Cross-account S3 access<a name="s3_bucket_ownership"></a>

When Data Prepper fetches data from an S3 bucket, it verifies the ownership of the bucket using the
[bucket owner condition](https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucket-owner-condition.html).
By default, the S3 sink does not require `bucket_owners`. If `bucket_owners` is configured, and a bucket is not included in one of these mapped configurations, the `default_bucket_owner` will default to the account id from the `aws.sts_role_arn`.

If you plan to ingest data from multiple S3 buckets but each bucket is associated with a different S3 account, you need to configure Data Prepper to check for cross-account S3 access, according to the following conditions:

- If all S3 buckets you want data from belong to the same account, set `default_bucket_owner` to the account ID of the bucket account holder.
- If your S3 buckets are in multiple accounts, use a `bucket_owners` map.

The following example shows a `my-bucket-01` that is owned by `123456789012` and `my-bucket-02` that is owned by `999999999999`, the `bucket_owners` map calls both bucket owners with their account IDs, as shown in the following configuration:

```
sink:
- s3:
default_bucket_owner: 111111111111
bucket_owners:
my-bucket-01: 123456789012
my-bucket-02: 999999999999
```

You can use both `bucket_owners` and `default_bucket_owner` together.

## Configuration

Use the following options when customizing the `s3` sink.

Option | Required | Type | Description
:--- | :--- | :--- | :---
`bucket` | Yes | String | The name of the S3 bucket to which objects are stored. The `name` must match the name of your object store.
`codec` | Yes | [Codec](#codec) | The codec determining the format of output data.
`aws` | Yes | AWS | The AWS configuration. See [aws](#aws) for more information.
`threshold` | Yes | [Threshold](#threshold-configuration) | Configures when to write an object to S3.
`object_key` | No | Sets the `path_prefix` and the `file_pattern` of the object store. The file pattern is always `events-%{yyyy-MM-dd'T'hh-mm-ss}`. By default, those objects are found inside the root directory of the bucket. The `path_prefix` is configurable.
`compression` | No | String | The compression algorithm to apply: `none`, `gzip`, or `snappy`. Default is `none`.
`buffer_type` | No | [Buffer type](#buffer-type) | Determines the buffer type.
`max_retries` | No | Integer | The maximum number of times a single request should retry when ingesting data to S3. Defaults to `5`.
Option | Required | Type | Description
:--- |:---------|:------------------------------------------------| :---
`bucket` | Yes | String | The name of the S3 bucket to which the sink writes. Supports sending to dynamic buckets using [Data Prepper expressions](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/). For example, `test-${/bucket_id}`. If a dynamic bucket cannot be accessed, it will be sent to the `default_bucket` if one is configured. Otherwise, the object data will be dropped.

Check failure on line 103 in _data-prepper/pipelines/configuration/sinks/s3.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: _id. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: _id. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_data-prepper/pipelines/configuration/sinks/s3.md", "range": {"start": {"line": 103, "column": 302}}}, "severity": "ERROR"}
`default_bucket` | No | String | The static name of the bucket to send to when a dynamic bucket in `bucket` is not able to be accessed.
`bucket_owners` | No | Map | A map of bucket names that includes the IDs of the accounts that own the buckets. For more information, see [Cross-account S3 access](#s3_bucket_ownership).
`default_bucket_owner` | No | String | The AWS account ID for the owner of an S3 bucket. For more information, see [Cross-account S3 access](#s3_bucket_ownership).
`codec` | Yes | [Codec](#codec) | The codec that determines how the data is serialized in the S3 object.
`aws` | Yes | AWS | The AWS configuration. See [aws](#aws) for more information.

Check failure on line 108 in _data-prepper/pipelines/configuration/sinks/s3.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: aws. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: aws. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_data-prepper/pipelines/configuration/sinks/s3.md", "range": {"start": {"line": 108, "column": 2}}}, "severity": "ERROR"}

Check failure on line 108 in _data-prepper/pipelines/configuration/sinks/s3.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.HeadingCapitalization] 'aws' is a heading and should be in sentence case. Raw Output: {"message": "[OpenSearch.HeadingCapitalization] 'aws' is a heading and should be in sentence case.", "location": {"path": "_data-prepper/pipelines/configuration/sinks/s3.md", "range": {"start": {"line": 108, "column": 2}}}, "severity": "ERROR"}
`threshold` | Yes | [Threshold](#threshold-configuration) | Configures when to write an object to S3.
`aggregate_threshold` | No | [Aggregate Threshold](#threshold-configuration) | Configures when and how to start flushing objects when using dynamic path_prefix to create many groups in memory.
`object_key` | No | [Object key](#object-key-configuration) | Sets the `path_prefix` and the `file_pattern` of the object store. The file pattern is always `events-%{yyyy-MM-dd'T'hh-mm-ss}`. By default, those objects are found inside the root directory of the bucket. The `path_prefix` is configurable.
`compression` | No | String | The compression algorithm to apply: `none`, `gzip`, or `snappy`. Default is `none`.
`buffer_type` | No | [Buffer type](#buffer-type) | Determines the buffer type.
`max_retries` | No | Integer | The maximum number of times a single request should retry when ingesting data to S3. Defaults to `5`.

## aws

Expand All @@ -106,6 +134,13 @@ Option | Required | Type | Description
`maximum_size` | No | String | The maximum number of bytes to accumulate before writing an object to S3. Default is `50mb`.
`event_collect_timeout` | Yes | String | The maximum amount of time before Data Prepper writes an event to S3. The value should be either an ISO-8601 duration, such as `PT2M30S`, or a simple notation, such as `60s` or `1500ms`.

## Aggregate threshold configuration

Option | Required | Type | Description
:--- |:-----------------------------------|:-------| :---
`flush_capacity_ratio` | No | Float | The percentage of groups to be force flushed when the aggregate_threshold maximum_size is reached. Default is 0.5
`maximum_size` | Yes | String | The maximum number of bytes to accumulate before force flushing objects. For example, `128mb`.


## Buffer type

Expand All @@ -119,7 +154,7 @@ Option | Required | Type | Description

Option | Required | Type | Description
:--- | :--- | :--- | :---
`path_prefix` | No | String | The S3 key prefix path to use for objects written to S3. Accepts date-time formatting. For example, you can use `%{yyyy}/%{MM}/%{dd}/%{HH}/` to create hourly folders in S3. The prefix path should end with `/`. By default, Data Prepper writes objects to the root of the S3 bucket.
`path_prefix` | No | String | The S3 key prefix path to use for objects written to S3. Accepts date-time formatting and dynamic injection of values using [Data Prepper expressions](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/). For example, you can use `/${/my_partition_key}/%{yyyy}/%{MM}/%{dd}/%{HH}/` to create hourly folders in S3 based on the my_partition_key value. The prefix path should end with `/`. By default, Data Prepper writes objects to the root of the S3 bucket.

Check failure on line 157 in _data-prepper/pipelines/configuration/sinks/s3.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: _key. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: _key. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_data-prepper/pipelines/configuration/sinks/s3.md", "range": {"start": {"line": 157, "column": 304}}}, "severity": "ERROR"}


## codec
Expand Down

0 comments on commit 1c94268

Please sign in to comment.