Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dead letter index: align error field to ECS and do not forward retryable errors #793

Merged
merged 27 commits into from
Oct 8, 2024

Conversation

kaiyan-sheng
Copy link
Contributor

@kaiyan-sheng kaiyan-sheng commented Sep 12, 2024

What does this PR do?

This PR brings the following changes:

  • Align the error field in documents sent to the dead letter index (DLI) to the ECS format; the field now provides error.message and error.type.
  • Add the http.response.status_code field
  • Limit the error type sent to the DLI; do not send Elasticsearch client errors that:
    • are connection errors (do not have an http.response.status_code)
    • have a retriable status code (for example, 429)

Here is a sample error document from a mapping conflict:

{
  "@timestamp": "2024-10-07T05:57:59.448925Z",
  "message": "{\"hey\":{\"message\":\"hey there\"},\"_id\":\"e6542822-4583-438d-9b4d-1a3023b5eeb9\",\"_op_type\":\"create\",\"_index\":\"logs-succeed.pr793-default\"}",
  "error": {
    "message": "[1:30] failed to parse field [hey] of type [keyword] in document with id 'e6542822-4583-438d-9b4d-1a3023b5eeb9'. Preview of field's value: '{message=hey there}'",
    "type": "document_parsing_exception"
  },
  "http": {
    "response": {
      "status_code": 400
    }
  }
}

Why is it important?

  • The error field must always match the definition of the ECS error field.
  • Avoid sending connection-related errors to DLI.
  • Avoid sending errors with retryable status codes to DLI.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.md

Related issues

@constanca-m
Copy link
Contributor

It looks good, is there a reason why you left it as draft? I see #799 that should be affected by these changes.

I'm also wondering if changing this to ECS format should require a version increase?

@zmoog
Copy link
Contributor

zmoog commented Sep 17, 2024

@constanca-m, I created #799 as an experiment, I guess we can focus on this one.

@zmoog
Copy link
Contributor

zmoog commented Sep 17, 2024

I'm running the ElasticsearchShipper in different scenarios to see what errors we get from the Elasticsearch client.

Invalid elasticsearch_url

Errors when the elasticsearch_url points to a server:port where no server is running:

CleanShot 2024-09-17 at 17 43 49@2x

Mapping exception

The document failed due to a mapping exception:

CleanShot 2024-09-17 at 17 12 08@2x

@zmoog
Copy link
Contributor

zmoog commented Sep 17, 2024

We have different fields depending on the error type, and some fields can have various types (status can be an str or an int, and error can be an str or a dict).

@zmoog
Copy link
Contributor

zmoog commented Sep 17, 2024

Quick recap:

Field 5xx N/A
status int str
error dict str
exception class

@constanca-m
Copy link
Contributor

We have different fields depending on the error type, and some fields can have various types (status can be an str or an int, and error can be an str or a dict).

Nice research @zmoog ! In that case we should include a test in the code with the possible different types.

@zmoog
Copy link
Contributor

zmoog commented Sep 17, 2024

We are targeting the following mappings:

{
  "mappings": {
    "dynamic": "false",
    "_data_stream_timestamp": {
      "enabled": true
    },
    "date_detection": false,
    "numeric_detection": false,
    "properties": {
      "error": {
        "properties": {
          "code": {
            "type": "keyword",
            "ignore_above": 1024
          },
          "id": {
            "type": "keyword",
            "ignore_above": 1024
          },
          "message": {
            "type": "match_only_text"
          },
          "stack_trace": {
            "type": "wildcard",
            "fields": {
              "text": {
                "type": "match_only_text"
              }
            }
          },
          "type": {
            "type": "keyword",
            "ignore_above": 1024
          }
        }
      }
    }
  }
}

@zmoog
Copy link
Contributor

zmoog commented Sep 17, 2024

We should probably at least provide a message:

Here's a possible mapping between error conditions and the error field to post to the dead letter index:

Dead letter field 5xx N/A
error.message From error.reason? From error?
error.type From error.type? From type(exception) or similar?

When available, we may also consider adding a http.response.status_code with the status code.

Dumping all the options now, don't consider this a plan (yet).

@kaiyan-sheng
Copy link
Contributor Author

@zmoog Thank you for the comparisons!! What's the exception.info struct in N/A case? Seems like that one has some information we can use for the N/A case.

@zmoog zmoog self-assigned this Sep 26, 2024
@zmoog zmoog added the enhancement New feature or request label Sep 26, 2024
@zmoog
Copy link
Contributor

zmoog commented Sep 26, 2024

What's the exception.info struct in N/A case? Seems like that one has some information we can use for the N/A case.

The errors[0]['create']['exception'] is a elasticsearch.exceptions.ConnectionError instance:

CleanShot 2024-09-26 at 17 45 28@2x

@zmoog zmoog changed the title Move error field to error.message and error.type Align dead letter index error field to ECS with error.message and error.type Sep 26, 2024
@zmoog zmoog marked this pull request as ready for review September 26, 2024 16:13
shippers/es.py Outdated Show resolved Hide resolved
@emilioalvap
Copy link
Contributor

@zmoog LGTM, thanks for tackling it! I'll defer the approval to @kaiyan-sheng since she had some comments

constanca-m
constanca-m previously approved these changes Oct 2, 2024
@zmoog
Copy link
Contributor

zmoog commented Oct 2, 2024

For the records, I tried sending the following test document:

{
  "hey": {
      "message": "hey there",
  }
}

On a data stream where the field mapping is:

{
  "hey": {
    "type": "keyword",
    "ignore_above": 1024
  }
}

Which causes a document_parsing_exception:

CleanShot 2024-10-02 at 10 25 54@2x

Here is the document in Elasticsearch:

CleanShot 2024-10-02 at 11 20 58@2x

@zmoog
Copy link
Contributor

zmoog commented Oct 2, 2024

Hey reviewers! Sorry for the noise; I just pushed the last commit with the docs update 🙇

@@ -230,6 +245,10 @@ For `elasticsearch` the following arguments are supported:
* `args.api_key`: API key of elasticsearch endpoint in the format `base64encode(api_key_id:api_key_secret)`. Mandatory when `args.username` and `args.password` are not provided. Will be ignored if `args.username`/`args.password` are defined.
* `args.es_datastream_name`: Name of data stream or index where logs should be forwarded to. Lambda supports automatic routing of various {aws} service logs to the corresponding data streams for further processing and storage in the {es} cluster. It supports automatic routing of `aws.cloudtrail`, `aws.cloudwatch_logs`, `aws.elb_logs`, `aws.firewall_logs`, `aws.vpcflow`, and `aws.waf` logs. For other log types, if using data streams, you can optionally set its value in the configuration file according to the naming convention for data streams and available integrations. If the `es_datastream_name` is not specified and it cannot be matched with any of the above {aws} services, then the value will be set to `logs-generic-default`. In versions **v0.29.1** and below, this configuration parameter was named `es_index_or_datastream_name`. Rename the configuration parameter to `es_datastream_name` in your `config.yaml` file on the S3 bucket to continue using it in the future version. The older name `es_index_or_datastream_name` is deprecated as of version **v0.30.0**. The related backward compatibility code is removed from version **v1.0.0**.
* `args.es_dead_letter_index`: Name of data stream or index where logs should be redirected to, in case indexing to `args.es_datastream_name` returned an error.
* `es_dead_letter_forward_errors`: List of errors that should be forwarded to the dead letter index. The default value is an empty list (forward all errors). If the list is not empty, only the errors in the list will be forwarded to the dead letter index.
In general, you can use values from the `error.type` field in the Elasticseach API response. Here is a few examples of errors that can be forwarded to the dead letter index:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, you can use values from the error.type

What does this mean exactly? I see the condition to not be sent is:

(  # no http status: connection error
                self._es_dead_letter_forward_errors
                and action["error"]["type"] not in self._es_dead_letter_forward_errors
            ):

I don't really understand the first part of the condition. It means you have to write the whole error to exclude it? Is there another way to exclude an error other than its error.type? @zmoog

Copy link
Contributor

@zmoog zmoog Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I had the feeling this part wasn't clear enough.

In general, you can use values from the error.type

The error.type in this sentence refers to the Elasticsearch API response.

For example, here's the Elasticsearch API response for

{
  "error": {
    "root_cause": [
      {
        "type": "fail_processor_exception",
        "reason": "Fail message"
      }
    ],
    "type": "fail_processor_exception",
    "reason": "Fail message"
  },
  "status": 500
}

_parse_error() picks error.type from the ES API response, so the shipper can later compare it with the values in the _es_dead_letter_forward_errors list.

I don't really understand the first part of the condition. It means you have to write the whole error to exclude it?

I don't get what you mean with "you have to write the whole error to exclude it".

Is there another way to exclude an error other than its error.type?

Users reported problems with the DLI forwarding all errors to ES.

This PR focuses on adding essential filtering capability:

  • do not forward connection-related errors (we assumed errors without http.response.status_code are related o connection failures).
  • filter out errors based on their error.type

There are probably more ways to filter the errors based on other criteria.

However, the Failure Store will ship in one upcoming ES release. So, it's better to give ESF what it needs today without re-implementing Failure Store features.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@constanca-m I tried to clarify the es_dead_letter_forward_errors option docs: does it flow better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, definitely! Thanks, what was causing me confusion was the term in general because it was giving me the idea there were other options to filter errors other than the error type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will never ignore the feeling that a sentence isn't working again.

Thanks for raising the issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zmoog Can we add the clarification that error.type is the one returned by bulk() api? AFAIK, there's no exhaustive list of error types but that should be a good starting point

constanca-m
constanca-m previously approved these changes Oct 4, 2024
Copy link

github-actions bot commented Oct 4, 2024

Minimum allowed coverage is 100%

Generated by 🐒 cobertura-action against f41a5f9

ES shipper should only forward to DLI persistent errors like mapping
exceptions.
@zmoog
Copy link
Contributor

zmoog commented Oct 7, 2024

@kaiyan-sheng, @constanca-m, @emilioalvap, I'm sorry for pushing yet another commit to this PR 🙇

Our tech leads asked to simplify the criteria to forward to DLI along these lines.

The ES output should not forward the error to DLI when:

  • there is no response from ES (no status code, probably due to connection errors)
  • the status code from ES is retriable; for example, 429

@constanca-m
Copy link
Contributor

there is no response from ES (not status code, probably due to connection errors)
the status code from ES is retriable; for example, 429

Sounds good!

for action in actions:
if (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

shippers/es.py Outdated Show resolved Hide resolved
@constanca-m
Copy link
Contributor

@zmoog The documentation is no longer in this PR, did you remove it on purpose?

@zmoog
Copy link
Contributor

zmoog commented Oct 7, 2024

The documentation is no longer in this PR, did you remove it on purpose?

I removed it on purpose because the error types list no longer exists. But I should update the docs for the es_dead_letter_index, on it!

@zmoog zmoog changed the title Align dead letter index error field to ECS with error.message and error.type dead letter index: align error field to ECS and do not forward retryable errors Oct 7, 2024
@zmoog zmoog merged commit 96fb54e into elastic:main Oct 8, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dead letter index: store errors using the ECS field format
4 participants