Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry individual messages/requests when failing with 429/Data too large. #21313

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

dennisoelkers
Copy link
Member

@dennisoelkers dennisoelkers commented Jan 10, 2025

Note: This needs a backport to 6.0 & 6.1.

Description

Motivation and Context

Before this PR, when OpenSearch nodes run out of heap space/circuit breakers are tripped during indexing, two things can happen:

  • The request for the entire batch is responded to with a 429 Too Many Requests - Currently this leads to halving the batch and retrying. When the batch size has reached zero, retrying will be given up.
  • Individual bulk items succeed, while others fail and are responded to with a Data too large error - Failures will be treated as permanent errors (like mapping exceptions) and will be written to the failure processing collection if available, or just dropped.

In order to improve this and avoid potential data loss, this PR changes this to:

  • When the overall request fails with a circuit_breaking_exception, we are now using the retryer used for conditions where the target index does not or the indexer master is not discovered yet, to retry indefinitely with an exponential backoff.
  • If individual items of a bulk fail with a Data too large exception, we will retry those, just as for blocked indices indefinitely with an exponential backoff as well.

Fixes #21282.

How Has This Been Tested?

I wrote an integration test trying to simulate this condition by setting the indices.breaker.total.limit setting to a very low limit. The same procedure was used to test it locally as well.

Screenshots (if appropriate):

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactoring (non-breaking change)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.

@dennisoelkers dennisoelkers force-pushed the feat/retry-data-too-large-errors branch from 8217c3d to 8727f5a Compare January 10, 2025 10:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OpenSearch indexing error causes lost messages
1 participant