-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not retry on 413 response codes #1199
base: main
Are you sure you want to change the base?
Conversation
Previously when Elasticsearch responds with a 413 (Payload Too Large) status, the manticore adapter raises an error before the response can be processed by the bulk_send error handling. This commit updates the adpapter to pass through the response code without raising an exception so that it can be properly handled. NOTE: this only applies to code 413 even though there may be arguably other error codes that should not be retried. The scope of this work is simply to 413 where we have explicit handling.
Still manually testing this. I'm a bit confused at the results currently. I configured ES to have a very low max message size and sent it a payload that exceeds that. Without the change it appears the pipeline just fails:
With this change it appears to attempt the retries:
|
It seems like preventing the adapter from raising an error results in exponential backoff. It seems like from your description this is not desirable. Do we want to update retry logic to not attempt to retry 413 errors? I'm not sure what expectations etc there are around the DLQ (now there is an event that will have error details at least). Essentially my question is do we want to DLQ 413 errors in logstash-output-elasticsearch/lib/logstash/plugin_mixins/elasticsearch/common.rb Lines 287 to 302 in de61518
|
AFAIK, we only route events to the DLQ in two situations:
When we fail to make a successful HTTP request, we are supposed to retry indefinitely (but in your unpatched test, since the pipeline is being shut down due to all inputs closing it bails on the retries). By not raising the 413, the request is considered a "success" and doc-level handling gets invoked (we expect a valid bulk-API response with one entry per sent document). BUT: what happens when we have a batch of two events, one of which exceeds the limit? Do we get an HTTP 413, or do we get an HTTP 2XX with doc-level 413's? I've seen situations where ES batch-of-one responses have the HTTP status of the doc-level response. |
From my testing it certainly looks like message size within a batch is not handled differently. If the batch is too large, it is whole sale rejected with a 413 even if within the batch there are events which would have been under the size limit. From this comment logstash-output-elasticsearch/lib/logstash/outputs/elasticsearch/http_client.rb Lines 10 to 23 in de61518
Assuming we are sticking by the 20MiB batch size (assuming it is reasonable that all ES would accept this size) the only time you would be dealing with 413 errors is when you are generating a single message that is too big. I dont think we would want to keep retrying that large message, but putting it in the DLQ would probably be useful for users to find the problematic messages. |
We definitely have some issues with the BadResponseCodeError abstraction.. A suggestion is to move BadResponseCodeError a layer up so that the perform_request no longer raises BadResponseCodeError. This leaves the caller to make the decision on what to consider a bad response code error depending on the type of request:
This way we can return the previous behaviour of allowing the http client to make the choice of reinterpreting the bulk level 413 as a document level error since we know there's only a single document in the bulk. Not sure if this change carries other consequences.. |
Been experimenting with refactoring the BadResponseCodeError handling. I think the surface area in the code itself wont be too large, the majority of the changes will be in ensuring the test suite mocking works with the refactor. As i'm working on gettting the tests together i'll push a commit showing what i'm thinking for the refactor 8427d29 take a look and let me know if that is way off base before I go too far down the test refactor :) |
Instead of trying to handle this at the manticore adapter layer, let callers decide how to handle exit codes and raise or generate errors. This will allow us to DLQ 413 errors and not retry and hanlde 404 for template API interaction without having to rely on catching generic errors.
8427d29
to
82872d0
Compare
Current StateThe current change (82872d0) has:
Open questionsIs it desirable to not retry 413 and dump them to DLQ if possible? Is the refactor moving the BadResponseCode generation out of the adapter worthwhile? WIth a config like:
And ES configured to have a very low message size (1kb)
We see
I would probably need to add some conditional logging to ensure that we are not claiming to retry when we are in fact not going to do that. Before I continue though i want to make sure This approach is desirable. |
Previously when Elasticsearch responds with a 413 (Payload Too Large) status,
the manticore adapter raises an error before the response can be processed
by the bulk_send error handling. This commit updates the adpapter to pass
through the response code without raising an exception so that it can be
properly handled. NOTE: this only applies to code 413 even though there may be
arguably other error codes that should not be retried. The scope of this work is
simply to 413 where we have explicit handling.