Rework gorouter error classifiers and retry logic #321

maxmoehl · 2023-05-09T11:01:32Z

Is this a security vulnerability?

No.

Issue

The current classifiers are misleading because they seem to represent relations between different sets of errors which should be completely distinct and, to some extent, abuse the error type.

Affected Versions

All.

Context

These two changes are closely related which is why I opened this as one issue. If we feel like this is not a good idea I can split it into two separate issues.

Suggested Changes

There are two (relatively) independent changes that I would like to discuss.

Make All Classifier Groups Distinct

Just because the groups happen to share some errors / classifiers they shouldn't depend on each other. This is a possible source of confusion and has caused mistakes in the past.

diff --git a/proxy/fails/classifier_group.go b/proxy/fails/classifier_group.go
index d0a1da5..b790c01 100644
--- a/proxy/fails/classifier_group.go
+++ b/proxy/fails/classifier_group.go
@@ -1,41 +1,57 @@
 package fails
 
 type ClassifierGroup []Classifier
 
 // RetriableClassifiers include backend errors that are safe to retry
 //
 // Backend errors are only safe to retry if we can be certain that they have
 // occurred before any http request data has been sent from gorouter to the
 // backend application.
 //
 // Otherwise, there’s risk of a mutating non-idempotent request (e.g. send
 // payment) being silently retried without the client knowing.
 var RetriableClassifiers = ClassifierGroup{
 	Dial,
 	AttemptedTLSWithNonTLSBackend,
 	HostnameMismatch,
 	RemoteFailedCertCheck,
 	RemoteHandshakeFailure,
 	RemoteHandshakeTimeout,
 	UntrustedCert,
 	ExpiredOrNotYetValidCertFailure,
 	IdempotentRequestEOF,
 	IncompleteRequest,
 }
 
 var FailableClassifiers = ClassifierGroup{
-	RetriableClassifiers,
+	Dial,
+	AttemptedTLSWithNonTLSBackend,
+	HostnameMismatch,
+	RemoteFailedCertCheck,
+	RemoteHandshakeFailure,
+	RemoteHandshakeTimeout,
+	UntrustedCert,
+	ExpiredOrNotYetValidCertFailure,
 	ConnectionResetOnRead,
 }
 
-var PrunableClassifiers = RetriableClassifiers
+var PrunableClassifiers = ClassifierGroup{
+	Dial,
+	AttemptedTLSWithNonTLSBackend,
+	HostnameMismatch,
+	RemoteFailedCertCheck,
+	RemoteHandshakeFailure,
+	RemoteHandshakeTimeout,
+	UntrustedCert,
+	ExpiredOrNotYetValidCertFailure,
+}
 
 // Classify returns true on errors that are retryable
 func (cg ClassifierGroup) Classify(err error) bool {
 	for _, classifier := range cg {
 		if classifier.Classify(err) {
 			return true
 		}
 	}
 	return false
 }

This change includes a minor tweak to the groups as well: IdempotentRequestEOF and IncompleteRequest are no longer part of the FailableClassifiers and PruneableClassifiers because those two errors are only "annotated" versions of an underlying error. Their sole purpose is to be able to match them using the RetriableClassifiers because we checked some pre-condition that allows us to retry the wrapped error even though we usually wouldn't be able to retry in that case.

Get Rid of "annotated" Errors

We initially introduced IdempotentRequestEOF and IncompleteRequest because we needed a way to tell the classifiers that those errors are retry-able without unconditionally retrying all of the errors that might get wrapped inside them. This included an additional check which is done in isRetriable.

The main issue is that we wrap errors without providing details that are particularly relevant in the sense that they enrich the error message. They purely exist because we need to pass a value of type error to the classifier groups. Instead I propose to split the logic for performing retries:

diff --git a/proxy/round_tripper/proxy_round_tripper.go b/proxy/round_tripper/proxy_round_tripper.go
index 5d2c9d0..09fd093 100644
--- a/proxy/round_tripper/proxy_round_tripper.go
+++ b/proxy/round_tripper/proxy_round_tripper.go
@@ -440,24 +439,23 @@ func isIdempotent(request *http.Request) bool {
 	return false
 }
 
-func (rt *roundTripper) isRetriable(request *http.Request, err error, trace *requestTracer) (bool, error) {
+func (rt *roundTripper) isRetriable(request *http.Request, err error, trace *requestTracer) bool {
 	// if the context has been cancelled we do not perform further retries
 	if request.Context().Err() != nil {
-		return false, fmt.Errorf("%w (%w)", request.Context().Err(), err)
+		return false
 	}
 
 	// io.EOF errors are considered safe to retry for certain requests
 	// Replace the error here to track this state when classifying later.
 	if err == io.EOF && isIdempotent(request) {
-		err = fails.IdempotentRequestEOFError
+		return true
 	}
 	// We can retry for sure if we never obtained a connection
 	// since there is no way any data was transmitted. If headers could not
 	// be written in full, the request should also be safe to retry.
 	if !trace.GotConn() || !trace.WroteHeaders() {
-		err = fmt.Errorf("%w (%w)", fails.IncompleteRequestError, err)
+		return true
 	}
 
-	retriable := rt.retriableClassifier.Classify(err)
-	return retriable, err
+	return rt.retriableClassifier.Classify(err)
 }

and completely remove IdempotentRequestEOF and IncompleteRequest. This way we get the benefits of our additional retries either from the checks that allow us to retry in special cases or from the classifiers while not tampering with the errors that are passed around and even displayed to the user (and I'm pretty sure that end-users would be confused if they see incomplete request (EOF) in response to their request).

Next Steps

Collect Feedback and discuss potential issues
Propose a PR
Review & Merge
Release

The text was updated successfully, but these errors were encountered:

maxmoehl · 2023-05-09T11:01:49Z

ping: @geofffranks @domdom82 @ameowlia

geofffranks · 2023-05-09T13:54:27Z

On first read through this seems pretty reasonable and thorough. I like the notion of splitting the classifiers out explicitly to reduce confusion and accidental side effects of changing any one of these individually.

domdom82 · 2023-05-10T08:41:59Z

Good idea. I'd say we tackle both issues in one go:

Make prunable errors distinct from retryable errors. Only prune errors that make future requests impossible (i.e. TLS cert mismatch)
Remove types for IdempotentEOF and IncompleteRequest errors and replace them with simple boolean checks. For this we have to keep in mind: If we remove the types we can only use isRetriable to actually determine if an error is retriable, not the Classify function of the RetriableClassifiers group. So far, this is not a problem because they are used only in one place (round tripper). But we won't be able to take the error anywhere else and then call Classify on it do determine if it was / is retriable, for example in other handlers that want to add logs or logic. That's the only downside I see, but it is a rather theoretical one.
As for the access log, we are losing a little bit of information, where the x-cf-routererror field would change from
EOF (via idempotent request) and incompleteRequest (EOF)
to just
EOF
But we might be able to fix that later on with a fix to access log.

geofffranks · 2023-05-10T14:04:31Z

Make prunable errors distinct from retryable errors. Only prune errors that make future requests impossible (i.e. TLS cert mismatch)

I would clarify this also include dial and handshake timeouts in in this. There could be cases where future requests would eventually be possible once the condition causing a timeout goes away, but we shouldn't keep the backend in the pool while its known to time out. Otherwise a lot of requests will incur extra latency trying it before retrying requests. If the backend eventually recovers, route-emitter will re-add it.

This commit removes `IdempotentRequestEOF` and `IncompleteRequest` from the fail-able and prune-able classifier group. This prevents errors that should not affect the endpoint from being marked as failed or being pruned. To do so all classifiers groups are split into distinct groups and any cross references between them are removed. The main motivation for this change is to avoid confusion and bugs due to artificial dependencies between the groups. Resolves: cloudfoundry/routing-release#321 Tested-By: `routing-release/scripts/run-unit-tests-in-docker gorouter`

With this commit `isRetriable` no longer overwrites / wraps the error that is passed to it. This was done to accommodate context from the circumstances in which the error occurred into the error itself to be able to match on those later on. This mechanism has proven to cause bugs and increase overall complexity by abusing the error type. Instead `isRetriable` now only returns whether a certain combination of parameters is considered retry-able, either because the circumstances allow for it or because the error matches one of the retry-able error classifiers. Resovles: cloudfoundry/routing-release#321

maxmoehl · 2023-07-22T18:53:37Z

Sorry that his has been so quiet for the last months. I've split out the simple change into a dedicated PR that can probably be merged more easily and addresses the first issue: cloudfoundry/gorouter#355.

The combined changes are still in cloudfoundry/gorouter#349 but since I need to figure out the testing and the change is more involved I think it makes sense to separate them. I will look into it in the coming days and (hopefully) provide a full PR soon.

This commit removes `IdempotentRequestEOF` and `IncompleteRequest` from the fail-able and prune-able classifier group. This prevents errors that should not affect the endpoint from being marked as failed or being pruned. To do so all classifiers groups are split into distinct groups and any cross references between them are removed. The main motivation for this change is to avoid confusion and bugs due to artificial dependencies between the groups. Partially-Resolves: cloudfoundry/routing-release#321 Tested-By: `routing-release/scripts/run-unit-tests-in-docker gorouter`

geofffranks · 2023-07-28T13:16:24Z

ah, sorry. gh auto-closed this when i merged :D

maxmoehl · 2023-07-28T15:10:46Z

ah, sorry. gh auto-closed this when i merged :D

My fault, I tried to prevent it by changing the commit trailer to Partially-Resolves instead of Resolves but GitHub is pretty liberal with those trailers 😄

MarcPaquette · 2024-05-02T14:36:06Z

Hi @domdom82 & @geofffranks ,
Any further activity on this issue needed?

maxmoehl · 2024-05-02T17:20:23Z

There is still one PR from me open which I unfortunately never got around to finish up :/

So it's about half-way done 😄

maxmoehl mentioned this issue Jun 22, 2023

Add @maxmoehl as Networking Approver cloudfoundry/community#628

Merged

cf-foundation-community-automation bot added this to Application Runtime Platform Working Group Jul 26, 2023

cf-foundation-community-automation bot moved this to Inbox in Application Runtime Platform Working Group Jul 26, 2023

geofffranks closed this as completed in cloudfoundry/gorouter@b820e88 Jul 27, 2023

github-project-automation bot moved this from Inbox to Done in Application Runtime Platform Working Group Jul 27, 2023

maxmoehl reopened this Jul 28, 2023

cf-foundation-community-automation bot added this to Application Runtime Platform Working Group Sep 19, 2023

cf-foundation-community-automation bot moved this to Inbox in Application Runtime Platform Working Group Sep 19, 2023

ctlong removed this from Application Runtime Platform Working Group Feb 22, 2024

cf-foundation-community-automation bot added this to Application Runtime Platform Working Group Feb 22, 2024

cf-foundation-community-automation bot moved this to Inbox in Application Runtime Platform Working Group Feb 22, 2024

MarcPaquette moved this from Inbox to Waiting for Changes | Open for Contribution in Application Runtime Platform Working Group May 2, 2024

MarcPaquette assigned domdom82 and geofffranks May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework gorouter error classifiers and retry logic #321

Rework gorouter error classifiers and retry logic #321

maxmoehl commented May 9, 2023 •

edited

Loading

maxmoehl commented May 9, 2023

geofffranks commented May 9, 2023

domdom82 commented May 10, 2023 •

edited

Loading

geofffranks commented May 10, 2023

maxmoehl commented Jul 22, 2023

geofffranks commented Jul 28, 2023

maxmoehl commented Jul 28, 2023

MarcPaquette commented May 2, 2024

maxmoehl commented May 2, 2024

Rework gorouter error classifiers and retry logic #321

Rework gorouter error classifiers and retry logic #321

Comments

maxmoehl commented May 9, 2023 • edited Loading

Is this a security vulnerability?

Issue

Affected Versions

Context

Suggested Changes

Make All Classifier Groups Distinct

Get Rid of "annotated" Errors

Next Steps

maxmoehl commented May 9, 2023

geofffranks commented May 9, 2023

domdom82 commented May 10, 2023 • edited Loading

geofffranks commented May 10, 2023

maxmoehl commented Jul 22, 2023

geofffranks commented Jul 28, 2023

maxmoehl commented Jul 28, 2023

MarcPaquette commented May 2, 2024

maxmoehl commented May 2, 2024

maxmoehl commented May 9, 2023 •

edited

Loading

domdom82 commented May 10, 2023 •

edited

Loading