Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework gorouter error classifiers and retry logic #321

Open
4 tasks
maxmoehl opened this issue May 9, 2023 · 9 comments
Open
4 tasks

Rework gorouter error classifiers and retry logic #321

maxmoehl opened this issue May 9, 2023 · 9 comments
Assignees

Comments

@maxmoehl
Copy link
Member

maxmoehl commented May 9, 2023

Is this a security vulnerability?

No.

Issue

The current classifiers are misleading because they seem to represent relations between different sets of errors which should be completely distinct and, to some extent, abuse the error type.

Affected Versions

All.

Context

These two changes are closely related which is why I opened this as one issue. If we feel like this is not a good idea I can split it into two separate issues.

Suggested Changes

There are two (relatively) independent changes that I would like to discuss.

Make All Classifier Groups Distinct

Just because the groups happen to share some errors / classifiers they shouldn't depend on each other. This is a possible source of confusion and has caused mistakes in the past.

diff --git a/proxy/fails/classifier_group.go b/proxy/fails/classifier_group.go
index d0a1da5..b790c01 100644
--- a/proxy/fails/classifier_group.go
+++ b/proxy/fails/classifier_group.go
@@ -1,41 +1,57 @@
 package fails
 
 type ClassifierGroup []Classifier
 
 // RetriableClassifiers include backend errors that are safe to retry
 //
 // Backend errors are only safe to retry if we can be certain that they have
 // occurred before any http request data has been sent from gorouter to the
 // backend application.
 //
 // Otherwise, there’s risk of a mutating non-idempotent request (e.g. send
 // payment) being silently retried without the client knowing.
 var RetriableClassifiers = ClassifierGroup{
 	Dial,
 	AttemptedTLSWithNonTLSBackend,
 	HostnameMismatch,
 	RemoteFailedCertCheck,
 	RemoteHandshakeFailure,
 	RemoteHandshakeTimeout,
 	UntrustedCert,
 	ExpiredOrNotYetValidCertFailure,
 	IdempotentRequestEOF,
 	IncompleteRequest,
 }
 
 var FailableClassifiers = ClassifierGroup{
-	RetriableClassifiers,
+	Dial,
+	AttemptedTLSWithNonTLSBackend,
+	HostnameMismatch,
+	RemoteFailedCertCheck,
+	RemoteHandshakeFailure,
+	RemoteHandshakeTimeout,
+	UntrustedCert,
+	ExpiredOrNotYetValidCertFailure,
 	ConnectionResetOnRead,
 }
 
-var PrunableClassifiers = RetriableClassifiers
+var PrunableClassifiers = ClassifierGroup{
+	Dial,
+	AttemptedTLSWithNonTLSBackend,
+	HostnameMismatch,
+	RemoteFailedCertCheck,
+	RemoteHandshakeFailure,
+	RemoteHandshakeTimeout,
+	UntrustedCert,
+	ExpiredOrNotYetValidCertFailure,
+}
 
 // Classify returns true on errors that are retryable
 func (cg ClassifierGroup) Classify(err error) bool {
 	for _, classifier := range cg {
 		if classifier.Classify(err) {
 			return true
 		}
 	}
 	return false
 }

This change includes a minor tweak to the groups as well: IdempotentRequestEOF and IncompleteRequest are no longer part of the FailableClassifiers and PruneableClassifiers because those two errors are only "annotated" versions of an underlying error. Their sole purpose is to be able to match them using the RetriableClassifiers because we checked some pre-condition that allows us to retry the wrapped error even though we usually wouldn't be able to retry in that case.

Get Rid of "annotated" Errors

We initially introduced IdempotentRequestEOF and IncompleteRequest because we needed a way to tell the classifiers that those errors are retry-able without unconditionally retrying all of the errors that might get wrapped inside them. This included an additional check which is done in isRetriable.

The main issue is that we wrap errors without providing details that are particularly relevant in the sense that they enrich the error message. They purely exist because we need to pass a value of type error to the classifier groups. Instead I propose to split the logic for performing retries:

diff --git a/proxy/round_tripper/proxy_round_tripper.go b/proxy/round_tripper/proxy_round_tripper.go
index 5d2c9d0..09fd093 100644
--- a/proxy/round_tripper/proxy_round_tripper.go
+++ b/proxy/round_tripper/proxy_round_tripper.go
@@ -440,24 +439,23 @@ func isIdempotent(request *http.Request) bool {
 	return false
 }
 
-func (rt *roundTripper) isRetriable(request *http.Request, err error, trace *requestTracer) (bool, error) {
+func (rt *roundTripper) isRetriable(request *http.Request, err error, trace *requestTracer) bool {
 	// if the context has been cancelled we do not perform further retries
 	if request.Context().Err() != nil {
-		return false, fmt.Errorf("%w (%w)", request.Context().Err(), err)
+		return false
 	}
 
 	// io.EOF errors are considered safe to retry for certain requests
 	// Replace the error here to track this state when classifying later.
 	if err == io.EOF && isIdempotent(request) {
-		err = fails.IdempotentRequestEOFError
+		return true
 	}
 	// We can retry for sure if we never obtained a connection
 	// since there is no way any data was transmitted. If headers could not
 	// be written in full, the request should also be safe to retry.
 	if !trace.GotConn() || !trace.WroteHeaders() {
-		err = fmt.Errorf("%w (%w)", fails.IncompleteRequestError, err)
+		return true
 	}
 
-	retriable := rt.retriableClassifier.Classify(err)
-	return retriable, err
+	return rt.retriableClassifier.Classify(err)
 }

and completely remove IdempotentRequestEOF and IncompleteRequest. This way we get the benefits of our additional retries either from the checks that allow us to retry in special cases or from the classifiers while not tampering with the errors that are passed around and even displayed to the user (and I'm pretty sure that end-users would be confused if they see incomplete request (EOF) in response to their request).

Next Steps

  • Collect Feedback and discuss potential issues
  • Propose a PR
  • Review & Merge
  • Release
@maxmoehl
Copy link
Member Author

maxmoehl commented May 9, 2023

ping: @geofffranks @domdom82 @ameowlia

@geofffranks
Copy link
Contributor

On first read through this seems pretty reasonable and thorough. I like the notion of splitting the classifiers out explicitly to reduce confusion and accidental side effects of changing any one of these individually.

@domdom82
Copy link
Contributor

domdom82 commented May 10, 2023

Good idea. I'd say we tackle both issues in one go:

  • Make prunable errors distinct from retryable errors. Only prune errors that make future requests impossible (i.e. TLS cert mismatch)
  • Remove types for IdempotentEOF and IncompleteRequest errors and replace them with simple boolean checks. For this we have to keep in mind: If we remove the types we can only use isRetriable to actually determine if an error is retriable, not the Classify function of the RetriableClassifiers group. So far, this is not a problem because they are used only in one place (round tripper). But we won't be able to take the error anywhere else and then call Classify on it do determine if it was / is retriable, for example in other handlers that want to add logs or logic. That's the only downside I see, but it is a rather theoretical one.
  • As for the access log, we are losing a little bit of information, where the x-cf-routererror field would change from
    EOF (via idempotent request) and incompleteRequest (EOF)
    to just
    EOF
    But we might be able to fix that later on with a fix to access log.

@geofffranks
Copy link
Contributor

Make prunable errors distinct from retryable errors. Only prune errors that make future requests impossible (i.e. TLS cert mismatch)

I would clarify this also include dial and handshake timeouts in in this. There could be cases where future requests would eventually be possible once the condition causing a timeout goes away, but we shouldn't keep the backend in the pool while its known to time out. Otherwise a lot of requests will incur extra latency trying it before retrying requests. If the backend eventually recovers, route-emitter will re-add it.

maxmoehl added a commit to sap-contributions/gorouter that referenced this issue May 23, 2023
This commit removes `IdempotentRequestEOF` and `IncompleteRequest` from
the fail-able and prune-able classifier group. This prevents errors that
should not affect the endpoint from being marked as failed or being
pruned. To do so all classifiers groups are split into distinct groups
and any cross references between them are removed.

The main motivation for this change is to avoid confusion and bugs due to
artificial dependencies between the groups.

Resolves: cloudfoundry/routing-release#321
Tested-By: `routing-release/scripts/run-unit-tests-in-docker gorouter`
maxmoehl added a commit to sap-contributions/gorouter that referenced this issue May 23, 2023
With this commit `isRetriable` no longer overwrites / wraps the error
that is passed to it. This was done to accommodate context from the
circumstances in which the error occurred into the error itself to be
able to match on those later on. This mechanism has proven to cause bugs
and increase overall complexity by abusing the error type.

Instead `isRetriable` now only returns whether a certain combination of
parameters is considered retry-able, either because the circumstances
allow for it or because the error matches one of the retry-able error
classifiers.

Resovles: cloudfoundry/routing-release#321
@maxmoehl
Copy link
Member Author

Sorry that his has been so quiet for the last months. I've split out the simple change into a dedicated PR that can probably be merged more easily and addresses the first issue: cloudfoundry/gorouter#355.

The combined changes are still in cloudfoundry/gorouter#349 but since I need to figure out the testing and the change is more involved I think it makes sense to separate them. I will look into it in the coming days and (hopefully) provide a full PR soon.

maxmoehl added a commit to sap-contributions/gorouter that referenced this issue Jul 22, 2023
This commit removes `IdempotentRequestEOF` and `IncompleteRequest` from
the fail-able and prune-able classifier group. This prevents errors that
should not affect the endpoint from being marked as failed or being
pruned. To do so all classifiers groups are split into distinct groups
and any cross references between them are removed.

The main motivation for this change is to avoid confusion and bugs due to
artificial dependencies between the groups.

Partially-Resolves: cloudfoundry/routing-release#321
Tested-By: `routing-release/scripts/run-unit-tests-in-docker gorouter`
@maxmoehl maxmoehl reopened this Jul 28, 2023
@geofffranks
Copy link
Contributor

ah, sorry. gh auto-closed this when i merged :D

@maxmoehl
Copy link
Member Author

ah, sorry. gh auto-closed this when i merged :D

My fault, I tried to prevent it by changing the commit trailer to Partially-Resolves instead of Resolves but GitHub is pretty liberal with those trailers 😄

@MarcPaquette
Copy link
Contributor

Hi @domdom82 & @geofffranks ,
Any further activity on this issue needed?

@MarcPaquette MarcPaquette moved this from Inbox to Waiting for Changes | Open for Contribution in Application Runtime Platform Working Group May 2, 2024
@maxmoehl
Copy link
Member Author

maxmoehl commented May 2, 2024

There is still one PR from me open which I unfortunately never got around to finish up :/

So it's about half-way done 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Waiting for Changes | Open for Contribution
Development

No branches or pull requests

4 participants