Retry on configurable exception #6991

YuriyHolinko · 2025-01-06T21:07:45Z

Number of retryable exceptions is very limited in the current logic so we have data loss in case of any other(not mentioned in the current java code) IO exception happen.
As we might have different networks we might experience different exceptions. In my environment I caught a few exceptions that very likely need to be retried and they are not listed as retryable in current code.
since each environment is different I suggest to have an ability to configure retryable exceptions

the change is fully backward compatible and does not change default behaviour of the library.

linux-foundation-easycla · 2025-01-06T21:07:49Z

The committers listed above are authorized under a signed CLA.

✅ login: YuriyHolinko / name: Yuriy Holinko (9296a67, 15cf507, 13c3b3c, aac988d, a45ebdd)

YuriyHolinko · 2025-01-06T21:56:08Z

Resolves #6962

codecov · 2025-01-06T22:29:44Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 90.01%. Comparing base (ccccd1b) to head (aac988d).

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #6991      +/-   ##
============================================
+ Coverage     89.97%   90.01%   +0.04%     
- Complexity     6591     6599       +8     
============================================
  Files           729      729              
  Lines         19852    19856       +4     
  Branches       1953     1954       +1     
============================================
+ Hits          17861    17873      +12     
+ Misses         1396     1387       -9     
- Partials        595      596       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jack-berg · 2025-01-06T23:28:17Z

.../okhttp/src/main/java/io/opentelemetry/exporter/sender/okhttp/internal/RetryInterceptor.java

-        RetryInterceptor::isRetryableException,
+        e ->
+            retryPolicy.getRetryExceptionPredicate().test(e)
+                || RetryInterceptor.isRetryableException(e),


The OR here is interesting. It means a user can choose to expand the definition of what is retryable but not reduce it. I wonder if there are any cases when you would not want to retry when the default would retry. 🤔

a user can choose to expand the definition of what is retryable but not reduce it

it's exactly the idea

I wonder if there are any cases when you would not want to retry when the default would retry

I would say no 🤔

I would say no 🤔

I think we might..

Suppose we want expose options to give full control to the user over what's retryable (as I alluded to in the end of this comment), we'd probably want to do something like:

Expose a single configurable predicate option of the form setRetryPredicate(Predicate<Throwable>)

Funnel all failed requests through this predicate, whether they resolved a response with a status code or ended with an exception

This means we'd need to translate requests with a non-200 status code to an equivalent exception to pass to the predicate

If the user doesn't define their own predicate, default to one that retriable when status is retryable (429, 502, 503, 504) or when the exception is retryable (like one of the SocketTimeoutException we've discussed).

In this case, its possible that a user doesn't want to retry on a particular response status code like 502, even when the default behavior is to retry on it.

jack-berg · 2025-01-06T23:28:26Z

Thanks for the PR!

In my environment I caught a few exceptions that very likely need to be retried and they are not listed as retryable in current code. since each environment is different I suggest to have an ability to configure retryable exceptions

Wondering if you could elaborate on these, since its possible that the errors aren't actually environment-specific and everyone could benefit from them. My initial inclination was that we should just update the static definition of what constitutes a retryable exception, but I'm open to being wrong.

YuriyHolinko · 2025-01-07T00:52:16Z

hey @jack-berg

Wondering if you could elaborate on these, since its possible that the errors aren't actually environment-specific and everyone could benefit from them. My #6962 (comment) was that we should just update the static definition of what constitutes a retryable exception, but I'm open to being wrong.

3 exceptions from me:

DNS issues. my services are running on popular cloud providers and using their DNS services but sporadically I encounter issues like this.

java.net.UnknownHostException: xxxxxx.com
	at java.base/java.net.InetAddress$CachedAddresses.get(InetAddress.java:801)
	at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1533)
	at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1385)
	at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1306)
	at okhttp3.Dns$Companion$DnsSystem.lookup(Dns.kt:49)
	at okhttp3.internal.connection.RouteSelector.resetNextInetSocketAddress(RouteSelector.kt:164)

java.io.InterruptedIOException: timeout 	
    at okhttp3.internal.connection.RealCall.timeoutExit(RealCall.kt:398)

java.net.SocketTimeoutException: timeout 	
    at okio.SocketAsyncTimeout.newTimeoutException(JvmOkio.kt:143) 	
    at okio.AsyncTimeout.access$newTimeoutException(AsyncTimeout.kt:162) 	
    at okio.AsyncTimeout$source$1.read(AsyncTimeout.kt:340) 	
    at okio.RealBufferedSource.indexOf(RealBufferedSource.kt:449) 	
    at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.kt:333) 	
    at okhttp3.internal.http1.HeadersReader.readLine(HeadersReader.kt:29)

Also recently we had network issues(retryable) when using SSL, but it was luckily solved by java upgrade so I can neglect it but it might be useful for some users with some java versions

I don't mind to put all of that into "static" definition but the reason I want to have it configurable is the ability to apply a quick fix when a new retryable exception is discovered. Also I don't know all the exceptions that other people encounter in their networks so the list of exceptions is not complete

So I can combine it all in static definition in addition to the current retryable exceptions, but I want to preserve the dynamic config as well

Tell me your thoughts about it

YuriyHolinko · 2025-01-07T16:32:35Z

there is flaky test in of the checks, not related to my change because it's in metrics product 🙈
could anyone tell if I can rerun it somehow without any new commits ?

jack-berg · 2025-01-08T20:54:49Z

java.net.SocketTimeoutException: timeout
java.io.InterruptedIOException: timeout

Hmm.. let's think about these. They are clearly the result of some sort of timeout occurring. We take the arguments for setTimeout and setConnectTimeout and apply them to the OkHttpClient here:

        new OkHttpClient.Builder()
            .dispatcher(OkHttpUtil.newDispatcher())
            .connectTimeout(Duration.ofNanos(connectionTimeoutNanos))
            .callTimeout(Duration.ofNanos(timeoutNanos));

The callTimeout represents the total allotted time for everything to resolve with the call, including resolving DNS, connecting, writing request body, reading response body, and any additional retry requests. So if this is exceeded, it won't do any good to retry because there's no allotted time left for the additional attempts to resolve.

The connectTimeout is a little different. It represents the max amount of time connecting a TCP socket to the target host. If this is less than callTimeout, then there is still time remaining in the allotted callTimeout for a retry to exceed. And so I think its correct to retry if this occurs, and if we look at RetryInterceptor, we see there's the attempt to retry on this type of situation.

So I think its appropriate to extend the condition to include the exception you're seeing:

    if (e instanceof SocketTimeoutException) {
      String message = e.getMessage();
      // Connect timeouts can produce SocketTimeoutExceptions with no message, or with "connect
      // timed out", or "timeout"
      return message == null || message.toLowerCase(Locale.ROOT).contains("connect timed out") || message.toLowerCase(Locale.ROOT).contains("timeout");
    }

There's two additional OkHttpClient settings that we don't configure: readTimeout and writeTimeout. Both of these default to 10s, and presumably produce SocketTimeoutException similar to the ones we already (attempt to) retry on.

This still doesn't address the java.io.InterruptedIOException: timeout exception you've included. I wonder if you have any more of that stack trace to include? But either way, I'm inclined to again expand the general RetryInterceptor#isRetryableException to include this as well since it seems like another variation of the types of timeout exceptions we're trying to retry on.

java.net.UnknownHostException: xxxxxx.com

This is the last one that's unaddressed. This exception is thrown when DNS lookup fails for the given host. I know that the java runtime caches DNS results, but wasn't sure what it does with DNS lookup failures. I did some searching and found that negative DNS cache TTL is controlled by a property called networkaddress.cache.negative.ttl which defaults to 10 seconds.

This indicates that we should in fact retry when UnknownHostException occurs because there's a chance that the error is transient and succeeds with the next attempt.

I don't mind to put all of that into "static" definition but the reason I want to have it configurable is the ability to apply a quick fix when a new retryable exception is discovered.

This is a good point. One downside I can think of to adding the proposed RetryPolicyBuilder#setRetryExceptionPredicate method is that it may lead users to think that they have control over other aspects of retry, like which HTTP / gRPCstatus codes are retryable. There's a spec issue open about this, so its possible that we make this configurable in the future. I wonder if we would want a holistic approach to configuring retry conditions, giving the user the ability to choose based on exception and status code, or whether we'd stick with distinct configuration options.

Retry on configurable exception

a45ebdd

YuriyHolinko requested a review from a team as a code owner January 6, 2025 21:07

code style

9296a67

added java doc diff

15cf507

added a few tests to increase coverage

13c3b3c

jack-berg linked an issue Jan 6, 2025 that may be closed by this pull request

Data loss if issues on TCP protocol layer or failures on network link. Retry policy is ignored #6962

Open

jack-berg reviewed Jan 6, 2025

View reviewed changes

fixed code style

aac988d

YuriyHolinko marked this pull request as draft January 7, 2025 16:24

YuriyHolinko marked this pull request as ready for review January 7, 2025 16:24

YuriyHolinko requested a review from jack-berg January 7, 2025 16:24

YuriyHolinko changed the title ~~Retry on configurable exception~~ Retry on configurable exception Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry on configurable exception #6991

Retry on configurable exception #6991

YuriyHolinko commented Jan 6, 2025 •

edited

Loading

linux-foundation-easycla bot commented Jan 6, 2025 •

edited

Loading

YuriyHolinko commented Jan 6, 2025

codecov bot commented Jan 6, 2025 •

edited

Loading

jack-berg Jan 6, 2025

YuriyHolinko Jan 7, 2025 •

edited

Loading

jack-berg Jan 8, 2025

jack-berg commented Jan 6, 2025

YuriyHolinko commented Jan 7, 2025 •

edited

Loading

YuriyHolinko commented Jan 7, 2025

jack-berg commented Jan 8, 2025

Retry on configurable exception #6991

Are you sure you want to change the base?

Retry on configurable exception #6991

Conversation

YuriyHolinko commented Jan 6, 2025 • edited Loading

linux-foundation-easycla bot commented Jan 6, 2025 • edited Loading

YuriyHolinko commented Jan 6, 2025

codecov bot commented Jan 6, 2025 • edited Loading

Codecov Report

jack-berg Jan 6, 2025

Choose a reason for hiding this comment

YuriyHolinko Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

jack-berg Jan 8, 2025

Choose a reason for hiding this comment

jack-berg commented Jan 6, 2025

YuriyHolinko commented Jan 7, 2025 • edited Loading

YuriyHolinko commented Jan 7, 2025

jack-berg commented Jan 8, 2025

YuriyHolinko commented Jan 6, 2025 •

edited

Loading

linux-foundation-easycla bot commented Jan 6, 2025 •

edited

Loading

codecov bot commented Jan 6, 2025 •

edited

Loading

YuriyHolinko Jan 7, 2025 •

edited

Loading

YuriyHolinko commented Jan 7, 2025 •

edited

Loading