Retry Policies in Istio

The default retry configuration on Istio is not safe, and can retry requests you may not want retrying!

A short and sweet post to share my thoughts around Retry Policies in Istio. I'm going to keep it short: In my opinion, the default retry policy is not safe.

You can see here that the mesh default retry policy is connect-failure,refused-stream,unavailable,cancelled,retriable-status-codes. The contentious part for me is retriable-status-codes, which combined with the status code configuration of http.StatusServiceUnavailable means that, by default, Istio will retry any 503 error. Even one you decide to return from your service.

Why isn't it safe?

If you have read my post on 503's, you will know that they're a bit of a catch all response. There are countless threads of Istio users talking about 503 errors on the net. My believe for the above configuration is to put a bit of a plaster on that problem.

However, as a result, requests that might have been processed by your destination workload could get retried by this policy. A 503 does not deterministically describe "a request that never got to the destination", in a whole bunch of places it describes "a request that envoy never got a response for". As an example, the connection might have died for some reason whilst the destination was processing the request, Istio maps that to a 503. The source sidecar will then, by default, send it again. What if that was a payment request? You've charged a customer twice.

My personal belief is that the only retrying that Istio should do by default is when they know the request was not sent. So TCP level errors, connection resets before the request was sent, etc. There is a thread here with a proposal to implement a reset-before-request retry condition, which would be a lot safer than the existing retry. The decision to retry a request where there was a status code response needs to live with the source & destination application owners, because only they know the domain and if is safe to retry, or not.

What should you do?

Change your default mesh configuration to remove the retriable-status-codes, this is done in your Istio configuration:

      attempts: 3
      retryOn: connect-failure,refused-stream,unavailable,cancelled,resource-exhausted

Then implement retries for 503's either in your source service, or per DestinationRule for each destination. But that decision of "is it safe" needs to be per-service, not per-cluster,

Even if the Istio defaults change in the future, I'm a big believer in being Explicit rather than Implicit. Pin down the retry behaviour you want now to ensure it doesn't change under your feet in an upgrade moving forward.

If you do this, you'll likely see more 503's errors. Read my other post as 99% of the time, these are simply miss-matched keepAlive configuration between envoy and your app. If these annoy you, +1 my issue here which talks about implementing a default solution to that in a safe way.