Time-outs in a chain
Setting you timeout values in a chain is a goldilocks problem. You will need to find just the right settings that work for your chain. I will try to go through the 3 situations that I can think of. The first 2 are not good and the third is the best solution.
The Chain For this explanation, I will be using the following chain of 4 APIs the last API calls a database and a consumer of the chain:
The Consumer calls API 1 , API 1 calls API 2 and after API 2 has answered API 1 will call API 3 and then API 3 will call API 4. After the API 3 gets its info from a database and answers to API 2 and API 2 answered to API 1, API 1 will answer the consumer of the chain. This in a normal working chain should, depending on what the APIs do, 500-700 milliseconds.
The consumer has said it wants to have an answer within 1 second and will retry 3 times before giving an error back to his users if by then it still has not worked.
Also, important to know is that we will design our chain for 1 failure so we assume that never 2 APIs will fail at the same time.
Problem Situation 1
We want our APIs to retry as well and we want them to retry 3 times. This would mean the following. API 1 has 1 second to answer the consumer. It has 2 calls to make so we could give both of them 500ms. Then we want to retry we would then have to divide the 500ms by 3 that would make a timeout value of 166 ms. API 2 then has to answer in 166 ms if we want to retry 3 times then the timeout for the call to API 4 becomes 55 ms. But then you have 55 ms to answer API 3 from API 4 and if we still want to do a database call and 2 retries then this becomes impossibly fast with a timeout of 18 ms for the API 4 call to the database.
This brings an interesting situation to bare, what now if the database somethings answers in 50 ms and we setup the chain as described above. The API 4 will start retrying to the database. And the load on the database will become 3 times as high as expected in normal load.
We also ignored the design for just 1 failure rule.
Problem situation 2
Let’s say we have seen that in normal operation our P99 of all of our APIs is 400 ms because of this it is logical to set all timeouts to 400 ms because then we would answer our consumer in 99% of the times on time. That is not correct in a chain of 3 calls you will only answer the consumer 97% of the time within the timeframe. But we will ignore that for the moment. If we now also include the retries you will get a chain like this.
What happens when a API starts failing. Let’s say API 2 starts failing and only answers after 2 retries. This will also be 1,2 seconds this means that the consumer will already be retrying while we still continue with the request. So, the complete chain will get 3 times the load that was expected.
So, setting the timeouts to high or too low can give problems in the chain but how can you protect yourself from this. We will need to do it completely different
We are not going to retry in our chain. Then if the p99 is 400ms for both APIs we could set the Timeout to 600ms for API 1. This because we only have to count for 1 slow API at time. But if both the Calls now take 590 ms we are not making the 1 second that the consumer wants. So, we need to introduce a new control. We need to program our APIs now to stop processing when the waiting/timeout time of the consumer is done. For API 1 we will have to stop processing within 1 second and give the consumer an error messages back. This we will have to feed completely through the chain and we will get something like this
Conclusion In a chain getting the Timeouts correct is a lot of thinking and a lot of compromising. There are always good and bad things about all solutions. But here are some pointers : • But think about how many failures / slow APIs you have to design for. • Try not to retry your own applications in the chain this will make it a lot easier. • Stopping / breaking off a transaction is always better then breaking the timeout value of your consumer.