Idempotency
Making Systems Correct Under Retries and Concurrency

When you read about idempotency the first time, it sounds reasonably precise and simple. The definition is short: performing the same operation multiple times should have the same effect as performing it once. On paper, it feels like a solid guarantee. If something can be retried safely, then retries stop being scary.
But the more you try to reason through real scenarios, the harder it becomes to hold that definition true. Idempotency only describes a property a system should have — not the conditions, constraints, or consequences required to actually maintain it.
To understand why, it helps to start with a…
Very ordinary situation
Assume a user added some pizzas in their food delivery app and on the checkout screen, they clicked pay. Now the client, the user’s mobile app, waits for a response — 30 seconds, 1 minute, 2 minutes — but nothing comes back. Maybe the server was slow. Maybe the response was lost. From the client’s point of view, there’s no reliable way to know what happened. The request might have failed, or it might have succeeded and the response just never arrived.
At this point, retrying is not a design choice but a necessity. Waiting forever isn’t acceptable, and assuming failure is risky. So the client retries the request.
Now imagine you’re responsible for handling that request on the server. The obvious concern is duplication. You don’t want the same user to be charged twice. A natural first idea is to check whether the action has already been performed. If it has, do nothing. If it hasn’t, proceed.
That logic feels careful. It also aligns nicely with the intuitive definition of idempotency. If the operation already happened, don’t repeat it.
You will see if two requests have same content like same pizzas from the same customer — if so, mark it as a duplicate and ignore it. But what if the customer deliberately ordered the same pizza two times, they may have ordered lesser amount the first and ordered again - or even maybe, one of the user’s friends friends also uses the same account and ordered from their device again.
So checking the request content can’t be a reliable way. What if you check the device IP or some other property to check if it’s a duplicate. You can hash the user id, device IP and send it in the headers and check on your server. But sending the IP and storing it on your servers to check may not be a very best practice.
So you decide to just generate a UUID and send that in the headers and on your server you store it in a cache and whenever a request arrives, you will first check your cache to see if the request was already received. And you call it…
An Idempotency-Key
If requests were processed strictly one at a time, this approach would be enough. The first request would cache the key and create the payment. Any retry arriving afterward would see that the the key exists and exit early.
But you are clever enough to know that it’s not practical. You do horizontal scaling and storing state in a server locally is a bad idea so you use a Redis cluster. Now even if the request first lands on an unfortunately slow server and the retry falls on a faster one, the second one will first check the Redis and know that the request was already received and return a generic response, say, “your request is being processed”.
But then you recall that that your Redis cluster prioritizes availability and performance and only offers asynchronous consistency. And you immediately recognize that it is a problem here.
What if the second server checked for the key in a Redis node that hasn’t been updated yet due to a lag in replication?
Forget Redis, even in the case of a strongly consistent database, if a request and its retry was processed concurrently at the same time, if you first check and then make an entry for idempotency key, without it being an atomic transaction, you are risking duplication.
Both your servers think the system received the request for the first time, both assume it’s safe to proceed and both start processing the request.
Check-then-act no longer works
Our “check then act” logic no longer works when both the request are processed parallelly.
Even if the cache was replicated almost instantly, just in case the retry arrives an hour later and by that time the cache was already cleared, the retry succeeds the idempotency key check and can be processed again.
Nothing exotic had to happen for this to happen. There was no bug in the code. There was no unexpected input. The failure came, purely, from assuming that the state observed during the check would be consistent enough or even long enough for the action to be safe.
The logic was correct in isolation, but unsafe in the presence of concurrency and network delays.
In this setup, ‘check before I act’ stopped being enough. Idempotency has to mean something stronger. So…
What exactly is idempotency?
A more useful way to think about idempotency is in terms of final state. An operation is idempotent if executing it multiple times — whether sequentially or concurrently — leads to the same final state as executing it once. The focus shifts away from how many times the code runs and towards what the system looks like afterward.
Once you think in terms of state, another complication becomes obvious. The system’s state is not limited to database rows. Sending an email changes the world. Sending an SMS changes the world. Calling an external payment provider changes the world. These effects often live outside transactional boundaries and cannot be undone.
A system might successfully prevent duplicate database records and still send two emails. From the database’s perspective, everything is consistent. From a user’s perspective, something is clearly wrong. This makes it clear that idempotency can’t stop at data storage. It has to account for side effects as well.
Now this doesn’t negate the use of idempotency keys. While idempotency keys can recognize duplicates, they aren’t the complete solutions either.
An idempotency key can only protect what is gated behind it. If some side effects happen outside that protection, retries can still cause duplication in places that matter.
At this point, a pattern becomes hard to ignore. A fragile solution relies on checking state and hoping it doesn’t change at the wrong moment. A robust solution seems to require something stronger — a way to ensure that certain states cannot exist at the same time, no matter how requests interleave. So, is there…
A potential solution?
One pattern I could find, that seemed to handle this better, is gating every sensitive action to prevent duplicates. We can do our check-then-act at the first boundary, but then even in the subsequent parts of the system, we ensure that duplication is not possible.
For example, in this case, generating an order id even before the request hits the server or generating an order id consistently for a request and its retries somehow — so a request and its retry must have the same order id and when both try to create a record in a strongly consistent database, one of them will inevitably fail due to a collision in the order id.
Depending on the system, this may or may not be worth the complexity. You may need your own tailor-made solution.
But wait…
What if the instance that is processing the request at a later part of the process crashes? The retries won’t be triggering the request again and the request has already failed. So instead what we can do is, store the progress and position along with the key.
A retry should recheck; not re-execute
Instead of treating a request as a single, all-or-nothing action, we should start treating it as a process with memory.
The idempotency key can no longer represent just “this request has been seen before.” It should represent where the system currently is with respect to that request.
When the request arrives for the first time, the system creates an entry associated with the idempotency key. But instead of only recording that the request exists, it also records progress. A status. A position. A checkpoint.
Something like:
received
payment initiated
payment confirmed
order created
notification sent
The exact steps don’t matter. What matters is that the system now has a way to answer a much more useful question than “have I seen this request?”
It can answer: “How far did I get last time?”
The server successfully initiates the payment, but crashes before sending a response. The client retries. The retry carries the same idempotency key. This time, instead of blindly starting from the beginning or blindly rejecting the request, the system looks up the key and sees that the process is already partway through.
If payment was already initiated, the system doesn’t initiate it again.
If the order record was already created, it doesn’t create another one.
If a notification was already sent, it doesn’t send it again.
Retries should not be treated as duplicates that must be blocked at the gate but as re-entries into a workflow. The same key should lead to the same execution path, starting from wherever the system last stopped.
Yes, this is difficult. This means different parts of your system, potentially handled by different teams, must be in sync about how they are handling idempotency. But this may be needed for the problem you are solving.
This is where a subtle but important distinction appears.
Idempotency is not about preventing work from happening more than once. It’s about preventing observable effects from happening more than once.
The system may execute code multiple times. It may retry steps. It may re-run logic after crashes. None of that violates idempotency as long as the outside world cannot observe duplicated outcomes.
“Performing the same operation multiple times has the same effect as performing it once” is technically correct — but only if we’re clear about what effect means. It doesn’t mean “the code ran once.” It means “the world ended up in the same state.”
An important observation
At this point, something uncomfortable becomes hard to ignore.
If idempotency requires memory, progress tracking, guarded side effects, and careful coordination across boundaries, then it’s no longer just a retry optimization. It’s a systemic property — one that depends on how work is claimed, how progress is recorded, and how effects are exposed to the outside world.
And that raises a deeper question.
If we have to tolerate retries, crashes, partial execution, and re-entries at every step — how can you be sure that something happened exactly once?
That’s a discussion for another article.



