Skip to main content

Command Palette

Search for a command to run...

Why Distributed Systems Can't Know What's Happening

Uncertainty, Timeouts, Retries and Side Effects

Updated
9 min read
Why Distributed Systems Can't Know What's Happening

Background

When we write programs, we have a quiet assumption that goes unnoticed until it gets explicitly questioned — when you send a request, it either succeeds or fails; even when you query a database, it either succeeds or fails. It always holds true when you are developing programs locally — because a local function either returns a value or throws an error, your local database either connects and queries or throws an error, your local API call reaches your program or fails.

Nobody can notice this on local runs because everything runs on same machine — your program, your database, your cache etc. There is no ambiguity about whether an operation ran or not. The program is the authority on its own actions. You get immediate feedback.

The moment network enters the picture, it no longer holds true — your program runs somewhere, database runs somewhere, cache runs somewhere. They communicate through network calls.

Two Generals Problem

There is a well-known thought experiment in distributed systems called the Two Generals Problem. It describes two armies positioned on opposite hills. They want to coordinate an attack, but the only way they can communicate is by sending messengers through a valley where their common enemy’s establishment is positioned. The only way they can communicate is by sending messages, that have to travel through the valley. If first general sends a message proposing an attack time, the message can be captured in the valley by the enemy — so the second general does not receive the message. Or if the second general receives the message and send an acknowledgement, which again has to pass through the valley — it can again get captured by the enemy — So the first general does not know if their message was delivered at all. So the first general sends another message for confirmation. Now the problem repeats — that confirmation might also be lost. And so on. Either ways, there is no way both the generals can know whether their messages were delivered or not.

The core insight of the problem is not tactical or military. It is about certainty. No matter how many messages are exchanged, neither general can ever be completely certain that the other side knows what it knows. At some point, a decision must be made without certainty.

Now, the generals are two systems and the valley is the network. This is not a hypothetical edge case or a missing protocol. It is a fundamental limitation of systems that communicate through unreliable channels. Once messages can be delayed or dropped, certainty becomes impossible.

Every networked system lives inside this constraint.

The Two Generals Problem does not tell us how to fix this situation. It tells us that it cannot be fixed.

Systems that communicate over networks must operate under permanent uncertainty.

Latency is uncertainty

Different components of a distributed system communicate through messages; but when a program sends a message to another program asking it to do something, certainty disappears. The sender no longer has a direct visibility into what happened on the other side. It can only wait for a response, and waiting introduces latency.

Latency, in distributed systems, is not just delay. It is uncertainty.

The issue is not that responses are slow; but that client does not know whether it is “slow” or “never.” When a request is sent across a network, silence does not mean failure, but also not success.

Silence simply means the system does not know.

Timeout is a guess

Timeouts are how systems cope with this uncertainty — you mark a request failed, if it didn’t respond in time — but timeouts are not facts, they are guesses. A timeout does not tell you that an operation failed; it tells you that you waited long enough.

From the client’s perspective, why it didn’t receive response doesn’t matter — it just didn’t receive the response in time, so it thinks the request might have failed; that’s the only thing that matters to it.

Logs are leading and misleading

If a system can sometimes not know what happened, partial failures stop being edge cases and starts looking like default conditions.

A system can be half broken in ways that are invisible from any single point of view. A request can reach the server, be processed fully, and commit changes to the database, while the response packet is dropped on the way back to the client. Or processing happened in time but the response might arrive just a millisecond later. But that doesn’t matter to either of the components.

From the server’s logs, the request succeeded. From the client’s logs, it failed. Both logs are accurate. Neither tells the whole story.

This is why logs feel leading and misleading at the same time. Logs from a single process describe what it believes happened. They do not describe what actually happened globally, because they don’t have global visibility. In a distributed system, every component tells a local story, and those stories can contradict each other without any of them being wrong.

Important Note:

This is where observability and tracing come into the picture; to observe the lifecycle of a request and get the whole picture — but that’s a story for another time.

Also, while global logs might give you the visibility of what happened to a request — it is — after the fact; you notice an issue only when the problem has already started or a user has experienced it; not during decision time. It cannot help prevent problems, just diagnose what happened.

Forced to decide in uncertainty

Can’t we just wait longer? No. How do you decide how long a program should wait?

Systems cannot wait forever. Users are on the other side of these systems, and users have expectations. A user clicking a button expects something to happen within seconds, not minutes.

These expectations force systems to make decisions while living in uncertainty.

So distributed systems are not allowed to wait until they know the truth. They must act while still uncertain. They must decide based on incomplete information.

It is a fundamental constraint of systems that communicate over networks. Once messages can be delayed, reordered, or dropped, certainty disappears. The system must move forward anyway.

Retries are not optional

Users expect progress. Products promise availability. When uncertainty appears, the system must choose an action, and the most common action it chooses is to try again.

Retries exist not because engineers love complexity, but because users hate waiting. A slow system feels broken even when it is technically correct. From a user’s perspective, a retry that succeeds is indistinguishable from a system that worked the first time. From a technical perspective, retries smooth over temporary failures.

What makes retries especially dangerous is that they rarely belong to a single place in the system. Even if you decide not to retry in your applications, something else almost certainly will. Browsers retry requests when connections are dropped. SDKs retry when you switch networks. Load balancers retry when upstream appears slow. Message queues redeliver messages when acknowledgements are delayed.

So you cannot fully control retries. You should design for them whether you like it or not.

A retry is not a repeat; it is a correctness problem

One would often think of a retry as a repeat of the same action, as if the system tried again under identical conditions. That is not what a retry is. A retry is not a replay. It is a new attempt after some time, under different load, possibly on a different machine, with different activity surrounding it.

Between the original request and the retry, the state may have changed. Caches may have warmed or expired. Locks may have been acquired or released. Other requests may have modified shared state.

This turns retries into a correctness problem.

Once retries exist, the same logical request can arrive more than once. It can arrive twice in sequence because the client timed out. It can arrive concurrently because a load balancer retried while the original request was still executing. It can arrive after partial success because the server crashed after committing data but before responding. It can even arrive out of order, with a later retry processed before the earlier attempt finishes.

From the system’s point of view, these are not edge cases. They are normal conditions.

Side effects

Up to this point, it is easy to think of a request as something that “runs code” and produces a result. But systems don’t just compute values. They change the reality around them. These changes are what we call side effects.

The moment an action changes something outside the process that executed it, you’ve created a side effect. Writing to a database is a side effect because it changes shared state. Publishing a message to another service is a side effect because it causes the other program to do some work. Sending an email is a side effect. Charging a credit card is a side effect.

Effect of retries on state and side effects

The moment a system processes the same request more than once, state becomes vulnerable. Data that was meant to be written once may be written twice. Counters may increment incorrectly. Emails may be sent twice. Money may be charged twice. Workflows may be triggered twice.

The code often reads as if a request has a clear beginning, a clear end, and a single effect on the world. Retries break that narrative. They turn one logical action into multiple physical executions, each capable of producing side effects.

What makes side effects different from state is that they are not reversible by default. State can sometimes be corrected. Side effects often cannot.

State lives inside the system and can sometimes be corrected.

Side effects leak outside the system and require compensation, not correction.

If a database row is written incorrectly, it can be updated later. But many side effects cannot be undone cleanly. An email cannot be unread. An API cannot be uncalled. A payment cannot simply be “taken back” without a compensating action that introduces even more complexity.

The cost of duplication is not evenly distributed across actions.

A retry does not just risk doing the same work twice. It risks affecting the world twice.

So, it becomes impossible to dismiss duplicates as rare edge cases. If retries are inevitable, and retries create multiple execution attempts, then duplicate processing is not an accident. It becomes a property of the system.

Correctness must be a property of the system

Correctness is no longer about success in a single execution. It’s about not producing redundant side effects under repeated execution. That is a much harder guarantee to provide, and it cannot be achieved by hoping retries won’t happen.

Once duplicates are guaranteed, correctness must be designed. It cannot be assumed and cannot be bolted on with logging or monitoring. It must be a property of the system, just like latency or throughput.

Now the real question is no longer whether an operation ran but whether running it more than once is safe.

This is where idempotency becomes unavoidable.