State Management in a System

Assume you are making a very simple login system. A user sends a request with a username and password. Let us list down what all needs to be present in the system.

If the credentials are valid, a session should be created.
All the authorized actions by that user after login should be allowed until session expiry.
If the credentials are not valid, failed attempts counter should be incremented.
A user must not appear logged in unless they actually are.
A user must be logged in through only one device at a time.
Logout must reliably revoke access.
No user should gain access due to race conditions.
Login should be fast.
Auth checks should be cheap (they happen on every request).
System must support many concurrent users.
Requests may be retried.
State must not be corrupted by retries or races.

What is a State?

Whenever a user authenticates themselves with their credentials, and expects all the further requests to be allowed. But in order to do that system must remember that the user has logged in already, in the form of some data stored somewhere. That remembered data is called a state.

What that remembered data is will change based on the problem statement you are dealing with.

The most obvious piece of state in this system, is the session itself — some representation that the user is authenticated. This can be a session ID or a token, stored in a cache or a database record. But once you start looking closely, you will find more pieces of state. This system needs to remember a lot more data - failed login attempts, expiry times, user permissions/roles for RBAC etc.

A useful way to recognize state is to ask a simple question: if the correctness of a request depends on something that happened earlier, you are dealing with state.

Once you see it this way, state appears everywhere.

Local State

Initially, imagine all of this running on a single server. When a user logs in, the server stores the session in memory. When the user makes another request, that same server checks its memory and validates the session. Everything works because there is only one place where truth lives.

Problems begin when we introduce horizontal scaling.

If there are multiple servers, requests can land on any of them. If session state remains stored in the memory of the server that handled the login request, other servers will not see it. This shows the major limitation of local state.

Local, in-memory state only works when there is exactly one server handling all requests. Once traffic can land on multiple servers, state must move somewhere shared.

Shared State

So we move session state into a database or a distributed cache. Now every server can read and write the same session data. This fixes the immediate correctness issue, but it introduces a new set of tradeoffs.

Accessing shared state requires network calls instead of memory access. These calls are slower, can fail, and can return outdated information depending on how the storage system works. The system is now relying on an external component to provide a consistent view of state.

Statelessness

At this point, application servers are often described as stateless. What this really means is that servers do not own authoritative state in their own memory or disk. Any server can handle any request because all necessary state lives elsewhere.

Statelessness doesn’t mean that state doesn’t exist but that it exists somewhere else — databases, caches, or other external systems.

Correctness: Consistency and Order

An important consideration that emerges is reads vs writes. Most of the requests simply check (aka read) the state — Is this session valid? Does this user have permissions?

But writes happen less frequently — session creation happens once when you login, session deletion happens once when you log out.

This distinction matters because reads happen more often and are also easier to scale. But when dealing writes the question of correctness appears.

Now assume that the state is also scaled across multiple servers to maintain availability — just like your login application.

Strong vs Eventual Consistency

Consider a user logging in. A server validates credentials and writes a new session to the database. The database confirms the write. Immediately after, the user sends another request that lands on a different server, which reads the session from the database.

The system must answer a precise question: after a successful login, should other servers be able to observe that session almost immediately? Or is it okay, if the updates are slightly delayed?

If the database guarantees that once it acknowledges a write, all subsequent reads will observe it, then the system provides strong consistency. In this case, the database ensures that the write is fully committed before responding, and that any server reading afterward sees the same result.

But what mostly happens is, databases acknowledge a write but some replicas may still return stale data for a short period. The difference lies in how much coordination the database system enforces before acknowledging a write. This is called eventual consistency.

This guarantee of strong consistency, is not free. Internally, the database must decide when a write is considered complete. If data is replicated, the database must decide whether to wait for all replicas, some replicas, or just one before acknowledging success. If one replica is slow or unreachable, the database must decide whether to wait, reject the write, or proceed anyway. These decisions affect both correctness and latency.

And for eventual consistency, whether it is acceptable depends on the system’s requirements. For login and authentication, immediate visibility is often expected. For other types of data, short delays may be acceptable.

Order

Now imagine a user clicks login on one device and clicks logout of all devices almost immediately on another device. These two operations may be handled by different servers and reach the database close together in time. The final state depends entirely on the order in which the database applies these updates.

If the system applies the login write first and the logout write second, the user ends up logged out. If it applies them in the opposite order, the user ends up logged in. Both requests are valid, but the outcome depends on ordering.

In this case, the application servers may not be in charge of this order. Maybe the database does. But whichever is in charge, it must impose a single, consistent sequence on concurrent writes to the same piece of state so that all readers observe the same result.

Locking

This ordering problem becomes even more visible when multiple requests attempt to modify the same data concurrently. Assume multiple failed login attempts incrementing a counter at the same time.

If these updates are applied without control, one update may overwrite another, leading to incorrect counters. To prevent this, locking is one of the ways that databases use apart from optimistic concurrency and version checks. Regardless of the mechanism, the goal is the same: ensure conflicting updates do not silently overwrite each other. When an update is being done to a record, any other updates to that record must wait or fail. This ensures that updates are applied one at a time and in a well-defined order.

Locking is not something the application servers coordinate explicitly. It is enforced by the database as part of managing shared state correctly.

Transactions

Often, a login operation involves multiple related updates. A session is created, failed login counters are reset, and timestamps are updated. If one of these changes succeeds and another fails, the system ends up in an inconsistent state. To prevent this, databases provide transactions, which allow a group of changes to be applied together or roll back applied changes if any one of them fails.

Failure is in POV

Suppose a login request reaches server, validated, session created but the response is lost due to some network issue. But the system has no way of knowing once the response leaves its area, so it regards it as request succeeded.

Now, if you look at it from system’s POV:

Session created → Succeeded
Session not created → Failed

If you look at it from client’s POV:

Received success response → Succeeded
Received failed response or didn’t receive one at all → Failed

But here, the request succeeded from the system’s view and failed from the client’s view at the same time.

Retries

Now user sends another login request:

If the system blindly creates another session in DB, there will be duplicate sessions created if its a valid request
Failed attempts counter will be incremented if the request fails while there is already a valid session.

To avoid this, operations must be designed so that retrying them does not cause incorrect state. This property is called idempotency.

An idempotent operation can be executed multiple times safely. The result is the same as if it were executed once.

In this use case, this means, when a user retries, the system must recognize repeated requests, should either return an existing session or invalidate the existing one and create a new session.

Considering Transactions

So far, transactions have helped us keep related changes consistent. But transactions have limits.

They work well when:

All data lives in one database
The database can enforce locks
Failures are rare and short-lived

They struggle when:

Data spans multiple systems
Systems fail independently

Imagine your login flow now touches:

A database like Postgres
A cache like Redis
A rate-limiting service
An audit log

If any one of these fails midway, a transaction that spans all of them becomes slow, fragile, or impossible.

So you will be forced to either accept that system can be inconsistent for a while before eventually reaching consistency, or design the system to deal with this chaos for strong consistency.

Performance vs Correctness

Every decision we have discussed so far, has a cost:

If you want correctness, where there is:

immediate visibility of writes
strict ordering
transactional guarantees

Then your system must:

Wait for confirmations
Reject or delay requests

This means a login request may block longer, fail more often during outages within the system, or require retries more often — all of which affect user experience.

If you want better performance with:

Faster reads
Faster writes

You have to accept:

Data may be outdated sometimes
Some clients receiving new data and some receiving older data.

So you are trading off consistency.

Find something hidden so far

If you look at all the decisions so far — deciding upon the order of writes, checking the availability of replicas, replicating data across them, order, locking, transactions — the database has been making some decisions for correctness. This is what we call coordination.

But as systems grow and you stop depending only on a database and start including other components like caches or even other types of databases, your system’s state gets distributed — different parts of state reside in different places.

That means, you can’t depend on any external component for ensuring correctness of operations.

State is unavoidable and when you have states distributed, you must coordinate between them. This is the moment when coordination stops being implicit and becomes explicit.

The question now is not “how do we scale?” anymore but “how do we handle coordination among different components in the system?”

State in a System