Caching: Performance vs Consistency

Every computing system, no matter how low level or high level, is constrained by the same fundamental problem: fetching data takes time. Computation itself is fast, but fetching data is slow.

A modern CPU is extraordinarily fast. But what slows a system down is not only the processing speed but the access to data. How fast can a system get the data it required to perform a computation?

Any computation that is involved with a state, must fetch the state from somewhere. Be it from Register/L1/L2/L3 cache at the CPU level, be it from memory (RAM) at code level, be it from disk (SSD/HDD) at OS level or even be it from network at higher levels.

The further you go from the place the data is needed, the longer you wait, the higher the latency is and the lower the performance is.

A way around this is to store copies of most frequently accessed/predicted to be accessed data, as near as possible. This is what’s called a cache.

If you need to know why cache is so important, you need to have a look at the physical reality of how long it takes to access data:

Storage Layer	Approximate Latency
CPU Register	~0.3 ns
L1 Cache	~1 ns
L2 Cache	~3–5 ns
L3 Cache	~10–20 ns
RAM	~80–120 ns
SSD	~100 µs
HDD	~5–10 ms
Network	1–100+ ms

Accessing RAM is roughly a hundred times slower than accessing L1 cache. Accessing an SSD is much slower than RAM. A network call is millions of times slower than a CPU instruction.

For most real-world workloads, memory and I/O latency dominates computation time. A modern CPU can execute millions of instructions in the time it takes to fetch data from across the network.

That is why caches exist.

Why CPUs cache at all?

If a CPU had to fetch every piece of data directly from RAM, most of its time would be spent idle. To avoid this, CPUs keep caches. These caches work because real programs exhibit patterns.

Programs tend to reuse the same data repeatedly, which is known as temporal locality. For example same variable in a code being accessed repeatedly.

They also tend to access data near other recently accessed data, known as spatial locality. When a CPU reads memory, it does not fetch a single byte — it fetches a whole block, predicting that nearby data will be needed soon as per spatial locality.

As programs grew more complex, a single cache was not enough. Multiple layers emerged: L1 closest to the core, L2 slightly farther, and L3 shared across cores. Each level is larger and slower than the previous one. This hierarchy exists for one reason only: to keep frequently used data as close to computation as possible.

This idea does not stop at the CPU.

The same problem appears in software systems

Once you move beyond a single machine, the same pattern appears at a larger scale. Instead of CPU trying to read from RAM, you now have applications trying to read from databases. Instead of memory access, you have disk reads and network calls.

A request that crosses a network, a database query that scans a disk is orders of magnitude slower than a computation done in memory.

So the same idea reappears: keep frequently used data closer to where it is needed.

Databases cache index pages in memory so they don’t have to reread them from disk. Operating systems cache file blocks. Applications cache computed results. Reverse proxies cache HTTP responses. CDNs cache content close to users.

Each of these is the same idea expressed at a different level.

Caching is inherently risky for consistency

Caching literally means creating a copy of data. Which means, you are inherently taking a risk that it may be stale — as the original source might get updated but the cache might not be, unless explicitly done.

Cache, ideally, should not be the source of truth.

This distinction is crucial. To understand this better, consider what happens when data changes.

A request comes in and reads data from a database. The result is cached somewhere. Later, another request modifies the data in the database — but cache is not updated at the same time. The cached copy is now wrong. Nothing breaks immediately, but the system is now inconsistent.

Caches are often eventually consistent unless designed to provide stronger guarantees. That is, given enough time and no further updates, the stale data will eventually be replaced or discarded.

A quick note: Based on the strategy used, caching can be eventually or strongly consistent, as you will see soon.

Now a cache doesn’t magically fix itself with time. Somehow, the old data has to be discarded so that new requests will hit the source of truth directly, or it has to be replaced with the newer results. How that happens depends on the caching strategy you choose.

How cache eventually becomes consistent

TTL

The simplest mechanism is time. Cached data is stored with a time limit, called a Time-To-Live (TTL). Once that time expires, the entry is discarded. The next request fetches fresh data from the source and repopulates the cache.

This is simple but effective. The system risks a known window of inconsistency in exchange for speed. The TTL can vary with use case like 60s, 15m, 1h so on.

Here caches are eventually consistent, because stale data eventually expires once you keep a finite TTL.

Explicit Invalidation

More sophisticated systems try to be smarter. When data is updated, the cache entry may be explicitly removed or replaced. This requires the system to know exactly which cache entries are affected by each write.

For example, when a user updates their profile, the service writes to the database and also deletes or updates the cache entry

This reduces staleness, but introduces complexity.

Because now the system must know which cache keys are affected and update or invalidate them reliably and also handle failures, if any, during invalidation. If invalidation fails, stale data persists.

Eviction Policies

When the cache is full, you will have to decide what to keep and what to decide. This is where eviction policies come into the picture. While TTL and invalidation decide when data becomes invalid, eviction policies decide what to remove when cache is full

LRU

Least Recently Used is a popular strategy that discards items that haven’t been used recently starting from the least recently accessed ones.

LFU

In Least Frequently Used, instead of least recently accessed, you take out least frequently accessed ones.

FIFO

This is First-In-First-Out. That is, whichever is the oldest cached data, will get evicted.

Caching Strategies

There are different ways one can choose to cache.

Cache-Aside

This is the the most commonly used strategy.

Read-Through Cache

Here, the service talk only with cache, never directly with DB. That’s why it’s called read-through cache. But the problem is, cache becomes a critical part of the path here.

A real-world use case can be found here: AWS DynamoDB uses DAX, which serves results from cache - if not found, it fetches from DynamoDB and returns the result, as well as store it in cache for future use.

Write-Through Cache

Here, writes are done to cache first and the cache becomes responsible for forwarding the write to DB. Since the cache is now part of the write path, its availability directly affects system correctness.

Write-Behind Cache

Here, writes are done to cache only initially. DB will be updated later i.e., asynchronously. This serves extremely fast writes but comes with a serious tradeoff: if the process crashes before the database update completes, data can be lost. This trades durability for throughput.

Refresh-Ahead

Cache serves stale data while refreshing in background before TTL expires. This is used by CDNs generally. This helps avoid cache stampedes, where once a cache item’s TTL expires, many requests simultaneously hit the DB, making it do the same query repeatedly for all those requests before the response is cached again.

Tradeoffs

Choice	Benefit	Cost
Long TTL	Very fast reads	Stale data
Short TTL	Fresher data	More DB load
Cache-Aside	Simple and mostly fault-tolerant	Can cause cache stampedes and stale reads if invalidation fails
Read-Through	Low latency	Cache becomes a critical component and consistency depends on write strategy
Write-Through	Strong consistency	Higher latency and cache becomes a critical component
Write-Behind	High throughput and low latency	Risk of data loss
Refresh-Ahead	Helps avoid cache stampedes	Increased complexity with adding background workers and risk of refreshing unnecessary data

Why systems slow down after restarts

When a cache is empty, it is called cold. Every request must go all the way to the database or backend service. Latency is high, and load spikes sharply.

As requests flow through the system, popular data accumulates in the cache. Over time, the cache becomes warm. Requests are served quickly, load drops, and the system stabilizes.

This is why systems often feel slow immediately after deployment or restart. The cache has no memory yet. Large systems often pre-fill caches with known hot data to avoid this cold-start problem. Others accept the temporary slowdown.

How to use cache effectively?

If a cache refreshes too aggressively, performance collapses. If it holds data too long or invalidation fails, you see outdated results. If many requests miss the cache at once, cache stampede occurs. Due to fast responses from cache, you might not even recognize poor database queries — which will become an issue under cache stampedes.

Which is why, you must design the system to work as efficiently as possible, without cache in mind, and add cache only as an additional layer for better performance.

Anything that requires strong correctness — financial balances, authorization, inventory counts — must not be cached. You can sacrifice performance for consistency here. Anything that can tolerate inconsistency — like home page feed, user profiles — can be cached safely.

A piece of advice I found: one must not start by thinking about what to cache but, by pointing out what not to cache.

If you add cache, your system should be faster but still be correct where it matters — don’t cache what must not be cached. And if you remove cache, your system may get slow but should still be correct without any breaks — don’t rely on cache too much.

CDN: Caching at the Edge

Until now, we’ve talked about caching within a system. But the same latency problem exists at a global scale. When users are geographically far from servers, network latency dominates everything else.

For example, a simple website might be hosted in the US, but it might have many users in India. Now, if every request for the website goes across continents to the system in US, it will cause a lot of latency. So Content Delivery Networks (CDNs) cache static assets like HTML, CSS, JS and media files, near the edge i.e., near the users, to deliver them faster.

A CDN reduces latency by reducing the physical distance. But you will need to invalidate the cache actively, if you want any changes to appear immediately.