Scaling & Load Balancing
Horizontal vs vertical scaling and the role of load balancing

If you have a single road, the 100 cars might take 5 minutes on average to cross the road. The throughput here is 20 cars/minute and latency is 5 minutes.
But if you have 5 roads, they might take only 2 minutes on average. So the throughput increased to 50 cars/minute and latency reduced to 2 minutes.
Adding roads increases both throughput and reduces latency because it increases capacity.
We ended the Latency vs Throughput article with this example. It is a bit simplified and you will notice that soon. Anyways, now you can interpret this example as one of the two things:
Adding more physical hardware like more CPU cores, memory etc. to a single server, so you get more and faster queues to process the requests. Or,
Adding more identical servers, so if a server is already under load, another server will take up new requests.
So bottom line, scaling increases system capacity. So under load, you get reduced latency with scaling.
This is the core idea of scaling. Whichever scaling method you choose, you are basically trying to widen the bottleneck to reduce contention. In vertical scaling, you will be making a single server more bulky, to be able to handle more requests and in horizontal scaling, you will be adding more servers and distribute the requests between them.
What Vertical Scaling Is
You will be generally improving your server in one of the following ways, core idea being reducing latency and increasing throughput:
If your application is CPU heavy, you will be increasing the CPU cores for more parallel execution of incoming requests.
You might be increasing the RAM, if the computation of more requests needs much more memory than what’s available.
In case, it is disk I/O heavy, you will be updating to faster SSDs to reduce request processing time.
Benefits
Since only one server will be communicating with your databases, you will likely have fewer data inconsistency issues.
And this is the only server that will be serving the requests, you can often simplify state management by keeping it local.
Using in-memory cache and disk for storage, you won’t be making the, comparatively slower, network calls to cache and databases.
Tradeoffs and Limits
Obviously, you have a single point of failure. If this system goes down, all the incoming requests will fail.
As you keep adding multiple processes and threads, you have to be careful of memory locks and database atomicity to prevent race conditions.
You can scale hardware only so much before hitting either physical limit or budget limit. As you add more and more of RAM, disk or a better CPU, your costs start increasing non-linearly.
Essentially, a bigger machine does not eliminate contention. Instead you are taking on the unsolved risk of single point of failure, need to robustly test the code to prevent race conditions and mainly every growing cost to increase physical limits to keep up with more and more requests. And beyond a point, vertical scaling hits diminishing returns due to shared resources and serial execution paths.
This makes horizontal scaling not a choice, but a necessity.
What Horizontal Scaling Is
Instead of keeping on bulking up one server, you will add more and more regular servers to handle more requests. So instead of one big queue, you have many smaller queues. This drops the latency per node and increases overall throughput.
Now, in order to distribute the requests between these servers, you will use a load balancer.
Load Balancer
Since you have multiple servers and each of them will have their own IP address, how will you choose which IP to map your DNS domain to?
Simple. Add another server in front of them and map its IP. And when requests hit this mediator server, you will route them to your servers which will be actually serving the requests, get the response and send it back. This mediator is called the load balancer. Because this will be balancing the load among the servers by routing the requests.
But how will you route the requests?
Will you choose randomly one of the servers? Or maybe you can send first request to server 1, next one to server 2, and the next one to server 3 etc. This strategy is called Round Robin. Or maybe keep track of which server is serving how many number of requests and route new requests to the one with the least load. This strategy is called Least Connections. These different ways give you different ways of balancing the load.
Apart from this, this load balancer will also keep track of health status of each server, typically by calling a health API endpoint you will be exposing, to route the requests to only the healthy ones.
Benefits
If one of your server fails, your load balancer will just route it to another server. Essentially, giving your service no down time.
Theoretically, you don’t have the same physical limits as vertical scaling. You can increase as many nodes as needed to handle more and more requests.
Beyond a certain scale, cost wise, horizontal scaling becomes more viable than vertical scaling.
New Problems with Horizontal Scaling
Horizontal scaling means multiple machines working together. The moment you do that, you enter the world of distributed systems, where new classes of problems appear.
I/O → Network
In a single machine, different services communicate with function calls. They share memory and storage. But once you scale, you often introduce a central cache and memory in order to maintain consistency.
Now they are no longer, internal calls to RAM or disk but network calls to cache and database servers, increasing latency from microseconds to milliseconds. Add to that there will be retries, exponential backoffs in case of failed network calls to those services.
Now the latency won’t be just because of computation but because of communication as well.
Partial Failures
When you have multiple nodes, one of them can be slow, one might be down, one might be good, one might already be overloaded. You can’t assume that all requests will succeed and need to account for retries, timeouts.
And when multiple services depend on each other, you must also account for cascading failures and be careful not to have a single point of failure.
Inconsistency in data
Now not just you but databases also implement horizontal scaling. Now look at these scenarios keeping that in mind.
Since data is outside of a server’s responsibility, updates can take time to propagate and replicate across machines of database, resulting in one person seeing updated data and one seeing outdated data.
If some of your nodes are slower, updates to data might go out of order.
If a person updates and another requests data, update might end up happening later than returning data resulting in outdated views.
So, in practice, many systems relax strong consistency to improve availability and scalability. But some systems might prefer strong consistency rather than availability — for example, banks. The choice depends on your use case and industry.
Load Balancing Tradeoffs
Load balancing doesn’t split traffic evenly.
Round Robin strategy, by default assumes that all servers are equally fast, and spreads them evenly only by count, so each server gets equal number of requests.
But counting requests is not the same as measuring load. Slow nodes become bottlenecks. If the event loop is filled on a slow node, it will keep the remaining requests in queue until the event loop gets free. This is called head-of-line blocking. In this case, latency explodes for further requests on that node.
Least Connections strategy is slightly better but even it doesn’t know about how much resource consumption that request leads to.
For example, a /health and a /orders/123 requests don’t really give any context to the load balancer, but a /health is not a heavy call and gives an immediate response but /orders/123 can result in a DB query along with auth checks on DB side which might be slower and more resource intensive but at the same time a /orders/456 might actually give the result from a DB cache itself resulting in a faster response. So if node A gets 2 cheap requests and node B gets 1 expensive requests, node B is not loaded enough in the eyes of the load balancer as it has only one active connection. So it will route the next connection to node B, which might turn out to be an expensive one.
A load balancer doesn’t know about any downstream services as well. In the same example, if the DB is not reachable, the node will keep on trying until it hits retry limit. Add that with an exponential backoff, resulting in a connection living longer on that node. But load balancer will keep sending it more requests, which will be just waiting in a queue until the previous requests are freed from the event loop, adding latency.
Conclusion
Now what would one want to use in a real system?
Since nobody likes a single point of failure, you would go with horizontal scaling generally but you don’t take very minimal nodes and scale them recklessly. Instead you find a sweet spot, a hybrid, between horizontal and vertical scaling to decide upon the hardware specs of each node in a node group based on your budget and latency requirements.
State - A Subtle Shift Horizontal Scaling Forces
Horizontal scaling looks simple as long as each server can handle requests independently — an assumption that rarely holds in real systems.
Once traffic can land on any node, state — things like sessions, counters, or cached data — that once lived comfortably in a single machine — in memory or on disk — can no longer be relied on. A request handled by one server may need information that was created or updated by another.
At this point new questions arise. Do we maintain the same state in all the machines somehow? Or do we completely avoid state? Or should we store the state somewhere externally and retrieve it when needed? But then how confident can we be about consistency?
This is where horizontal scaling in systems stops being just about adding servers and becomes a question of how state is managed across them.
I will explore these questions in my next article.




