Latency and Throughput Compared

Background

Assume your browser made a request for a webpage. It flows like this:

When an application wants to communicate using TCP, it asks the OS to create a socket. The OS creates a socket data structure that holds state and buffers. When the application writes data, the data goes into the socket’s send buffer. The OS networking stack breaks this data into packets and places them into kernel network buffers.

The network hardware (for example, a Wi-Fi card) pulls packets from these buffers and transmits their bits sequentially over the physical link (Wi-Fi radio channel). Because the link can only transmit one stream of bits at a time, packets from different applications and sockets are serialized and wait in queues.

Each packet is received by the Wi-Fi router, queued, and forwarded over the ISP’s fiber optic link. The packets pass through multiple routers on the internet, where they may again be queued and delayed, until they reach the server.

As packets arrive, the server’s OS acknowledges them and delivers the data to the server application. Acknowledgments travel back along the reverse path (may not be same intermediate routers). While acknowledgments are returning, the client continues sending more packets.

The server begins processing the request before all data arrives. When it generates a response, that response is broken into packets again and sent back through the same sequence of links, queues, and buffers in reverse, until the client receives it.

Meanwhile just like you, many different people make these requests, so all these requests are accommodated by the server by spreading them over through event loops, threads and processes (to understand these topics, please check previous articles).

Bottlenecks

Now if you see the above flow, there are several bottlenecks:

The link from your system’s Wi-Fi card to your Wi-Fi router.
Your Wi-Fi router to your ISP fiber optic cable
Internet to the server’s network hardware
Inside the server, how it manages multiple requests

Definitions

Latency is the end-to-end time between sending a request and receiving the response, including network delay, queueing, and server processing.

Network Throughput is the amount of data that can be sent over a unit of time. In our case, to measure a server efficiency, you can think of it as, amount of requests that can be handled by the server in a unit of time, call it Server Throughput.

How latency occurs

From a client POV, the latency is mostly because of the hardware limits. Even if your OS and applications can handle thousands of requests simultaneously, at the end they have to be sent through your system’s network layer that can send only a finite amount of data at a time. So after your OS packages the data, your network layer will serialize the packets and sends them one by one. As a result, when there is more data, it will be automatically queued in buffers, which means there is a waiting time, meaning there is a latency.

Now, even if you are sending just one unit of data, it still needs to travel over networks to reach server, then server needs to compute and send a response which again goes through networks to reach your system. which will always take some amount of time, at least in milliseconds. So latency will always be there.

So we don’t try to remove latency, it is impossible, instead we try to reduce latency.

What about throughput

Now if you want your request data to be sent over as quickly as possible, logically you want your buffer queue to be empty so that the request is handled immediately. But if your queue is always handling as less data as possible, it means you are sending very low amount of data per second, which means very low throughput. But since you don’t want your hardware not be used to its full efficiency, you want your queue to be filled, meaning some requests needs to wait in the queue before being sent, meaning higher latency. So you notice the pattern:

Think of a network as a road. Number of cars that can pass through per a unit of time is called the network throughput. How long a car takes to go from start to the end is called the latency.

Now it might seem like latency and throughput are proportional, but actually they aren’t. The difference comes from the perspective. You see, from your system’s perspective this is the case when it has a network to itself. But look at it from a shared network’s perspective.

If your Wi-Fi is shared by 5 devices, all 5 devices can’t send data to Wi-Fi at the same time, since all of you share the same physical medium even though you have 5 different logical links. So while one device sends data, the others have to wait for their turn. The same happens when you are downloading as well, while your device is receiving its packets, the other device needs to wait to receive its packets.

So ultimately, when you have more and more devices sharing a network, the throughput decreases per device, and the latency will increase. This is the reason why if someone on you Wi-Fi network downloads a movie, your YouTube video’s quality drops.

Similarly, no matter how much of data is flowing between your Wi-Fi router and your systems, the throughput of the ISP’s fiber optic cable to your router will again limit how much data can flow.

Add to that, as utilization approaches the capacity of a link or server, queueing delay increases non-linearly, which is why latency can suddenly spike even when throughput only increases slightly.

Role of a server in latency and throughput

Now if a server receives just a single request, it will process it immediately and send a response. But if it gets multiple requests at a time, based on how the code was written, it will distribute among its event loops, threads and processes to try to handle them concurrently, and if possible, parallelly as well. This concurrency make it seem like they are being handled simultaneously and moderately reducing the response time of a request, thereby directly affecting the latency.

But if it gets overwhelmed by requests and its hardware limits of threading are reached, the requests will be queued and thereby wait time increases, increasing latency. Now here, number of requests it can handle per a unit of time is called the throughput.

How to maximize throughput and reduce latency

Similar to the previous traffic example, say you have a 100 cars.

If you have a single road, the 100 cars might take 5 minutes on average to cross the road. The throughput here is 20 cars/minute and latency is 5 minutes.

But if you have 5 roads, they might take only 2 minutes on average. So the throughput increased to 50 cars/minute and latency reduced to 2 minutes.

Adding roads increases both throughput and reduces latency because it increases capacity.

You can increase the throughput of a single server by increasing its hardware limits, which will let it handle more requests simultaneously, reducing latency. But there’s only so much you can do that way.

Here’s where the vertical scaling vs horizontal scaling comes into the picture, which I will cover in the next article.

Latency vs Throughput