TCP Windows and Window Scaling

I remember a conversation I once had with a coworker while I was working as a network administrator in Iraq. He wanted to know whether upgrading our meager (and extremely expensive) 2 Mbps satellite Internet connection would be sufficient to play XBox Live. I replied that it wouldn't matter if we had a 1 Gbps connection. Why? Because it was still hindered by one-way latency of around 250 msec, not counting the delay from the upstream satellite hub in the UK to servers in the US. With that kind of round-trip delay, you'd get your head blown off in Call of Duty before you even realize the game has started.

"Long, fat networks" such as ours, so called because of their relatively high delay and high bandwidth, posed an interesting problem for early TCP implementations. To understand this issue, first we must familiarize ourselves with TCP windowing.

TCP Windowing

As we know, TCP is a connection-oriented protocol; both ends of a connection keep strict track of all data transmitted, so that any lost or jumbled segments can be retransmitted or reordered as necessary to maintain reliable transport. To compensate for limited buffer space (where received data is temporarily stored until the appropriate application can process it), TCP hosts agree to limit the amount of unacknowledged data that can be in transit at any given time. This is referred to as the window size, and is communicated via a 16-bit field in the TCP header.

Suppose we have two hosts, A and B, that form a TCP connection. At the start of the connection, both hosts allocate 32 KB of buffer space for incoming data, so the initial window size for each is 32,768.

Host A needs to send data to host B. It can tell from host B's advertised window size that it can transmit up to 32,768 bytes of data (in intervals of the maximum segment size, or MSS) before it must pause and wait for an acknowledgment. Assuming an MSS of 1460 bytes, host A can transmit 22 segments before exhausting host B's receive window.

When acknowledging receipt of the data sent by host A, host B can adjust its window size. For example, if the upper-layer application has only processed half of the buffer, host B would lower its window size to 16 KB. If the buffer was still entirely full, host B would set its window size to zero, indicating that it cannot yet accept more data.

On a LAN with high bandwidth and extremely low delay, windows are rarely stressed as there are typically very few segments in transit between two endpoints at any given time. On a high-bandwidth, high-delay network, however, an interesting phenomenon occurs: it is possible to max out the receive window of the destination host before receiving an acknowledgment.

As an example, let's assume a TCP connection is established between two hosts connected by a dedicated 10 Mbps path with a one-way delay of 80ms. Both hosts advertise the maximum window size of 65,535 bytes (the maximum value of a 16-bit unsigned integer). We can calculate the potential amount of data in transit in one direction at one point in time as bandwidth * delay: 10,000,000 bps divided by 8 bits per byte, multiplied by 0.08 seconds equals 100,000 bytes. In other words, if host A begins transmitting to host B continuously, it will have sent 100,000 bytes before host B receives the first byte transmitted. However, because our maximum receive window is only 65,535 bytes, host A must stop transmitting once this number has been reached and wait for an acknowledgment from host B. (For the sake of simplicity, our example calculations do not factor in overhead from TCP and lower-layer headers.) This delay wastes potential throughput, unnecessarily inflating the time it takes to reliably transfer data across the network. TCP window scaling was created to address this problem.

Window Scaling

Window scaling was introduced in RFC 1072 and refined in RFC 1323. Essentially, window scaling simply extends the 16-bit window field to 32 bits in length. Of course, the engineers could not simply insert an extra 16 bits into the TCP header, which would have rendered it completely incompatible with existing implementations. The solution was to define a TCP option to specify a count by which the TCP header field should be bitwise shifted to produce a larger value.

A count of one shifts the binary value of the field to left by one bit, doubling it. A count of two shifts the value two places to the left, quadrupling it. A count of seven (as shown in the example above) multiplies the value by 128. In this manner, we can multiply the 16-bit header field along an exponential scale to achieve more than sufficiently high values. Of course, this causes us to lose granularity as we scale (we can only increase or decrease the window size in intervals of 2ⁿ where _n_ is our scale), but that isn't much of a concern when dealing with such large windows.

The window scaling option may be sent only once during a connection by each host, in its SYN packet. The window size can be dynamically adjusted by modifying the value of the window field in the TCP header, but the scale multiplier remains static for the duration of the TCP connection. Scaling is only in effect if both ends include the option; if only one end of the connection supports window scaling, it will not be enabled in either direction. The maximum valid scale value is 14 (section 2.3 of RFC 1323 provides some background on this caveat for those interested).

Revisiting our earlier example, we can observe how window scaling allows us to make much more efficient use of long fat networks. To calculate our ideal window, we double the end-to-end delay to find the round trip time, and multiple it by the available bandwidth: 2 * 0.08 seconds * 10,000,000 bps / 8 = 200,000 bytes. To support a window of this size, host B could set its window size to 3,125 with a scale value of 6 (3,125 left shifted by 6 equals 200,000). Fortunately, these calculations are all handled automatically by modern TCP/IP stack implementations.