Network data is invaluable for understanding how the tradeoff of speed against reliability plays out in the choice of TCP or UDP for application messaging.
I recently had the privilege of a ringside seat at a duke-out between TCP and UDP to be the transport of choice for a high-performance application. Message transport is the backbone of this application, and it's performance depends critically on the speed of that transport. Despite the received wisdom that UDP is faster than TCP, the team building the application plumped for TCP. I described the rationale for this in a previous installment of our blog; the short version is that the speed advantage of UDP was marginal enough to be outweighed by the better reliability that TCP offers. I’d like to take the opportunity to take another deep dive into another topic related to these questions.
The received wisdom is that TCP is reliable, in contrast to UDP whose name is sometimes claimed to have been crafted as an acronym for “Unreliable Datagram Protocol.” This story may be apocryphal, as RFC 768 which standardized UDP defines the U to stand for User. However it is not unknown for an original, more honest, acronym to be replaced by one crafted with more PR awareness – witness the rebranding of the TCP congestion management scheme, RED, from Random Early Discard to Random Early Detection. It sounds much better for your router to be anticipating congestion problems than for it to be deliberately discarding packets!
I digress, however, from my point that the accepted wisdom that TCP is reliable. Just as with other such nuggets of knowledge, this is true but is subject to an important caveat – one that has surfaced a few times in recent conversations with our clients, and one we’ll dig into in more depth.
Reliability is universally acknowledged to be a good thing, but in hard engineering terms is actually meaningless unless you quantify the adverse conditions against which you want your system to operate. To take a rather extreme example, nobody bothers building systems that are reliable in the event of an asteroid strike.
As we saw the last time, a client of ours was building a high performance distributed application, and a central concern of theirs was the behaviour of the application under load and it's propensity to drop packets when connected via UDP. The failure mode in question was essentially one of the processing capacity of a receiver node being temporarily overwhelmed by the rate of messages directed at it. This problem was addressed by the ability of TCP to provide flow control and significant buffering at the sending nodes.
Another important failure mode that TCP was originally designed to protect against is that of packet-loss: the networking protocol that both TCP and UDP ride on top of, namely IP, provides no guarantees that packet delivery will be reliable. It's unreliability is a trade-off for a flexibility that allows IP to run on top of pretty much any link-layer technology – a trade-off that has enabled IP to be utterly ubiquitous. TCP’s sequence-number and acknowledgement mechanism protects against sporadic packet-loss at the IP layer and below, but it is important not to mistake this mechanism for an unconditional guarantee of reliability. Obviously TCP cannot protect against a complete network outage, and even moderate packet loss can stall a TCP connection so it is unusable.
A failure mode that is much more important for application architects to consider is that of TCP disconnects, caused by an application closing the socket or crashing. When this happens, data that has been sent to it by it's peer is lost. That data may be in it's local receive buffer waiting to be read, still in-flight on the network, or still on the peer waiting to be transmitted – no matter, it will be discarded. The reason this is such a problem for application designers is that, although TCP uses a rock-solid acknowledgement scheme to ensure reliable transmission of data over live connections, this mechanism is completely hidden from the application. Although the sending side of the TCP stack knows exactly what data is in flight and what data has been successfully acknowledged by the receiving side, the socket interface provides no way for the application to share in that knowledge.
Although the sending application can ultimately find out if the TCP connection has failed, it will never know just how much data was lost. As a result, any applications that require reliable transfer of data cannot depend on TCP alone.
FIX, the Financial Information eXchange protocol, provides a great example of a protocol which can leverage, but does not depend solely on, the guarantees offered by TCP. FIX implements the concept of a session, which is a persistent context negotiated and shared by a pair of endpoints. It is an application-level analogue of a TCP connection, with source and destination identifiers (in the form of SenderCompID and TargetCompID), sequence numbers and, of particular interest in the context of TCP connection failure, a replay mechanism to allow it to recover from broken network connections.
Specifically, when a pair of endpoints restart a FIX session, the first thing each does after authenticating with each other via Logon messages is to check the sequence number sent by their peer. If there is a gap between this and the highest number they previously received, it knows messages were lost and requests a replay of those messages from its peer.
Figure 1. FIX session resending old messages after reconnecting over TCP: the three time-series along the top show the number of replayed messages (flagged with FIX tag 43 PossDup) versus the total number of messages sent, and the number of TCP connection state-changes. The table at the bottom shows that the first thing that happens after the exchange of FIX Logon messages is a request for missed data to be resent.
Another example of the use of application-level acknowledgements on top of TCP is in Kafka, the distributed streaming platform initially developed at LinkedIn. While Kafka uses TCP as a transport mechanism to connect data producers and consumers to the brokers, it also uses application-level acknowledgements to allow producers to ensure that data has been properly published and replicated. Consumer offsets serve as a sequencing mechanism enabling consumers to ensure they can recover messages published but not consumed.
The implications of these reliability considerations bear on the trade-offs between TCP and UDP. As we saw the last time, UDP can be a little faster in raw terms but the reliability of TCP against packet-loss is often attractive enough to warrant the overhead. TCP achieves this reliability not just by automatically retransmitting lost packets but, more importantly, by rate-limiting transmission to avoid overloading not just the network but the receiver too.
However another client of ours used their Corvil deployment to help stability problems with the backend of a large web-based application. They were able to trace poor web-tier response times to the communication between two back-end components and, interestingly, there were similar factors at play here: speed mismatches between different components drove TCP rate-limiting and buffering into action, and were occasionally causing TCP disconnects.
Software logging was not helpful in diagnosing this problem: precisely because the software was intermittently locked up, it was not able to log its state reliably or in a timely fashion. The view from the network, as afforded by Corvil Analytics, showed a clear pattern: although TCP disconnects were rare, they were consistently preceded by the receiver stalling and unable to process data. This was evidenced by the zero-window advertisements seen only on the network. Once the session between the components was disconnected, it took the application a fairly complex procedure to re-establish the necessary shared state to allow transaction processing to start up again.
Figure 2. Analysis of a TCP connection stalling, displaying TCP zero-window, and ultimately failing, forcing a reconnection. Top-left charts the goodput of the application, top-right the state changes (reset and reconnect) of TCP connections, and bottom-right renders TCP zero-window events, evidence of a stalled receiver.
Once the operations team were able to pinpoint exactly when and how these disconnects were happening, and most importantly identify that they were triggered by capacity mismatches in the application, they were in a good position to start to remedy them by scaling out the bottlenecked application tier.
In summary, although TCP’s rate-limiting, buffering, and retransmission mechanisms provide greater reliability than UDP, these do not constitute an absolute guarantee against data-loss. As in the case of our client’s web-based application back-end, the loss of a TCP connection can be much more costly than simple packet loss, and can take significantly longer from which to recover. In both cases, visibility into the network behaviour of the application is critical to diagnosing user-impacting problems and providing effective remedies.