I'm starting a series of posts that I'll call "War Stories" to share some of the more interesting from-the-trenches stories I come across.
Working effectively to troubleshoot problems in IT takes a number of skills, both on a technical and social level. It's not enough for a Network or Application Administrator to simply demonstrate technical acumen -- they must possess the patience of a saint and be a model of customer service at the same time. In the heat of the battle, it's often hard to remember that it's not about "getting lucky" and solving a problem: it's about having a repeatable and scalable process that can get you there. Of course, the more prepared you are, the luckier you get.
We often see firefights where fingers are pointed at the most esoteric part of the infrastructure. Application-aware load balancers, WAN optimizers, forward caches, reverse caches, compression-offload devices, encryption terminators, softphone PBXs, filesystem virtualizers -- the vast array of enterprise network devices that play in the application space is overwhelming. And when there are problems, they are expensive and frustrating.
Today we're going to take a look at an expensive problem that a customer was forced to put up with for months, and how a simple change to their load-balancer's TCP stack settings solved their problems without requiring extensive validation or patching of the hundreds of devices affected. As an additional bonus, it also just happened to turn the Network Administrator into a superhero (without a cape).
In one field-evaluation, we would see a steady trickle of LPR (Line Printer) traffic through the day to a certain group of print servers on a customer's network. It was easy to see on the application usage charts: there was a thin, constant band of data always running to the printer throughout the day. Worse still, when workstations crashed, the jobs would have to be restarted from scratch. The issue was perplexing. There was no reason for a 30MB print job to take almost an hour to print, especially on a gigabit-ethernet network. The icing on the cake was that the network was being blamed: server load was low, and the workstations were otherwise responsive, so by process of elimination, it had to be the network, right? Moreover, the switch occasionally showed a few buffer overruns here and there. Calls to vendors were met with unhelpful suggestions to buy new products, and the sheer volume of traffic made packet-sniffing an expensive path-of-last-resort.
When debugging problems in a packet dump, there's a natural tendency to focus on the easily quantifiable characteristics of the traffic analyzed. (There is of course the question of finding the haystack in which to search for your needle.) Drops, bad checksums, frame-jabbers -- these are all valuable statistics, but distinguishing between cause and effect can be hard. It's quite possible for an occasional checksum error to occur, but it would not cause a print job to run at 75 Kbps on a 1Gbps network. In order to understand what is happening on a problematic network, we'd need to be able to understand how traffic would flow on a properly functioning network. That's one of the things we've put a lot of work into at ExtraHop: rather than just count explicit TCP statistics, we simulate TCP stacks on both endpoints to surface hard-to-find issues. The concept is straightforward: TCP has been around for nearly 30 years; it employs a feedback-control system whose behavior is well understood. If a stack is behaving strangely based on our understanding of the basics, then further investigation is warranted. It's essentially fine-grained parametric and behavioral analysis of the protocol stack.
What we observed using the ExtraHop system was an excessive amount of Send Window Throttling. This means that based on our analysis of the traffic, and the fit-and-start nature of the transfers, we determined that the throughput of the connections was being limited by a small internal send buffer; moreover, we saw the median round-trip time on the local network between the print server and clients to be around 200ms for observed TCP connections, indicating a problem with delayed-ACKs. This was interesting because ICMP pings between the devices had a sub-1ms response time. A quick search on Microsoft's knowledge base for delayed acknowledgement problems and slow printing yielded suggestions involving hot-fixes and registry changes. Some of these registry changes would in fact be extremely ill advised (like disabling delayed-acknowledgements altogether). However, since our device can be used to tune load-balancers and application-aware network infrastructure, our documentation simply asked our customer to create a TCP-profile virtual server for the problematic application, increase the size of the send buffer, and enable the "ack_on_push" setting on the load balancer's TCP stack.
It took about 5 minutes to take these steps, and the problem was resolved immediately on making these changes. If there had been an issue, it could have been backed out as quickly as it was applied. The Network Engineer saved the day by using a system to analyze the behavioral differences between the expected TCP behavior and the actual TCP traffic flow, rather than spending hours poring through hundreds of thousands of network flows hoping to find a pattern.
The Network Engineer had good news to share with the rest of the organization. Armed with the appropriate visibility and knowledge, he was able to rollout a change on the network to fix problems across hundreds of problematic workstations and servers in a matter of minutes. It goes to show that it's not about the raw data you collect: it's about the actionable information you have on-hand.
- Raja