One of the services that ExtraHop provides for customers is hands-on, in-person training. While on-site, trainers often address actual customer challenges as a way to help teams learn how to use different elements of the product.
Recently, one of our senior trainers was onsite with a fast-growing software company in the southern United States. During the training, a member of the IT team responsible for storage interjected to notify the people in the room that they were experiencing a problem with their G Suite applications, including mail, calendar, and drive. He immediately pointed his finger at the network.
One person in the room then pulled up Riverbed. The Riverbed UI was showing five-minute samples of interface utilization. These numbers quickly proved useless.
The ExtraHop trainer suggested they dig into the ExtraHop UI, as it showed much more granular, one-second resolution.
The finger had been pointed at the network so the IT team decided to start there, taking a look at L2 traffic. There was no significant surge in network traffic, so they moved on to L4. At L4, they immediately saw a spike in retransmission timeouts (RTOs), suggesting that the company's devices weren't responding in a timely fashion. A HUGE majority of those RTOs were transiting a single network device.
The team then dug into the network device, looking up the stack at L7 protocols. Right away, they noticed a lot of network file storage (NFS) traffic moving through the network device, starting right around the time that the IT team started to get complaints about both internet service and G Suite applications. This suggested that the issue was arising at the storage tier, but the storage admin insisted this was impossible-after all, their storage array used 10Gb interfaces.
Despite assurances from the storage admin, the NFS traffic continued to be suspicious. All of the problematic NFS traffic on the network device was coming from a single interface on a storage cluster, so they pivoted and drilled into that particular storage interface. As it turned out, a single Linux desktop at a remote office was hammering that specific storage interface, information the IT team had due to ExtraHop's out-of-the-box auto-discovery and classification capability.
Despite mounting evidence, the storage admin continued to insist that every storage interface was 10Gb and so it was impossible that the problem could be originating at the storage tier. So the IT team went back to L2 traffic for the storage cluster. The traffic between the storage interface and the Linux desktop was ebbing and flowing but never got above 1 Gb/s, strongly suggesting an interface limited to 1Gb/s.
As it turned out, the storage admin was correct. The storage cluster did indeed have 10Gb interfaces. With ExtraHop, they were able to query the specific interface at issue. With this information, the IT team was able to determine that, while it was, in fact, a 10Gb storage interface, it was connected on the other end to a 1Gb interface. When one user decided to move a large volume of data over that one poorly configured storage link, it took everything down. Armed with this information, the storage admin was able to quickly resolve the issue, restoring internet and G Suite service across the organization.
The writing of this blog post took longer than troubleshooting the issue!
For the fast-growing software company, correlated visibility across multiple tiers, including L2, L3, L4, and L7, armed them with the information they needed to stop the finger-pointing and blamestorming and find the actual root cause of the issue. Rather than spending hours trying to prove innocence while the organization's productivity suffered, the team troubleshooted and resolved the issue quickly, keeping business running with minimal disruption.
To put it in perspective, according to the trainer, the writing of this blog post took longer than troubleshooting the issue!