Storage Performance Monitoring: 3 Metrics Needed for Holistic Visibility

Service level agreement (SLA) cartoon for storage performance monitoring

The Service Level Agreement is always the first casualty in the war to assign blame.

Most IT organizations monitor their applications and infrastructure in an extremely disjointed manner, with each specialist team relying on tools that provide visibility into a specific technology silo: network tools for the network engineers, database profilers for the DBAs, agent-based APM tools for developers, and so forth. This fractured approach to monitoring contributes to high IT costs, poor user experience, wasted capacity, and an IT organization that is responding to issues reactively instead of proactively.

SAN and NAS Performance Visibility

The fractured approach to monitoring truly fails when trying to ascertain the role of networked storage in application performance problems. The teams responsible for storage-area network (SAN) and network-attached storage (NAS) systems frequently have minimal visibility into how those systems interact with applications, the network, or other infrastructure. On the other hand, monitoring tools built for other technology silos provide zero visibility into real-time storage performance. Agent-based application performance monitoring (APM) products, for instance, include storage performance in database response-time metrics, masking the real source of transaction latency.

ExtraHop offers a much-needed new approach that provides holistic visibility across the entire application delivery chain. This cross-tier view enables IT teams to easily understand how applications are impacting the database, network, and storage tiers. With shared operational intelligence, IT teams can collaborate to solve problems faster and identify interrelated issues that would otherwise go undetected.

This month's Performance Metric of the Month highlights the importance of CIFS, NFS, and iSCSI transaction metrics in the context of other application and infrastructure performance. The three real-world examples below demonstrate the value of this correlated visibility.

Case #1 – Tiered Storage vs. the Rogue Application

The first case features a customer who saw unexplained poor performance with their tiered storage setup. They had NetApp as their primary storage system and DataDomain as their second-tier backup storage. The second-tier storage was performing slowly during some backups, with constrained I/O and TCP connection stalls manifesting as well.

This customer used the ExtraHop system to inspect a list of all transactions hitting the DataDomain system during the periods of slow performance and identified a single system that was aggressively reading from the storage system. As the back-up storage system was optimized for writes and not reads, this activity had a serious impact on overall performance. The ExtraHop system made this diagnosis easy by showing all the read and write transactions on a per-client basis. This capability can also be applied to monitoring OLAP database applications, or data warehouses, which are optimized for reads.

Case #2 – iSCSI Connectivity Issues and the Confused SAN

This second case demonstrates the importance of contextual visibility when troubleshooting storage performance issues. In this case, a prospective customer had tried for months to isolate the root cause of iSCSI connectivity issues between its Compellent SAN, Citrix Xen, and VMware virtual servers.

Mapping iSCSI connections helped identify misconfigured servers.

Figure 1. Mapping iSCSI connections helped identify misconfigured servers.

During a proof-of-concept demonstration, the IT manager at the company and an ExtraHop systems engineer confirmed the iSCSI connectivity issues and then pinpointed the specific servers experiencing these problems out of the entire pool of Xen and VMware servers. By generating an application activity map that visually mapped all devices using the iSCSI protocol (see Figure 1), the IT manager confirmed that the two suspect servers were connecting to the SAN in different ways. These servers were using the Microsoft iSCSI Software Initiator in Windows in addition to host-bus adapters (HBAs). As the SAN tried to load-balance requests across all available interfaces and controllers, it would sometimes send a response from the HBA back to the Microsoft iSCSI Software Initiator on that same server, which would then drop the response.

The ExtraHop system helped to solve this obscure issue by providing the necessary context. With the problem identified, the IT manager turned off the Microsoft iSCSI Software Initiator on those servers, and the iSCSI connectivity issues disappeared.

Case #3 – The Bandwidth-Hog Logging System

This final example demonstrates the importance of correlated storage and network visibility. At one company that uses the ExtraHop system, an Operations team member was investigating database activity with the aim of finding SQL queries that were good candidates for caching. In the course of his investigation, he saw that CIFS traffic comprised 70 percent of network bandwidth. This number seemed odd to him, so he drilled into CIFS transaction details and found some familiar file names in the list—files associated with the company's homegrown logging system!

Analysis of L7 application protocols.

Figure 2. The ExtraHop system analyzes L7 application protocols.

A bug in the log archive script caused large files to be copied across the network repeatedly. Five million files were unnecessarily rewritten. The network team was unfamiliar with the logging system and had assumed that this growth was organic. In fact, they were preparing a forklift upgrade of the network infrastructure to handle this increased traffic—a cost of hundreds of thousands of dollars. However, with the archive script fixed, network utilization dropped by an astounding 70 percent, which helped the company defer a significant unnecessary capital expense.

Legacy network-monitoring tools would not have helped in this case. Only the ExtraHop system, with its ability to analyze L7 application-level details, is able to distinguish CIFS traffic (see Figure 2) and list the filenames for each transaction.

What's Needed: An Operational Intelligence Solution

The three examples above demonstrate the need for correlated, cross-tier visibility. This holistic view has been the goal of IT monitoring for decades but never realized until now. Cobbling together various specialist tools into a solution portfolio or suite will not provide the same visibility—only a system that is built from the ground up with the goal of operational intelligence fits the bill.

If you have your own networked-storage tales to tell, please leave a comment below. Or, if you're interested in finding out how the capabilities of the ExtraHop system can help you, try the free, interactive ExtraHop demo.

Subscribe to our Newsletter

Get the latest from ExtraHop delivered straight to your inbox.