CIFS Errors: The ExtraHop system provides storage performance details not contained in event logs.
September's Performance Metric of the Month (PMM) focuses on CIFS errors for networked storage, but applies equally to other storage networking protocols such as NFS and iSCSI. Previous PMMs included TCP retransmission timeouts or PAWS dropped SYNs.
Storage is a hot topic for enterprise IT these days. The Big Data phenomena, no longer confined to research and industry-specific applications, has taken firm hold in enterprises. As confirmation, IDC recently reported continued strong growth among storage vendors in the second quarter of 2011. Read more about ExtraHop's take on APM in the era of Big Data.
The exponential growth of data in the enterprise can cause serious trouble for IT Operations teams. Increasingly large amounts of data pose a challenge to traditional methods of troubleshooting storage performance.
One ExtraHop customer found some corrupt files in the multi-terabyte share used for software quality assurance (QA) and testing. The corrupt files prevented QA team members from running important regression testing reports. Normally, the team responsible for the share would simply run the chkdsk.exe command-line tool to locate and fix the bad volumes. However, because of the large size of the share, chkdsk.exe would have required more than a week to complete. This amount of downtime would be hard to accept at any organization, but even more so with this customer, who risked missing important software development deadlines without a quick fix.
Depending on the size of the file system, chkdsk.exe can take days to complete.
Fortunately, the customer had the ExtraHop system deployed in their environment. The ExtraHop system analyzes storage transactions as they pass over the wire, extracting L7 application-level details by client IP, method, user, and filename. In this case, the customer examined CIFS metrics, but similar metrics are available for NFS. For IP-based SANs, the ExtraHop system also provides health and performance metrics based on analysis of the iSCSI protocol.
Simply logging into the ExtraHop web-based user interface, the customer examined CIFS server error messages to find the location of the corrupt files. With that critical information, the customer restored those files from backup—and voilà!—problem solved!
Were there other ways to troubleshoot this problem? Sure. But none of them offered the immediacy and simplicity of the ExtraHop system, a network-based application performance management solution that provides real-time L7 analysis of web, database, and storage transactions.
We suspect many other organizations may face similar problems when troubleshooting their file services and network storage systems. For example, a recent post to the NetApp Community forums asked for advice dealing with corrupt files stored in their SAN-based file system. In this case, the corrupted files lay among approximately 22 million others. The total used disk space equaled 946GB out of 1.89TB. Running CHKDSK would likely take several days, give or take, depending on specifications and setup. By uploading a recent packet capture from the appropriate network segment, this poster could test the CIFS and iSCSI analysis capabilities of the ExtraHop system with our free www.networktimeout.com tool.
CIFS Error Example: Dealing with corrupt files stored in a SAN-based file system. Dear Tom: Give us a call, we'd love to help!
To get a better idea of how the ExtraHop system can help solve storage performance problems, watch a three-minute demonstration of how to troubleshoot a CIFS spike on the network.