Next­-Gen Network Performance Monitoring

How do I get correlated network, infrastructure and performance visibility safely in a highly secure environment?

The Problem

The network managers at a large multi­-billion dollar financial services company knew that correlated L2 ­- L7 analysis of all communication on their network was the key to quick and proactive network problem identification and time­-to-resolution. The challenge was obtaining a complete view into their network in a large, highly secure environment where over 75 percent of their applications were encrypted and they were subject to strict compliance policies. The network managers needed a solution that could perform decryption at scale, did not perturb the environment as previous probes had, and needed visibility without the hassles and dangers of collecting and sifting through captures with protocol analyzers. Taking these traces required not only obtaining advance approval but also required proper guarding and disposal.

The department had three major challenges:

  • Getting access to detailed information that could be auto-discovered, categorized, sorted, and combined for analysis.
  • The cost in time and storage of collecting and sifting through packet captures during outages.
  • Keeping historical data without risk of violating Sarbanes-Oxley, PCI standards, or SEC regulations.

Their current solutions and data sets were suboptimal. NetFlow and SNMP data gave some basic insight that did not scale well and only showed which system were talking to another, the bytes exchanged, drops, and frame metrics as well as basic resource utilization. Netflow offered no opportunity to go back for other details and was missing all Layer 4-7 information while still consuming large amounts of storage. While TCP dumps proved invaluable for forensic analysis, any packet capture analysis like Wireshark couldn't scale to the necessary data volume and required approval from the management chain.

When the network team had no other option but packet analysis the costly requirement was to obtain written authorization to install the packet analyzers on a system, take the packet capture, then uninstall the packet analyzer, and securely destroy the packet trace files after analysis.

With any IT performance issue, the application developers, storage teams, and DBAs had much more sophisticated and siloed tools, and typically thought the network was the problem.

Desired Outcome

  • Collect and visualize all cross-tier communication on the network
  • Customizable dashboards that reflect their tiered architecture
  • Trending and time comparison so they could spot patterns
  • Ability to share dashboards widely within the organization
  • Show how application changes affect the network and other infrastructure

NetFlow, SNMP, and reporting gave some level of information, but that information was difficult to sort through. It offered no opportunity to go back for other details and was quite a storage hog. At about 45 megabytes of NetFlow information per interface, the company needed to store 8 to 10 gigabytes per day.

The Solution

The company had originally brought ExtraHop in for Citrix analysis but the network team realized that ExtraHop had powerful network analytic capabilities as well. In fact, because ExtraHop is a real-time stream processor and not a packet capture system, it was an ideal solution to meet compliance and regulatory needs with their secure and complex environment.

The network team set about creating dashboards first from a top-level view with all active systems auto-categorized, and then measured the actual conversation of the systems communicating on the network. For the first time they could go beyond Netflow and correlate not only who an FTP server was talking to, but what methods were being used, what was the rate of those specific messages, the response time, and size of the transaction? They did the same for nearly all tiers and protocols within the application and network architecture. With the time-over-time comparisons, the team could immediately understand how the same network performed and what was communicating at the same time the previous day, week, or month. This type of situational awareness improved their time to incident identification and response by greater than 60 percent.

Once this baselining was complete, they began to get more proactive. Within ExtraHop, the team quickly wrote an Application Inspection Trigger to fire a packet capture when a particular database error was observed on the wire. ExtraHop's Precision Packet Capture has a continuous ring buffer so it can go back in time from when an event occurred and extract only those packets associated with the creation of the event. When the error event occurred, the network team analyzed the precision packet capture and took it to the DB team, along with ExtraHop's database analysis. It turned out that the error was being caused by an application code update that was in test. Having the packet capture confirmed that the application was occasionally making the same database call twice within the same transaction. Had this not been identified, it would have had significant repercussions for this customer facing application.

User Impact

Not only did their time to incident response improve by over 60 percent but the Network Manager discovered 43 DNS servers that they thought had been decommissioned but were still active. This was not only costing the company in terms of reuse and electric power, but left gaping security holes. They also discovered storage backup processes that were running every hour instead of once a day consuming an average of 30 percent of bandwidth on their backhaul links. The issue was identified with Layer 7 visibility of the specific traffic type and the file names themselves. The biggest win for the Network Manager is that he and his team are able to do more and better quality work with the same staff as they've moved from a reactive to a proactive state.
Contact us Try our free online demo