Application Transaction Tracing Across All Tiers

How can my development team trace individual user transactions through every tier, including proxies, across our application delivery chain?

The Problem

Application developers at an online retailer lacked visibility into how their applications connected to other web services in the organization. They couldn't see the impact network, security, and other infrastructure elements had on end-to-end transaction performance. When outages occurred, the developers could report on and troubleshoot only their own application stack, using a combination of agents and logs but they were highly dependent upon the firewall, load balancing, and other proxy tiers which they didn't control.

Because of the company's large and complex environment, determining the source of a problem required considerable personnel resources burdened by the inefficient process of elimination. They estimated that the average outage, no matter how brief, cost a minimum of $75,000 in lost revenue and productivity. In the last year alone there were more than fifteen such events.

A typical incident would start with an alert showing that a particular service was down or degraded. Developers, network, and system administrators would gather into physical and virtual war rooms, where representatives from every group would attempt to solve the problem. After investigation, they often would discover that a suspect application was fine but that a downstream or upstream dependency was causing the issue. The operations team would then bring in representatives from other teams eventually growing the war room from five to ten or more people adding to the chaos.

Desired Outcome

  • Reduce or eliminate outages through early warning
  • Non-invasively measure every user's transaction across all tiers
  • Automatically map dependencies across the entire service chain
  • Improve coordination among teams when triaging

After some investigation, [the app devs] might discover that a suspect application was fine but that a downstream dependency was causing the issue. The operations team would then bring in representatives from other teams. In some cases the war rooms grew to ten, twenty, or more people. The additional people involved over time often created chaos.

The Solution

The retailer had already deployed an ExtraHop platform but was using it exclusively for network monitoring. However, a developer at the company who knew of ExtraHop's stream processing and programmability wondered if his organization could perhaps use the platform's wire data analysis to pinpoint issues throughout the entire service chain. Because the ExtraHop platform can parse, visualize, trend, and alert on any transactional data from the physical layer up to and including encrypted bi-directional application payloads, it could trace applications across all of their tiers, including their proxy architecture which he referred to as the "black hole."

The first step was to use a unique and trackable identifier that could span all tiers. Their applications already used a unique identifier, the JSession ID and easily inserted it through the complete transaction flow in their network. The first time a request traversed their environment, the application delivery controller would load-balance their requests to their application servers. The application servers would then assign a JSession ID for each user.

The developers used this session ID in the header for their downstream API calls, other web services, and injected it as into the SQL comment for every database call.

The second step was to create an Application Inspection Trigger (AI Trigger) to track the identifier. The developers used the AI Trigger to track this ID as it traversed the different components in the delivery chain. Because the ExtraHop platform in the retailer's environment received a SPAN of all traffic, the AI Trigger was coded to record time-stamped measurements as the JSession ID traversed every element in the chain. The metrics included information such as TCP behavior and anomalies (Nagle delays, stalls, resets, and failures), L7 response times of each tier by measuring the client-side and server-side response times independently, the individual request and response size, and any errors both on the client and server side including the descriptive error response found in the application payload.

The third step was to create a dashboard to show the individual transaction performance and impact within each element across the service chain for their applications. With the tracker and AI Trigger information on one screen, the dashboard showed the transactions as they entered the perimeter firewall, traversed the routers and load balancers, travelled through the web servers and application servers, and finally passed through the APIs, web service tiers, and databases.

For the application developers, the end result was the ability to trace outages and slowdowns both upstream and downstream from their own application servers. If an API server downstream experienced issues, the dashboard clearly showed which clients were affected, and to what degree. Based on this information they set up trend-based thresholds and alerts to create a proactive early warning system.

User Impact

After six months of use, the devops team estimated they prevented seven outages from occurring, protecting at least $525,000 in revenue and hundreds of personnel hours. Had they not had ExtraHop in place, they would not have immediately understood the impact that several firewall and load-balancing rule changes had on user transaction performance. An entirely different DevOps process was established where all teams used ExtraHop as the platform for observing all L2-L7 transaction behavior across all of their tiers; they used AppDynamics' code-level instrumentation to understand application internals in real time, and used Splunk for post-hoc log analysis and forensics. Over time they estimated that greater than 95% of all potential incidents were addressed in under 5 minutes before they resulted in outages and the remaining 5 percent of incidents were resolved in under 15 minutes.
Contact us Try our free online demo