A large technology company who had virtualized nearly 70 percent of their applications were migrating a highly interactive and performance sensitive application to their virtualized environment. Nearly immediately afterward, customers began posting on their public forums that the application seemed to pause and stutter, causing a poor user experience.
Nobody among the network, virtualization, storage, application, and database teams could explain the degradation. Their virtual monitoring and application performance monitoring product indicated that CPU, memory, and disk I/O were all green and the application stack trace indicated no problems at the code level and there was no unusual spike in user sessions. The logs and machine data indicated nothing was amiss. The application had no exception or error outputs and there were no identified memory leaks or excessive threads. Their network performance monitoring tool showed no packet loss, latency or congestion and the database and storage teams were able to prove that those systems' processing, queries, disk access and I/O were performing as expected. Ultimately, someone suggested that "virtualized applications just run slower and maybe they should go back to bare metal."
The CIO did not want to migrate back to bare metal. The majority of their other applications that had been virtualized for years were performing well and their private cloud had provided numerous benefits, including cost savings from server consolidation and better ability to increase and decrease production workloads, rollout new product updates, and recover from disasters. However she knew that the remaining 30 percent of applications in her portfolio were complex multi-tier applications with sometimes unexpected and even unknown dependencies which threatened the private cloud initiative.
- Quickly identify the virtualized application and end-user performance issue
- Return the application's performance to its previous or better level
- Gain visibility across all non and virtualized tiers and continually monitor the dependencies, and behaviors of all components in the application delivery chain
- Leverage their data to improve processes and workflows across her IT teams
The company tried many traditional approaches including web server logs, synthetic data monitoring tools, and probes to check the health of the web servers. These tools were great in giving an up/down picture but not always helpful in fixing the actual issue. The tools could not keep pace with the deployment, change, and retirement of the web tiers.
The first critical step to the solution happened when the CIO asked her leads from each technical domain to sit down and map out the reasons why they thought they were unable to identify the issue, despite already owning and utilizing nearly a dozen monitoring tools.
The consensus was that they lacked the ability to objectively observe all transactional behavior between all systems connected in their private cloud and lacked correlation with end-user behavior. They had plenty of host-based logs and instrumentation as well as network data but nothing that continuously observed the whole environment, showing dependencies, all L2 - L7 transaction process times, latency, and errors in a way that would let them understand the complex relationships and rapidly eliminate potential causes of latency.
ExtraHop was deployed to begin passively monitoring all transactional behavior between the clients, application, network, and infrastructure components in real-time. Within a few hours of collecting data, they saw in the ExtraHop UI a spike in end-user response time. The teams noted that the application server's TCP behavior showed intermittent spikes of retransmission timeouts (RTOs) or application stalls at the same time the clients started seeing a degradation in performance. Excessive RTOs often indicate over-provisioning within a virtual environment, something that goes unrecognized unless a platform can observe all client and server behavior, not what is reported by the application or host itself. TCP will stop sending application layer data while waiting for an acknowledgment to what it previously sent. The sender can wait up to a full second before resending the unacknowledged data. The cause is a hypervisor scheduling issue where the hypervisor, which is responsible for managing the shared resources, does not run the guest as often as the TCP stack requires for optimal performance leaving the application unable to fulfill the strict timing required by TCP. The virtual CPUs had been running well within range as was memory consumption and I/O so everything had looked fine in terms of the original provisioning based on log and agent data.
After seeing these metrics, the virtual administrator decreased the number of guests provisioned on their virtual application farm and increased each guest's share of virtual resources improving performance beyond what they had originally measured when the application was running on bare metal.
The CIO's initiative of building out their private cloud was able to continue but with even greater speed and predictability. By using the ExtraHop platform to measure and characterize all non-virtualized applications, infrastructure, database, and storage elements they created a factual performance baseline before migrating. During virtual development and testing, this baseline served as the authoritative source for overall performance as applications were virtualized. Once the application delivery stack was fully virtualized in their development environment, simulated load testing was run against the applications and another ExtraHop baseline was created. This baseline performance view was then compared in real time when the application was rolled into production.
The CIO and her team estimated that this new accelerated and more predictive process with factual and non-siloed data reduced their time to production and rollout by over 30 percent. They estimated they had spent well over $150,000 in personnel time, additional infrastructure, and vendor consulting time in attempting to identify the performance problem which would have continued had they not used ExtraHop.
They plan to integrate ExtraHop's wire data set with their machine and agent data set to enable correlation of ExtraHop's observed behavioral analysis with their system self-reported data and their agent-reported data. The CIO is anticipating that this new initiative should pay for itself in less than 12 months by reducing the number of tools in use and eliminating maintenance costs while providing better cross-team and environment insight.