An online gaming company's CTO said that the rate of change in their environment was so fast that they were "living too close to the ragged edge." He was concerned about waste but especially about application downtime. A popular game that suffers even a blip in performance or an outage can be skewered on social forums, hurting both revenue and the brand.
The company was releasing an average of one new game per week, and dozens of code updates for existing games that had anywhere from hundreds to tens of thousands of players at any moment. Even with autoscaling procedures in place, managing an infrastructure to accommodate the right number of users without over-provisioning is tough. This team could scale only so far and needed a better alternative than throwing more people or servers at the problems.
The company tried many traditional approaches including aggregating web server logs, agents, and probes to check the health and utilization of the web servers, application, database, and image servers. These tools were good at providing an availability picture and a host-based response time perspective, but couldn't provide a holistic view across their environment. If a developer or administrator forgot to instrument to collect the logs, deploy an agent, or set up a test probe, they were blind to any operational information.
These tools were also inconsistent in their ability to isolate the actual problem source across their routers, firewalls, load balancers, top of rack switches, LDAP servers, their application stack and backend storage. Keeping pace with the deployment, change, and retirement of even just the web tier was proving to be a burden.
- Continuous auto-discovery and categorization of all web servers, application and infrastructure components.
- Monitor everything without requiring instrumentation or application changes.
- Provide a real-time view of all end-user transactions by game, response time, errors, and total throughput.
- Reduce the time to resolution during outages by capturing comments found in error headers or payload.
- Utilize usage metrics to plan capacity as well as retire systems.
- Reduce reliance on logs and associated costs.
The company tried many traditional approaches including web server logs, synthetic data monitoring tools, and probes to check the health of the web servers. These tools were great in giving an up/down picture but not always helpful in fixing the actual issue. The tools could not keep pace with the deployment, change, and retirement of the web tiers.
ExtraHop's auto-discovery and classification had an immediate impact. Within 15 minutes, the Operations team had a complete accounting of all active web servers, their activity and performance levels. Within two hours from time of installation, the CTO had a dashboard showing not only response times broken out by game, but total transactions providing a realtime view of game popularity, traffic consumption and trending patterns, and detailed errors for rapid remediation.
The CTO especially liked ExtraHop's real-time run book capability. Now each dashboard that they created could provide prescriptive and proactive guidance to the Operations team on what to do if certain conditions were observed or alerted upon. The fact that ExtraHop's run book could also link to their internal support Wiki made the insights even more actionable. The CTO said it was like laying the foundation to transform his Tier-1 operators into Tier-3 experts.
To stay ahead of game popularity, the team created a dashboard for each game environment to map the most interesting statistics indicating popularity and growth. Key metrics included Current vs Historical Per Server Process Time, Requests (per sec), Transactions (per sec), and Server Processing Time. Soon the Operations team knew which patterns to look for. A popular game would appear on the chart with increased requests, followed by an increase in server processing time, and transaction charts would grow accordingly. At a glance, they could see which games were spiking or dipping in popularity and could rapidly schedule new resources (web, application, databases, storage, and image processing) or retire excess ones accordingly.
The Game Popularity and other executive level dashboards were used daily by executives so they could see how the business was performing. The CMO could now correlate online promotional efforts with real-time analysis of game use to understand the efficacy of marketing campaigns. He estimated that this analysis improved their target conversions by over 20% without an increase in spending.
The Operations team reduced the number of fire drills (as defined by unscheduled capacity expansion) by 50 percent in one quarter and within the same quarter was able to definitively retire 25 percent of their web servers.
For the first time, the CTO felt they were controlling their infrastructure and web applications instead of the being controlled by them. Best of all, the Operations team was able to cut on-call after hours firefighting time by nearly two-thirds, which, in a highly competitive labor market, was significant for employee retention.