Load Balancer Optimization

How do I know if my load balancers are performing as expected and improving user experience?

The Problem

The IT organization deployed load balancers, or application delivery controllers (ADCs), to improve availability and performance of their web applications and other key services. They had been sold on the ADCs' ability to perform Layer 7 load­-balancing, HTTP caching, content compression, and SSL offload among other functions.

However, months later, they had not realized expected improvements and important applications still had occasional outages that were only resolved by rebooting all servers involved. The team couldn't install monitoring agents on the ADCs since the devices were locked down. Turning on logging to generate machine data added significant overhead while providing relatively little insight because ADCs do not log every function or show its results. To fully understand its services impacts and benefits, an ADC's traffic and behavior must be observed passively on the wire.

Desired Outcome

  • Understand and prove the ADCs' benefit for availability and improved performance
  • Eliminate surprises or unknowns by continuous auto-­discovery and dependency mapping
  • Isolate application performance problems faster across the entire application delivery chain
  • Continuously improve application performance, availability, and user experience
  • Limit war room time to only relevant personnel

The IT organization deployed application delivery controllers (ADCs) to improve availability of their web applications … However, months later, they had not realized expected improvements and important applications still had occasional outages that were only resolved by rebooting all servers involved.

The Solution

The ExtraHop platform automatically discovered, classified, and mapped all systems and applications communicating on the network and their dependencies. Focusing on the ADCs, the IT team began analyzing the caching, compression, TCP optimization, SSL, and load-­balancing services in support of the applications to better understand the impact of those ADC services. With the ExtraHop platform, the IT team discovered several significant design and configuration issues:

Cache Analysis

  • ADC and client computers were not caching as many resources as they should.
  • The cause was misconfigured HTTP cache control headers sent by the application servers. The backend applications' cache duration had been incorrectly set to cache­control: no­cache for these resources.
  • Drilling down, the IT team could see all objects requested over time, and rapidly discern what should be cacheable by object type and size versus what was actually being cached. This data was not available in logs.
  • The IT team could easily search and sort by object type, by server, and by URI so they could configure the proper cache settings on their application servers and ADCs. They then were able to monitor the offload volume from the servers, demonstrate its benefit on improved page load times, and understand ADC cache utilization and capacity.

Content Compression Analysis

  • The ADC was forwarded uncompressible traffic through its compression engine. Depending upon traffic volume, this misconfiguration can consume upwards of 25% of an ADC's processing capacity and cause latency.
  • The IT team suspected that misconfigured and overloaded ADCs had caused several hour­-long outages in the last year alone.

TCP Analysis

  • WAN-side Nagle delays had been accidentally configured, delaying transactions up to 500ms (.5 seconds).
  • Prior to the ADC deployment, their web tier had used the TCP_NODELAY socket option to address this issue. Offloading TCP optimization to the ADC is extremely efficient, but implementation is risky given that logs and internal ADC system information does not provide visibility into the impact of these settings.
  • By turning Nagle delays off, the IT team saw a 500ms improvement in response time.

SSL Analysis

  • Most ADCs license SSL capacity in a pay-as-you-go model. If SSL transactions exceed the licensed amount, those sessions may be dropped.
  • Most systems do not expose when SSL transactions have exceeded licensed capacity or show the user experience impact caused by dropped sessions.
  • Setting a simple threshold in ExtraHop, the IT team could easily plan and justify SSL capacity increases and proactivity prevent service disruption.

Armed with this information, the IT team began tuning the ADCs and measuring offload and performance benefits in real­-time. They were able to cut page load times in half using the ADC services they already owned and could prove the results. A few days later, the Application Delivery and DevOps team began receiving calls about an application failure. They noticed that an abnormal number of method calls to a backend API were failing but the front-end web servers were still responding with a 200 OK status. The web servers were responding with a Service Unavailable page—something their web logs didn't record. Because the ADC was configured to note servers as being down only when they returned a 500­-series error, those affected servers continued to receive traffic despite services being unavailable. Leveraging ExtraHop's analysis, the team modified the ADC configuration and brought the web application back online within a matter of minutes.

The team then began to incorporate the ExtraHop platform into their operational processes, leveraging the auto-­mapping capability and their domain knowledge to create end-to-end application views. These dashboards correlate an observed problem on the front-end with a specific behavior on the back-end. For example, an application API called a specific database directly, bypassing the ADC. ExtraHop showed that the database connections would slowly climb until they hit a threshold, which caused the API queries to fail. The developers reviewed the related code for the API and found a resource leak in their database connection pooling code. After fixing that bug, the application stopped failing.

User Impact

ExtraHop now plays a strategic role in the release and production process for every new feature and application rollout, providing the facts, insight, and dependency impacts of all application elements. The company not only accelerated time-to-market for new feature and product releases, but also streamlined their triage and troubleshooting workflow, capacity planning, and continuous improvement initiative. The estimated cost savings in productivity gains, downtime avoidance, and better utilization of infrastructure saved over $250,000 in their first year. They also prevented an unnecessary $200,000 investment in a CDN service because of the performance gains they were able to achieve using their ADCs acceleration features. Lastly, the IT team has sped up release cycles by 30% while releasing new services with greater reliability.

Contact us Try our free online demo