Paying Down Technical Debt in Your IT Infrastructure

The Phoenix Project co-author Gene Kim spoke at the ExtraHop sales conference in January, explaining how technical debt led to the "IT death spiral." The Phoenix Project co-author Gene Kim spoke at the ExtraHop sales conference in January, explaining technical debt and the "IT death spiral."

The idea of "technical debt" is one of the most important things I learned from The Phoenix Project. Described as "a novel about IT, devops, and helping your business win," the book has deeply influenced the IT operations community. In the simplest terms, technical debt is the result of not doing things right in the first place. Here's Erik, a lean-methodology guru in the book, describing technical debt:

"… like financial debt, the compounding interest costs grow over time. If an organization doesn't pay down its technical debt, every calorie in the organization can be spent just paying interest, in the form of unplanned work."

As illustrated in The Phoenix Project, the accumulation of technical debt results in constant firefighting and an inability to implement new projects quickly. A less recognized yet equally damaging result is the increased waste and noise in the IT infrastructure, causing:

  • Unnecessary infrastructure purchases
  • Greater load on critical resources
  • Low signal-to-noise ratio
  • Security vulnerabilities
  • More places for malware to hide
Most of the time, IT teams do not even realize the extent of the waste and noise in their environment and simply assume that it is part of organic growth. Even if they have suspicions that waste is accumulating, it is difficult to allocate resources to investigate and fix the problems when so many other issues demand their attention—similar to how we know we ought to exercise more and eat healthier, but only take it seriously after we visit the doctor.

Supporting Continuous Improvement with ExtraHop

ExtraHop makes it easier to measure and reduce the technical debt in your IT environment. By analyzing their wire data—all L2-L7 communications, including full bi-directional payloads—IT organizations can identify waste and inefficiency in their IT infrastructure, along with the details needed to fix those problems.

Many organizations use ExtraHop to support continuous improvement environment, applying methodologies adapted from lean manufacturing. ExtraHop's Atlas Services remote analysis reports are a perfect fit for these "lean IT" efforts. IT organizations receive regular analysis across all tiers of their environment, identifying both acute and chronic issues, and then use these reports to create work items for their kanban-type scheduling systems.

By dedicating resources to paying down their technical debt—fixing misconfigurations, adjusting settings, optimizing scripts, decommissioning legacy systems, etc.—these IT organizations are freeing up capacity, increasing goodput, addressing issues proactively, and improving signal-to-noise ratios so that it is easier to spot anomalous behavior.

Real-World Examples of Paying Down Technical Debt

The examples below show how organizations are using remote analysis reports from ExtraHop to make significant improvements to their IT infrastructure.

DNS

The chart below shows a decrease in DNS errors from August to October for one organization, dropping from an 11.6 percent error rate to less than 1 percent across their entire environment! In fact, when they first started receiving Atlas reports, this organization had a 21 percent error rate for DNS requests. DNS is often taken for granted and can be a silent killer of application performance, with failed lookups adding seconds to transactions as they resolve. Yet, shockingly, it is common to see DNS error rates as high as 50 percent in IT environments.

The red bars at the bottom show DNS errors. After problems are fixed in the middle of October, the errors drop significantly. The red bars at the bottom show DNS errors. After problems are fixed in the middle of October, the errors drop significantly.
In August, DNS servers responded with 409,404 errors for 4.1 million DNS requests—an 11.6 percent error rate. In August, DNS servers responded with 409,404 errors for 4.1 million DNS requests—an 11.6 percent error rate.
After the problems are fixed in October, the DNS servers responded with 15,987 errors for 3.09 million DNS requests—an error rate of less than 1 percent. After the problems are fixed in October, the DNS servers responded with 15,987 errors for 3.09 million DNS requests—an error rate of less than 1 percent.

TCP

Because ExtraHop recreates the TCP state machines for every sender and receiver in real time, the platform can understand TCP mechanisms, such as throttling. Monitoring solutions that only inspect L4 headers cannot do this. The screens below how the previously mentioned organization decreased out-of-order segments and tinygrams by 90 percent.
In August, out-of-order segments and tinygrams were contributing to network congestion. In August, out-of-order segments and tinygrams were contributing to network congestion.
After the problems were fixed in October, out-of-order segments and tinygrams were reduced by 90 percent. After the problems were fixed in October, out-of-order segments and tinygrams were reduced by 90 percent.

HTTP

The chart below comes from a different IT organization that subscribes to Atlas Services remote analysis reports. The chart shows HTTP errors—most of which were internal server errors (HTTP status code 500)—reduced by 9.5 times after problems are identified and fixed. This is a large environment with upwards of 3,000 web transactions per second at peak periods, and analyzing large amounts of data at the level of detail that ExtraHop does is no trivial task. For details on how ExtraHop does this, read our blog post, Monitoring at Scale: Questions You Should Ask Your Vendor.
HTTP errors are reduced by 9.5 times after the problem is identified and fixed. In large environments, it can be difficult to analyze all transactions with sufficient detail to pinpoint problems. HTTP errors are reduced by 9.5 times after the problem is identified and fixed. In large environments, it can be difficult to analyze all transactions with sufficient detail to pinpoint problems.

Database

The following chart shows an even more dramatic reduction in database errors at another organization that subscribes to the Atlas reports. After the delivery of the March report detailing the "(ORA-28000) the account is locked" errors at the database tier, the organization fixed the issue. The second graph shows that not only were errors almost eliminated, but that the variability in database processing time dropped precipitously on March 12, when the fix was implemented. Read about Oracle database monitoring with ExtraHop.
The problem causing the "(ORA-28000) the account is locked" errors is fixed on March 12, resulting in an almost complete elimination of database errors. The problem causing the "(ORA-28000) the account is locked" errors is fixed on March 12, resulting in an almost complete elimination of database errors.

After the fix is implemented on March 12, database server response time is much more predictable (and fast, with responses in less than a millisecond). After the fix is implemented on March 12, database server response time is much more predictable (and fast, with responses in less than a millisecond).

LDAP

The chart below shows a dramatic drop-off in LDAP requests and errors after an organization changed a general configuration that was causing a bad LDAP query. In this case, not only were fewer errors served, but load on the LDAP server was reduced by five times. This is a great example of how previously unnoticed waste and inefficiency in the environment can be eliminated with ExtraHop. Although the effects of the bad LDAP query may have been tolerable to users, it was causing unnecessary load and could have masked anomalous activity indicating an acute performance issue or even a brute-force attack against the Active Directory server. Read about LDAP monitoring with ExtraHop.

A general configuration change results in 5 times less load on the LDAP server and dramatic reduction in LDAP errors. A general configuration change results in 5 times less load on the LDAP server and dramatic reduction in LDAP errors.

Make It Easier to Take the Doctor's Orders

Like taking care of our physical health, addressing the technical debt in your IT environment is easy to ignore. However, ExtraHop can make continuous improvement much easier—first by providing you with the visibility you need across all tiers, and second with our periodic remote analysis reports that can identify low-hanging fruit for optimization and tuning.

Check out the sample Atlas remote analysis report below and then visit the web page to learn more.

Subscribe to our Newsletter

Get the latest from ExtraHop delivered straight to your inbox.