The Data Layer is Becoming the Performance Ceiling for Agentic AI in Security Operations

May 18, 2026

The Data Layer is Becoming the Performance Ceiling for Agentic AI in Security Operations

Why context engineering-- not model size or orchestration-- will determine whether your AI agents actually work.

The shift no one budgeted for

The pitch for agentic AI in security operations is familiar by now. Autonomous agents investigate alerts, correlate telemetry, build a timeline, propose containment, and hand a clean case to a human only when needed. Most large enterprises now have at least one pilot in flight. What is becoming clear from those pilots is that the bottleneck is rarely the large language model (LLM). It is the data the model is asked to reason over.

Security telemetry pipelines were built for two consumers: storage systems and human analysts who could search them. Both are forgiving. A SIEM does not care that the same event arrives three times from three sources with three different field names. A senior analyst, given enough coffee, can mentally reconcile that. An LLM operating inside a finite context window cannot — and the cost of asking it to try is showing up in benchmarks, in token bills, and in missed detections.

This piece is about why that is happening, what the research actually says, and what security and IT operations leaders should be doing about it now, before the first wave of agentic deployments calcifies into architectures you will be paying to undo in 2027.

Part 1: The research is in, and more context is not better context

For most of the past two years, the dominant narrative around LLMs was that context windows were a scarcity problem, and the answer was simply to make them larger. A context window is the maximum amount of text — measured in tokens — that an LLM can process at one time. It functions as the model's working memory, holding the system prompt, conversation history, tool results, and generated output. Tokens also have a price tag, with commercial LLM APIs charging per million input tokens and per million output tokens.

Gemini 1.5 Pro introduced a 1M-token window in February 2024. GPT-4.1 followed in April. Llama 4 announced a 10M-token window. The implication was that we would eventually be able to hand a model an entire incident's worth of telemetry and let it sort things out.

The research that has accumulated since then says something more uncomfortable: the size of the window is not the limit. The signal density inside it is.

If you’re a leader signing off on AI investments, here are a few findings worth knowing:

Every frontier model degrades as input length grows. Chroma's 2025 Context Rot study evaluated 18 leading models — including GPT-4.1, Claude 4 (Opus and Sonnet), Gemini 2.5 Pro/Flash, and Qwen3 — and found that performance dropped with increasing input length on every single one. Not most. All of them. Filling a 1M-token window with low-value content was worse than using a fraction of it well.

Position inside the window matters. The 2023 Lost in the Middle analysis demonstrated a U-shaped performance curve: models are better at using relevant information that occurs at the very beginning (primacy bias) or end (recency bias) of its input context, but performance degrades significantly when models must use information in the middle. Performance with 20–30 documents in context was worse than answering with no documents at all.

AI has an attention span problem. The 2025 NoLiMa benchmark reveals that LLMs are much worse at “remembering” and “finding” information in long documents than they claim, especially when the answer isn't written using the exact same words as the question. Researchers tested 13 major AI models that claim they can handle huge amounts of text (128,000 tokens or more). While they performed well in short contexts (<1K), the performance of 11 models dropped significantly as context length increased. At 32K, 11 models dropped below 50% of their short-length baselines.

Models become measurably worse at distinguishing relevant signal as noise accumulates — a property that maps almost perfectly onto how SOC telemetry behaves during an active incident. The practical translation for security leaders: an AI agent handed 50,000 raw events to investigate a single detection is not in the same operating regime as the same agent handed 50 pre-correlated, deduplicated, context-enriched objects. They are different products with different accuracy profiles, even when the underlying model is identical.

Part 2: Why this hits SOCs harder than other domains

If context rot were evenly distributed across enterprise AI use cases, security operations would still be affected. But SOC workloads have three properties that maximize it.

Accumulating context. A security investigation is not a single query. It is a session — pull the detection, retrieve related events, enrich the asset, check the user's recent behavior, examine network flows, query threat intelligence. Every tool call leaves residue in the context window.

High distractor density. SOC telemetry is unusually rich in near-duplicates and semantically similar events. A single failed-then-successful login from a service account generates events from the identity provider, the endpoint, the network sensor, and the SIEM correlation engine — often with different field names and inconsistent entity identifiers. To a human, this is redundant. To a model trying to reason over it, every duplicate is a potential distractor that competes for attention.

Long-horizon tasks. Unlike simple queries, long-horizon tasks require an AI to execute a plan over many steps. A meaningful security investigation may involve dozens of tool calls and tens of thousands of tokens of intermediate state, e.g., notes, tool outputs, and prior reasoning accumulated over the session. As this context grows, maintaining consistency and tracking relevant information becomes increasingly difficult for the model.

Layered on top of all of this is the underlying alert-volume problem the SOC was already failing to solve. The numbers vary by source, but the direction does not:

The SANS 2025 SOC Survey found that 66% of teams cannot keep pace with incoming alert volumes.
The SANS 2024 Detection and Response Survey reported that 62.5% of SOC teams feel overwhelmed by data volume, with duplicate alerts cited as a leading cause.
Prophet Security’s State of AI in Security Operations 2025 puts the typical enterprise SOC at roughly 960 alerts per day, rising to 3,000+ for organizations over 20,000 employees, with 40% of alerts going uninvestigated.
The Verizon 2025 Data Breach Investigations Report showed that in 96% of breaches, attackers — not the security team — disclosed the incident.

The honest reading of this is that the SOC has had a data quality problem for a long time. The arrival of agentic AI does not create the problem; it changes who pays for it—from analysts absorbing the cost in time and burnout to token spend, latency, and reasoning errors.

Part 3: What “data that already thinks” actually means

The architectural response taking shape across the industry is something like this: instead of asking an AI agent to reason over raw event streams, give it telemetry that has already been normalized, correlated, deduplicated, and enriched with context — risk scoring, entity relationships, behavioral interpretation, MITRE ATT&CK mapping — before the agent ever sees it. The data layer is being asked to do reasoning work that used to live in the analyst's head, and now needs to live in the agent's input.

Context engineering is the practice of structuring, selecting, and managing the information provided to an AI model so it can perform a task effectively over time. Given that LLMs are constrained by a finite attention budget, good context engineering means finding the smallest possible set of high-signal tokens that maximize the likelihood of some desired outcome.

Organizations are concluding that the cheapest place to fix data quality is upstream, before it hits the analytics layer — and certainly before it hits a model. Two developments illustrate this. First, OCSF is gaining traction. The Open Cybersecurity Schema Framework is an open-source, vendor-agnostic standard designed to normalize security telemetry data. Second, data pipeline tools are moving enrichment upstream. Products such as Cribl and Databahn are increasingly pitched on the basis of pre-ingestion normalization and enrichment rather than just volume reduction.

The architectural pattern that is emerging looks something like this:

Collect at the edge.
Normalize to a common schema as early in the pipeline as possible.
Deduplicate and correlate before storage, so a single security-relevant event is represented as one object with linked evidence, not a dozen near-duplicates.
Enrich with the context an agent would otherwise have to derive: asset criticality, user identity resolution, network topology, threat intelligence overlays, ATT&CK technique mapping where applicable.
Expose to agents through APIs that return semantically dense context — e.g., structured metadata, detections — not raw log streams.

None of this is novel. Mature SOCs have been moving in this direction for years. What is new is that the economics now reward it twice: once in analyst time, and again in model accuracy and token cost.

Part 4: The economic argument that closes the deal

For most director-level readers, the cleanest way to make this case to a CFO is to frame it as cost optimization rather than capability investment.

As an illustrative example, a reasoning task that consumes 80,000 tokens of context to investigate one alert is roughly 4–8x more expensive than the same task consuming 10,000 tokens of pre-enriched context, depending on the model and pricing tier. At a few hundred investigations per day across a mid-sized SOC, the difference is six figures annually in inference spend alone, before factoring in the accuracy improvements documented in the research above.

Telemetry hygiene is becoming an AI cost-optimization strategy. It also happens to be a detection-quality strategy and a workforce-retention strategy. Few security investments line up that cleanly.

Part 5: What to do about it in the next two quarters

For directors of security or IT operations evaluating where to spend attention before agentic AI deployments lock in:

Audit your telemetry as if a model were going to read it. Pick a recent incident. Pull every event that contributed to the investigation. Count duplicates, inconsistent entity identifiers, missing fields, and unjoined relationships. That delta is your model's overhead — and your bill.

Demand semantically dense APIs from your detection vendors. If your SIEM, XDR, NDR, or EDR platform exposes only raw event streams to agents, you are pre-committing to context-window problems. Ask for investigation-level or detection-level objects with embedded context.

Push enrichment upstream. Asset context, user identity resolution, and threat intelligence overlays belong in the pipeline or pre-correlated objects, not in the agent's prompt.

Measure agent performance with the data layer in mind. Token consumption per investigation, accuracy on held-out cases, and false-negative rates on known-good incidents are more useful than generic "AI productivity" metrics. They will also tell you, faster than anything else, when your data layer is becoming the ceiling.

The bottom line

The defining constraint on agentic AI in security operations is turning out not to be model size, orchestration framework, or even agent design. It is the quality, structure, and semantic density of the data the agent is asked to reason over. The research is clear: every leading model degrades as low-signal context accumulates, and the failure modes that follow — missed relationships, hallucinated causality, blown token budgets — show up exactly where SOC operations cannot afford them.

The leaders who will get the most out of agentic AI over the next two years are not the ones with the largest models or the most ambitious autonomy claims. They are the ones who recognized early that the data layer is now the performance ceiling, and started fixing it before the bill came due.

Discover more

NDRRevealX

Robyn Fisher

Principal Product Marketing Manager

Robyn is a product marketing leader specializing in AI, cybersecurity, and emerging technologies. At ExtraHop, she focuses on how network context advances autonomous security and operational resilience. Previously, she held marketing roles at Google Cloud, Amazon, and Microsoft.