Vanity vs Real Metrics in Detection & Response

There are a number of metrics currently being used in detection and response. Many of them provide some measure of value, but they don’t show the entire picture. Others are truly vanity metrics.

Vanity vs Real Metrics in Detection & Response

The following insights come directly from our CTO and Co-Founder, Andrew Ingalls.

There are a number of metrics currently being used in detection and response. Many of them provide some measure of value, but they don’t show the entire picture. Others are truly vanity metrics.

The Coverage Illusion

One of the most useless metrics today is how companies measure coverage.

Right now, everyone uses the MITRE framework, which in and of itself is a great framework to standardize on. It provides detailed categorization on how and why attackers do what they do. It helps us, as humans who love putting things into buckets, start organizing detections.

The problem comes when people start analyzing the buckets themselves.

“I have 1 rule in T1557 and 12 rules in T1110, so I must be better covered in T1110.”

That depends. What assets do you actually have that can be attacked using Adversary-in-the-Middle versus Brute Force techniques? Do you have rules that truly detect those techniques? Are they covering the data sources they need to cover? Are you focusing on techniques that don’t even matter to your industry?

All of those questions need to be considered when dealing with coverage. Unfortunately, we often sell the former to leadership as “coverage,” when in reality it’s a vanity metric.

Fine-grained categorization is all well and good, but many times we forget that all coverage is not the same. That’s why I recommend starting with a metric that is easier to understand: detection-to-coverage ratio. Focus on your crown jewel assets and critical data sources. It doesn’t matter if that internal tool with no access to the outside world is protected. It does matter if that critical external system with user PII is protected.

While this doesn’t tell you how well your assets are covered, you can at least tell your bosses that X% of our critical resources have some sort of protection. That % should increase as you build out your repository. Focusing on critical components gives you the best return.

Mean-Time Metrics

Then we have the Mean-Time metrics: To-Detect, To-Acknowledge, To-Respond, To-Contain, To-Resolve.

I wouldn’t consider any of these metrics “vanity,” because they all provide insight into an adversary’s dwell time, a.k.a. how long someone has access to your system.

Unfortunately, these metrics can be found wanting.

The first reason is obvious. The statistic of “mean” can be heavily skewed by a single outlying data point. If you’re trying to understand what you typically anticipate for your company, median provides a truer middle number, not weighted by that one time it took 24 hours to detect an incident. That doesn’t mean you ignore those incidents. In fact, it’s often better to display them separately as outliers and explain the difference. They highlight gaps you may have since fixed.

The second issue is definition. As time-based metrics, the measurement period must be consistent and clearly understood. Does the clock start when the bad actor first attempts something malicious? When the tool first observes the event? Or when the detection fires? The ambiguity becomes even more complicated when discussing acknowledgement and resolution.

You need clear definitions of an event, detection, acknowledgement, response, and resolution for these metrics to be meaningful.

Our definitions are straightforward. An event is the first time malicious or suspicious activity occurs in your system.

A detection is when your system alerts you to that event.

Acknowledgement is when your team claims the alert and begins investigating.

Response is when the team determines whether the alert is benign, false positive, or malicious and takes action.

Resolution is when the incident is closed and the threat has been neutralized.

Without alignment on those definitions, the metrics lose meaning.


Precision, Recall, and What You’re Missing

Precision is another one you hear a lot. As a Data Scientist, this is one that I can get behind the most. However, while it has good intent, it misses a big part of the story: false negatives. What real attacks are your detections missing? That’s the gap you need to understand.

Because yes, while we want to reduce the amount of times our detections cry wolf, what everyone should be worried about are the times it doesn’t… and your sheep get eaten.

Recall is the other half of that equation. Do you care more about finding all positive cases at the cost of false positives, or do you care more about not overwhelming engineers while potentially missing some?

In machine learning, we use a metric known as F1-score. It forces the score to be low if either precision or recall are low, balancing the importance of these two metrics and far outclassing raw accuracy in classification metrics.

The challenge is that these metrics aren’t always easy to get accurate data for. False negatives are especially difficult. While actual breaches are the key item to report on and use for this metric, you don’t want to wait for one to happen before realizing your recall is poor. The fallback would be using simulated breaches or red team validation to generate meaningful data… easier said than done though. Those are generally large undertakings that happen maybe once or twice a year. Not a great way to understand the real-time recall of a single detection.

Related to precision is alert fidelity, or signal-to-noise ratio. What percentage of alerts that reach analysts are actually actionable? If alert fatigue is hitting your team hard, this is a strong place to start trimming “rule fat.” You’ll see this metric more often than precision among teams, and that’s because it focuses on operational and actionable events.

Days to Patch and Volume Metrics

Days to Patch (D2P) is yet another commonly reported metric. It measures the time between when a vulnerability is disclosed and when your organization applies that patch. It’s a good indicator for how well your company responds to vulnerabilities.

But again, averages can be seriously skewed by outliers. Median days to patch, grouped by service criticality, is more robust. You can patch hundreds of low-priority internal systems quickly, while one critical external system remains vulnerable for months. The average will not tell you the story you actually care about. You care that your one critical system remained exposed for months, not that your team focused on low-priority, low-hanging fruit. Well… actually, that’s another issue entirely.

D2P can also provide a false sense of security. It only deals with known vulnerabilities that have available patches. What about all those vulnerabilities that don’t have patches? Do you have detections for them?

I’ll limit myself here, but I truly hate most metrics that just provide “number of…”

For example, “number of incidents reported” is another metric that gets attention. These kinds of pure volume metrics can provide some insight, but they are incredibly easy to misinterpret. If you see a spike in incidents, did your security suddenly worsen, or did you gain more coverage? It’s hard to benchmark, hard to trend meaningfully, and it treats incidents equally, which we know isn’t realistic.

It’s the lack of context that really makes these types of metrics meaningless. Even segmenting the data tells a better story. Number of incidents reported by detection source, by severity. You could map them to MITRE techniques and track the distribution over time. Are you seeing a shift from brute force attempts to credential theft? Now you’re cooking with threat intelligence. Overlay it to your basic coverage. If you are seeing low incidents in an area with strong coverage, that’s a success story. If your incidents are grouped over a few gaps, that’s an easy area for improvement.


Cost Per Incident

Cost per incident is often a big one for leadership. Everything eventually translates to dollars gained or spent. It helps quantify financial impact and justify tooling investments. It should include investigation labor, downtime, lost productivity, legal or regulatory costs, and more.

As you’ve probably already said to yourself, this is extremely difficult to measure accurately, especially indirect costs like reputation damage or long-term scrutiny. Not to mention, most companies aren’t equipped to actually measure the time and resources spent resolving and reporting on the actual incident.


The Forgotten Metrics

Finally, the metrics we see are too often forgotten.

Detection staleness is one. What is the median time between maintenance for your rules? How many rules haven’t been tested, tuned, or validated in six months? Like driving a car off the lot, the moment you publish a rule it begins to degrade.

We like this metric as a piece of the puzzle. It tells us a lot about available resources, team priorities, and most importantly, how out-of-date and useless our rules are.

Log source health and coverage is another. This is something Rilevera is keen on shouting to the rooftops. If your logs aren’t flowing, nothing downstream matters. If a log source changes its schema, your detection rules can silently break. If you’re not even sending logs for a critical source to your SIEM, it doesn’t matter how many rules claim to protect it.


The Theme

You can see the theme. Every metric has blind spots. Some more than others. That’s why Rilevera does not depend on any singular metric to determine detection quality or coverage.

What makes this worse is that each of these metrics takes significant time to continuously update. More than most teams can realistically handle on their own. New assets get added daily. Adversaries change patterns. Log schemas drift. And security ends up being the last to know about that new app that was added to the network.

If you are not continuously evaluating coverage relevance, rule quality, staleness, log health, and validation together, you are not measuring detection maturity. You are measuring fragments of it.

And fragments rarely tell the full story.