Latest Blog
Defining Quality for Triage and Investigation | AirMDR Blog
January 12, 2024
Defining Quality for Alert Triage, Investigation and Response
Total Article Read Time: 4 min
What Is The Best Detection and Response You Can Get?
Cybersecurity professionals face a daunting task. While the threats are endless, the talented workforce to defend against them is scarce and budgets are tight. Many CISOs realize that they do not have good quality detection and response, they have to settle for the best they can get given the talent and budget constraints.
To determine what’s best — you have to be able to compare two providers. And, how do they do that before you sign on the dotted line? As a result, most small/medium-sized organizations end up buying MDR without doing a trial.
How Do You Benchmark Quality for Threat Investigation and Response?
In an optimal scenario, all CISOs could have “really good” alert triage, investigation and response. But, the reality check is that setting an expectation of good must be done within the context of available budget and resource constraints, so the key to raising the bar of good within existing resources comes down to being able to consider the cost-effectiveness of solutions implemented.
So, what's the journey from alert to incident and where could the biggest improvements be made? Let’s look at each phase of the process:
- Threat Detection
- Alert Investigation
- Alert Triage
- Incident Response
Breaking Down the Process From Threat Detection to Incident Response
Threat Detection, Alert Generation
It all starts with detection. Does the MDR vendor share what detections they have deployed? Do those detection use cases take what’s really important to the enterprise? Also - does the enterprise have data sources that will trigger those detections when the relevant attack happens?
The cyber threat landscape is constantly evolving — there can never be too many detections. The more detections you have the lower the risk of an unnoticed attack. However, an even bigger problem than not having good detections is that the output of detections teams already have in place either gets ignored or does not go through thorough investigation and triage. Less than 10% of alerts that an enterprise receives are real incidents. Many times that number is even lower. It's no wonder - small security teams simply chose to ignore those alerts. But the right solution is to determine which alerts pose a real risk. And, that requires investigation and triage.
Proper Triage and Investigation of Alerts
Alerts aren’t incidents until proven guilty, especially given the high false positive rates. To pinpoint a real threat, first you need to determine if a threat is a false alarm, a warning, or a real incident. Investigation to bring together more context (aka enrichment of the alert) is the most crucial step in determining if something requires additional action and requires someone to sift through a stream of alerts and gather more information to decide if an incident should be created and what the level of urgency should be in responding to that incident.
Incident Response Actions
Once an alert is investigated and confirmed as a real incident, the set of actions and next steps that should be taken is part of the Incident Response process. Some responses may have preset playbooks to help you work through them, while other incidents may require new action steps be taken to suit the situation.
Each step of the phase can require a significant amount of resources to complete thoroughly.
Does More Money = Better Detection, Investigation and Response?
It’s cliche - faster, better, cheaper - pick two. I would suggest a slight variation - peg the cost - and ask for better and faster. It is also a more practical approach, as in most enterprises, budgets for cybersecurity are finite and hard to change.
Pegging the cost by quantifying value of risk reduction
So, how do we measure risk reduction? It's a tough nut to crack. However, we might have a proxy: ask the Chief Information Security Officer (CISO) this question:
“How much would you be willing to spend to make sure that every alert is thoroughly investigated and responded to in a timely manner?”
This establishes an implied value for “risk reduction” and the value they attach to ensuring timely investigation and response for every alert may not be perfect, but a solid starting point to establishing a better benchmark. Plus, you don’t need to let perfect be the enemy of good.
How to Measure Faster?
Mean time to detect (MTTD), Mean time to investigate (MTTI) and Mean Time to Respond (MTTR) are pretty common metrics. Unfortunately - 80%+ of security teams in the 100-1000 employee range do not have these metrics. They do not use processes and systems that can produce these metrics in seconds.
This should be a key deliverable for every MDR.
How to Measure Better?
Comparing two SOCs on quality of detection and response, while not impossible, is not an exact science. It is much harder to distinguish between two detection and response providers is hard if the differences are minor, but if the differences are major - it is not hard to tell. Here are a few questions one can use to assess the quality of Detection and Response providers:
- Can you identify the major risks and threats that you want to protect against?
- Do you have the monitoring to detect those types of threats?
- Do you have a catalog of detections you have in place by sources and use cases?
- Can you map detection techniques to ATT&CK techniques?
- How is your coverage?
- How quickly are the alerts triaged?
- Do you have a quality check process in place to review the quality of investigations?
- What %age of alerts are real incidents?
- For incidents, how quickly are you able to take the response?
- Do you document what you do with every alert?
- Do you use a case management system, or is this process ad-hoc done over emails and Slack (making it much harder to examine this process and measure how well it is working, or not)?
- Do you have documented playbooks in place?
- What %age of repetitive tasks that your SOC has to do is automated?
Benchmarking Alert Triage Quality
One thing that is measurable in SOC is alert fidelity. Alert fidelity is the %age of alerts that human analysts have to respond to that are real threats, and how often do you find threats that were not detected early enough or detected by chance? The first factor is “false positives”, the second case is the case of “false negatives”. You can reduce your false negative rate by escalating every alert as an incident. That is not a good idea as it comes with a hefty cost — the human time required for in-depth analysis and response.
Similarly, you can turn off every alert or ignore every alert, and the false positive rate drops to zero. But now you are incurring a significant risk that you will miss an alert that becomes an incident that turns into a breach. SOC teams have to constantly balance these two metrics given the limited time and resources. The number of false positives and false negatives processed establishes a clear quality metric for benchmarking alert triage quality.
You can even combine both false positives and false negatives into a single metric using the following:
- False Positives: The cost of false positives is determined by the amount of person-time spent per alert analyzed. (X minutes spent per alert multiplied by the headcount value)
- False Negatives: Evaluate the cost impact of false negatives, possibly higher but with a dollar figure attached.
This helps you get to a unified single metric for the value of quality of alert triage.
We hope this article gives you some ideas as to how you can measure the quality of detection and response.
Remember, what gets measured, typically gets improved. Conversely, how do you know if it is getting better if you can't measure it?
Kumar Saurabh, CEO of AirMDR, has 20+ years in enterprise security, including roles at ArcSight and LogicHub.