CAUSALITY Project News: High 2026 Recall and CVE Trend Data Integrations; 252 Provable Predictions to Date

So one of the FRs that came out of RSAC was to join together CVE predictions with trend data. We added a tab to the CAUSALITY web application interface to check the prediction label rating of CVEs trending there:

So far, for 2026, model recall is between 95-96%, meaning that percentage of KEV CVEs are being predicted accurately. There are 252 provable predictions in the project journal (https://github.com/opendr-io/causality/blob/main/journal.md) which are timestamped by GitHub commit history so that anyone can audit them. Every intrusion prediction, and associated incident response avoidance, saves around a thousand hours of time that goes back to the business; 252 avoided incidents is north of a quarter million hours of time.

The lift also continues to improve, compared to conventional metrics. Ninety percent of the KEV CVEs come from 19% of the population which allows for more precision risk targeting as shown below. Roughly a third of KEV CVEs are critical; 45% are high or medium; and another 30% had no severity label in the first quarter of the year. The severity label is being discontinued by NIST, for all but a subset of CVEs, this year.

Detection Lattices For Emerging Threats: “Dirty Frag”

Today’s scenario is probably what the future looks like. A new Linux exploit, named “Dirty Frag”, was released yesterday after the embargo elapsed. No patch or broadly available detections yet exist; guidance from security vendors, at the time of this writing, is that they are “working on related detections.” We don’t have to wait for humans to write detection rules; we can find this using detection lattices without waiting for static rules to be written.

The Dirty Frag PoC lights up a detection lattice like a Christmas tree with at least three conventional behavioral detections, four machine learning detections, 21 unusual syscalls, and three signal fusions between these detection classes as shown below:

While none of these, by themselves, are sufficient to produce a critical alert, the combined set creates a strong alpha signal detection lattice. This lattice tells us that anomalous compiler activity, followed by anomalous process activity from the temp directory, was followed by a cluster of syscalls, and ultimately a root shell. Each of these alerts do sometimes come from benign outlier activity, but the confluence is too unusual to be coincidence.

There is one school of thought that this class of exploit is low or medium risk because it is not remotely exploitable without existing shell access. I think this kind of assumption works against us, and is one of the ways vuln management misses attack paths and things get popped. If an attacker is going to obtain a root shell through an exploit like this, they’re most likely going to get initial execution by looking for an RCE or RCI vuln in a web application (often a target rich environment) not a CVE in the web server process . Maybe something that has a CVE, maybe something bespoke in the application code that the owners have not noticed, or have given low priority because it yields low privilege execution. In order to see these kinds of attack paths, you need a fusion of CVE and appsec data. Explore your combined attack surface and ask two questions 1) What can I get right now? 2) What path can I take to get something valuable?

Also, I think the conventional “scan, attack, exploit, actions” model is not necessarily how operators work. It is not always feasible to achieve actions on objectives in linear time on a target of choice if the conditions are not quite right. So some crews are “collectors” in that they collect as much persistence as they can, in as many networks and cloud accounts as they can, so that when a privesc becomes available, they can jump on it and be the first. Sometimes we do see a conventional discovery, execution, initial access, persistence, privesc, lateral movement cycle in a short windows of time, but sometimes it comes from a crew or an operator who have been persisting in the environment for a while. So sometimes people are looking for discovery or initial access that happened a long time ago and think everything is fine because they are not detecting the start of a cycle.

Although I suppose model based vuln development changes all this in that the time frame will probably be compressed, and there are rapid cycles to find. The question I see is what happens if the cycles become faster than human response or even human cognition. We’re scaling up offensive (attack) art first, with AI, because the models are good at that, and nothing attracts eyeballs and sells things like dramatic attack case studies. We’re not really scaling up agentic defense yet.

This scenario – a new exploit without a patch or available detection rules – is probably going to more more and more common as AI-assisted vulnerability and exploit development continues to scale. As the volume and velocity increases, we will not always have time for humans to manually write detections. We will increasingly need robust detection lattices capable of identifying emerging threats without de novo signatures, exploit-specific rules, or even prior threat intelligence

Hunting the Copy Fail Exploit With Detection Lattices

So the Copy Fail exploit (https://copy.fail/) is the hunt of the week. If you’re one of the few who have not heard, it is a Linux privilege escalation vuln, with a working exploit, that advertises elevation to root on any Linux distro from the last nine years which would include most fleets. I can’t confirm that but I can confirm it works very reliably on the Ubuntu instances I have seen it on. I know there is one school of thought that local privesc exploits are of lesser importance because they require starting from an existing shell. Many of the flags I have seen taken were the result of combining low-privilege remote code execution, or command injection, with local privilege escalations. These threat models are usually not visible from either a vulnerability management tool, or an appsec tool, because neither of those usually show users this kind of attack path, because neither one knows much about the other. So this one is worth hunting or at least threat modeling; the odds are this attack path exists somewhere in your environment.

This is an excellent case study for the PROTOSTAR project because, as we are seeing, there isn’t really a good conventional alert rule that can be written for this, there are simply too many moving parts and shades of grey. None of the events involved are clearly and distinctly bad enough to fire an important alert. It does, however, produce an alpha detection signal set. Here is what it looks like in what we call the tactical view which uses a table structure for packaging lattices as questions sent to an AI agent along with the appropriate prompts and knowledge:

The exploit creates what we call an alpha signal with a perfectly matched set of conventional and machine learning detections. As you can see, there are three types of regular alerts, for execution and privilege escalation, and two machine learning detections. These cannot be joined by a single process ID or username because there are at least two of each involved. What’s more important than any of those conventional joins or correlations is the fact that the conventional and machine learning detections fired on the same events. The “detection lattice” element was created when the PROTOSTAR engine noticed that pattern.

This is what a Copy Fail detection element set looks like in what we call the visual view:

This is the detail page, in the visual view, for a Copy Fail detection element. The trinary node structure is created by the relationships between the detection elements in the set. In the northern quadrant, there are two different machine learning detections for the Python activity and some of the syscalls used by this exploit payload – both are unusual, but as most of us know, many ML anomaly detections don’t have stronger signal than the conventional alerts they try to replace. There are simply too many normal outliers. These Python events, and the associated syscalls, are unusual, but Python scripts and syscalls are not inherently suspicious enough to turn them into alerts by themselves. In the southeast quadrant there are three types of conventional rule based detections for the Python shell activity and the associated syscalls. These again are not clearly malicious, when considered individually. None of these, so far, are decidable enough to fire alerts on individually.

What we call a detection lattice is in the southwest quadrant. What this indicates is that the engine saw multiple agreements, on the very same events, by the different detection pipelines. The conventional alerts tell us these events could be threat activity, possibly even something like Copy Fail. The machine learning detections tell us that these events are extremely unusual for this compute instance. One detection lattice is enough to create what we call a beta signal – essentially meaning suspicious activity is happening that is far outside what normal looks like. Multiple detection lattices is enough to create what we call an alpha signal. An alpha signal is usually a true positive detection because this many strange things, that resemble potential threat activity, happening in the same place, at the same time, by pure chance, is simply improbable.

The syscall events and associated detections here are something new we just added to our endpoint project. We’re going to start logging more types of syscalls using something like bpftrace (very carefully, as we generally don’t want to log every single syscall, particularly the one that are already evented.) The process detection elements, both conventional and machine learning, can be created today without eventing the syscalls used by Copy Fail.

Show and Tell: The DUNE Project Mark Three

So this video (https://www.youtube.com/watch?v=WDERBaYL05U) provides an overview of the DUNE (detection of unknown novel events) project that was being talked about on LinkedIn after being shown at the Boston area meetups. It was originally presented at RSA 2024. In summary, the project consists of these mark three versions of the notebooks used for applying machine learning to hunting detection resistant threats. The mark three notebooks have increased performance and additional integrity checks to ensure the scored data is matched precisely to the raw event data.

Jupyer

  • Silhouettes is a stand-alone notebook for calculating the optimal value of k, prior to running k-means
  • k-means-mark3- is a notebook for running k-means on a dataframe using the silhouette score as an input
  • pyod-mark-3 is a notebook for running an ensemble model on a dataframe using pyod.
  • viewer-mark-3 is a tool for aggregating, sifting and querying the output of the models in a series fo interactive widgets.

Cloudtrail

The dashboards folder contains anomaly detection dashboards for Elastic and Splunk that do not require Jupiter. The Cloudtrail, Flow Logs, and Kubernetes folders contain notebooks for working with those data types. There are some new notebooks in the Cloudtrail folder:

  • the two s3-ingestor notebooks contain code to read CloudTrail logs directly from an S3 bucket into a dataframe, including unzipping the files. They assume the notebook is running on a Jupiter instance that can reach the buckets using the s3 client via either an IAM user or an AssumeRole.
  • aws-ip-discovery is a notebook that will enumerate IP addressed associated with an account, for cases when the question of wether an IP address is associated with your account or not, in order to ascertain if the source is someone else’s account.

Introducing Protostar

(originally code named “skynet” if you saw it at DEF CON or Blackhat MEA 2024)

Why do we ship products that create alert fatigue?

Three years ago, a team member asked that question.  As is often the case when the best questions are asked, there was no clear answer from the room containing something like a millenium of assembled experience in the security product R&D. It left a lasting impression because we realized we had stopped asking that question. To the youngest team member, unencumbered by the weight of precedent, it was obvious. To us, it had become partly invisible, as we answered a stream of requests for short-term relief and quick fixes. 

Alerts historically have been the result of a logical test, usually called a rule, outputting a boolean true / false decision on an input; something is either good or bad. The thing in question is usually a data blob comprising a log or email message. It could be an artifact from memory or a file system. The decision has to be made more or less immediately, using only the available input to the rule. Each true output is packaged into an alert for the users to read and the false outputs are mostly discarded. The problem is that too many of the things the rules say are bad could or should have been labeled good or benign. One way to look at this is that it is an issue of quality or tuning and the logical tests in the rules need refinement or additional branching logic; this leads to detection engineering. Another approach is to build complex post-processing systems that try to sort out the true and false positives using statistical analysis or machine learning. Both of these approaches, while often productive, create significant work streams for human analysts that are often incomplete due to their time and effort cost. The latest, and perhaps most fashionable approach is to use agents and large language models to offload the work of alert decisioning from the humans. 

All of these approaches have one assumption in common: the assumption that we have to work with what we have, a list of alerts, a stream of relatively small discreet messages created by rules. This paradigm dates back to the first security systems of the 1990s when larger, more complex data structures were less practical. The alerts we have today are much more evolved, of course, but most of them are still a stream of Boolean true / false decisions on a small message, event, or data artifact. We have long known of the base rate fallacy and the sensitivity. specificity paradox. Alert rules, like people, cannot be in two places at the same time. A single rule can be optimized to minimize false positives, or to maximize true positives, but not both at once. 

Imagine a criminal trial where each piece of evidence, from the case, is shown to a single juror, who then makes a decision. The next piece of evidence is shown to another juror, who makes a decision based on that. There would be different verdicts from the jurors with some returning guilty or not guilty as best they could decide from what they were shown. A good number of them might return no decisions, citing insufficient information to form the basis for a decision. However we calculate the final verdict, after tallying the votes, we can easily imagine how inaccurate this would be, given the difficulty of making such a large decision from a single piece of information. We would never do this, in real life, but the analysis of individual alerts, one at a time, by security teams today, can be likened to this. 

what would we do if we had a blank slate? score the graphs, not the alerts

If we were going to do something different, what would it look like? One option is to plot alerts on graphs instead of putting alerts into tables. Plotting masses of alerts on graphs has been done, but a simple alert graph will quickly become overloaded beyond readability. Drawing alerts, in a graph, doesn’t necessarily have more information gain than a conventional list format. There is more untapped information in the graph, however, in the form of relationships. Most entities  – endpoints, servers, roles and users – in such a graph don’t have relationships with most of the others, so putting everything in one graph can again have low information gain and high density. 

Suppose we make a series of graphs instead; a separate graph for each entity, with each entity at the center, and the detection artifacts – alerts created by rules and by machine learning models – are the edges. Call each detection type a relationship, and add booster relationships for the entities with multiple detection relationships. Add another boost for detection relationships where different detection types agree on the same original input. Two or three witnesses are better than one. Add additional boosts for complementary ATT&CK tactics, machine learning based recommendations, agentic analysis, or local user sentiment.  Now, instead of scoring the alerts individually, we score the the detection graph for each entity in the graph, not relying on the scores or severities of the alerts themselves, but instead using the signal strength  of the number and strength of the detection relationships in the sub-graph for the entity. Next up: some examples of how this is better solution, at lower cost, to alert fatigue.

Show and tell: agent smith


So we are seeing more and more cases of malicious code injected into shared models, skills, tool, agents, and projects. I had a case where a shared model had a number of malicious code blocks that the users had not noticed due to the size of the codebase. At the time of this writing, we are seeing proliferation of malicious skills and prompts with a variety of payloads. When run in an IDE with AI tools, much of these produce execution and C2 activity that is detectable if something was watching closely enough. The challenge is that these events will be in a haystack of benign process and network events from tasks the AI tools were given permission for by the user. One of the purposes of the ODR (open DR) project is to hunt for this class of threat activity using anomaly detection techniques and a learning informed detection pipeline that is continuously updated.

One reason to run these hunts at the local level, in addition to conventional SOC and SIEM operations, is that the definition of what normal looks like, for a particular IDE, is largely in the head of the developer. Another reason is that much of the context is on the endpoint where the activity took place, and all of that state cannot generally be logged to a SIEM due to data volume and cost. At the same time, devs cannot monitor every action taken by their tools, and they are not trained in what to look for. This feels like a job for an autonomous agent pack.

Smith is an autonomous agent pack that will process most alerts and anomaly detections generated by ODR (open DR, also in this GitHub org.) ODR mainly looks for strange things happening in your AI dev tools at present but Smith will process most alerts generated by ODR. There is a show and tell video here: https://youtu.be/lsh3JRne9sg and the project lives in a repo here: https://github.com/opendr-io/agentic-park/tree/main/smith

Smith processes raw event and alert data from the ODR (open DR) project in order to investigate alerts and anomaly detections. It outputs an analysis of each alert. It lets you know when it thinks it has a high confidence detection and engages you in a collaborative analysis conversation in order to work out whether the detected activity is benign and expected or unexpected and potentially malicious.

Some sample alert data is provided so that it can do something out of the box. Sample event data is not provided due to its size, and the need to sanitize it, but can be generated by running openDR. It has a filter layer to try to stop prompt injections from reaching the agents and one filter test case alert is included in the sample data; you will see it get “intercepted” by the filter. That is an interesting area of research and we would like to hear from both offensive and defensive researchers as we add more filtering techniques and more detections.

SHOW AND TELL: OPENDR

There are a number of reasons that organizations and networks may have no meaningful EDR or endpoint instrumentation. With or without EDR tooling, something has to be placed to the left of the equals sign when we have indicators of compromise from a compute instance. In such cases, when there is threat hunting to be done, we have to use what is at hand, or worse, attempt to talk someone through live response who is not experienced or prepared. This video gives a quick overview of the openDR project which, given Python 3, can be running in a matter of minutes with zero security knowledge. The tool currently works under Windows, Linux and MacOS. Over the summer we added network event enrichment and Sigma rule support, both covered in the video:

SHOW AND TELL: THE CAUSALITY PROJECT

Expanding on this latest post (https://www.linkedin.com/posts/activity-7389333041816481792-LdEb)

How was the prediction made? How did we predict that CVE-2025-33073, published in June, would eventually be added to a known exploited vulnerability (KEV) watch-list? How can we audit that the prediction was made forward in time? This show and tell video gives an explanation of the CAUSALITY project which has generated 132 provable CVE predictions since January with a mean early warning time of 124.5 days.

The difference between exploitation detection and exploitation prediction is akin to the difference between detecting a missile launch and detecting a missile detonation – two very different outcomes. Every exploitation cycle we can avoid gives time back to dev and business teams in addition to security.

the causality project does not require sharing of vulnerability data

One of the questions we are asked is how much vulnerability data you need to hand over to use the CAUSALITY model’s CVE predictions and the answer is none. The data flow is one-way from us to you and no customer vuln data needs to go anywhere.

There is a notebook in the repo that can be used to process and rate CVE data wherever you run Jupyter. If you have rules or policies about open source code, that’s also fine, and you don’t have to use our code. The only ingredient you need from us are the ratings files and those are text, not data. It isn’t actually necessary for users to send us data because of the way the model works.

If you just want to see the ratings, there is a Streamlit app in the /web directory with a search interface for the ratings data. Search by CVE (exact match) or other fields. Ratings are available for CVEs between January 2024 – August 2025. I have not yet rated years before 2024; hit me up if you would like me to. This also runs locally and reads in two data files; no data is sent anywhere. This is what it looks like:

The Causality project

Some time ago, one of my stakeholders said, “We may never get to zero cves. How can we identify the ones that matter the most?” Annual CVE volume has since quadrupled over the past decade. Recent research continues to explore the challenges associated with vulnerability management, and the limitations of existing prioritization methodologies, such as severity, which are not always good predictors of exploitation and risk. [1] [2] EPSS, while more sophisticated and predictive, is an ongoing topic of discussion as to whether it predicts exploitability or exploitation. [3] Of the roughly forty thousand CVEs issued last year, less than a percent were added to watchlists for observed exploitation activity and we lack a methodology for targeting this subset. Having spent a good deal of time with red teams, I believe exploit selection and usage resembles tool or equipment selection in other adversarial pursuits. I would liken it to athletes choosing equipment, lawyers choosing precedents and arguments, or warfighters choosing weapons and tactics. Factors such as theaters of operations, playing fields, opponents, past experience, and bias for successful tactics used in the past, are more influential to selection than mathematical scores and metrics used by existing prioritization methodologies. 

Last year, I experimented with applying a number of machine learning models to the problem of CVE prediction and arrived at one that yielded the best results which we named CAUSALITY. This model has, at the time of this writing, produced sixty provable correct predictions. A provable prediction means that a CVE was rated “hot” or “warm” – meaning it has potential to see heavy exploitation and be watchlisted – before it was added to a watch list. The prediction lead times range between days and months. The predictions are published in a Github repo (https://github.com/opendr-io/causality) where anyone can audit them to verify we are making predictions forward in time by comparing the time deltas. The correct predictions made to date are summarized in the readme for the repo where the raw data is published. I am not publishing output there constantly, only enough to prove prognostication, as extraordinary claims require extraordinary evidence.

On the questions of sensitivity, specificity, precision and recall; I am open to suggestion. Is a prediction a false positive if it does not come true in a month? In three months? a year? The interval for the published predictions ranges from a few days to as long as 137 days. Meanwhile, the watchlists continue to upgrade CVEs from prior years, even some from the prior decade, as they are selected for weaponization by threat actors. The way I think about this is more like having an advantage in an adversarial process. If this were hockey, instead of cybersecurity, and a model could predict that most successful shots on goal would come from a subset of 8-11% of the total shots, that would increase our odds of winning the game. Prioritizing a subset of CVEs according to their potential yields a larger risk reduction at a lower cost relative to existing processes.  When exploitation cycle avoidance can be realized, where the prediction lead time is sufficient, the ROI is much higher.

CVEs have interesting differences from other data domains. CVE classification differs from malware classification in that there are no benign CVEs apart perhaps from those that have been rejected or withdrawn. They are on a gradient of risk potential, and some never amount to much of anything, but their presence cannot be considered benign. Rather, the objective is to try to identify the smallest set that yields the greatest risk reduction, and to deal with those quickly enough to avoid exploitation.

[1] https://arxiv.org/abs/2302.14172: Enhancing Vulnerability Prioritization: Data-Driven Exploit Predictions with Community-Driven Insights

[2] https://arxiv.org/pdf/2508.13644v1: Conflicting Scores, Confusing Signals: An Empirical Study of Vulnerability Scoring Systems

[3] https://www.linkedin.com/posts/resilientcyber_vulnerability-scoring-frameworks-activity-7363978158439600128-oS3t?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAAZIaEBGLaE7H8r2VCTwQayr6Vq_PFIqYY,