Show and Tell: The DUNE Project Mark Three

So this video (https://www.youtube.com/watch?v=WDERBaYL05U) provides an overview of the DUNE (detection of unknown novel events) project that was being talked about on LinkedIn after being shown at the Boston area meetups. It was originally presented at RSA 2024. In summary, the project consists of these mark three versions of the notebooks used for applying machine learning to hunting detection resistant threats. The mark three notebooks have increased performance and additional integrity checks to ensure the scored data is matched precisely to the raw event data.

Jupyer

  • Silhouettes is a stand-alone notebook for calculating the optimal value of k, prior to running k-means
  • k-means-mark3- is a notebook for running k-means on a dataframe using the silhouette score as an input
  • pyod-mark-3 is a notebook for running an ensemble model on a dataframe using pyod.
  • viewer-mark-3 is a tool for aggregating, sifting and querying the output of the models in a series fo interactive widgets.

Cloudtrail

The dashboards folder contains anomaly detection dashboards for Elastic and Splunk that do not require Jupiter. The Cloudtrail, Flow Logs, and Kubernetes folders contain notebooks for working with those data types. There are some new notebooks in the Cloudtrail folder:

  • the two s3-ingestor notebooks contain code to read CloudTrail logs directly from an S3 bucket into a dataframe, including unzipping the files. They assume the notebook is running on a Jupiter instance that can reach the buckets using the s3 client via either an IAM user or an AssumeRole.
  • aws-ip-discovery is a notebook that will enumerate IP addressed associated with an account, for cases when the question of wether an IP address is associated with your account or not, in order to ascertain if the source is someone else’s account.

Introducing Protostar

(originally code named “skynet” if you saw it at DEF CON or Blackhat MEA 2024)

Why do we ship products that create alert fatigue?

Three years ago, a team member asked that question.  As is often the case when the best questions are asked, there was no clear answer from the room containing something like a millenium of assembled experience in the security product R&D. It left a lasting impression because we realized we had stopped asking that question. To the youngest team member, unencumbered by the weight of precedent, it was obvious. To us, it had become partly invisible, as we answered a stream of requests for short-term relief and quick fixes. 

Alerts historically have been the result of a logical test, usually called a rule, outputting a boolean true / false decision on an input; something is either good or bad. The thing in question is usually a data blob comprising a log or email message. It could be an artifact from memory or a file system. The decision has to be made more or less immediately, using only the available input to the rule. Each true output is packaged into an alert for the users to read and the false outputs are mostly discarded. The problem is that too many of the things the rules say are bad could or should have been labeled good or benign. One way to look at this is that it is an issue of quality or tuning and the logical tests in the rules need refinement or additional branching logic; this leads to detection engineering. Another approach is to build complex post-processing systems that try to sort out the true and false positives using statistical analysis or machine learning. Both of these approaches, while often productive, create significant work streams for human analysts that are often incomplete due to their time and effort cost. The latest, and perhaps most fashionable approach is to use agents and large language models to offload the work of alert decisioning from the humans. 

All of these approaches have one assumption in common: the assumption that we have to work with what we have, a list of alerts, a stream of relatively small discreet messages created by rules. This paradigm dates back to the first security systems of the 1990s when larger, more complex data structures were less practical. The alerts we have today are much more evolved, of course, but most of them are still a stream of Boolean true / false decisions on a small message, event, or data artifact. We have long known of the base rate fallacy and the sensitivity. specificity paradox. Alert rules, like people, cannot be in two places at the same time. A single rule can be optimized to minimize false positives, or to maximize true positives, but not both at once. 

Imagine a criminal trial where each piece of evidence, from the case, is shown to a single juror, who then makes a decision. The next piece of evidence is shown to another juror, who makes a decision based on that. There would be different verdicts from the jurors with some returning guilty or not guilty as best they could decide from what they were shown. A good number of them might return no decisions, citing insufficient information to form the basis for a decision. However we calculate the final verdict, after tallying the votes, we can easily imagine how inaccurate this would be, given the difficulty of making such a large decision from a single piece of information. We would never do this, in real life, but the analysis of individual alerts, one at a time, by security teams today, can be likened to this. 

what would we do if we had a blank slate? score the graphs, not the alerts

If we were going to do something different, what would it look like? One option is to plot alerts on graphs instead of putting alerts into tables. Plotting masses of alerts on graphs has been done, but a simple alert graph will quickly become overloaded beyond readability. Drawing alerts, in a graph, doesn’t necessarily have more information gain than a conventional list format. There is more untapped information in the graph, however, in the form of relationships. Most entities  – endpoints, servers, roles and users – in such a graph don’t have relationships with most of the others, so putting everything in one graph can again have low information gain and high density. 

Suppose we make a series of graphs instead; a separate graph for each entity, with each entity at the center, and the detection artifacts – alerts created by rules and by machine learning models – are the edges. Call each detection type a relationship, and add booster relationships for the entities with multiple detection relationships. Add another boost for detection relationships where different detection types agree on the same original input. Two or three witnesses are better than one. Add additional boosts for complementary ATT&CK tactics, machine learning based recommendations, agentic analysis, or local user sentiment.  Now, instead of scoring the alerts individually, we score the the detection graph for each entity in the graph, not relying on the scores or severities of the alerts themselves, but instead using the signal strength  of the number and strength of the detection relationships in the sub-graph for the entity. Next up: some examples of how this is better solution, at lower cost, to alert fatigue.

Show and tell: agent smith


So we are seeing more and more cases of malicious code injected into shared models, skills, tool, agents, and projects. I had a case where a shared model had a number of malicious code blocks that the users had not noticed due to the size of the codebase. At the time of this writing, we are seeing proliferation of malicious skills and prompts with a variety of payloads. When run in an IDE with AI tools, much of these produce execution and C2 activity that is detectable if something was watching closely enough. The challenge is that these events will be in a haystack of benign process and network events from tasks the AI tools were given permission for by the user. One of the purposes of the ODR (open DR) project is to hunt for this class of threat activity using anomaly detection techniques and a learning informed detection pipeline that is continuously updated.

One reason to run these hunts at the local level, in addition to conventional SOC and SIEM operations, is that the definition of what normal looks like, for a particular IDE, is largely in the head of the developer. Another reason is that much of the context is on the endpoint where the activity took place, and all of that state cannot generally be logged to a SIEM due to data volume and cost. At the same time, devs cannot monitor every action taken by their tools, and they are not trained in what to look for. This feels like a job for an autonomous agent pack.

Smith is an autonomous agent pack that will process most alerts and anomaly detections generated by ODR (open DR, also in this GitHub org.) ODR mainly looks for strange things happening in your AI dev tools at present but Smith will process most alerts generated by ODR. There is a show and tell video here: https://youtu.be/lsh3JRne9sg and the project lives in a repo here: https://github.com/opendr-io/agentic-park/tree/main/smith

Smith processes raw event and alert data from the ODR (open DR) project in order to investigate alerts and anomaly detections. It outputs an analysis of each alert. It lets you know when it thinks it has a high confidence detection and engages you in a collaborative analysis conversation in order to work out whether the detected activity is benign and expected or unexpected and potentially malicious.

Some sample alert data is provided so that it can do something out of the box. Sample event data is not provided due to its size, and the need to sanitize it, but can be generated by running openDR. It has a filter layer to try to stop prompt injections from reaching the agents and one filter test case alert is included in the sample data; you will see it get “intercepted” by the filter. That is an interesting area of research and we would like to hear from both offensive and defensive researchers as we add more filtering techniques and more detections.