back caretBlog

ExtraHop Shares Huge Dataset for Detecting Domains Generated by Algorithm on GitHub

Over the past 25 years, detecting and blocking traffic from botnets and other malicious domains has been difficult for cybersecurity professionals as threat actors have used sophisticated techniques to avoid being shut down. 

Specifically, threat actors increasingly employ domains generated by algorithm (DGAs) in malware and botnet operations to establish communication between infected computers and their command-and-control (C&C) servers.

In traditional botnet architectures, the C&C server's IP address or domain name is hard-coded into the malware. However, this makes it easier for security professionals to identify and block these servers. To overcome this limitation, attackers employ DGAs to dynamically generate a large number of domain names that can be used to communicate with the C&C server.

DGAs generate domain names using an algorithm that takes into account factors such as time, date, seed values, or other variables. By generating a large pool of potential domain names, the malware can attempt to connect to one of these domains periodically or when certain conditions are met. This makes it difficult for security systems to predict or block the exact domain names used by the malware.

The purpose of DGAs is to make it harder for security researchers and network administrators to disrupt or shut down the communication channels between infected computers and the C&C server. By constantly changing the domain names, malware authors can maintain control over the botnet and continue to issue commands to compromised systems without detection.

Detecting and mitigating DGAs is a challenging task for cybersecurity professionals, as they need to analyze the algorithm used by the malware, monitor DNS requests, and implement advanced techniques to identify and block malicious domain names generated by the DGA.

New Tool to Help

Today, ExtraHop has taken a significant step toward helping organizations defend against DGA-aided attacks by releasing a massive open source, machine learning dataset designed to defend against DGAs on GitHub. 

The dataset, one of the largest available for this use case, consists of 16 million rows of data. In contrast, many other datasets we’ve reviewed contained much less data and several limitations. 

Originally built for the ExtraHop Reveal(x) network detection and response (NDR) platform, this data set can now be used by any security researcher to construct their own machine learning (ML) classifier model to more quickly identify DGAs and intervene in attacks with greater speed and precision. Since its implementation in Reveal(x), the ExtraHop DGA model has demonstrated more than 98% accuracy.

Improving the Data for Detecting DGAs

ExtraHop began the research resulting in this dataset because we were not satisfied with the performance of existing models to identify DGAs. The ExtraHop team made several attempts to improve the models using feature engineering, model selection, and testing before hitting upon some methods to improve the accuracy of the data. 

We went through several cycles of reviewing academic research on DGAs, testing model architectures, feature engineering the data, writing training and testing code, and training and testing the models to ultimately create the dataset we are releasing.

Feature engineering, involving the extraction and transformation of variables from raw data, was an important part of the process. To create a good DGA tool, we needed both a good dataset and a strong feature engineering process. 

Make It Simple

When we started on the project, we used available automated feature engineering tools, but they produced overly complex features that in many cases negatively impacted the model. These tools have improved dramatically since then, but at the time, we realized we didn’t need to use such complicated tools, and we ended up using a rather simple method for encoding symbols.

Ultimately, our feature vector was simple:

  • We made a list of all legal characters that can be in a domain name. In python it looks something like this: keys = [‘a’,’A’,’1’,’2’, ….]
  • We created a lookup table of keys to integer values. In python it looks like this: lookup_table = {}, then lookup_table[‘A’] = 1, lookup_table[‘B’] = 2 and so on.
  • To ensure we were not injecting an ordering bias or magnitude bias, we randomly assigned both the keys and values for each.

To test the dataset, ExtraHop found three methods of identifying DGAs with promising results:

With this dataset, we were able to demonstrate the accuracy of results using these models, and we hope others in the cybersecurity community can use the dataset to implement a predictive DGA model that is highly accurate and protect their organizations against malware, botnets, and other attacks. Download the dataset at GitHub.

ExtraHop Reveal(x) Live Activity Map

Stop Breaches 87% Faster

Investigate a live attack in the full product demo of ExtraHop Reveal(x), network detection and response, to see how it accelerates workflows.

Start Demo

Sign Up to Stay Informed