One of the recurring issues in artificial intelligence is gathering enough data to train your model. In our case, working with windows event logs is not an easy task, as there are no available datasets that correspond exactly to what we need. To fix that, we rolled up our sleeves and simulated an Active Directory environment to run attacks and build our own dataset.
Active Directory is an important part of IT infrastructure as a lot of companies use this technology to manage permissions.
As it is commonly used, it should be noted that Active Directory systems are vulnerable to several types of threats, as it plays a central role in user authorization. For example, there are many tools for password guessing in an Active Directory. Usually, it works in three steps:
User name enumeration: getting a list of usernames
Password spraying: testing popular and common passwords for each username
Gaining access: one of the tested username and password combinations works, and the account can be abused to enumerate assets in the AD network, exploit authenticated services and put the organization at risk.
It is interesting for the AI team to study how to detect these attacks. Let’s see how we created attacks on virtual machines in order to analyze them.
Building a virtual infrastructure
What is an Active Directory (AD)?
Active Directory (AD) is a directory service developed by Microsoft that runs on Microsoft Windows Server. In Active Directory, users, groups, applications and devices are stored as objects. They are classified according to their name and attributes. A group of objects that share the same Active Directory database is called a domain.
This system enables administrators to manage permissions and control access to network resources. Users gain authenticated and authorized access to devices, applications and both cloud and on-premises applications reliably and conveniently.
The main component of Active Directory is Active Directory Domain Services, which verifies access when a user logs into a system or tries to connect to one over the network, as well as assigns and enforces security policies. A server that runs it is called a Domain Controller.
Creating the Active Directory environment
Our goal was to create a small IT infrastructure that includes a domain controller as well as four clients (which is actually an user). We don’t need to have many, as this is enough to simulate benign and malicious data.
Running five computers continuously is an uphill process that is costly in material. To overcome this problem, we looked for virtual machines. We relied on this tutorial to create the infrastructure.
On each of our machines, we installed the agent of our Endpoint Detection Response in order to be able to track our data on a dedicated stack.
After creating all our virtual machines and connecting them, we set up these virtual machines on a server of our infrastructure. As these machines can take up space, the goal is to reduce the size as much as possible: delete files that are not needed, uninstall programs that are not used and empty the Recycle Bin.
To manage our machines, we use VirtualBox. You might wonder how we exposed our virtual machines on our server. To do this, after uploaded the file .ova
on your server, a succession of commands must be done:
vboxmanage import vm.ova # This command imports one or more virtual machines into Oracle VM VirtualBox
vboxmanage modifyvm vm --vrde on # Enables the VirtualBox Remote Desktop Extension (VRDE) server
vboxmanage modifyvm vm --vrdeaddress x # The IP address (replace x by the IP) of the host network interface the VRDE server will bind to
vboxmanage modifyvm vm --nic2 bridged --nictype2 x --bridgeadapter2 x # Configures the type of networking, replace x by what your computer needs : https://www.virtualbox.org/manual/ch06.html
vboxmanage startvm vm --type headless # Start the virtual machine
To connect to the machines, an RDP client is required (I recommend Remmina). Now that we have our infrastructure, we can generate data!
Generate data
Before generating data, we need to know what kind of data we are working on.
The needed data comes from the Windows Event Log, which is a report of events related to the system, security, and application stored on a Windows operating system. An event can be identified by its log name, event date, task category, event id, source, level, user and computer.
Events we need are of type “Microsoft-Windows-Security-Auditing”, in particular the Audit logon events and Audit account management.
Benign data
Events we want to analyze relate to the events around authentication and are generated mostly from logging on, logging off or locking and unlocking the computer, which corresponds to events of ids 4624, 4634, 4648, 4625 and 4768.
As the data we look at comes from events performed by a user, it is difficult to have a sufficient variety and quantity of events for analysis in the early days. Benign data are difficult to obtain immediately, aside from turning and locking on and off the virtual machine multiple times in a row, there are no direct solutions. However, the data generated this way strays from real-world data as it does not properly simulate a typical user activity.
It takes several days, even weeks for the number of benign events to be sufficient for analysis. This wait, in addition to being time-consuming, sometimes requires human intervention on the virtual machine. In order to accelerate this process, once the format of the events is known, we can synthetically generate more data by oversampling the events initially created as the studied events are often structured in the same way. This oversampling is done by adding events from successful connections (of type 4624), as these are the most easily reproducible events.
Malicious data
We looked at the Atomic Red Team’s GitHub to get an idea of possible attacks, mostly on Credential Access tactics. We based ourselves for example on a technique of brute force. There are many tools to generate attacks on an active directory. We focused on three known tools: Kerbrute, Talon and CrackMapExec.
When generating malicious attacks, it is capital to note the time of the attacks to have properly labeled data for later analysis.
We also got help from a member of the Cyber Threat Intelligence team – It is not the first time that we work in collaboration with the CTI, see this article – that helped us understand how some attacks work and brought us valuable insights on the attacks we made while crafting and executing complex attack scenarios on the machines.
For example, to launch a password spraying with CrackMapExec
we performed the following command:
python cme smb ip_address -u common-usernames.txt -common-passwords.txt
ip_address
is the one of the machine we want to run the attack on. The text files are files containing the most common usernames and passwords.
With each attack carried out, he provided us a detailed report on the scenario ran and on the identifiable events made during it.
Conclusion: running a laboratory that generates real data to help AI
This infrastructure allows the AI team to save time in data generation and enhance independence. This article specifically outlined the generation of Windows Event logs because the AI team is currently working on a model that detect suspicious authentications on Windows computers.
These virtual machines will allow our team, beyond the Windows Event logs, to study other types of data. On the technical improvement side, we could set up a schedule for our virtual machines in order to turn them on/off at normal working hours and reproduce a customer’s behavior. As virtual box is not the best option because it is slow and use a lot of resource, we could use QEMU to manage our virtual machines.
Data remains the main problem when studying a new topic in AI but we have seen that it can be solved by running a dedicated laboratory that generates real data, as similar as possible to what is seen in production.