Machine Learning to identify malicious strings in a file

Why bother with strings?

When analyzing a new sample found “inthewild“, it may make sense to extract the strings within it to identify IP addresses, domains, log files or C&C server signatures. For example, if an Artificial Intelligence model such as HL-AI identifies as malicious a binary named “calc.exe”, and Mimikatz is found in its strings, the executable was probably not designed to compute anything.

Trop de Lj&&lZ66~A?<

The fact is that most extracted strings are just random data corresponding to valid ASCII characters. Nothing to worry about if you’re manually processing a sample, but then how can you automate the extraction process to handle thousands of binaries? Blindly storing all extracted strings would waste storage space, not to mention the resources involved in filtering and extracting useful information. That’s why we need a system that can tell whether a string is nonsense or relevant (in the context of cybersecurity, POFXCALLBACKENUMERATEUNMASKEDINTERRUPTS and iisncizn.dll are relevant strings). The system needs to be efficient enough to filter thousands of strings quickly, while having very low false-negative rates to avoid missing important strings, and a relatively low false-positive rate to avoid unnecessary storage.

Choosing a battle plan

Two approaches come to mind for this project:

The deterministic approach, where regular expressions are used to extract IP addresses, file names, domain names and so on.
The non-deterministic approach, which exploits Machine Learning either directly on strings, or on features extracted from them.

The first method is useful for extracting Indicators of Compromission (IoCs), but is difficult to generalize to strings without patterns, such as function names. The second method is more generalizable, but we can’t be sure that it won’t miss strings of great importance. So why not combine the two approaches? A first filter would detect IoCs based on a set of patterns to ensure that the attacker’s server is identified, then the Machine Learning-based method would filter out the gibberish.

Building the database: sorting strings, or the Tinder of strings

We now have a clearer idea of what we want to do, and we need a database of strings identified as relevant or irrelevant to train and test our filtering system. So we created a tool that takes binaries as input, extracts strings from them, and asks the user to reject or approve them by dragging them left or right using the arrows on his keyboard. Fortunately, strings are generally extracted in sets of “good” or “bad” strings, so we were able to create a database of 15,000 tagged strings in an hour.

**Choosing the best Machine Learning model**

As mentioned earlier, we want our model to perform as well as possible and quickly filter out meaningless strings from our file. Many conventional Machine Learning models come to mind: Random Forest, XGBoost, SVMs, even something as simple as logistic regression or naive Bayesian classification. Each of these models is relatively quick to train, yet allows for extremely rapid inference, which is exactly what we’re looking for.

Now that we have a set of classification models, we need to convert the strings into something a model can understand. In other words, we need to find a numerical representation of the strings with enough information for the model to tell the difference between “good” and “bad” strings. The simplest way to convert strings into numerical values is to calculate the features we consider interesting. Shannon entropy is a good example: it evaluates the distribution of characters or symbols within a string and determines whether they are arranged in an unpredictable way. Strings with higher levels of entropy show a more varied and disordered distribution of characters, implying a higher level of randomness. Chains with lower entropy values show a more structured arrangement of characters, indicating more regularity and predictability.

In addition to Shannon entropy, the following features are used:

string_length Chain length
proportion_numbers Proportion of integers found
proportion_letters Proportion of alphabetical letters found
proportion_space Proportion of white spaces found
proportion_special Proportion of special characters found
proportion_lower Proportion of lower-case characters found

We now have the models and numerical representations. So we’re ready to train and test the performance of the Machine Learning models we’ve chosen.

Choosing which model to use

Since there are several Machine Learning algorithms that can perform this task, we need to find out which one gives the best results. In Machine Learning, it is customary to split the data into a training set and a test set, in order to evaluate the model’s performance on features it has never seen during training. Model performance is often evaluated in this way. However, this procedure can produce misleading results, as it is always possible that the test set is not representative of the training data distribution.

Imagine, for example, that “bad” strings frequently include special characters such as %, !, #, while “good” strings show a more uniform distribution of characters. In a single random split between the training set and the test set, if the test set contains mainly “good” strings and the training set has more “bad” strings containing these special characters, the model may wrongly associate these characters with the notion of maliciousness. Evaluating the model on the basis of this test set may prove to be highly accurate. However, the model only performs well by learning these specific characters (%, !, #) from the “bad” strings in the training set.

A common and fairly effective approach to avoiding this pitfall is to use cross-validation. This technique helps us choose the Machine Learning algorithm best suited to the task, by running a series of mini-tests using each of the algorithms. Instead of relying on a single test, cross-validation divides the data into different groups that better represent the overall distribution of the data.

The right algorithm to limit false negatives

Calculating anaccuracy score, Random Forest comes out as the best classifier with an average score of 0.969 over 5 folds, closely followed by XGBoost (0.967), then logistic regression (0.968), and finally Bayesian naive classification (0.886).

Random Forest can be further enhanced by adjusting certain parameters, called hyperparameters. Hyperparameters are like settings on a Machine Learning model that you can adjust to improve the model’s performance for a specific task. To adjust them, we act on a predefined set of hyperparameter values, and create a grid of combinations to study. Each combination is evaluated by cross-validation. The average score of the Random Forest algorithm increases by 1%, with the help of better selected hyperparameters. Accuracy then reaches a score of 0.98.

Now that the algorithm is optimal, our goal is to reduce the number of strings that our classifier wrongly marks as unimportant. In other words, we need to minimize false negatives as much as possible. To achieve this, we could adjust the thresholds of the binary classifier until the ratio of false negatives is sufficiently low. Another approach, and the one we ultimately prefer, is to establish rules that help us to quickly spot strings that might be important. These special strings don’t need to go through our machine learning algorithm and are directly classified as malicious.

These rules are designed to quickly identify channels that seem important according to specific criteria. In this case, here’s what we’re looking for:

File Extension Detection: if a string ends with a known file extension (for example, “random.dll” or “evenmorerandom.pdf”), it is automatically categorized as being of interest.
Domain Name Matching: strings that match domain names are deemed to be of interest.
URL Validation: if a string conforms to a valid URL format, it is automatically categorized as being of interest.
IP Address Validation: strings resembling valid IP addresses are considered of interest.
Base64 Validation: strings meeting the criteria for valid Base64 encoding are considered of interest.
Windows Object Detection: if a string begins with “”, indicating a Windows object, it is automatically categorized as of interest.
Bitcoin Address Recognition: strings resembling Bitcoin addresses are deemed to be of interest.
Email Detection: if a string appears to correspond to an e-mail address, it is considered to be of interest.

And what about embeddings and neural networks?

Despite the appreciable results obtained using classic Machine Learning algorithms, we did try a few alternatives, namely string embeddings. Stringembedding involves representing strings as vectors, so as to enable the algorithms to understand the semantic relationships between characters and substrings, which may not be obvious in their raw form. Twoembedding algorithms were tested: n-grams and CANINE. N-grams involves breaking down strings into smaller sub-units (subsequences of characters of length n) and creating features based on their frequencies. CANINE, on the other hand, uses a pre-trained language model designed to understand and represent strings efficiently.

Neural networks are naturally well suited to the classification of these embeddings, particularly RNN (recurrent neural network) architectures. RNNs are efficient at processing sequences, and are well suited to managing data from chains. RNNs process sequences step by step, retaining an internal memory of past elements to influence the processing of future elements.

Unfortunately, the RNN (here a bidirectional LSTM) did no better than the Machine Learning algorithms and our bespoke features. Accuracy on the test set was 0.85 for n-gram and 0.83 for CANINE, despite satisfactoryloss convergence during training. This is not surprising, given the specific nature of strings (which do not represent real words) and the fact that ourfeatures are tailored to the context of malicious and benign strings.

It seems that Random Forest (like other Machine Learning algorithms) can effectively exploit our features with the advantage of interpretability. It is crucial to choose methods that match the complexity of the task in hand, demonstrating that the most sophisticated methods are not always the most suitable.

Conclusion

For our mission to detect malicious strings, we opted for a hands-on approach combining simple rules and Machine Learning. Our main objective was to train a fast and efficient algorithm, and to pay particular attention to minimizing false negatives.

We created a database of labeled strings, tested several Machine Learning models, and studied cutting-edge techniques including string embedding and neural networks. In the end, we found that the Random Forest model, combined with specially adapted features and identification rules, proved to be the most effective.

The final algorithm is now used by the CTI team at Harfanglab, making string analysis much easier and less time-consuming than before: on our test samples, the model eliminated over 80% of strings, leaving only relevant ones. The algorithm will also be integrated into the product itself to store in memory the relevant strings found by the agents, thus facilitating the analysts’ investigations.

Article co-written with Axel Rousselot from HarfangLab’s CTI team.