This article was originally written in French and has been automatically translated into English.
Cyber threat intelligence is the product of a process whereby cyber attacks and threats are identified, analysed and tracked. The point is to acquire intelligence, notably technical information, which enables organisations making decisions and improving their defence capabilities. It is a ‘multi-source’ approach: the more complementary sources used, the better the intelligence potential.
Open sources are of course essential to this process. They offer significant benefits: they are rich, constantly updated, easily accessed, inexpensive and quick to use. Their diverse nature may require acquisition, standardisation and technical processing efforts before any use. However, the development of systematic large-scale data processing (‘data-driven’, ‘big-data’) is causing many organisations to expose content via the Internet and make it available to machines (‘APIs’), simplifying this use.
Among the overwhelming pool of open-source data that exists, which intelligence can be mined about cyber threats, and how can it help identifying threat actors?
Types of open sources useful for intelligence
In addition to all the conventional content likely to be displayed on the Internet [2] on purpose by their producers or authors (websites, scientific publications, digitised books, technical documents, photographs or videos, etc.), the open sources of particular interest to our use case can be divided into 3 main categories:
- Social sources: all forms of social networks (forums, messaging, X/Twitter, etc.). They convey a variety of unstructured information, but also provide information about individuals and organisations (profiles) and their links (interactions).
- Technical sources: structured databases made publicly available (often for a fee and via a service), which are generally easy to consume automatically or integrate. We will be looking for those that are of specific interest in terms of cybersecurity (indicators of compromise – IOCs; malicious files or URLs, eg. MalwareDB, VirusTotal, URLScan; results of Internet address scans, eg. Censys, Shodan, Onyphe; archive data and metadata on Internet names and addresses, eg. WHOIS, Farsight DNSDB, RiskIQ, Validin, or statistical flows on network communications, eg. Pure Signal), although others may also meet a need (job offers, company status, invitations to tender, computer source code repositories, etc.).
- Grey sources: more or less structured data that was not intended to be made public. This is the case with data leaks obtained illegally (for example as part of computer attacks) but made available on the Internet, or administrative data (logs, statistics, direct access to databases, etc.) left freely accessible on the Internet by mistake.
These sources and generic content exposed via the Internet can also be made accessible through two main types of channels:
- Indexed Internet content: data exposed directly on the Internet, the existence of which can be revealed by a directory or search tool (e.g. Google, Internet Archive, X/Twitter search engine, digital archive index, etc.).
- Alternative Internet content: data exposed on the Internet but not indexed by conventional search tools (“Deep Web”), or exposed through third-party communication systems (overlay, eg. Tor) that require the use of specific browsing tools (sometimes called “Dark Web” or “Dark Net”).
Common use of open sources
In our case, the most common use of open sources is to collect IOCs or reputation scores, which analysts can then use to build up their own collections, detect and qualify computer attacks, develop search heuristics or train expert models (machine-learning, AI). Numerous technical sources provide qualified data that can be consumed and exploited directly by automated systems. The data collected in this way can be enriched and cross-referenced with other technical sources, whether open or not. Exploiting social sources or indexed Internet content for this purpose is often more complex, due to their unstructured presentation. The development of natural language processing (NLP) algorithms and tools now makes it possible to automatically extract technical data from literary content (attack analysis reports, for example) or from social exchanges.
Other common uses include investigation (“pivot”): starting with an IOC, the idea is to exploit technical sources to identify links, and then new exclusive indicators which will in turn enable attacks to be anticipated or detected. This is what we routinely do when we study a threat, and this tactic enables us to continuously improve our knowledge of a malicious infrastructure exploited by attackers. For example, at the very beginning of 2024, by studying the characteristics of a compromised router (designated as exploited by APT28 by the Ukrainian government) and a technical open source, we were able to identify and then share the addresses of thousands of other routers compromised in the same way. More recently and in the same way, we were able to discover the creation of a disinformation infrastructure that could certainly have been exploited against France later.
Social sources in particular are also exploited by analysts for monitoring purposes. They provide information on new attacks or vulnerabilities, which are sources of inspiration and research for threat detection. These social sources can also be the vector of weak signals and user testimonials (“crowdsourcing”) revealing the start of an IT incident or confirming its scale, such as the complaints from French users of the NordNet provider during the attack on ViaSat modems in February 2022, or more recently the first symptoms of the outage caused by CrowdStrike.
Using open sources to identify threat actors
The most common uses of open sources are specialised and technical, and in turn enable the collection of new technical data. However, these sources can also be used to support the attribution of cyber-attacks – in other words, to identify the organisations or individuals contributing to cyber-attacks – and thus provide strategic intelligence.
The exfiltrated data provided by grey sources often contains information that is useful for attribution. For example, a leak of Chinese documents in February 2024 contained a network address that made it possible in retrospect to attribute the “Poison Carp” cyber-attacks, which targeted the Tibetan community, to the Chinese company I-Soon. In 2016, data leaks exposed in 2015 and containing presentation documents for the “Pegasus” tool from the NSO Group made it possible to attribute a cyber-attack against a dissident to the United Arab Emirates. Back in 2013, data recovered by Edward Snowden had already made it possible to attribute computer attacks targeting Europe to the American NSA.
Technical sources contain equally valuable attribution data. In 2015, public records of Internet names made it possible to determine that “CyberCaliphate”, allegedly affiliated to the terrorist organisation ISIS and responsible for the computer attacks on TV5Monde, was in fact certainly a front for Russian military intelligence. Such name registrations have also betrayed other malicious actors on several occasions, and combined with information presented on social networks, even regularly make it possible to identify individuals contributing to computer attacks, as in the case of “Mr WU”, a member of the Chinese group APT3, identified in 2017. In 2022, an Austrian company marketing computer attack services was betrayed by an attack testing infrastructure hosted on the Internet in its name. Analysis of the tools deployed by this company even enabled me to identify the author of a malicious code: rare names present in the tools also existed in the source code made public that had been published on the Internet by an individual.
Information that is useful for attribution and available from open sources is sometimes not sufficient on its own, and is only useful as a complement to private information. For example, in 2018, it was possible to attribute computer attacks targeting mobile phones to a Lebanese intelligence service. It was the combination of private information obtained by a security company on mobile phones and geolocation information on WiFi access points available from open sources that enabled the first tests of the attack tools to be linked to a government building.
Ironically, open sources are exploited by attackers to prepare or carry out computer attacks: in OSINT territory at least, attackers and defenders are playing on equal terms.
[1] Open sources constitute an overall unstructured collection, i.e. the format of which varies and may be locally undefined.
[2] We deliberately ignore hereafter all open sources exclusively exposed on physical media (libraries, archives, registers, etc.). Although they are likely to be exploited, the interest/difficulty ratio is too often unfavourable in our case compared with sources accessible via the Internet. In fact, the trend is to digitise these physical sources and then display them on the Internet.