Cyber Threat Intelligence Web Pages Dataset
Cyber Threat Intelligence (CTI) information corresponding to evidence-based knowledge about potential cyber threats that can be used to inform decisions regarding the response to them can be found in online sources. In the process of discovering and acquiring such CTI-related online sources, in particular CTI-related web pages, a step of filtering out non relevant content is beneficial and even necessary . Such non-relevant content that needs to be filtered out includes web pages completely unrelated to the cybersecurity domain and also web pages that are related to cybersecurity, but are not informative in CTI terms (i.e., do not contain technical information about vulnerabilities, threats and attacks, such as Tactics, Techniques and Procedures (TTPs) or Indicators of Compromise (IoCs) which include virus signatures, IPs, etc).
Generally, the selection of articles based on their topic is commonly approached with text classification based on supervised machine learning techniques trained on annotated data. The lack of such datasets in the cyber security domain motivated the creation of this dataset, a corpus of website articles annotated in terms of CTI information; this dataset can be used for the purposes of CTI-related text classification.
The Cyber Threat Intelligence Web Pages Dataset contains the URLs of 920 web pages, classified in three classes: (i) not related to cyber security, (ii) cyber security-related but without containing CTI-related information, and (iii) CTI-related web pages. The web pages were collected from nine websites, using the respective sitemaps or via crawling. The dataset covers a variety of topics; 6 websites cover cyber security-related topics, 2 websites are technology-related, and 1 is a well-known general news source.
For each web page, we provide its URL, its class according to our annotation, as well as its download date. The dataset is available as a .tsv file. Additional details about the dataset, concerning the web sources, the data collected, the class distribution, and the labelling process are available in our publication  entitled “Towards Selecting Informative Content for Cyber Threat Intelligence”.
If you are interested in obtaining the dataset, please contact us.
In case you use the Cyber Threat Intelligence-Related Web Articles Selection Dataset or our work is useful to your research activity, please cite the following publication:
1. P. Panagiotou, C. Iliou, K. Apostolou, T. Tsikrika, S. Vrochidis, I. Kompatsiaris, “Towards Selecting Informative Content for Cyber Threat Intelligence”, in the proceedings of the 2021 IEEE International Conference on Cyber Security and Resilience (IEEE CSR) – Workshops, Virtual Conference, July 26 (Mon) – 28 (Wed), 2021 (accepted for publication)