During the PRECINCT project conference on 16-17 May 2023, the AI4CYBER project organised a workshop session on “Cybersecurity Datasets: Challenges and Opportunities”.  The workshop was led by AI4CYBER coordinator Ms. Erkuden Rios (TECNALIA) and was attended by 18 participants, including AI4CYBER partners Montimage, Search Lab and EOS. All participants actively joined the discussions and exchanged views on the difficulties and opportunities of using and sharing cybersecurity related datasets for cybersecurity research and cybersecurity solution development.

AI4CYBER partners have come across certain challenges around this issue in their research which led to the development of this workshop. In particular, when using models and algorithms to develop AI-based cybersecurity solutions, the learning requires big amounts of well-structured and sanitized data which are sometimes difficult to get. The discussion showed that other researchers also faced similar issues and that there is a need to incentivise the sharing of cybersecurity data with projects that can contribute to the security and resilience of infrastructures.

This article summarises the key points made in the discussions and the key conclusions drawn.

Major barriers to sharing data

To allow participants to understand the background and context around the challenges faced and presented, the workshop began with a presentation of the AI4CYBER project, its tools, added value for critical entities resilience, and use cases in the energy, healthcare and banking sectors.

Then the workshop proceeded with presenting the challenges experienced in the project regarding cybersecurity datasets and explained that they constitute major barriers to the opportunities cybersecurity research can provide for CI providers. For example, the accuracy of AI algorithms and models is directly dependent on the quality and amount of data used to train them.

  • Lack of trust between CI providers and researchers leads to them not sharing important data
  • Open datasets are often very specific and do not meet the needs of all researchers
  • Companies often do not want external researchers to access their data and even be aware of the types of data they use
  • Commercial datasets often exist but it is not common and often not possible to use EU funds to purchase them
  • Lack of data in European data pools

The solution often chosen is to replicate the provider’s system and attacks – however, participants noted that this solution can often be too “artificial” as it is not certain that all the vulnerabilities that exist in a real system were built in in the lab as well. In addition, many researchers have noted that simulating IT systems is often easier than simulating OT systems, as it is most likely that each OT environment is different in each sector. In-house OT experts often have equipment that is used for testing, but they do not log data that can be used for security research purposes – they do not have detection or protection systems in place.

Participants also identified an inherent contradiction between the principles of cybersecurity and sharing data with researchers. Even the action of sharing data can create vulnerabilities that can be exploited. Even anonymised datasets can be linked with other information that can already exist or has been leaked and therefore reveal confidential information.

A final challenge raised by participants refers to the lack of a common understanding of the vocabulary used in projects. AI4CYBER has followed the definitions used by other projects and EU Directives (CER and NIS 2), but this is not the case for all terms. For example, many projects define cyber resilience or hybrid threats differently.

Possible solutions

Participants identified that the best solution would perhaps be for the CI providers to apply AI models internally, conduct the test and then share the results with researchers. AI4CYBER is following this approach, while also sharing an NDA with the operators involved in the project.

The issue of certification and the example of GAIA-X was also brought up. An EU-approved certification is needed to build trust between researchers and operators. An analytics program certified by the EU, for example, can be trusted not to use data for any other purpose. This does not solve all the issues, but it is a starting point.

Finally, a good idea would be to ensure interoperability between projects that focus on the same area of research and data sharing in the EU data pools.

Conclusions

All participants agreed that there are significant barriers to cybersecurity research today with lack of trust and secrecy being the most common ones. There is a culture of ‘cybersecurity by obscurity”, and more groundwork is required towards showcasing the value of EU projects and research for cybersecurity solution development and by extension on the need of quality data for the development of algorithms and solutions.

We thank all participants for the discussion and PRECINCT EU project for the invitation and the organisation of a successful conference. The AI4CYBER is open to continuing the discussion and we invite you to stay in touch by following our social media on Twitter and Linkedin.