Skip to main content

When will an AI Model Reveal your Sensitive Data?

Wednesday, June 5, 2024

AI models can leak training data–this is known. However, such leakage has mainly been observed in lab-like conditions that often favor the attacker. Today, there are few answers on what the risks look like in real-life operations.

Via LeakPro, a project spearheaded by AI Sweden, the ambition is to change that.

The LeakPro project is an ambitious initiative to provide Swedish organizations with the necessary knowledge and tools to use AI models that are trained on sensitive data. In a collaboration between AstraZeneca, Sahlgrenska University Hospital, Region Halland, Scaleout, Syndata, RISE, and AI Sweden, the project targets critical questions about AI model data leakage:

  • How can one assess the likelihood of an AI model leaking its training data? 
  • How can the risk of this happening be mitigated? 
  • How can the legal and technical definitions of risk be aligned to make informed decisions about AI model usage? 

When asked why LeakPro is an important initiative, Magnus Kjellberg, Head of AI Competence Centre at Sahlgrenska University Hospital says:

“The world isn't simply divided into 'risky' and 'risk-free'. All activities carry some level of risk, and it's essential to identify and evaluate these risks just as you would assess potential benefits.”

LeakPro came to fruition due to the fact that AI models may implicitly contain personal data–an issue highlighted during the Regulatory Pilot Testbed project. In the Regulatory Pilot Testbed, the Swedish Authority for Privacy Protection (IMY) highlighted the infeasibility of assessing the likelihood of models leaking sensitive training data. Therefore, such models must be treated as personal data and managed accordingly.

Johan Östman and Fazeleh Hoseini, Research engineers at AI Sweden

Johan Östman and Fazeleh Hoseini, Research engineers at AI Sweden. Photo: © AI Sweden.

Johan Östman, research scientist and project manager at AI Sweden, elaborates on the implications:
“IMY’s findings limit the healthcare sector's ability to use AI, but the risk of leakage must also be addressed from a business perspective. Similar legal challenges may arise in other Swedish organizations and have the potential to stifle innovation and the use of AI. For example, the issue has emerged in our discussions with government agencies”, he says.

In the Regulatory Pilot Testbed, participants explored federated learning. However, the conclusions apply to many scenarios where models trained on sensitive data are shared, for example through cloud services or when models are released publically.

The Regulatory Pilot Testbed, therefore, has a natural successor in LeakPro. The project is part of the larger context in which AI Sweden, through several initiatives within the framework of AI Labs, is building capabilities and knowledge within the area of AI Safety (see fact box).

The need for risk assessment is evident

“Sharing models offers significant benefits, but it also comes with inherent risks that can never be completely eliminated. That's why having a thorough risk assessment is crucial. Our goal is to estimate and minimize the risks of data leakage when sharing models, whether within a collaboration or through scientific publications,” says Ola Engkvist, Head of Molecular AI, AstraZeneca.

Markus Lingman, Chief Strategy Officer at Region Halland, makes a similar point about the healthcare sector’s needs:

“We always need to balance risk and benefit. With modern methods, it's hard to claim that something is 100 percent anonymous, hence, clarity on the potential risk is essential. There's little guidance from legislators, and subjective terms like 'reasonable risk' are not very helpful in these contexts,” he adds.  

LeakPro tackles these knowledge gaps on three fronts: technical, organizational, and legal.

The technical perspective focuses on understanding how and when models leak data, and what can be done to reduce the likelihood or prevent it from happening. LeakPro aims to develop tools to assess the risk of information leakage, allowing users to test different defenses, reassess, and build sufficiently secure solutions.

“From tests conducted in ‘lab conditions’, we know that models can leak data under conditions favorable to the attacker. We want to find out if, how, and when this can happen with models that are operating under more realistic conditions,” Johan Östman explains.

The organizational and legal perspectives aim to provide decision-makers with better tools for making informed decisions about AI usage. These tools will help non-technical people understand the leakage risks of specific models, enabling well-informed decisions about when, how, and whether to use them. Since trained models may be classified as personal data, their handling falls under GDPR.

To address this, LeakPro includes a legal reference group, in which IMY participates. Johan Östman puts it as an attempt to make "the technical definition of risk meet the legal definition of the same concept."

Magnus Kjellberg further explains:
“To make well-informed decisions that balance benefits against risks, we need a clear way to measure risk. From a legal point of view, there must be some form of consensus on managing and assessing risks for new AI solutions. Tools for evaluating the latest AI technologies are lacking, which is why a project like LeakPro is important to us,” he says.

Broad impact

The technical aspects of the project, involving AstraZeneca, Sahlgrenska University Hospital, and Region Halland, focus on life science and healthcare applications. However, through a cross-sectoral reference group, the project receives input from various sectors to develop methods and tools applicable across industries.

Linda Lindström, legal expert at eSamverkansprogrammet (eSam), sees great potential benefits for the public sector:

"The ability to share data and AI models is anticipated to bring significant benefits to public services. Doing so in a legally secure way is crucial for our government agencies, and we at eSam are involved in several related activities. When we learned about the LeakPro project and its goals, along with the opportunity to join the reference group, we saw another chance to make progress on these issues together," she says, and continues:

"Without more specific metrics on risks, it can often be challenging to conduct legal and security assessments. This could result in data sharing or model exchanges not happening at all, ultimately leading to a loss of public benefit."

Johan Östman explains that the work within LeakPro also has clear connections to other projects driven by AI Sweden:

“We are working with banks to develop federated learning solutions to detect money laundering. The financial sector is another example of an industry with strict personal data laws and where there is great interest in technology that reduces the risks of leakage,” says Johan Östman.

How will your organization benefit from LeakPro?

LeakPro is crucial for us to leverage recent advancements in AI, such as federated machine learning or synthetic health datasets. 

Magnus Kjellberg, Sahlgrenska University Hospital.

The project will facilitate collaborative efforts with other regions, especially for training AI models across organizational boundaries. 

Markus Lingman, Region Halland.

Initially, pre-clinical models can be shared within collaborations or through publications. Eventually, we hope to explore potential clinical applications as well. 

Ola Engkvist, AstraZeneca.

Facts: LeakPro

LeakPro will provide a holistic platform that can be run locally and assess information leakage in the following contexts:

  1. Evaluating the risk of membership attacks and construction attacks on training data with both open access (white-box) and API access (black-box). The platform will support multiple data types, including tables, images, and text.
  2. During the training stage of federated learning where the attacker is either a client or a server. The attacks considered are membership inference and training data construction.
  3. Evaluating the information leakage between synthetic data and its original data source. Interesting attacks include membership inference, linkability, and missing value inference.

For more information, contact Johan Östman.

an icon showing a shield with incoming arrow, AI in the center and threat being dissolved into stars
Many recent works have highlighted the possibility of extracting data from trained machine-learning...

Facts: Regulatory Pilot Testbed

Federated learning means that the data stays where it is, and the AI models move around instead. In healthcare, this could be a way to comply with legislation on patient data and privacy while still capitalizing on the potential of artificial intelligence. In each hospital or region, a model would be trained on the data available there. Those models can then be merged into one, with the combined knowledge from all the training.

- Such an approach is based on the fact that the smaller, locally trained models are not considered personal data. But since a model can leak training data, IMY's conclusion was that, at least sometimes, it will be considered personal data, says Johan Östman.

The purpose of the regulatory sandbox was to learn more about both the technical and legal aspects of federated learning in healthcare. Project participants were Region Halland, Sahlgrenska University Hospital, and the Swedish Authority for Privacy Protection, IMY, together with AI Sweden.

Facts: AI Safety in AI Labs

The issue of AI and safety has become a central issue worldwide. AI Safety is also a central part of AI Labs' projects for many years. 

- We divide AI Safety into three overall categories: Ensuring that AI learns the right things, that AI does the right things, and that AI does not leak information, says Mats Nordlund.

The safety aspects of AI are included in a number of projects at AI Sweden, in addition to LeakPro, including Federated Machine Learning in the Banking Sector and the Industrial Immersion Exchange Program, which is organized together with the American Dakota State University and will be held for the third time in 2024. 
If you would like to know more about AI Sweden's work in AI Safety, please contact us.

For more information, contact

Picture of Johan Östman
Johan Östman
Research Scientist - Decentralized AI
+46 (0)73-561 97 64
Mats Nordlund
Mats Nordlund
Director of AI Labs
+46 (0)70-398 08 37