Skip to main content

Sustainable data strategies for natural language understanding

Wednesday, November 5, 2025

The accelerating use of large language models is unlocking enormous value for organizations. But to fully realize this potential, a critical bottleneck must be addressed: Sustainable data management. 

“This challenge, spanning legal, ethical, and practical dimensions, requires a new level of leadership and competence to ensure that AI systems are safe, transparent, and contextually relevant,” says Danila Petrelli, senior data lead in AI Sweden’s NLU team.

Danila Petrelli

Danila Petrelli, senior data lead in AI Sweden’s NLU team.

AI Sweden’s NLU-team is currently involved in four big projects with funding from the EU: OpenEuroLLMTrustLLMEuroLinguaGPT, and DeployAI. OpenEuroLLM is the most ambitious, with its goal of building an open European family of large language models that cover all European official languages and are compatible with the AI Act.

A shared challenge between all of them is the training data: As large language models are increasingly creating value in organisations, the need for good data management is simultaneously becoming crucial. 

This means that Danila Petrelli is a key person in AI Sweden’s work on large language models. Her conclusion from the lessons learned: The way forward is to focus on the real needs in Sweden and the EU, and to strengthen collaboration within the region. 

“Sweden has the opportunity to be really competitive if we focus on relevance and quality in our own context. Nobody will prioritise the Swedish language and public sector use cases the way we do, and that’s where we can make a real difference. For instance, we can focus on creating evaluation benchmarks, becoming excellent at fine-tuning and post-processing, and developing well-curated datasets. Those are areas where smaller teams can make a real difference,” she says and continues:

Danila Petrelli

Europe’s reliance on models developed outside the EU is becoming more apparent by the day. We are training our own models, but not fast or coordinated enough to keep up with the international competition. At the European level, the smartest way to stay competitive is through collaboration: sharing infrastructure, datasets, and governance frameworks instead of recreating them in every country.

Danila Petrelli

Danila Petrelli

Senior data lead in AI Sweden’s NLU team

Data management for AI is the discipline of making data usable, lawful, and meaningful. But there are currently a couple of different aspects that make data a bottleneck in the endeavor for European models according to Danila Petrelli. The constraints run in three dimensions: Legal, ethical, and practical. New regulation without a consensus on how it should be interpreted is one example. The fact that many European languages are small and underrepresented is another. A lack of benchmarks that could help evaluating open base models on specific languages and/or use cases a third. 

“Taken together, this all means that on a really high level the biggest challenge is access to data at all. Thankfully, we see that it's being worked on throughout the EU. One important reason for this is a mindset shift. A few years ago, having AI systems that worked was enough for most organisations. But increasingly there are requirements of safety, sustainability, transparency, and more – and with that follows the need of better data management,” says Danila Petrelli. 

Danila Petrelli in a meeting with Sofia Hedén.

To get there, Danila Petrelli together with colleagues at AI Sweden as well as other project participants explores possible solutions in many dimensions. There are technical aspects, like developing methods to create high quality synthetic data as a replacement for authentic data. 

“That could help when private personal information is a challenge as well as for smaller languages where there is a limited amount of data available. Still, it’s not a long-term substitute for grounded, authentic data. Models trained mainly on synthetic material risk losing touch with real-world use. In Scandinavia, we’ve seen both the value of synthetic data for smaller languages and the importance of strong metadata and traceability to stay connected to reality.”

On the legal side, Danila Petrelli sees a need for standardized rubrics for risk evaluations, methods and processes to keep track of data provenance, license terms, and metadata in various forms, and EU-wide, harmonized interpretations of current regulations. 

“I also think it could help a lot if more legal experts were trained in the technical details of how large language models are trained and used.”

She also sees a need for tailor-made benchmarks in addition to the most used ones that new models are measured by.

“The reason we see so many new benchmarks is that language itself is complex, varied, and constantly shifting. No single evaluation can capture all aspects of performance. Every language, domain, and use case requires its own way of testing models. For large language models, we need multiple benchmarks because we should measure everything from reasoning and factual accuracy to cultural and linguistic nuances,” says Danila Petrelli.

As for why this work is important, she says that what’s ultimately at stake is Europe’s digital sovereignty. 

“In Europe, we are still building, but not fast or coordinated enough to stay independent. Our dependence on infrastructure and models developed elsewhere is increasing, and that dependence is a concrete risk. It limits our ability to govern systems on our own terms and to respond when issues arise. It also connects to competence, if we are not deeply involved in building and understanding these systems ourselves, we lose the expertise needed to shape them responsibly.”

Why data management is increasingly important for LLMs

  • To meet regulatory and legal requirements
  • To protect personal and sensitive information
  • To track and respect copyright and licensing conditions
  • To document dataset composition and limitations
  • To enable traceability and accountability in model outputs

The EU projects AI Sweden’s NLU team is part of

  • OpenEuroLLM – The project aims to build an open European family of large language models (LLMs) covering all European official languages ​​and that are compatible with the AI ​​Act. Partnership 20 organizations.
     
  • TrustLLM – To develop European language models (LLMs) with a focus on Germanic languages. The goal is to create an open, reliable and sustainable ecosystem for the next generation of modular and extensible European LLMs. Partnership 11 organizations.
     
  • EuroLinguaGPT – The aim of EuroLingua-GPT is to develop and train new, large-scale language models (LLMs) covering all official languages ​​of the European Union. It is a strategic collaboration project between AI Sweden and the German research institute Fraunhofer IAIS (Institute for Intelligent Analysis and Information Systems).
     
  • DeployAI – The main objective of the DeployAI project is to build, implement and launch a fully functional AI-on-demand platform (AIoDP) that promotes reliable, ethical and transparent European AI solutions for use in industry, primarily for SMEs, as well as in the public sector.
EU flag and text: Funded by the European Union

Related articles

Nina Ökvist

Nina Ökvist appointed new Head of NLU at AI Sweden

2025-08-22
With over 20 years of experience from several government agencies, Nina Ökvist is now taking on the role of Head of AI Sweden's NLU team, one of the country's leading research and development teams...
A picture of Magnus Sahlgren next to the text 'OpenEuroLLM' below a logo with the EU-fag and the text 'Co-funded by the European Union'

AI Sweden contributes to the development of open LLMs for transparent AI in Europe

2025-02-26
Europe's leading AI companies and research institutions combine their forces and expertise to develop next-generation open-source language models. AI Sweden is one of the 20 European partners that...
One of the supercomputors in the Barcelona Supercomputing Center

AI Sweden and Fraunhofer IAIS to develop language models for all of Europe

2024-05-16
AI Sweden, in collaboration with Germany's Fraunhofer IAIS, has gained access to one of Europe's most powerful supercomputers to train language models for all EU languages. The EuroLingua-GPT project...
People working on laptops in a collaborative office environment.

Svea's third phase almost fully subscribed

2025-11-03
The third phase of the unique innovation initiative A Shared Digital Assistant for the Public Sector will begin after the New Year. “The interest in contributing to strengthening the public sector’s...
esponsible AI Knowledge Hub interface on a laptop next to a smiling professional woman.

AI Sweden launches Responsible AI Knowledge Hub to help organizations develop and use AI responsibly

2025-10-21
AI Sweden has launched the Responsible AI Knowledge Hub, a new national platform designed to help organizations identify and adopt tools and resources for developing and utilizing AI responsibly. The...