Technical Optimization for Limited Data Sets – Increasing Performance and Efficiency in LLM Training
Large Language Models (LLMs) require massive amounts of data. For languages like Swedish, finding sufficient volumes is difficult, and the task becomes even harder when filtering and adapting data for specific purposes. Consequently, scaling and performance are often limited for smaller languages. AI Sweden's researchers aim to enhance the performance, quality, and stability of LLMs through systematic improvements.
Felix Stollenwerk, PhD, Senior Research Scientist at AI Sweden, has contributed to several papers in 2025 with European partners, published at prestigious conferences such as ACL (Association for Computational Linguistics) and EMNLP (Empirical Methods in Natural Language Processing). These papers focus on Stollenwerk’s work regarding more effective data filtering, cleaner input, and the creation of more robust and balanced word representations (embeddings)—two distinct parts of the training process united by the ambition to create better and more efficient language models.
Felix Stollenwerk presenting his poster at the ACL 2025 conference.
Data Filtering with Cross-Lingual Effects
Early in the training process, data is collected and then filtered into a usable dataset. In "Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models," the research group describes a new method for data filtering. This approach not only makes larger amounts of curated data usable but also highlights a cross-lingual effect that occurs when translating English data into various other languages.
Traditional, rule-based data filtering in LLM training risks discarding large amounts of data because the process cannot account for significant deviations from established rules. This is particularly difficult to transfer to smaller languages. The JQL (Judging Quality Across Languages) method addresses this by introducing new ways to prune and retain larger quantities of high-quality data, even for smaller languages.
Instead of strict rules, JQL uses "LLM-as-a-judge" to process large volumes of data using an optimized AI model—an innovative approach in the field. However, the preceding step is equally vital: researchers used "ground truth" through human-annotated data. "You need the human component to have a point of comparison. There are no technical metrics for human judgment; you have to create a ground truth to ensure the model behaves similarly to a human," Stollenwerk clarifies. By using LLMs of different sizes (Gemma, Mistral, and Llama), the researchers could compare which open models yielded the best results.
Crucially, quality here is defined in strictly technical terms: data that leads to a better model or maintains performance with less data. This efficiency gain is decisive. "Even if you had unlimited data, it is still vital not to train on just anything, but to utilize the highest quality data optimally. It’s about efficiency—training models is expensive in terms of compute, energy, and finances. With high-quality data, less compute and energy are needed to train the same models," Stollenwerk explains.
High data quality can be hard to define. JQL establishes what constitutes high quality by using humans to teach models how to discriminate quality based on annotation. For example, coherent text was retained even if it was incomplete or slightly off-topic, provided it contained key concepts suitable for educational purposes. New methods were also developed to handle the relationship between punctuation and letters, which rule-based heuristic methods often classify as "junk." Both approaches led to more exceptions and allowed a larger volume of data to be preserved.
From Large Language Models to Small Annotators
To make the process scalable and cost-effective, the next step was distillation. The expertise of the large "LLM-as-a-judges" was transferred to significantly smaller annotators (built on Snowflake Arctic Embed v2.0), which performed quality assessments very efficiently—approximately 11,000 annotations per minute on an A100 GPU.
![]()
Fundamentally, it's about using AI to distinguish between good and bad data. By distilling large models into smaller, faster tools, we get the same judgmental capability at a fraction of the cost and energy consumption.
![]()
Felix Stollenwerk
Senior Research Scientist at AI Sweden, PhD
One of the most critical insights from the JQL research is the robust cross-lingual ability. By translating a small amount of English, human-annotated data into 35 different languages, the distilled models can learn to judge data quality across vastly different language families—even in zero-shot scenarios with languages like Thai, Mandarin, and Arabic. This demonstrates that the method is useful even for languages where the model lacks sufficient data.
As early as the development of GPT-SW3, AI Sweden's research group noted that cross-lingual effects strengthen the robustness of language models and provide desirable performance improvements. This research further strengthens the argument for developing multilingual models and aligns with the EU projects in which the NLU team participates. Focusing on training shared filtering models with other languages maximizes the cross-lingual effect and the limited data available. This is an advantage for languages like Swedish, with the hope that it will also benefit minority languages such as Sami and Meänkieli.
However, performance gains are not achieved through data quality alone; an equally important part of increasing LLM efficiency lies in optimizing the training process itself.
Coupled Adam: For Better Word Representations
A common problem with word representations (embeddings) is anisotropy, where words cluster within a narrow portion of the possible statistical distribution. This can limit the semantic utility of words and the model's expressive power. In the paper "Better Embeddings with Coupled Adam," Felix Stollenwerk, together with Tobias Stollenwerk, focuses on optimizing the training process to solve this issue. Through a modified optimization algorithm called Coupled Adam, the model gains a way to maintain the language's vital structural and semantic components.
The research on Coupled Adam identifies the root cause of the bias found in the standard Adam optimizer. Specifically, this shift occurs in the algorithm's second moment. Optimization in this step is adjusted for each parameter individually (normalized), which is effective for infrequent words (sparse data). However, this results in the entire set of word representations collectively shifting away from the origin. Stollenwerk’s research shows a simple but powerful adjustment: it couples the second moment for all word representation vectors by using an average of all moments, providing the model with a more balanced distribution of embeddings.
Instead of adapting the learning rate specifically for every single word, Felix and Tobias used a shared average value so that all words are treated more equally. This avoids the problem of all words being pulled in the same direction (an area where Stochastic Gradient Descent (SGD) excels) while retaining Adam's ability to learn quickly and intelligently.
Results show that Coupled Adam creates significantly higher quality word representations. This, in turn, has a positive impact on both upstream and downstream performance for LLMs trained on sufficiently large datasets, leading to a more robust and efficient learning process.
Streamlining and Improving Language Models
Although JQL and Coupled Adam are technically distinct, they are driven by the same ambition: to raise LLM performance through systematic improvements that are significant for low-resource languages and for enhancing model quality and robustness generally.
For researchers and developers in language technology, these methods point toward new strategies for developing future LLMs. While core components like massive computing power and data volumes remain important, Stollenwerk’s research shows that there are still underdeveloped mechanisms within the field. His contributions demonstrate that higher input quality (JQL) and more robust training processes (Coupled Adam) are achievable. These methods pave the way for doing more with less data, leading to solutions with lower environmental impact and more cost-effective architectures. This research highlights the NLU team’s contributions toward democratizing LLM development for resource-poor languages.
Related content
Scientific publications
Language technologies (NLU)