Swedish Language Data Lab

A thoroughly developed base for Natural Language Processing (NLP) is one of the cornerstones of successful AI applications. NLP is one of AI Sweden's strategic areas, and the Swedish Language Data Lab was the first initiated NLP project.

Update from the Swedish Language Data Lab

Head of Project Portfolio Johanna Bergman, tells us about the project status of the Swedish Language Data Lab, September 2020

Background

Natural Language Processing (NLP) offers the opportunity to develop methods, tools, and applications that are based on machine understanding of the human language. NLP makes the meaning contained within data more accessible to us in many different contexts. These NLP-based applications can assist us in extracting the relevant information based on the context, by doing summaries, simulations, interpretations, and much more from large amounts of language data.

The algorithms that form the basis of these applications are called language models. Development of Swedish-specific language models relies to a large extent on data specifically written (or spoken) in Swedish. Swedish is a small language and global players rarely have an interest in producing annotated data sets for Swedish. The development of language models in Swedish is important to maintain linguistic diversity and promote innovation in the field of NLP in Sweden, which will benefit a wide array of organisations in the academic world, industry and the public sector.

Purpose

The Swedish Language Data Lab is a project funded by Vinnova and coordinated by AI Sweden. It is an explorative project, based on collaboration between leading players in the field of NLP and stakeholders from public sector and academia. The aim of the project is to collect the know-how and connected challenges of some of the important steps in the NLP implementation process - from identifying the needs to evaluating trained language models. The work is divided into several areas with the following goals:

Develop and make available trained Swedish language models; a NER model and two sentiment analysis models.
Produce a technical, legal, and ethical framework for processing and facilitating accessibility to Swedish language data sets.
Analyse text and models from the perspective of spoken dialogue.
Perform requirement analysis and data harvesting in the public sector.
Conduct preliminary studies for NLP specifically developed for the medical and legal domains.
Develop a platform for training models without seeing the actual data.

Project goal

The overall goal of the project is to create a national knowledge hub within NLP that will accelerate innovation, research, and applications in this area. The project forms part of Vinnova’s “Data-driven innovation” funding programme which aims to “increase the level of expertise in reusing data in innovations in Sweden”.

Facts

The project is coordinated by AI Sweden. Recorded Future, Gavagai, and Talkamatic provide language technology expertise, while Språkbanken, the language research unit at the University of Gothenburg, and the Swedish Association of Local Authorities and Regions (SKR) are stakeholders and owners of data. A wide variety of other stakeholders also support the project by providing letters of support and taking part in the reference group.

Project period: June 2019 - May 2021