Swedish Language Data Lab
Head of Project Portfolio Johanna Bergman, tells us about the project status of the Swedish Language Data Lab, September 2020
Natural Language Processing (NLP) creates opportunities in developing methods, tools, and applications that is based on machine understanding of the human language. In this way, NLP enables making the information-bearing data more available and accessible to us in many different contexts. These NLP-based applications can assist us in extracting the relevant information based on the context, by doing summaries, simulations, interpretations, and much more - of large amounts of language data.
The algorithms that form the basis of these applications are called language models. Development of language models specifically designed for the Swedish language relies to a large extent on data specifically written (or spoken) in Swedish. Additionally, Swedish is a small language and global players rarely have an interest in producing annotated data sets for Swedish. The development of language models in Swedish is important to maintain linguistic diversity and promote innovation in the field of NLP in Sweden, which will benefit a whole series of organisations in the academic world, industry and the public sector.
The Swedish Language Data Lab is a project funded by Vinnova and coordinated by AI Sweden. It is an explorative project, based on collaboration between leading players in the field of NLP and stakeholders from public sector and academy. The aim of the project is to collect the know-how and connected challenges of some of the important steps in the NLP implementation process - from identifying the needs, to evaluating trained language models. The work is divided into several work packages with the aim to:
- Develop and make available trained Swedish language models; a NER model and two sentiment analysis models
- Produce a technical, legal, and ethical framework for processing and facilitating accessibility to Swedish language data sets.
- Analyse text and models from the perspective of spoken dialogue.
- Perform requirement analysis and data harvesting in the public sector
- Conduct preliminary studies for NLP specifically developed for the medical and legal domains.
The focus for the upcoming year in terms of data, will be on investigating the alternatives for facilitating and increasing the access to Swedish datasets in general. One part of this work, is the starting of the development of a platform for training models without seeing the actual data.
The overall goal of the project is to create a national knowledge hub within NLP, that will accelerate innovation, research, and applications in this area. The project forms part of Vinnova’s “Data-driven innovation” funding programme which aims to “increase the level of expertise in reusing data in innovations in Sweden”. It is also in line with the EU strategy for the digital transition and, in particular, for data and artificial intelligence. The strategy highlights the importance of broadening the access to data in order to “create added value for citizens”, while at the same time ensuring that individuals have greater control over their own data.
The project is coordinated by AI Sweden. Recorded Future, Gavagai and Talkamatic provide language technology expertise, while Språkbanken, the language research unit at the University of Gothenburg, and the Swedish Association of Local Authorities and Regions (SKR) are stakeholders and owners of data. A wide variety of other stakeholders also support the project by providing letters of support and taking part in the reference group.
Project period: 20190601-20210530