Swedish Language Data Lab
NLP - Natural Language Processing - provides the ability for interpretation and analysis of the human language, with its underlying meaning and experience-based conclusions that we humans have naturally when we read or listen to a text. The development of models adapted for the Swedish language is becoming all the more important and the amount of open Swedish data is limited. Large sets of data for development of NLP models are needed and a limiting factor is that the texts may be protected by copyright or contain personal data and cannot be shared as a result.
In Swedish Language Data Lab (Svenska Språkdatalabbet), we will create an infrastructure for Swedish data sets in which annotated, classified data and fully trained models are made available. The objective is to create a national knowledge node within language technology and to develop Swedish reference data sets for NLP which will be made accessible with open access in AI Innovation of Sweden's data factory.
Swedish NLP data and training models as a national resource
In addition to ordinary glossaries and including existing Swedish data sets, the raw data will be collected from a large number of sources, including news text, social media, internet forums and reviews from various fields. The project will, in addition to annotated data, produce a number of fully trained models for entity tagging and sentiment analysis, among other things, in order to enable additional research and innovation.
In this project, we will also develop background-trained models that will handle this - we will provide models trained on sensitive data without sharing the basic data. The provision of Swedish text and models will contribute to maintaining the linguistic diversity and promote innovation within the NLP field in Sweden, which is something that will benefit a multitude of interested parties.
The Swedish Language Database project is partially financed by Vinnova and will be operated by AI Innovation of Sweden in collaboration with partners with expertise within NLP, including Recorded Future, Gavagai, Talkamatic, Språkbanken and SKR together with a reference group consisting of a wide range of needs owners from various fields.
Project period: 20190601-20210530