Skip to main content

New NLP project to improve linguistic understanding in Swedish AI applications

Tuesday, May 28, 2019

A thoroughly developed base for NLP - Natural Language Processing - is one of the cornerstones of successful AI applications. AI Innovation of Sweden is now starting a project together with our partners; Swedish Language Data Lab (Svenskt Språkdatalabb), to create a comprehensive set of NLP data for the Swedish language.

An illustrative composition featuring multiple screens arranged in a circular formation, radiating light from the center

A massive vocabulary is needed to interpret the human language

NLP - Natural Language Processing - provides the ability for interpretation and analysis of the human language, with its underlying meaning and experience-based conclusions that we humans have naturally when we read or listen to a text.

By means of mathematics and algorithms, computers can process language automatically, but in addition to the pure understanding, other analyses such as sentiment analysis is also required to understand a text. Sentiment analysis means attempting to understand the feeling or the opinions contained within a text and can involve understanding if the text indicates someone's opinion or mood. In order to train the computers in this respect and thereby creating a linguistically-intelligent AI motor, a basic requirement is access to comprehensively annotated quantities of data, which means that there is a large amount of data which can be used to train the models.

Swedish Language Data Lab will be a national NLP resource

AI Innovation of Sweden will now start up the work to create an infrastructure for Swedish data sets in which annotated, classified data and fully trained models are made available. The objective of the Swedish Language Data Lab is to create a national knowledge node within language technology and to develop Swedish reference data sets for NLP which will be made accessible with open access in AI Innovation of Sweden's data factory.

In addition to ordinary glossaries and including existing Swedish data sets, the raw data will be collected from a large number of sources, including news text, social media, internet forums and reviews from various fields. The project will, in addition to annotated data, produce a number of fully trained models for entity tagging and sentiment analysis, among other things, in order to enable additional research and innovation.

The Swedish Language Database project is partially financed by Vinnova and will be operated by AI Innovation of Sweden in collaboration with partners with expertise within NLP, including Recorded Future, Gavagai, TalkamaticSpråkbanken and SKL together with a reference group consisting of a wide range of needs owners from various fields.
 

Vanja Carlén, Project Manager at AI Innovation of Sweden will lead the project.

Why do AI Innovation of Sweden want to be a part of this project?
"The new technology places high demands on models that understand and can generate natural language and we are increasingly encountering NLP without being aware of it. Access to models and open data for the Swedish language will facilitate development of Swedish language applications for industry, academics and within the public sector - we don't want to fall behind!"

Who do you think will be the major users of Swedish Language Data Lab?
"There has been a great deal of interest in the project and we see needs owners ranging from academics to industry and the public sector. Many applications require large sets of Swedish training data and we are now seeing a great deal of interest in pre-trained models and, for example, translation applications. We also see a great deal of interest in further development of the data sets for applications within specific domains, such as law and medicine."

What do you see as the greatest difficulty associated with the linguistic part of AI? What are and what will be the greatest challenges?
"Swedish is a small language and global players do not ordinarily have any interest in developing annotated data sets for Swedish. The development of models adapted for the Swedish language is becoming all the more important and the amount of open Swedish data is limited. Large sets of data for development of NLP models are needed and a limiting factor is that the texts may be protected by copyright or contain personal data and cannot be shared as a result. In this project, we will also develop background-trained models that will handle this - we will provide models trained on sensitive data without sharing the basic data. The provision of Swedish text and models will contribute to maintaining the linguistic diversity and promote innovation within the NLP field in Sweden, which is something that will benefit a multitude of interested parties."

You have been a project manager at AI Innovation of Sweden for a while now. What other projects do you work on?
"In addition to involvement in the Swedish Language Database, I will be project manager in the National Swedish Space Data Lab (Nationella Rymddata-labbet) together with the Swedish National Space Agency, RISE and Luleå University."