Swedish Language Data Lab
The Swedish Language Data Lab is a project funded by Vinnova, Sweden’s innovation agency, and implemented by AI Innovation of Sweden. By working with leading players in the field of language technology and natural language processing (NLP) and with other stakeholders, the project aims to:
- develop and make available a general Swedish language data set.
- develop and make available trained Swedish language models.
- produce a technical, legal and ethical framework for processing and publishing Swedish language data sets.
- analyse text and models from the perspective of spoken dialogue.
- hold preliminary studies based on the need to create language data sets that are specific to medical and legal domains.
The development of language models designed for the Swedish language relies on data sets of Swedish text and, as things currently stand, access to these is limited. Swedish is a small language and global players rarely have an interest in producing annotated data sets for Swedish. Therefore, the provision of reference data sets and language models in Swedish will help to maintain linguistic diversity and promote innovation in the field of NLP in Sweden, which will benefit a whole series of organisations in the academic world, industry and the public sector.
One limitation on openly sharing data in text form is that it may be protected by intellectual property rights and/or include personal data. A set of clearly formulated ethical values that describe how AI systems should be developed and used will play an important role in attracting talented people who want to contribute to positive developments for the benefit of people and society as a whole. The data lab will make data available on the basis of ethical principles, safeguard personal integrity and comply with current legislation.
The overall goal of the project is to create a national knowledge hub and an NLP resource that will accelerate innovation, research, and applications in this area. The project forms part of Vinnova’s “Data-driven innovation” funding programme which aims to “increase the level of expertise in reusing data in innovations in Sweden”. It is also in line with the EU strategy for the digital transition and, in particular, for data and artificial intelligence. The strategy highlights the importance of broadening the access to data in order to “create added value for citizens”, while at the same time ensuring that individuals have greater control over their own data.
Project plan and status (June 2020)
WP1: Project coordination
- (Ongoing) Continuous collaboration and coordination with the other NLP projects; Language Models for Swedish Authorities and Swedish Medical Language Data Lab
WP2: Data access
- (Ongoing) Mapping of the data to be used for training of the sentiment analysis model
- (Ongoing) Need and requirement analysis among members of SKR
- (Ongoing) Compiling annotation guidelines for the sentiment analysis, including evaluation of the guidelines from the spoken dialogue perspective
- Summer 2020: Annotation of data
WP4: Licenses, integrity, and intellectual property rights
- (Ongoing) The technical procedure for distribution of the NER model is to be decided
- (Ongoing) Project learnings so far about the process for sharing models and data in terms of GDPR and IP rights is to be shared
WP5: Data Factory, adjustment, and development
- (Ongoing) The technical, legal, and ethical framework for distribution of data and models in the Data Factory is under development. Read more about it here.
WP6: Models for named entity recognition (NER) and sentiment analysis
- December 2019: A first reference group meeting was held where the NER model was presented, along with an evaluation of the model from the perspective of spoken dialogue. Presentations were also given by the stakeholders and experts.
- (Ongoing) Identifying additional use cases for testing, evaluation, and application of the models
- (Ongoing) Further development of the NER model
- Fall/winter 2020: Development of the sentiment analysis model
WP7: Models trained on background data
- Fall 2020: Starting the pilot on developing a solution for developers to train models without accessing the data
- November 2019: The preliminary study on language data specific to the medical domain was completed and resulted in another project funded by Vinnova: The Swedish Medical Language Data Lab.
- Fall 2020: Starting up the preliminary study on data specific to the legal domain
Two examples of related advances in the field of Swedish NLP that have been published in 2020:
- The National Library of Sweden has distributed three BERT-based models in Swedish on their GitHub page.
- The Swedish Public Employment Service has shared two BERT models, trained on Swedish Wikipedia, on their GitHub page.
The project is being coordinated by AI Innovation of Sweden. Recorded Future, Gavagai and Talkamatic are providing language technology expertise, while Språkbanken, the language research unit at the University of Gothenburg, and the Swedish Association of Local Authorities and Regions (SKR) are stakeholders and owners of data. A wide variety of other stakeholders are also supporting the project by providing letters of support and taking part in the reference group.
Project period: 20190601-20210530