The Data Factory of AI Innovation of Sweden aims to accelerate research and innovation by making data available in a uniquely enabling way.
The Data Factory shall therefore:
- Make data available to AI researchers and developers. It will also provide the expertise and infrastructure for managing and accessing such datasets (whether donated, acquired or newly developed)
- Be a professional recipient of data (e.g. donated) by providing legal, ethical and technical frameworks by design so that the datasets are used according to agreed principles and constraints
- Provide training on the frameworks above for the users
- Initiate and run projects that: generate strategically important datasets; provide and develop expertise and legal structures; create an infrastructure that facilitates research on datasets; help partners set up and fund projects to develop datasets
- Actively strive to ensure that datasets are made available and easily accessed across industries and applications areas, in the most open way possible
- Provide resources, like IT infrastructure, tools and expertise related to the data engineering process or machine learning
- Provide competence and methods to use safe cloud services and external resources for training of models
- Generate and publish important and interesting research problems related to its activities
- Seek collaboration with other larger data centers or computational initiatives that optimize service and capacity
- Consider and limit the environmental impact due to for example electricity consumption of data centers
The Data Factory will recruite a team of experienced individuals who will focus on data procurement, data science engineering, methods, versioning and access & support.
Resources & Services
The Data Factory shall have the capacity and know-how on managing data (end-to-end) for pre-commercial projects. This should include methods, storage, processing power for training algorithms, versioning and access management from IP, as well as a legal perspective. The Data Factory will provide relevant tools for annotation, de-identification, etc. Resources (primarily processing power, tools, etc.) will be made available (within current budget constraints) and charged at cost for high-level users/projects. Collaborative project between partners will be prioritised over individual partners’ needs.
The resources provided by the Data Factory will be available to all partners, across geographical boundaries and in accordance with the terms & conditions for each dataset. AI Innovation of Sweden will periodically update which resources and services its partners will have access to, along with the specifying areas where further contribution on a project by project basis might be necessary. resources and services its partners will have access to, along with the specifying areas where further contribution on a project by project basis might be necessary.
The main focus is to support the data process with own hardware, tools and competence. For training, contracts with external resources (like HPC centers or cloud services) will be used when possible (depending of security and integrity level of the datasets) otherwise own resources will be available (computer cluster or tightly coupled GPUs). Finally a small testbed for inference is also planned to be included in the center.
In order to facilitate the usage of external facilities rather the own (on-premises) resources, the center will provide competence and support for secure connection to cloud services. This includes initiating innovation activities, holding seminars and the auditing of cloud service providers. Also processes support and contracts with selected cloud service providers (e.g. Microsoft Azure) will be provided to the partners of the center.
Typically the infrastructure will consists of SW or HW tools for de-identification, image calibration, annotation platforms for images, signal values or text, tools for synthetic data (e.g. rendering), simulators, version handling, access control and cyber security, backup systems, high capacity storage, analysis workstations, encryption, AI training frameworks (e.g. Caffe, Caffe2, Microsoft cntk, cuda, infereceserver, mxnet, pytorch, tensorflow, tensorrt, tensorrtserver, theano, torch, Hops, Hadoop, MINERVA, Chainer, OpenDeeep, julia mocha, Spark, kafka, KERAS, Pylearn2, Flink, jupyter, HIVE, Airflow, Influx, Grafana, kibana, elastic), access to external high performance computing, access & support to cloud services (e.g. Azure), GPU stacks, visualization and analysis and testbeds for inference.
During the first year of operation an investment plan for the Data factory will be developed.
Data Factory Process
In a series of workshop and meetings with stakeholders a data factory process has been developed and agreed. The figure shows the process model (note that this is highly iterative process). The objective is that the Data factory shall support all the included (dotted line) activities after three years.