More than you think. Inside AI Sweden’s Data Factory, Zenseact has used a dataset from the Swedish University of Agricultural Sciences to speed up important R&D work by 6-12 months.
Mina Alibeigi works as an AI Researcher in Zenseact’s Technology and Collaboration Team. Zenseact is developing a software platform for autonomous driving. A lot of Mina’s time is spent on research and proof-of-concept projects, i.e. demonstrating that state-of-the-art methods or ideas have practical potential. This is often done in collaboration with universities or startup companies.
When building autonomous vehicles, “federated learning,” i.e. training an algorithm across multiple decentralized devices or servers, without exchanging locally stored data samples, plays a crucial role.
Mina points out, “With federated learning, there is less need to move data from one point to another. Instead, you train models locally at the same location as the data. These local models can be shared with fewer regulatory challenges, and then combined into a common global working model. So, for Zenseact, federated learning is an important technology. We can, for example, train models in one country without moving the data across borders.”
Nonetheless, there are still data challenges even when training models in a federated way. Mina Alibeigi and her colleagues at Zenseact have access to autonomous driving data which can’t be shared freely in R&D projects with external partners. Even within Zenseact’s organization, there are regulatory hurdles to overcome when you want to move raw data and non-trained models between borders.
“So, we started to examine if we could use proxy data during our R&D phase,” says Mina Alibeigi.
Proxy data are datasets that share a lot of the same features of the data related to the solution you are aiming to build, minus the regulatory aspects of the data.
In this specific case, video recordings of seabirds, breeding on the cliffs of Stora Karlsö off the coast of Gotland, Sweden, turned out to be a perfect match for Zenseact’s needs.
Ebba Josefson Lindqvist, project manager at AI Sweden’s Data Factory, explains, “The dataset was part of a project called FedBird, where we worked with the Swedish University of Agricultural Sciences to help their researchers build a model to annotate birds in video material.”
When working on other projects in the Data Factory, Mina Alibeigi saw similarities between road data and bird data. She realized the value that could be gained by experimenting with the FedBird dataset in Zenseact’s R&D methods and activities.
“For one thing, the FedBird data is real-world data, not synthetic. Also, the challenges related to the bird dataset are like the ones we need to address in developing solutions for autonomous driving; for example, variations in light conditions, and differences in the birds' appearance are like differences between pedestrians. The dataset being split in two, one larger and one smaller, was also valuable to us, since we can’t get access to the same amount of data from all countries around the world.“
There are also two important differences between road data and bird data. There is no GDPR regulating the privacy of seabirds and there are no business secrets hidden in the “proxy” seabird data.
“Once we realized that we could use the FedBird dataset that was available in AI Sweden’s Data Factory, we could start working with our partners straight away. I estimate that the FedBird data gave us between a 6 to 12-month head start, compared to waiting for clearance on the real-world road data.”
How do you reuse your work with proxy data in actual implementations?
“We rarely use work based on proxy data directly in our production models or other code. Instead, it’s the knowledge and learnings we gain from working with the proxy data, often in collaboration with others, that speed things up.”
1. Develop an understanding of what proxy data is and the benefits it can bring, for example, faster development times, easier collaboration with other organizations and the possibility to build different solutions based on the same shared dataset.
2. Find datasets that are representative of the challenges you have in the actual dataset your models are going to use. Without these similarities between the real data and the proxy data, it will be hard to draw any generalized conclusions from the work with the proxy data.
3. Find organizations to collaborate with. Share knowledge with each other and/or benchmark different solutions’ results.
Further read
→ Working with proxy data in Data Factory
→ The Baltic Seabird dataset