Neuroscience data joins the cloud
August 9, 2018
Neuroscientists and open data experts have teamed up to make a new, and large set of mouse brain data publicly available and open for analysis on the cloud. Through a collaboration launched earlier this summer, Amazon Web Services is hosting the Allen Brain Observatory — Visual Coding dataset through the AWS Public Dataset Program.
That large dataset is comprised of the raw data from the Allen Brain Observatory, a set of experiments that captures neurons’ activity in real time in the mouse visual system. Putting this neuroscience data in the cloud, where powerful remote servers store the information and allow anyone around the world to access the public database, has already opened more doors than was possible before the team turned to cloud computing, said David Feng, Ph.D., Associate Director of Technology at the Allen Institute for Brain Science.
The Allen Institute is built on a model of sharing its data with the research community. Sometimes, greater insights result when multiple groups sift through the same raw numbers and images with different perspectives, asking different research questions. But when Feng and his colleagues got together to discuss how to share the observatory data, they hit a stumbling block. There was just too much data to share the way the researchers had always made information available in the past, through the Institute’s dedicated online data portal, the Allen Brain Atlas.
David Feng, Ph.D., Associate Director of Technology at the Allen Institute for Brain Science, and Justin Kiggins, Ph.D., a scientist at the Allen Institute for Brain Science work together on generating and sharing data from the Allen Brain Observatory.
The observatory experiments entail capturing precise information about brain cell activity as mice look at different photos or movie clips, with the ultimate goal of understanding how brains process visual information. The research team on the Allen Brain Observatory has already recorded information from more than 65,000 different neurons. Some of that information is easily shareable, but some — namely, the raw video files of brain cells in action — presents a larger challenge.
To date, that dataset is 40 terabytes large, which is about four times the size of the Hubble Space Telescope’s yearly output, and more experiments are coming down the pike.
An iceberg of information
When the scientists initially began sharing the observatory data in 2016, they put the curated results of the experiments — the analyses of those videos — online for anyone to download. It’s still a large amount of information, but it’s manageable, said Justin Kiggins, Ph.D., a scientist at the Allen Institute for Brain Science who is part of the observatory team.
“But that data is really just the tip of the iceberg,” Kiggins said.
Without any better way to share the massive piles of raw data, the researchers decided to advertise an old-fashioned work-around: External researchers could mail a hard drive to the Allen Institute, where the researchers would load it up with as much data as would fit (at most around a dozen of the hundreds of experiments, Kiggins said) and put it back in the mail.
In the two years since they posted that note on their website, they’ve gotten a grand total of two requests to distribute the data via hard drive, Feng said.
“It was really only open in the technical sense of the word,” he said. “And that wasn’t because we didn’t want to make it as easy as possible for people to access the videos, but because we just didn’t have a way to do it.”
Enter the cloud.
Feng, Kiggins and their colleagues realized that they could use shared cloud computing services like AWS to make the data available to the research community outside the Allen Institute. In 2017, they set up a pilot project through the Allen Institute and University of Washington’s Summer Workshop on the Dynamic Brain, a two-week workshop on San Juan Island that introduces students to a variety of neuroscience and data science topics through hands-on computational projects.
The previous year, they’d brought a stack of hard drives to the workshop and had to spend a few days at the start of the workshop configuring each student’s laptop to work with the data. In 2017, through their pilot collaboration with AWS, the workshop organizers not only gave the students access to much more data, but they also set up a universal programming environment in the cloud so the students could boot up and immediately get to work.
Instead of several days of setup, “now it’s about five clicks, five minutes of waiting, and you have a powerful computer running remotely and you can start analyzing the data” with the latest machine learning tools, Feng said.
Opening new doors
With a successful pilot behind them, the researchers started exploring a larger, more long-term solution. The collaboration was a perfect fit for their public dataset program, said Jed Sundwall, Open Data Lead at AWS.
“There are research questions that people would like to ask but the cost of asking the questions is too high; there’s all this pain to get to the data,” Sundwall said. “We have a very obvious solution to that.” After the dynamic brain workshop, which Sundwall also helped coordinate on the cloud computing side, “it became very clear that the Allen Institute team understood the value of the cloud computing and knew how much further it could go,” he said.
The AWS Public Dataset Program covers the cost of hosting public datasets on the cloud for two years. The Allen Brain Observatory dataset joined that program in June. Since then, they’ve already had a handful of outside groups access the data — an encouraging increase from the two hard drive requests in the previous two years, Feng said.
The researchers are excited not only about a better storage solution for the information, but about new ways to interact with that data. There’s a lot being done in the broader community to develop new tools to work with scientific data through cloud computing services like AWS, Kiggins said.
“This could completely open up doors to exploring and communicating about this work,” he said. “When you have 40 terabytes of data right behind your browser, the future opportunities are awesome.”
Get the latest news from the Allen Institute.