Scientific Computing

http://Computer%20monitors%20showing%20neuroscience%20data%20and%20data%20compression%20code

NULL

Scientific Computing

Creating a best-in-class open, reproducible, and scalable platform for scientific data storage, analysis, and sharing.

Goals and Approach

The Scientific Computing group at the Allen Institute for Neural Dynamics is working to accelerate scientific discovery for the scientific community at large by creating a best-in-class open, reproducible, and scalable platform for scientific data storage, analysis, and sharing. This group manages data standardization, automated processing, analysis infrastructure, and sharing. They continue the Allen Institute’s tradition of pushing the boundaries of open and reproducible science in the face of unprecedented challenges brought by the massive size and complexity of their data.

Sharing data with the community soon after acquisition is essential to maximizing impact and this has important implications for data infrastructure. This group’ data will be standardized, compressed, described by rich metadata, and pushed to the public cloud. Anyone with an internet connection can then view and analyze the data, without requiring increasingly massive downloads. They leverage modern, cloud-based data storage and analysis tools that scale to the most demanding data modeling tasks.

Projects

Reproducible Computation in the Cloud

Two researchers handling an electrical device while staring at a computer monitor

The well-documented reproducibility crisis in science is in part caused by the immense work required to write and maintain sustainable data processing and analysis software. Code Ocean is a cloud-native platform that enables scientists to easily version and track their entire software environment, including hardware configuration, to ensure that every processing or analysis run can be re-run to produce identical results at any point in the future. We use Code Ocean as a platform for both internal teams and external collaborators to share data, code, and results with each other from the comfort of a web browser.

Peta-scale Image Handling

Our cutting-edge microscopy platforms (see Multi-scale Molecular Anatomy Group) collect 100s of TBs of data at GBs per second around the clock. Handling data at this scale requires carefully managed network infrastructure, state-of-the-art storage systems, and fast data compression algorithms. To share massive imaging data with our community in the public cloud, we upload imaging data in the OME-NGFF file format with the cloud-friendly Zarr backend. We are building on community tooling to deploy cloud-based image processing pipelines that handle large-scale stitching, object identification, and alignment to the Allen Common Coordinate Framework.

Flexible, Extensible Metadata

Rich metadata is critical to reproducible science. Inspired by the BIDS, HCA, and OME efforts in metadata standards, we extend and develop standardized metadata to ensure that all raw and processed data is completely self-describing. Metadata describing specimen history, experimental procedures, and data processing is stored alongside the data it describes in simple human- and machine-readable file formats. As metadata standards need to continue to evolve with the science, all metadata will be carefully versioned.

Tools & Resources

aind-data-schema: metadata schema for AIND data

This repository contains the schema used to define metadata for Allen Institute for Neural Dynamics (AIND) data. All data is accompanied by a collection of JSON files that include the metadata that provide detailed information about the data, how it was acquired, and how it was processed and analyzed. The metadata also including administrative information including licenses and restrictions on publication. The purpose of these files is to provide complete experimental details and documentation so that all users have a thorough understanding of the data. Learn more.

aind-data-transfer: rapid data upload and compression scripts

This repository contains scripts used to transfer AIND data from local data acquisition systems to cloud storage. These scripts are often modality-specific, as they are also responsible for data compression and file format standardization. Learn more.

wavpack-numcodecs: numcodecs implementation of the WavPack audio codec

This package enables lossless and lossy compression of extracellular electrophysiology data. WavPack is an Open Source audio codec, which we have packaged into numcodecs, the Zarr codecs package. Learn more.

aind-codeocean-api: Python wrapper around Code Ocean’s REST API

This is a Python wrapper around Code Ocean’s REST API. This enables AIND to automate data transfer and processing in Python. Learn more.