Workshop: Mapping of Cell Types Data

Data Challenges for Workshop: Mapping of Cell Types Data

Data challenge: Mapping of Cell Type Data

Single cell technologies have made it possible to identify novel cell types, considerably advancing our knowledge of a tissue’s composition. Cell types are typically derived by clustering single cell transcriptomics data, but the quality and the granularity of these clusters remain difficult to assess. In the absence of gold standard information, benchmarks commonly validate predicted cell types against author annotations, implicitly favoring approaches that mimic the original study.

Participants will complete one or both of the challenges described below. You will submit your preliminary results for evaluation by May 2, in advance of the workshop on May 12, and briefly present at the workshop on your approach and progress.

Creative solutions to these challenges will have the opportunity to present to a national audience at the June 6, 2022, BRAIN Initiative Cell Census Network (BICCN) meeting and participate in a published review article.

Key dates:

Submit preliminary results for evaluation: May 2

Challenge participants updated on workshop presentations: May 6

Workshop: May 12

Selected participants present to BICCN meeting: June 6

Selected participants participate in review article: Summer 2022, with submission targeted for August 30

Challenge 1 details

Challenge 2 details

Data challenge overview and rationale

A key part of the Frameworks for Cell Type Definition, Ontology, and Nomenclature Workshop III will be a transcriptomic data mapping comparison of state-of-the-art approaches. In this Critical Assessment of Cell Type Matching (CACM), several teams will be provided with synthetic and real biological datasets for measuring mapping accuracy and potential novel association discovery. Teams will perform the mappings in advance of the workshop and results will be presented and compared during the workshop. The algorithms will be benchmarked on both real and synthetic data sets, which are distinguished by the availability of exact ground truth. Synthetic data sets will be constructed from real datasets to simulate specific and well-defined differences and perturbations. We summarize below the use case objectives of the data mapping challenges and identify key data reference sets.

Using methods of spatial transcriptomics we also hope to shed new light on clustering assessment by taking advantage of multi-modal technologies. We assess the validity of cell types using a completely independent modality, the spatial location of neurons. This heuristic is rooted in classical neuroscience, where different classes of neurons have specific spatial patterns (layer- or brain region-specificity) that also correlate with the neuron’s functions and connectivity patterns. Our rationale is that when a cell type is split into subtypes, the subtypes should have more refined spatial patterns than the original type. This second challenge, Critical Assessment through Spatially-resolved Transcriptomics (CAST) involves identifying cell types in a BARseq2 dataset that jointly measures transcriptomic signatures of marker genes and physical locations of neurons across a whole mouse cortex. Participants are asked to identify cell types based on the expression data alone; the cell types will then be assessed through their spatial patterns in the held-out physical locations.

We emphasize that interested participants need not tackle both challenges but can select either of the challenge cases below. There is flexibility in each challenge's definition and participants can modify or make more precise and interesting or relevant use cases that better illustrate the use case concept described.

Data Challenge 1

Set by Jesse Gillis, Mike Hawrylycz, Gerald Quon, Richard Scheuermann, Uygar Sümbül, and Sarah Teichmann

Essential to our understanding of cell types and their function is the ability to establish cell type references and to map and compare new experimental data with these references, thereby updating our understanding of types and their characteristics. Analogous to mapping sequencing reads to a reference genome, the ability to map query cells onto complex reference atlases allows identification of common and novel cell types and states. The problem of molecular data mapping to reference standards has been successfully approached by several groups and it is important to evaluate, compare, and identify the relative strengths of these methods in practice.

For Challenge #1, participants are provided two cell x gene expression matrices – Challenge1_reference and Challenge1_query – derived from a single cell RNA-seq experiment in which cell type cluster membership has been assigned to each cell based on unsupervised clustering results from R_cl_1 to R_cl_86 for the reference cells and Q_cl_1 to Q_cl_85 for the query cells. The goal is to match query clusters to reference clusters, and quantify the overlap.

Based on these datasets, participants are asked to provide:

  • A list of one-to-one cell type cluster matches between the reference and query datasets

  • A list of cell type clusters in the query dataset that are absent from the reference dataset as examples of novel cell types in the query

  • A list of cell type clusters in the reference dataset that are absent in the query dataset as examples of missing cell types in the query

  • A list of cell type clusters in the query that map to more than one cell type cluster in the reference as evidence of underparitioning of the query clusters

  • A list of cell type clusters in the reference that map to more than one cell type cluster in the query as evidence of overparitioning of the query clusters

  • A detailed description of the computational method used

  • Open source code or pseudo-code of the computational method used

The matching results can be provided as either qualitative or quantitative/probabilistic or both.

Data for challenge 1     Readme for Challenge 1     Submit final results

Data Challenge 2

Set by Jesse Gillis, Mike Hawrylycz, Gerald Quon, Richard Scheuermann, and Uygar Sümbül

Single-cell technologies have made it possible to identify novel cell types, considerably advancing our knowledge of a tissue’s constituents. Cell types are typically derived by clustering transcriptomics data, but the quality and the granularity of these clusters remain difficult to assess. In the absence of gold standard information, benchmarks commonly validate predicted cell types against author annotations, implicitly favoring approaches that mimic the original study. In CAST (Critical Assessment through Spatially-resolved Transcriptomics), we propose to shed new light on clustering assessment by taking advantage of the rise of multi-modal technologies.

We assess the validity of cell types using a completely independent modality, the spatial location of neurons. This heuristic takes its roots in classical neuroscience, where different classes of neurons have specific spatial patterns layer- or brain region-specificity that also map to the neuron’s functions and connectivity patterns. The CAST challenge consists in identifying cell types in a BARseq2 dataset that jointly measures transcriptomic signatures of marker genes and physical locations of neurons across a whole mouse cortex. Participants are asked to identify cell types based on the expression data alone; the cell types will then be assessed through their spatial patterns in the held-out physical locations.

For Challenge #2, the task consists in identifying cell types in a BARseq2 dataset provided by Xiaoyin Chen and Tony Zador that jointly measures transcriptomic signatures of marker genes and physical locations of neurons across a whole mouse cortex. The dataset contains two matrices: one expression matrix (dimension 109 x 642,340) and a held-back physical location matrix (dimension 3 x 642,340) with matched samples.

Subtask 1: Mapping cells to a reference dataset (see below). Each cell should be mapped to one of the types of the reference dataset or labeled as “unassigned” if it does not seem to correspond to any of the reference cell types.

Subtask 2: De novo clustering of cells. Each cell can be mapped to an arbitrary label. Each type should contain at least 20 cells.

Reference dataset: For subtask 1, cell types should be mapped to the glutamatergic supertypes from the isocortex and hippocampus taxonomy from Yao et al. 2021 (PMID: 34004146) as the reference dataset. The reference taxonomy contains cell type descriptions at varying levels of granularity: 388 clusters (finest types), 101 supertypes (medium resolution) and 42 subclasses (coarse types). The data is a compendium of 39 scRNAseq datasets (21 using the SmartSeq4 technology, 18 using 10Xv2) totalling ~1.3M cells. Each dataset samples a specific cortical or hippocampal brain region (e.g., MOp, VISp, HIP). The final taxonomy uses an integrated version of all datasets, enabling to identify shared and region-specific types.

Expected output: For both subtasks, the expected output is a two-column CSV file containing one row per cell (642,350 rows). The first column contains the sample ID of the cell (column names in the expression matrix), the second column contains the cell type label of the cell (one of the reference types or “unassigned” for subtask 1, arbitrary label for subtask 2).

Data for Challenge 2     Submit interim results     Submit final results

0