Developing a Computational Pipeline and Benchmark Datasets for Image-Based Transcriptomics

p16_Ganguli-image-400x400
Deep Ganguli
Deep Ganguli
Ambrose Carr
Ambrose Carr
Brian Long
Brian Long
Ed Lein
Ed Lein

A New Technology

Multiplexed image-based transcriptomics is an emerging technology for measuring spatially resolved gene expression in cells and intact tissues at scale. These transcriptomic profiles can be correlated with other measures of cellular structure, function, and localization to obtain a better understanding of cellular biology in health and disease. Nevertheless, a wide variety of new methods exist, and significant computational problems remain around storing, analyzing, sharing, and comparing data obtained from these methods. These computational demands pose significant challenges for individual labs. They also pose challenges—and opportunities—for large-scale consortium efforts like the Human Cell Atlas (HCA), the Brain Initiative, and the Human BioMolecular Atlas Program (HubMap), all  of which aim to generate spatially resolved molecular atlases of cells in major organs.

A Team Effort

To address these challenges, the SpaceTx consortium was formed among the developers of image-based transcriptomics assays in an effort to systematically compare how these assays differentiate cell types in the human brain. This grassroots consortium is coordinated by the Allen Institute for Brain Science, under the umbrella of the Human Cell Atlas project, and is funded by grants from the Chan Zuckerberg Initiative (CZI). Each consortium member will be provided with a sample of partitioned human brain tissue, and will then apply its method to the sample to generate a publicly available reference dataset suitable for comparative analysis and benchmarking.

Additionally, a team of computational biologists and software engineers from CZI is working with SpaceTx in an effort to solve several challenges around data dissemination and standardized analysis through an open-source software package called Starfish (https://github.com/spacetx/starfish). Through open science and open source software, SpaceTx aims to enable and accelerate the adoption and utility of multiplexed image-based transcriptomics.

Below, we briefly summarize the state of the field, then describe our current efforts and roadmap for the future.

The Landscape of Image-Based Transcriptomics Technology

A variety of molecular technologies have been developed for transcriptional profiling of cells, varying in terms of their scale (number of cells and number of genes profiled) and spatial resolution (coarse-scale sequencing versus fine-scale imaging).1,2,3 Single-cell RNA sequencing is the current popular technology used to obtain transcriptome-wide profiles of up to a million cells. However, these methods lose information about cellular and transcript spatial location, and have low RNA capture efficiency.

Single-molecule fluorescence in-situ hybridization (smFISH) is another established technology that retains spatial information, and has higher capture efficiency, but does not yield transcriptome-wide profiling. In smFISH, fluorescently labeled probes are hybridized onto target sequences of RNA transcripts. These probes are subsequently visualized as diffraction-limited spots through fluorescence microscopy. Although smFISH obtains high spatial resolution, the number of genes profiled is limited by the number of resolvable fluorescence channels, which is typically three to five. To profile more genes, researchers can strip existing probes, re-hybridize to target new transcripts with the same fluorophores, re-image, and repeat over multiple rounds. One such method in this vein is known as cyclic smFISH (osmFISM), which has been shown to accurately profile 33 genes over 13 imaging rounds in mouse somatosensory cortex.

{For image-based transcriptomics} we collaboratively arrived at a consensus input file format for the microscopy data and codebook that each method generates.

Imaging across many rounds of hybridization—and the chemistry required to strip and re-hybridize probes—is time consuming. As such, barcode-based methods like multiplexed error-robust FISH (MERFISH) and sequential FISH (seqFISH) have been developed that encode specific patterns of fluorescence across channels and rounds to represent individual genes. These predefined patterns (“barcodes”) can in turn be imaged then decoded to localize the expression of hundreds of genes or more with only a few rounds of imaging. There are also early proofs of principle that these methods can scale to measure more than 10,000 spatially resolved genes.

An alternative approach to multiplexed FISH is to sequence RNA directly in situ. Fluorescent in-situ sequencing (FISSEQ) sequences RNA after amplification and has been shown to detect up to thousands of spatially resolved genes in cell culture. In-situ sequencing (ISS) uses RNA sequence targeted padlock probes, with attached oligonucleotide barcodes, which are subsequently sequenced by ligation and decoded to obtain spatially resolved target gene identities. This method works well for up to hundreds of genes, and in principle can be scaled higher.

Challenges for Sharing, Comparing, and Analyzing the Data

Independent of the method, the raw image data generated from multiplexed image-based transcriptomics are large—up to terabytes per experiment—and need to be stored and processed to obtain images of gene expression and cell-by-gene expression matrices for biological analysis. Furthermore, image-based transcriptomics methods currently lack standardization around file formats and analysis methods. This makes it difficult to compare these methods on equal footing and generate user-friendly benchmark datasets.

To address the file format standardization problem, we collaboratively arrived at a consensus input file format for the microscopy data and codebook that each method generates. All methods developers have agreed to provide data in this standardized file format to facilitate a comparison of these methods. The format is human readable, amenable to local and cloud-based computing/storage/visualization, and open source (https://github.com/spacetx/starfish/tree/master/sptx_format). Additionally, we are working with the developers of the open microscopy environment (OME) to facilitate easy conversion from OME-TIFF and other popular microscope formats to this standardized file format.

Even after the data are stored in the same file format, each different chemistry or multiplexing scheme initially appeared to require bespoke data analysis software, which has made it difficult to derive and compare biological results across assays. Furthermore, there lacks a consensus on a standardized data processing strategy. To address this issue, the team has collaboratively developed an open-source software package called Starfish (https://github.com/spacetx/starfish). Starfish exposes a standardized library of tools to solve the difficult image processing problems required to analyze the data: image registration to align spots and cells across rounds, image filtering to remove background signal and enhance spots, spot detection, decoding to assign gene identities to spots, and segmentation to assign genes to cells (Figure 1, images on left). Starfish can run locally or in a cloud-based environment and is designed to scale with data volume.

Figure 1.

Opportunities for the Future

So far, we have shown that Starfish can be used to reproduce results obtained from many existing pipelines developed for specific image-based transcriptomics assays. This is exciting, as the analysis can now be done in one standardized software package, for many assays, instead of one custom software package per assay. Although Starfish enables the implementation of a variety of pipelines (Figure 1, subway diagram), we aim to show that Starfish can be employed by SpaceTx consortium members as a unified data processing approach.

In addition to deriving consensus standards around file formats and processing, SpaceTx is making all data and analysis code publicly available through the HCA. We hope this will enable the community to mine the data in different and creative ways to drive discoveries about the physical cellular architecture of all organs in health and disease. Communities like ASCB are crucial for advancing this goal, and we encourage contributions, usage, and feedback on Starfish.

References

1Svensson et. al. (2018). Exponential scaling of single-cell RNA-seq in the past decade. Nature Protocols 13, 599–604.

2Lein et. al. (2017). The promise of spatial transcriptomics for neuroscience in the era of molecular cell typing. Science 358, 64–69.

3Crosetto et. al. (2015). Spatially resolved transcriptomics and beyond. Nature Reviews Genetics 16, 57–66

About the Author: