Marilyn Matz, CEO of Paradigm4, and Zachary Pitluk, Vice President of Life Sciences at Paradigm4, explain why scalable data science platforms are key to supporting integrated analysis of single-cell genomic data sets.
As researchers approach the practical limits of flow cytometry and are challenged to simultaneously explore the thousands of proteins expressed by the genome that exist in a single cell, relying on a significantly larger number of markers, single-cell ‘omics technologies, such as single-cell RNA sequencing (scRNA-seq), have become established in drug discovery routine. Vast, information-rich datasets are generated by scRNA-seq that can, for example, help clarify specific molecular mechanisms and pathways, and even reveal the nature of cell heterogeneity. With this advanced technology, researchers can look for insights into the transition from ‘healthy’ to ‘disease’ states, investigate potential biomarkers, identify subpopulations for drug response and patient stratification, and study and classify how expression within a cell type varies across different biological conditions in order to understand the cellular basis of off target drug effects or find new indications for already approved drugs.
These new possibilities require new, powerful, optimised processing, data management and computational methodologies to extract translational value for drug discovery and development scientists. In this article, we highlight several examples of how scRNA-seq data is powering progress in pharma and discuss why scalable data science platforms are key to supporting the integrated analysis of single-cell genomic data sets.
A pivotal role for scRNA-seq
The Sanger Institute1 summed up the significance of single-cell data, commenting in a 2020 blog post that: “The volume of single-cell data that will be generated will exceed the volume of genotype sequencing data by orders of magnitude. While the human genome has 20 thousand genes, there are 300 different cell types in the human body comprising 37 trillion cells. Previously, scientists would take billions of cells together and measure an average of gene activity. Now, it is feasible to measure each cell’s individual gene expression profile.’
On a practical level, ground-breaking work over the last two years in respiratory research provides a good example of what has been achieved. Applying scRNA-seq led to the detection of a transcriptionally novel cell type – termed a ‘pulmonary ionocyte’ – that expresses large quantities of CFTR (cystic fibrosis transmembrane conductance regulator), the causal gene of cystic fibrosis.2,3
This finding had important implications for gene-therapy approaches to cystic fibrosis, and changed the way drugs are developed to treat such diseases. It also helped fuel the rapid adoption of single-cell omics – a recent report suggests that by 2028, the single-cell analysis market is expected to reach US$6.7 billion4. The authors note that whilst much of this growth is accounted for by sequencing instrumentation and reagents, a significant base driver is the adoption of the technique by pharma and biopharma research groups.
The challenge of data integration
If we consider the wider application for scRNA-seq, we can start to understand the challenge behind the headline of the sheer size of single-cell datasets – the Immune Cell Survey in the Human Cell Atlas (HCA) project, for example, contains 780,000 cells, which come from just 16 donors. As datasets grow to include enough individuals to offer statistical power to drug discovery, or require cross-comparison between studies, the number of cells will grow to billions (as it has with flow cytometry), meaning the issues of integration, analysis across studies, and making comparisons across data types, become very significant questions.
Some experts have proposed a ‘knowledge network’ approach to integrate single-cell data with the complex mix of datasets and measurement parameters that make up the new healthcare landscape, from pre-clinical aspects such as biomarker identification and investigations into molecular mechanisms, to patient data and observational studies5.
Importantly, even restricting ourselves to a discussion of key factors for a computational platform for single cell ’omics data within this overall web of interconnected, heterogeneous data, we must consider how cross study comparisons are needed for many drug discovery and development processes: target screening, phase II efficacy testing, hypothesis testing to identify differences between diseased vs normal states, and patient selection once a drug is approved for certain patient groups, for example.
New trends, expanded applications, same needs
A survey of recent literature allows us to highlight just three of many areas where the application of scRNA-seq is developing rapidly.
Firstly, Nature magazine recently (January 2021) selected spatially resolved transcriptomics as its ‘method of the year’ for 20216. This new technique allows researchers to localise expressed genes, integrate the spatial response to treatment across cell types and anatomical markers (like blood vessels, tissue boundaries, etc.), and evaluate the actual dose response by cell type. A key requirement here is the preservation of spatial coordinates for inter-sample comparisons to show dose response – a significant challenge for many data science platforms.
Secondly, recent work explored population-scale multiple ‘omics analysis of single-cell data. A total of 224 antibody-derived tags on 20 genes were evaluated in parallel to scRNA-seq data in Peripheral Blood Mononuclear Cells (PBMCs). The data reveals the well-known differences between an mRNA transcript and its translated protein (Figure 1). Most importantly, the data analytics package used for this work was able to query multi- ‘omics datasets across modalities irrespective of the number of cells (~1 million in this case).
Finally, at the time of writing (May 2021), the US National Library of Medicine lists 53 clinical trials either underway, close to completion or currently recruiting subjects, highlighting the urgent need for tools to manage data from large numbers of patients because of the use of scRNA-seq as a key methodology for population-scale studies8. Areas under investigation range from infectious disease, through oncology to fibrosis (Table 1), and the researchers are looking for outcomes that include:
- Understanding mechanism of action of a drug or candidate
- Confirming mechanism of action of a drug in development
- Providing explanatory power for Mass Cytometry proteomics
- Understanding disease aetiology
- Understanding transcriptional profile of disease and response to therapy.
All this demands an analysis platform that can evaluate key biological hypotheses by querying at scale and facilitate data integration across studies or populations without requiring elaborate IT knowledge and plumbing. Many current methods require repetitive extract/transform/load operations (data wrangling) increasing setup time and computational cost with every question asked of the data. Because of fundamental technical limitations, they restrict the use of preferred computational methods, and significantly constrain the number of total cells/datasets that can be inter-compared, preventing the analysis of correctly powered results (i.e., enough patients/biological replicates).
Within the range of data analytics ‘ecosystems’ and ‘platforms’ competing in this space, one novel approach has emerged that blurs the line between storage and computation (REVEAL, Paradigm4). It uses a fundamentally different scientific data management and computing platform that is purpose-built for large scale multidimensional scientific data. Data is stored on disk as arrays that can easily be queried with scientific languages, such as R and Python. The old way of working – opening many files and bringing the data together into a matrix – is no longer necessary, because the data is natively stored for rapid exploration and analysis.
The application is designed to handle spatial transcriptomics, and all flavours of multi-omics single cell analysis, including mass spec single cell proteomics, and results from multiple types of clustering or normalisation stored as attributes of each and every cell are covered. Public and custom ontologies, with versioning, can be seamlessly integrated and queried independent of the underlying cell data.
In addition, a purpose-built data schema that uses scientist-friendly R and Python interfaces with the data and computational platform using task-relevant functionality and application-appropriate vocabulary (REVEALTM: Single Cell, Paradigm4). New data types as well as user and community developed analytics like novel normalisation and clustering capabilities can be integrated. Moreover, this new approach has been consistently shown to handle sufficient numbers of patients’ worth of data for clinical applications.
The view ahead seems clear – drug discovery scientists will be collecting, collating and probing larger and larger datasets in their search for understanding and innovation. There is now a path forward to meet the challenge of extracting clinical value from these datasets, and a new analytical approach that looks set to be an essential element for timely testing of key biological hypotheses in target evaluation, disease progression, and precision medicine.
Marilyn Matz is CEO and co-founder of Paradigm4. Prior to Paradigm4, after completing a MS degree at the MIT AI lab, she was one of three co-founders of Cognex Corporation. Matz is was the recipient of the sixth annual Women Entrepreneurs in Science and Technology (WEST) Leadership Award; a co-recipient of the SEMI industry award for outstanding technical contributions to the semiconductor industry; and a 2020 NACD Directorship 100. She also serves on the Board of Directors of Teradyne.
Zachary Pitluk has worked in sales and marketing for 23 years, from being a pharmaceutical representative for BMS to management roles in Life Science technology companies. Since 2003, his positions have included VP of Business Development at Gene Network Sciences and Chief Commercial officer at Proveris Scientific. Pitluk has held academic positions at Yale University Department of Molecular Biophysics and Biochemistry: Assistant Research Scientist, NIH Postdoctoral Fellow and Graduate Student, and has been named as co-inventor on numerous patents.
- Mapping the Human Cell Atlas – charting the body’s cellular world, Sanger Institute (2018) https://sangerinstitute.blog/2020/04/08/mapping-the-human-cell-atlas-charting-the-bodys-cellular-world/#:~:text=20%2C000%20dimensions,may%20use%20to%20varying%20extents
- Alexander MJ, Budinger GRS, and Reyfman PA. Breathing fresh air into respiratory research with single-cell RNA sequencing, European Respiratory Review, 2020; 29(156):200060
- Plasschaert LW, Žilionis R, Choo-Wing R. et al. A single-cell atlas of the airway epithelium reveals the CFTR-rich pulmonary ionocyte. Nature, 2018;560: 377–381.
- Single-cell Analysis Market Size Worth $6.7 Billion By 2028 | CAGR: 15.0%: Grand View Research, Inc. (2021) [online] Available at: https://www.benzinga.com/pressreleases/21/05/n21146133/single-cell-analysis-market-size-worth-6-7-billion-by-2028-cagr-15-0-grand-view-research-inc [Accessed 24 May 2021].
- Seyhan et al, Are innovation and new technologies in precision medicine paving a new era in patients centric care? J Transl Med (2019) 17:114 https://doi.org/10.1186/s12967-019-1864-9
- Marx, V. Method of the Year: spatially resolved transcriptomics. Nat Methods 18, 9–14 (2021). https://doi.org/10.1038/s41592-020-01033-y
- Srikant Sarangi , Ryan Golhar , Namit Kumar , Sergey Fridrikh , Connie Brett , Jason Kinchen, Kriti Sen Sharma, Zachary Pitluk, AGBT 2020 Scaling up multi-omics scRNA-seq analysis using REVEAL: Single Cell, Paradigm4 (2021)
- Clinicaltrials.gov. Search of: single cell RNAseq – List Results – ClinicalTrials.gov. (2021) [online] Available at: <https://clinicaltrials.gov/ct2/results?cond=&term=single+cell+RNAseq&cntry=&state=&city=&dist=&Search=Search> [Accessed 25 May 2021].