Johannes Goll, Global Head of Emmes’ Biomedical Data Science and Bioinformatics department, explains how big data analytics can deliver improved therapies and bring clinical research one step close to realising the potential of personalised care.
A key breakthrough area where big data analytics has shown tremendous scope and application is in systems biology, which has bolstered holistic drug research and development, ultimately to provide patients with access to the right therapies. A case in point is the emerging omics and other high-throughput laboratory technologies that allow simultaneous profiling of genetic alterations in combination with profiling thousands of molecules in plasma, cell populations, or single cells. These technologies allow researchers to get an unprecedented view of human cellular responses to vaccines, therapeutics, or medical devices and play a crucial role in identifying next-generation diagnostic, predictive, and prognostic biomarkers. Typically, the data obtained from such studies is large and high dimensional with thousands of molecules being measured for each individual sample requiring special statistical, data science, and bioinformatics skills to integrate the information and correlate it with the clinical outcome of interest. It’s a rapid growth area for clinical trials in the next few years as we seek to understand the underlying biology behind many of the targets and diseases we study.
Clinical research organisations (CRO) therefore stand to benefit from having a dedicated team of data scientists and bioinformaticians which can provide the right focus and insights for deeper client engagement and business growth. A CRO that utilises: 1) statistical methodologies, including machine learning and advanced data visualisations, 2) cloud computing to scale data storage and analysis with data size and complexity, and 3) reproducible data analysis methods stands to gain a competitive advantage by ensuring high quality, timely delivery, and reproducibility of clinical trial results without being held back by data volume and/or complexity.
How to harness the potential of big systems biology data
Here, we outline some lessons and case studies we have learnt through our more than 25 years of collaboration with National Institute of Health (NIH) – particularly in the field of influenza vaccines and cancer biomarkers. Given below is a summary of these practices along with some examples of real-world applications.
Given the complexity of systems biology analyses, a crucial step for a CRO is to establish a scalable computational framework that facilitates reproducibility. This is accomplished by using scalable cloud computing, virtual machine image technologies to maintain project-specific operating system and software snapshots for each project, and a combination of custom-built and open-source analysis software to build analytical workflows.

The important point for these analytical workflows is that all analysis steps starting from the raw data processing and ending with the report generation step are fully automated, facilitating reproducible research (Figure 1). Components that can be parallelised, i.e. executed on multiple threads, should be implemented in that way so that the overall time for execution can be scaled with increasing number of threads. Such an approach, if effectively implemented, provides for increased efficiencies, as 1) analyses can be scaled with project needs (available resources, turnaround time, data size), 2) a report that includes all tables, figures, and listing (TFL) can be automatically generated avoiding manual effort to layout TFLs and 3) the likelihood of human error as part of this process is removed.
Given the whole process is automated, this not only ensures a high-level of integrity of the results but also can be highly beneficial both in terms of cost-savings and analytically when performed for clients that have multiple trials with similar endpoints facilitating meta-analysis. It also helps to quickly address unplanned changes. For example, we noticed in a report that signal of the treatment effect in the data was not aligned with what would be expected. On bringing this back to the laboratory, the laboratory realised that a mislabeling of specimens had occurred. Having the workflow in place allowed us to return an updated 500+ page report within two days.
Key analytical components that we typically include in all our systems biology workflows include components to 1) assess quality of the raw and processed data (batch effects and outliers visualized using data reduction techniques such as PCA and MDS plots), 2) address limitations of the data (missing values via imputation strategies such as k-nearest neighbors, systematic technical biases via suitable data normalisation strategies such as median or quantile normalisation) and 3) identify and characterise biomarkers of interest (machine learning, univariate/multivariate, and pathway enrichment statistical methods). Finally, results need to be summarised in an intuitive way that facilitates interpretation of key signals in the data. We use MA plots, heatmaps, pathway maps, and trend plots of clusters (clusters for features with correlated response) as shown in Figure 2 but also UpSet plots, advanced Venn diagrams, to summarise overlap in differential features across multiple timepoints and study arms. This approach has led to the development of two software packages for RNA sequencing analysis and reproducible ribosomal profiling1,2.

What opportunities exist for CROs to broaden their systems biology expertise?
Working closely with laboratories that generate systems biology data provides very valuable insights into where the field is moving next. By interacting with laboratories, we can access their knowledgeable staff familiar with the data and suitable analysis strategies, which provides valuable feedback for statistical analysis planning. There might even be the opportunity for a collaboration to optimise experimental and analytical methodology. For example, we recently collaborated with two sequencing laboratories to determine optimal parameters for RNA-Sequencing (RNA-Seq) in the context of a vaccine clinical trial. RNA-Sequencing (RNA-Seq) is a revolutionary technology that enables researchers to assess gene expression changes in thousands of genes over time. Results provide deep insights into mechanistic processes, such as which innate and adaptive immune system processes are activated following vaccination and when. However, the analysis of this data is not trivial and there are many different choices to make in terms of analytical approaches and cut offs that can be applied.
In addition, RNA-Seq experiments are costly and careful experimental design considerations can save a lot of money. In collaboration with the central sequencing laboratories, we evaluated the impact of filtering lowly-expressed genes, using external RNA controls, fold change and false discovery rate (FDR) filtering, read length, and sequencing depth on differential expressed genes (DEGs) – generating a concordance between aliquots from the same samples. We also developed an R package to determine optimal gene filtering cut offs and modeled statistical power to detect DE genes for a range of sample sizes, effect sizes, and coverage depths3. A summary of the study and additional lessons learned related to RNA-Seq in the context of clinical trials is provided in our recent webinar4. Together, such collaborative work can help CROs to become more acquainted with emerging systems biology data, analysis paradigms, and add value to clients by helping them save cost on the experiment and optimise their analytical approaches.
What can be learned from systems biology studies?
Through a partnership with NIH as the Statistical Data Coordinating Centre for its vaccine trials, we had the opportunity to support the analysis for a large-scale A/H5N1 influenza (bird flu) systems biology study for which we analysed and integrated transcriptomics, proteomics, and flow cytometry data5,6,7. We explored the difference in immune responses from individuals who took the H5N1 vaccine with the ASO3 adjuvant, and a control group who took the vaccine only. Utilising machine learning techniques, we identified early biomarkers that best differentiated between AS03 adjuvanted and unadjuvanted subjects or that best correlated with later antibody response against the vaccine antigen. The ASO3 adjuvant, which is known to modulate and improve immune response was seen to stimulate subsets of white blood cells increasing expression of genes and proteins that improve uptake and processing of antigens 24 hours after vaccination. This led to better antigen presentation, ultimately improving the efficacy of the vaccine as measured by the degree of protective levels of antibodies post-second dose.
An implication of the above is that the machine model we applied enables us to predict the effectiveness of immune response in an individual based on a small subject of gene expression biomarkers 24 hours post-vaccination. This could be expanded to assessing predictors at baseline. The outcome could be used to develop a diagnostic kit to ascertain high responders in a population before giving the vaccine. Based on that, in theory, one could also do a dosage assessment. For example, give them less of each dose, or instead of two doses a single shot. Ultimately, such approaches lead to personalised treatment.
Another application of a system biology approach is the development of a diagnostic classifier based on targeted exon sequencing to detect somatic mutations that are characteristic for certain forms of blood cancer. This work was conducted as part of the NHLBI National MDS Study, a prospective cohort study that was conducted at dozens of community hospitals and academic centres that enrolled patients undergoing work up for suspected myelodysplastic syndromes (MDS) to understand the genetic, epigenetic, and biological factors associated with the initiation and progression of the disease8.
For this study, our team supported cloud-based targeted exon sequencing data processing and analysis to identify, characterise somatic variants. We then built a two-stage diagnostic classifier using machine learning. It utilises somatic mutational profiles in 18 select genes in bone marrow samples to diagnose subjects as having a myeloid malignancy, and if so, if they have MDS. Ultimately, utilising genetic information as part of MDS classification schemes will be critical to help establish the most optimal treatment.
Summary
To summarise, biomedical research is increasingly conducted by large, interdisciplinary collaborations to address problems with significant public health impact. These include improving existing vaccines or developing new vaccines for emerging pathogens, reducing antibiotic resistance, and identifying disease sub-types including different forms of cancer and cancer subtype-specific therapies. Many of these projects are data driven and involve the collection and analysis of biological data at a large scale. As a result, life-science projects, which are frequently diverse, large and geographically dispersed, have created unique challenges for collaboration.
With the development of modern sequencing technology, more clinical trials seek to include such information in clinical decision making and trial design. Intelligent methods of analysing this complex data are also constantly being developed and will become useful tools for future clinical trial design in the era of precision medicine.
CROs or sponsors that innovate and develop streamlined solutions for big data analysis and stay at the forefront of emerging experimental and analytical technologies will have an advantage over others by saving vital time, money, and resources. But more importantly, this will enable researchers to better understand the molecular mechanisms of treatment effects they are exploring, ultimately delivering improved therapies and bringing clinical research one step close to realizing the potential of personalised care.
The implications of these analytic developments are hugely exciting, and to say one should ‘watch this space’ would be an understatement. However, there are still further learnings to overcome, especially around data mining and presentation in clinical trials. Yet, I would expect with the growth of big data analysis and, just as importantly, greater industry knowledge of these – aided by machine learning – we will constantly be reimaging the art of the possible over the next few years. We now have tools, technologies and experience for transformational changes to trials, with system biology approaches potentially the key to unlocking individualised approaches – but built into the trials rather than developed post approval.
References
- Jensen TL, Frasketi M, Conway K, Villarroel L, Hill H, Krampis K, Goll JB. RSEQREP: RNA-Seq Reports, an open-source cloud-enabled framework for reproducible RNA-Seq data processing, analysis, and result reporting. F1000Research. 2018 Apr 13;6.
- Jensen TL, Hooper WF, Cherikh SR, Goll JB. RP-REP Ribosomal Profiling Reports: an open-source cloud-enabled framework for reproducible ribosomal profiling data processing, analysis, and result reporting. F1000Research. 2021 Feb 24;10(143):143.
- Goll, Johannes B., et al. “The Vacc-SeqQC Project: Benchmarking RNA-Seq for Clinical Vaccine Studies.” bioRxiv (2022).
- https://pmi-live.com/events/vaccine-clinical-trials-with-rna-seq-endpoints-to-measure-gene-expression-lessons-learned
- Howard, Leigh M., et al. “Cell-based systems biology analysis of human AS03-adjuvanted H5N1 avian influenza vaccine responses: a phase I randomized controlled trial.” PloS one 12.1 (2017): e0167488.
- Galassie, Allison C., et al. “Proteomics show antigen presentation processes in human immune cells after AS03‐H5N1 vaccination.” Proteomics 17.12 (2017): 1600453.
- Howard, Leigh M., et al. “AS03-Adjuvanted H5N1 Avian Influenza Vaccine Modulates Early Innate Immune Signatures in Human Peripheral Blood Mononuclear Cells.” The Journal of infectious diseases 219.11 (2018): 1786-1798.
- Sekeres MA, Gore SD, Stablein DM, DiFronzo N, Abel GA, DeZern AE, Troy JD, Rollison DE, Thomas JW, Waclawiw MA, Liu JJ. The National MDS Natural History Study: design of an integrated data and sample biorepository to promote research studies in myelodysplastic syndromes. Leukemia & Lymphoma. 2019 Nov 10;60(13):3161-71.
About the author
Johannes Goll is the Global Head of Emmes’ Biomedical Data Science and Bioinformatics department. As Senior Biostatistician, he led analyses of multiple cutting edge systems biology vaccine clinical trials to identify genes, proteins, or metabolites that best predict vaccine efficacy, reactogenicity, or adjuvant effects. Prior to joining Emmes, Goll served as a Senior Bioinformatics Engineer at the J Craig Venter Institute where he contributed to groundbreaking genomics research including the NIH Human Microbiome Project. He holds an MS degree in Statistics and a BS-equivalent in Bioengineering.