Tariq E. Khoyratty DPhil, Computational Biologist, Aigenpulse, and Satnam Surae PhD, Chief Product Officer, Aigenpulse, discuss the challenge of the large volume of data generated and how automation/AI/ML can help to manage this.
As pharmaceutical companies expand their research and development (R&D) efforts to meet the growing global demand for new drugs, the need for efficient analysis of preclinical and clinical research data is becoming a priority. The current lack of access to consistent and reliable data limits the success rate of progressing a new pharmaceutical compound through to market, regardless of the advanced analytical technology used to produce the data. Flow cytometry is a diverse and crucial technique in pharma research, used to investigate disease aetiology and alterations in immune responses as well as for quantitative pharmacokinetic studies.
However, because flow cytometry measures dozens of parameters, the data generated is vast, often causing a bottleneck in research. Significant computational power is required to capitalise on the benefits of flow cytometry as a technique. Here we explore how automation, artificial intelligence and machine learning can expedite Big Data processing and management, leading to improved insights in flow cytometry analysis in pharma R&D.
An introduction to flow cytometry
A range of analytical techniques are implemented in pharma R&D, such as chromatography, spectroscopy, and flow cytometry, to aid the development of successful pharmaceutical compounds. Different commercial flow cytometry methods have been used in research and clinical labs since the 1970s, primarily for immunology and haematology applications, but its benefits have recently been seen at various stages of drug discovery and development. Flow cytometry collects complex information as streams of cells in suspension pass through a focussed laser. As particles are exposed to the laser, they scatter light and any fluorochromes used to label the cell fluoresce. Both of these signals are detected, reflecting the physical and biological properties of the cell.
By using multiple fluorochromes with different emission spectra, many data points can be captured simultaneously for every event detected. Technological developments over recent years have resulted in the availability of high-throughput flow cytometric approaches, extending applications in cell-based assays such as cell proliferation, differentiation, cell death, adhesion, ligand binding, transport and cellular signalling.1The high-speed quantitative analysis of cells and particles makes flow cytometry an appealing technology for drug discovery research, by enabling rapid drug molecule screening. In addition, the method’s renowned multiparameter capability produces different types of information, ranging from elucidation of mechanisms of action (for drugs and disease progression) to functional assays, giving flow cytometry an important role in the prioritisation, verification, and clinical validation of new biomarkers.2
However, the clear advantages of the high throughput, multiparameter functionality of flow cytometry are hampered by the immense output of highly complex data. Significant expertise is required to interpret this data correctly, and there is a lack of standardisation in assay and instrument set-up. In terms of biomarker discovery and analysis, algorithms are needed to establish the most appropriate correlation among biomarkers, drug effect, and clinical outcome, and therefore inform personalised treatment.2
Challenges of analysing complex data
Flow cytometry data analysis is built upon the principle of gating, which is necessary for the visualisation of correlations in multiparameter data. Populations of interest are sequentially identified and refined using a panel of fluorochromes conjugated to antibodies that target a specific protein (marker). The fluorescence detected in the unique emission spectrum of the fluorochrome is therefore proportional to the amount of the marker present on the cell. However, gating is often completed manually and is a laborious process, requiring significant expertise and hands-on time.
Quality control (QC) and analysis are also highly manual, often leading to a lack of consistency between operators and laboratories. Such variability compromises the reproducibility of results, impacting the quality of output and limiting any sharing or publishing of data. Accurate gating is further complicated by asymmetric and overlapping signals, frequent outliers, and errors in fluorescence channels – all of which may influence the output of both manual and automated gating, and subsequent downstream analysis.3 Having an algorithm available that facilitates flow cytometry gating and, ultimately, makes it possible to compare data between samples, is therefore important in pharmaceutical research and can facilitate enhanced insight generation from Big Data.
Machine learning is starting to make inroads into clinical trials data analysis, and is becoming a necessity for interpreting the vast quantities of complex data resulting from pharma R&D. Such tools not only increase the speed of discovery, but allow more complete discoveries to be achieved. Machine learning and cloud computing go hand-in-hand, facilitating quick and stress-free data storage and sharing.
The data lifecycle
The steps involved in the flow cytometry data lifecycle can be grouped into the following stages:
- Data acquisition
- Processing (QC and gating)
- Sub-population selection
- Results integration
- Data analytics
- Insight generation
During the processing step, gating can be completed manually (sequential gating) or automatically, where homogenous particle populations are identified in the data. Manual gating is the traditional approach used by many labs, but is time-consuming and can only be completed accurately by users with sufficient experience of the technique and knowledge of the biological processes at play.
Automated gating is based on the mathematical modelling of the fluorescence intensity distribution of particle populations.3 As well as drastically reducing analysis time, automated gating addresses the challenge of subjectivity in manual methods, and could even lead to the discovery of novel, biologically relevant populations that had not previously been considered (Table 1).
Despite the growing adoption of automated methods, manual gating has been used for decades and is a trusted approach by many labs due to its simplicity and ease of interpretation. There has also been some reluctance to move to computational approaches due to the biological interpretability of results – gated populations are not always representative of the biology and can be difficult to match with manually gated data. Additionally, random variation in automated clustering algorithms can lead to inconsistent results. Comparing results from automated gating with each other, as well as with traditional manual gating results, has therefore been an ongoing challenge, as every new algorithm developed is assessed using distinct datasets and evaluation methods.4
So, although a range of software platforms exist that enable automated gating, there is a lack of tools available that enable data sharing. There is therefore a need for a solution that allows gated and analysed data to be exported from one platform and imported into another, to reproduce analyses from raw files and facilitate demonstrable reproducibility.
What are the regulatory considerations?
Flow cytometry has the potential to be used in every stage of drug discovery and development and as such, there are important considerations regarding the regulated pharma market. There is a need to develop robust flow cytometric methods and ensure compliance with appropriate regulatory guidelines, including Good Laboratory Practice (GLP), the United States Food and Drug Association (US FDA), the European Medicines Agency (EMA), the International Organization for Standardization (ISO) and Standard for Clinical Laboratory Accreditation (ISO15189).
Although regulated method validation is not mandatory for assays developed to support early drug discovery, method qualification is advisable to ensure consistent and reproducible data.5 A number of factors complicate the validation of flow cytometric methods, including the complexity of the data output and interpretation of results. Additional attention should be given to capturing metadata for validating data, ensuring comparability between experiments and therefore strengthening the position of flow cytometry as a robust and repeatable method.
Case example: Automating cytometry analysis in highly compliant environments
The manager of an Early Clinical Development group for a large pharmaceutical company was becoming increasingly frustrated with the time, effort and resources that her team of senior scientists were spending on manually analysing cytometry data from clinical trials. Added to this, strict validation protocols meant that if manual gating was not within a variance threshold, then it would have to be repeated – a common occurrence, with an estimated 35-50% repeat analyses adding further time, effort and frustration. The result was more time spent on manual analysis and time wasted on repeat analyses. This frustrated the team because higher value and more interesting scientific tasks, such as identifying relevant immune signature as clinical biomarkers, were being neglected.
In-house solutions to automate some processing steps had been built but became stale and unmaintained. Further, output from these in-house tools could not be used when providing data for regulatory submissions due to uncontrolled code and lack of FDA 21 CFRPart11 compliance. The group manager also surveyed the established solutions available and could not find an approach which had the automated processing capabilities combined with the compliance to FDA and GXP standards and were interoperable with other lab systems such as LIMS.
The Aigenpulse Platform provides a next-generation approach to software solutions in the compliant space without compromise on features and usability. The Aigenpulse Platform [GxP] with the CytoML suite may be rapidly deployed on-premises or in the Cloud and provides a compliant solution enabling streamlined automated cytometry analysis at scale and leveraged ML-assisted data processing. The team was able to quickly configure their regulatory-required gating and analysis strategies – including multi-step gating isolating sparse populations and representing data in lower dimensions using tSNE.
Now, this team could automate the execution of automated analysis across all of their live and previous clinical trials using their configured pipelines. Repeated analysis due to remaining within the variance threshold fell from as much as 50% to zero, freeing up scientist time for higher value insight generating tasks. Scientist’s frustrations were reduced by the ability of the Aigenpulse CytoML suite to access experiment meta-data from LIMS ensuring a single point-of-truth across systems. Further, because of the scalable, automated and semi-automated tools provided, better performing analysis pipelines for more complex datasets could now be deployed – increasing the quality of results, providing more precision and better-defined clinical biomarkers.
Simultaneously saving time, increasing quality and reducing frustration in a highly regulated environment were all made possible with the Aigenpulse CytoML Suite [GxP].
Case example: COVID-19 immunophenotyping
In collaboration with Guy’s and St Thomas’ NHS Foundation Trust and the Francis Crick Institute in London, and the European Bioinformatics Institute (EMBL-EBI) in Cambridge, UK, King’s College London (KCL) launched the Covid-IP (COVID–ImmunoPhenotype) project in March 2020, to better understand the immunophenotype of patients infected with SARS-CoV-2 (the coronavirus responsible for COVID-19).6 Immunophenotyping is a test used to identify cells on the basis of the types of markers or antigens present on the cell’s surface, nucleus, or cytoplasm. This technique helps identify the lineage of cells using antibodies that detect markers or antigens on the cells.
Immunophenotypes vary greatly across different individuals, giving strong clues as to what mechanisms the human immune system must employ to protect us from COVID-19, and indicating ways in which it can go wrong, worsening rather than improving the patient’s condition. The COVID-IP project performed immunophenotyping on blood samples from >120 COVID-19 patients, consisting of eight complementary panels per patient for a comprehensive overview of the immune response. This generated thousands of datasets, requiring significant manpower to analyse the results and increasing the risk of inconsistency and inter-operator variability.
To overcome this challenge, automated pipelines were implemented for their flow cytometry analysis (CytoML, Aigenpulse) encapsulating all steps from data import, QC, gating, statistical analysis and visualisation. This enabled the researchers to apply guided algorithms to mimic human gating strategies to the entire dataset without manual intervention, saving time and effort, and minimised the risk of introducing bias. The no-code interface allowed scientists to make use of computational tools, removing the barrier for data processing and enabling data to be analysed in real time compared to uncontrolled, highly complex programmatic scripting in R or Python. This provides researchers with more time to focus on data exploration and hypothesis testing – a crucial factor when investigating the cause of a global pandemic.
Compared with the KCL manual pipeline, automated processing has provided a strong correlation (Pearson’s R = 0.93) (Figure 1a) and reduced variation for each gating step (Figure 1b). The fast processing time has reduced the full time equivalent (FTE) from >10 over eight weeks using manual gating, to 1.5 over two weeks using automated gating.
A) Direct comparison of population sizes from 210 samples gated using manual or automated strategies. Colours represent gated populations. Covariance was measured with Pearson’s correlation coefficient, R = 0.93, p-value = 2.2-16.
B) Coefficient of variation for gated populations (normalised to live cells), comparing manual and automated gating strategies.
Automated flow cytometry data processing platforms, such as CytoML, not only take the labour out of routine analysis and facilitate traceability and consistency, but enable the reuse of processed cytometry data, integrating population counts identified by manual gating (in .csv format) to increase the value of the data and enable cross-project analysis. Such capabilities are particularly vital for projects such as Covid-IP, where numerous laboratories from different institutions rely on sharing data in real-time and obtaining insights that could aid the diagnosis and treatment of Covid-19.
The Big Data generated by pharmaceutical R&D holds enormous opportunity for the development of life-changing therapies but cannot be leveraged without appropriate data analytics to unlock insights and facilitate decision making. For example, more value can be derived from integrating flow cytometry data with both in-house and public proteomics and transcriptomics data, using platforms that integrate every step of the data lifecycle.
Researchers can rapidly explore large data assets to drive development decisions, and use the time saved on laborious data processing for higher value-added tasks. Only recently have solutions become available, leveraging AI and ML speed up analysis and allow the sharing of gated cytometry data between researchers working across different platforms. It is an invaluable tool for validating and verifying the reproducibility of analyses, filling the gap that currently exists in the optimisation of flow cytometry data for pharma research.
Tariq E. Khoyratty DPhil, Computational Biologist, Aigenpulse
Khoyratty works at the intersection of science and technology, driving the development of the Aigenpulse Platform with a focus on scientific rigour. Previously, he studied transcriptional regulation in the innate immune system, with a focus on signal dependent inflammatory responses at the Kennedy Institute of Rheumatology, University of Oxford. During his DPhil, he focussed on transcription factors governing macrophage responses, initially starting as a wet lab biologist before progressing to bioinformatics. Subsequently, as a Postdoctoral Fellow, Tariq studied neutrophil transcriptional responses to infection and inflammation, in conjunction with Celgene.
Satnam Surae PhD, Chief Product Officer, Aigenpulse
Suraehas been active in life sciences for more than 10 years. While originally focussing on biochemistry, he discovered early on his passion for applying information technologies to biological challenges. His unique ability to transition between both worlds enables our teams to effectively develop the product and at the same time provides him with the ability to shape our customer engagement.
- Flow cytometry: breaking bottlenecks in drug discovery and development, Drug Target Review (2016).
- Millán O and Brunet M (2015) Flow Cytometry as Platform for Biomarker Discovery and Clinical Validation. In: Preedy V., Patel V. (eds) General Methods in Biomarker Research and their Applications. Biomarkers in Disease: Methods, Discoveries and Applications. Springer, Dordrecht.
- Montante S and Brinkman RR (2019) Flow cytometry data analysis: Recent tools and algorithms, Int J Lab Hematol. 41(Suppl. 1):56–62.
- Aghaeepour N et al. (2013) Critical assessment of automated flow cytometry data analysis techniques, Nature Methods, 10(3), p.228–238.
- van der Strate B, Longdin R, Geerlings M et al. (2017) Best practices in performing flow cytometry in a regulated environment: feedback from experience within the European Bioanalysis Forum, Bioanalysis, 9(16), 1253–1264.
- The COVID-IP project, https://www.immunophenotype.org/