The role of machine learning in cancer drug development

Throughout the continuum of drug development, from target discovery to patient selection, machine learning approaches are being adopted to reliably mine vast amounts of data and make predictions with higher accuracy Anita Ramanathan discusses how machine learning is currently used across different stages of cancer research. 

Applying computational methods to address prevailing challenges in cancer research, although an emerging field, brings the promise of accelerated timelines and diminished cost and is being used throughout various stages of cancer research. 

Genome interpretation: To decipher genetic variants in cancer 

The success of targeted therapies in cancer comes on the heels of identifying specific molecular targets resulting from genetic variations. Thanks to next-generation sequencing experiments, volumes of datasets profiling both normal and cancer-related mutations are now readily accessible. Modulations in protein expression levels or the presence of dysfunctional proteins observed in individuals with specific mutations act as biomarkers in selecting patients for targeted therapies.  

Current problem: Genetic variants identified using genomic sequencing belong to coding regions of the genome, leaving out the majority of cancer-associated variants that lie in the non-coding areas outside exons, comprising over 98% of the genome. “When selecting patients for therapy, we currently look at variants that directly affect protein expression or function. But what if there’s a regulatory variant that is shutting down the production of that protein or is interfering with splicing? These non-coding contributors can also lower protein expression levels, and are generally missed,” says Dr Olga Troyanskaya, Professor at Princeton University, who specialises in developing machine learning models to interpret disease complexity. “In the past few years, there has been a realisation that understanding the transcriptional and post-transcriptional effects of disease variants is critical in drug development, especially in identifying new targets.” 

How machine learning is used: Emerging computational approaches are being used to predict the biochemical impact of non-coding variants in numerous diseases, including cancer1. Algorithms essentially learn the regulatory code to make predictions. For example, the sequence model, a framework of deep learning, is trained on central dogma principles using expansive databases. Profiles of regulatory modifiers, such as chromatin modifications, transcription factor binding sites, histone marks and so on, serve as training material. Subsequently, the algorithm can model the relationship between genetic sequences and the many factors that influence gene regulation. When a novel sequence from, say, a cancer variant is presented to the algorithm, it can now predict the impact of this variant on regulatory attributes such as chromatin modifications or histone marks.  

One of the highlights of this deep learning model is that it can also predict the effect of mutations that are rare or never seen before. This is because the model is trained on the regulatory rules followed in the human genome, so remains unbiased to any existing cancer variant information. “We don’t train these algorithms using disease-specific data because the model would then try to overfit,” explains Dr Troyanskaya “Also, the available data on diseases is limited to existing studies and can limit the scope of predictions. Instead, the model uses the fundamental rules of genetic regulation so it can make accurate predictions about non-coding regulatory factors.” 


Targeted therapy: In the coming years, as these promising computational models in genomics go through further refinement for use in precision medicine, one direct application would be to identify additional patients for existing treatment modalities. “By focusing on not only protein expression levels due to genetic mutations but also the underlying regulatory genes that influence protein expression, we can expand the patient population benefitting from current therapies,” notes Dr Troyanskaya.  

Additionally, this method can also be used to find new targets in drug discovery or to classify patient subgroups based on disease susceptibility or treatment resistance. 

To predict drug treatment response 

Current problem: High-throughput screening is a fundamental step in drug discovery. However, it’s not financially or practically feasible to screen for all possible drugs or drug combinations for a specific target. “To put this into context, imagine that you’re testing a thousand drug candidates against a thousand different cell lines… you’ll need to perform a million experiments. That can be challenging for even high-throughput centres to pull off. And we haven’t even considered replicates or drug concentration dilutions yet,” notes Dr Miguel Rocha, Associate Professor at the University of Minho, a computer scientist applying machine learning models to biological systems. “But if you’re able to test even 1% of these combinations, then machine learning models can use this data to make predictions about the rest of this experimental matrix. It significantly reduces time and effort.”  

How machine learning is used: Of the different computational methods that can be used for drug response predictions, deep learning may be most suitable due, in part, to its ability to handle massive datasets and capture nonlinear, complex relationships seen in biology2. A typical model used to predict drug responses may use pharmacological and cell line ‘omics data profiles to predict the half-maximal inhibitory concentration (IC50) of test drugs.  

Deep learning workflows for drug response prediction start by defining what needs to be predicted, ie., the sensitivity of one drug or the synergistic effects of a combination of drugs. The model is then trained using public repositories of cancer drug screening datasets, along with cell line data. Then, parameters are further fine-tuned, the model’s ‘fit’ is determined and its performance is evaluated with a scoring system2. 


Drug sensitivity prediction: In a preclinical translational study performed at The Institute of Cancer Research (ICR), London, UK, 35 patient-derived lung cancer cell lines were exposed to seven different drugs for an hour at a concentration similar to what will likely be achieved in humans3. The sensitivity of each drug on 50 different phosphoproteins depicting cellular inflammation was then predicted using trained machine learning algorithms. 

“We used seven different drugs that each blocked a different part of the kinase pathway. With 35 different cell lines and 50 phosphoproteins being examined for each drug, that’s a significant amount of data representing the changing cell health in response to drugs,” says Professor Udai Banerji, Deputy Director of the Drug Development Unit at the ICR who led this study. “In predicting which cancer cells would die after drug treatment, the machine learning algorithms outperformed conventional biomarker-based predictions.” Prof Banerji continues: “Moreover, the model was able to predict what drug combinations to use to achieve the desired result. That is, if one drug blocks a particular pathway but doesn’t succeed in killing the cancer cell, what is the next most suitable drug to add to successfully kill it. We also confirmed the accuracy of these predictions by testing these drug combinations on cell lines and noticed a strong correlation.” 

This proof-of-concept study shows potential in assessing treatment response in a non-invasive manner. “Although we’re far from applying this method to clinical applications, theoretically, it’s possible to obtain a patient’s biopsy, expose cancer cells to the drugs, lyse the cells, and run proteomic experiments to derive predictions about the patient’s drug response in about 48 hours,” Prof Banerji explains. 

Drug repurposing: To discover novel targets for existing cancer drugs  

Current problem: Despite making considerable progress in technological advancement, drug development outcomes haven’t improved. Most clinical studies continue to face high attrition rates, with numbers for cancer being higher than other therapeutic areas. The prolonged timelines in bringing new drugs to market also result in exorbitant investment costs, making drug development a risky enterprise for pharmaceutical sponsors. 

One way to reduce costs and shrink timelines is to repurpose existing drugs for a novel clinical application. Not only is the repurposed drug less likely to fail in safety assessments due to earlier successful trials, but it also wouldn’t need extensive preclinical tests to get approved. In doing so, machine learning can be used to parse through the enormous information contained in existing databases to extract promising drug candidates with repurposing potential4. 

How machine learning is used: Developing drug repurposing hypotheses using computational approaches involves systematically analysing large datasets such as high-throughput drug profiling assays or protein databases containing chemical structure and ligand affinity data. It can also involve examining electronic health records of patients to perform a retrospective clinical analysis4,5. 

“The advent of big data has made it possible to develop machine learning methods to predict potential drug candidate compounds in cancer as well as identify novel anti-cancer targets,” says Dr Sivanesan Dakshanamurthy, Professor and Director of Computational Chemistry and Drug Discovery Resources at Georgetown University Medical Center. “It can provide new insights into how existing drugs bind to targets, and also offer predictions about the phenotypic outcomes of this drug-target interaction.” 

Several machine learning approaches are being used to obtain meaningful information for cancer drug repurposing. Below are two such models regularly used in cancer studies4,5: 

Signature matching: Unique attributes of a drug product, also known as the ‘signature’, i.e., its chemical structure or transcriptomic characteristics, are matched with other drugs or disease phenotypes. This can be used to make drug-disease comparisons or assess drug-drug similarities to determine potential shared applications or similar mechanisms of action. 

Computational molecular docking: This structure-based approach predicts the binding complementarity between a target protein and drug ligand. When the target protein is known, several drug candidates can be tested for binding affinities. On the other hand, several target proteins can also be assessed against drug libraries to explore new interactions or uncover novel applications.  

This method can be used for large-scale virtual screening projects in lead optimisation. The model can also provide a prioritised ranking order of drug compounds along with the proposed structural hypothesis for each target-ligand interaction.  


Improved prediction of drug candidates: Machine learning models trained with thousands of known protein-ligand binding datasets are able to provide a more accurate prediction of drug candidates. The reliability of the predictive model is determined by a set of performance metrics that uses a combination of factors such as precision, sensitivity, and specificity, among others. Repeatability is evaluated by running over 30 iterative simulations and checking the consistency of the predicted outcomes across all iterations. 

Prioritising compounds: In drug screening, computational methods can improve the scoring and ranking of candidate compounds to prioritise them in an unbiased manner based on the data used to train the algorithm. Moreover, millions of compounds can be screened with minimal resource requirements. “In one of our projects, we’re working on a challenging cancer protein that is uncharacterised and considered an undruggable target. In this case, conventional technologies cannot be used for screening drug compounds,” explains Dr Dakshanamurthy. “Using our in-house hybrid neural network machine learning method, we prioritised candidate compounds based on the predicted ligand-binding affinities.” 

Computational pathology: To digitalise image analysis  

Pathologists visually examine tissue slides obtained from patient biopsies under the microscope to provide a diagnosis. This report contains details such as gross morphology, tumour size, and tumour margins, along with the type and grade of the tumour. Immunohistochemistry (IHC) analyses may also be performed to further distinguish between cancer subtypes and classify them. Information gleaned from analysing pathology slides also feeds into a better understanding of the tumour microenvironment and serves as a method for patient selection in clinical trials.  

Current challenge: Manual examination is highly subjective and can yield variable results across pathologists. Given the high inter-observer variability, when subtle changes need to be measured, for instance, in response to drug treatment, these changes may be indiscernible. Human fatigue poses further constraints on reproducibility, especially when observations need to be scaled for larger clinical studies. Moreover, the human eye, regardless of expertise, may not always catch all the microscopic details present within the slide, missing opportunities for discovering new biomarkers.   

“Pathology has been and still is a manual, subjective process. But there’s a growing need within the pharmaceutical industry to get accurate, reliable data from pathology samples, and to make this process repeatable, scalable, and quantitative,” notes Dr Mike Montalto, Chief Scientific Officer and Biopharma President at PathAI, a company specialising in applying artificial intelligence to pathology to improve patient outcomes. 

How machine learning is used: Convolutional neural network, a deep learning model that can be applied to images, is typically used to digitalise pathology. Images from slides annotated by pathologists are used to train algorithms that eventually learn the cellular architecture and characteristics of healthy and diseased tissues. “We’re developing pathologist-guided algorithms that can examine slides in the same way a trained pathologist would,” says Dr Beatrice Knudsen, Professor of Pathology at the University of Utah. “Traditionally, pathology readouts use non-quantitative or semi-quantitative reporting systems, primarily using categorical scales. Using machine learning frameworks, we’re able to determine probabilities on a linear scale and make the process more quantitative. Plus, the computer quantifies H&E staining or IHC more objectively, eliminating the issue of human-to-human variability.” 

In addition to improving accuracy, these models can be reliably applied to large cohorts of patients, even in the order of thousands. Protocols can be scaled across institutes to allow reproducible measurements irrespective of staff experience or observer bias. Most notably, drug responses to treatment can be quantified by comparing ‘before’ and ‘after’ images. 

Another interesting benefit of applying machine learning to pathology is its ability to perform ‘microscopically impossible’ analyses. Looking into the microscope, trained professionals examine known variables, but may miss undetectable details such as nuclear textures or cellular patterns. These newly surfaced pathological characteristics caught by computer algorithms have the potential to become novel biomarkers or provide insights into unexplored tumour biology. 

It must be noted, however, that pathologists, who spend hours meticulously engaged in image analysis, bring invaluable institutional wisdom that an algorithm simply cannot replace. For example, if the underlying tissue sample itself was of poor quality, a pathologist would promptly pick up on that whereas an algorithm may not, until it is programmed to do so. As such, pathologists form integral partners in developing machine learning models, often providing annotations to program algorithms or confirmations on AI-based pathology reports. Keeping pathologists in the loop even when applying computational methods helps maintain the trust factor, especially when the results influence patient outcomes. 


Tumour microenvironment: Deep learning methods are being employed to examine intricate details of the tumour microenvironment and quantitatively study tumour-immune interactions using only pathology slides. The computation method developed by Dr Knuden’s team analysed six different cell-expressing biomarkers that labelled immune cells and tumour cells within a single tissue section of pancreatic cancer samples6. Each cell population is classified by detecting uniquely coloured chromogens. The algorithm was developed to accurately identify characteristics such as colour, texture, shape, and a combination of these. It also predicted the nearest cell neighbour using spatial analysis. 

These proof-of-concept studies are opening doors to machine learning-based applications in clinical settings where insights about the patient’s tumour microenvironment can be obtained without requiring any additional tests. “In the future, when we’re able to apply this technique to precision medicine, it has the potential to make testing more affordable and accessible,” says Dr Knudsen. “By running images of routine pathology slides through an algorithm, we’d be able to obtain prognosis readouts without requiring additional expensive testing.” 

Patient selection: To be eligible for immunotherapy, non-small cell lung cancer patients are tested for their levels of PD-L1, a receptor whose activity is blocked by the immune checkpoint inhibitor used in this therapy. The PD-L1 test involves a microscopic examination of the immunostained tumour tissue to measure what percentage of cells express PD-L1. Patients with higher PD-L1 expression levels are considered good candidates. Manual readings, however, don’t always accurately capture sub-moderate or lower expression levels, often classifying these patients as ‘negative’, and therefore, ineligible for therapy. 

To make the patient selection procedure more reliable and quantifiable, the team at PathAI used machine learning to perform PD-L1 measurements7. “We found that an entire subset of patients considered ‘PD-L1-negative’ through manual readouts actually had low expression levels that the computer could detect. These patients could respond to the treatment in the same way that a ‘PD-L1-positive’ patient would,” explains Dr Montalto. “Although this study was performed on a retrospective cohort, it demonstrates the real-world benefits of AI-driven applications in pathology. Including those patients previously considered ‘negative’ can significantly expand the number of eligible patients receiving life-saving therapies.” 

Advancing machine learning with continued improvements and collaborations 

Evidently, machine learning has promising applications in finally addressing – or even resolving – the enduring challenges slowing down cancer drug development. However, researchers working on this topic will be quick to point out that it’s far from perfect. Several improvements will be necessary to make models perform better, to obtain good quality training data, and, most importantly, to define how the algorithm arrives at a particular conclusion. 

Trusting machine learning results: Many cancer researchers and physicians are wary of trusting machine learning results due to ‘black box’ models that are exceedingly complex but simply cannot be interpreted. This can be especially concerning when decisions need to be made about a patient’s health, but there’s no way to ascertain how the machine learning model made that decision. 

To alleviate this concern, the emerging subfield of ‘explainable machine learning’ is now gaining traction. “One way to understand what the computer is learning is to visualise it,” says Dr Knudsen, providing the perspective of a seasoned pathology expert working on machine learning projects. “We’ve applied a backpropagation method to look back into the tissue used for training and see what the algorithm has picked up and learned from. Attention maps or activation maps can be used to identify what patterns the computer is detecting, and how these patterns are distributed across tissues.” 

Recently, a machine learning model that was interpretable or ‘visible’ was developed, where the inner workings of the algorithm was mapped to the cellular functions in a simple eukaryotic cell8. Interpretable models have since been applied to human cancer cells. In 2020, a predictive, ‘visible’ machine learning model, DrugCell, was built to simulate the cellular responses of human cancer cells to therapeutic drug compounds9, providing a precedent for building interpretable models in translational research. 

The importance of collaborations: Successfully applying machine learning to cancer research is a collaborative effort between three different groups of experts: computer scientists, cancer biologists, and clinicians. Each group brings a unique perspective to the problem at hand. In serving the same collective cause, working in silos can create unnecessary delays or even yield inaccurate results. There have been instances of papers published where a machine learning model was deemed capable of detecting mutations in cancer tissues when it was only catching the five different lung adenocarcinoma patterns, an easy observation for a pathologist to spot. 

“The machine learning question being asked about how well a particular model fits, will be very different from the scientific or clinical question that needs to be asked in the same project,” notes Prof Banerji, whose team of bioinformaticians, scientists, and clinicians at the ICR work closely to test drug response predictions. “It’s important to have multidisciplinary teams where each group has an equal footing on the project and differing opinions are heard and respected. It can get problematic when teams simply outsource to or hire a computer scientist or a medical professional but don’t really treat them as equal partners. Having mutual respect for different domain experts and an inquisitive approach to each other’s areas of expertise is key to making collaborative projects work in machine learning.” 


 1: Wong AK et al., Decoding disease: from genomes to networks to phenotypes. Nat Rev Genet. 2021 

2: Baptista D et al., Deep learning for drug response prediction in cancer. Brief Bioinform. 2021 

3: Coker EA et al., Individualised prediction of drug response and rational combination therapy in NSCLC using artificial intelligence enabled studies of acute phosphoproteomic changes. Mol Cancer Ther. 2022 

4: Pushpakom S et al., Drug repurposing: progress, challenges and recommendations. Nat Rev Drug Discov. 2019 

5: Issa NT et al., Machine and deep learning approaches for cancer drug repurposing. Semin Cancer Biol. 2021 

6: Fassler DJ et al., Deep learning-based image analysis methods for brightfield-acquired multiplex immunohistochemistry images. Diagn Pathol. 2020 

7: Duan C et al., Association of digital and manual quantification of tumor PD-L1 expression with outcomes in nivolumab-treated patients. Poster presented at AACR 2020 

8: Ma J et al., Using deep learning to model the hierarchical structure and function of a cell. Nat Methods. 2018 

9: Kuenzi BM et al., Drug Response and Synergy Using a Deep Learning Model of Human Cancer Cells. Cancer Cell. 2020 

Included in DDW Oncology Research ebook  

About the author 

Anita Ramanathan is a science writer and award-winning speaker based in Bristol, UK. In her capacity as a science writer/editor at several digital publications, including NIH Research Matters, she has crafted dozens of stories buried under numbers and scientific findings. A storyteller at heart, Anita also delivers science communication workshops. 




Suggested Reading

Join FREE today and become a member
of Drug Discovery World

Membership includes:

  • Full access to the website including free and gated premium content in news, articles, business, regulatory, cancer research, intelligence and more.
  • Unlimited App access: current and archived digital issues of DDW magazine with search functionality, special in App only content and links to the latest industry news and information.
  • Weekly e-newsletter, a round-up of the most interesting and pertinent industry news and developments.
  • Whitepapers, eBooks and information from trusted third parties.
Join For Free