Professor Sonika Bhatnagar explores how artificial intelligence is enabling drug discovery at rates never seen before.
A bullet for cure
The year was 1907, and Paul Ehrlich, the German Nobel laureate had proposed the idea of a magic bullet to cure diseases. The magic in the bullet was that the human at whom it was fired would be untouched, but the tiny disease-causing agents would be annihilated. Here started the story of drug discovery. Ehrlich proved that magic bullets existed with Salvarsan, an anti-syphilis compound that cured a rabbit with a single dose and left it unharmed. The magic bullet was the drug, while the specific mark it would reach in the syphilis causing bacteria to eliminate the disease was the drug target. Salvarsan was soon overtaken by the Penicillin antibiotics. The drug target for Penicillin is a protein that helps bacteria to synthesise its cell wall.
The potential of drug targets
Over the next 50 years, new targets emerged and drugs were developed against them. Many targets also failed or were discontinued due to safety issues, leading to the loss of millions of dollars. Thus, there was a dire need for new and effective drug targets.
One of the first and most basic requirements for a protein to be characterised as a drug target is to show that inhibiting its function prevents the disease. Here, new bioassay technologies helped testing, but more definite answers were required. Thus, there came a slew of loss-of- function strategies to validate the efficacy of a drug target.
As more drug targets became available, their common features started to get noticed. For instance, most successful drug targets were proteases and G-protein coupled receptors with definite characteristics. The pharmaceutical R&D focused on these and other successful drug target families. Recombinant DNA technology and improved protein purification techniques further contributed to our understanding and explorations of drug targets. The advent of X-Ray Crystallography and nuclear magnetic resonance (NMR) made structures of many of the drug targets available alone or in complex with inhibitors. In this way, Ehrilch’s magic bullets materialised before our eyes years after inception.
Drug discovery: accelerated
The process of drug discovery remained essentially slow till the advent of high-throughput Omics technologies that enabled large- scale production of biological data. Expression patterns, proteomic profiles, bioassay results, and structures could now be captured en masse in normal and disease conditions. Something was needed for us to view these results in the context of a synchronous, connected large-scale system instead of as disjointed fragments. Systems Biology emerged to give us a bird’s eye view of a large amount of interconnected data, thus allowing us to model and furthermore predict the entire system accurately.
Computers and computational chemistry grew simultaneously with experimental drug discovery technologies. Virtual screening made it possible to predict thousands of drug activities at a click. Structure Activity Relationships of drug molecules could now be quantified to assimilate molecular properties of hundreds of molecules at once while fragment-based strategies caused enforcement in de novo design. The computational chemistry algorithms and force fields had developed in tandem with hardware, processors, parallel processing and supercomputers. Big data handling and machine learning made progress by leaps. Combined with internet technologies, the progress in drug discovery has been phenomenal.
The constant and urgent need for new drug targets
The starting point of modern day drug discovery is the selection of a drug target. Often drug targets already validated in academic literature are picked up for drug discovery and development efforts. Unseen problems with target selection may emerge at a later stage, leading to heavy losses of time, money and effort.
There are still many common and rare diseases for which there are no drugs. Apart from this, there are a large number of diseases that are caused as complications of the primary microbial infection. As an example, about 10% of the patients of Covid-19 suffer from its long-term effects. That amounts to 65 million people with over 200 symptoms arising from effects on multiple organs. Currently, there are no known specific drug targets for these conditions but they can be accurately identified using ML.
Perspective: organisms through the lens of network biology
It’s a small, small world – says the song. The concept of six degrees of separation states that anyone can be connected to another person through no more than five intermediaries or “friend of a friend”. While social networks connect people, biological networks allow us to visualise connected biological entities.
Network biology is one of the techniques of systems biology. Essentially, a biological network is a connected graph of nodes that could represent diseases, genes, proteins, atoms, organs, microorganisms, drugs or ligands. The edges between the nodes could also encompass a variety of features ranging from relatedness, origin, forces or interactions.
Biological networks have been likened to road transport networks providing connectivity of different locations. Just as the maximum traffic is witnessed between important central locations, the highly connected nodes in a biological network are extremely significant. Studies in network biology have shown that biological networks have a distinct ‘small world’ structure in which a few nodes have many connections. So, a biological network consists of multiple clusters of highly connected nodes that are in turn connected to one another.
In fact, the connectedness or centrality of the different nodes can be directly correlated with its biological significance. In this way, biological networks let us estimate the node connections and significance. Networks also allow us to integrate diverse types of biological pathways, literature, genomics, transcriptomics, proteomics and ligand-binding studies.
Network biology has also given us some interesting insights into the behaviour of drug targets. For disease-causing microorganisms, the best strategy is to target the highly connected hub nodes. This was also shown by loss-of function studies. The highly interacting central or hub nodes were the ones that were essential for the survival of the microbe. On the other hand, the highly connected hub nodes are the ones to avoid targeting in case of the human/host, as they can cause too many side-effects.
It takes two to tango: host-pathogen interactions
For a disease to spread, biological molecules of the host and the pathogen must interact with each other. As protein-protein interactions (PPIs) initiate and are the root cause of infectious diseases, many host-pathogen interaction studies have concentrated on PPI networks.
Some very interesting facts have come to light using this approach. As an example, it can be shown that viruses have multifunctional proteins that interact in large numbers with many of the host proteins. Also, the interactions with key host proteins can show us which biological pathways will be involved, and what are the associated complications that a patient may develop.
Highly interacting proteins of hosts are targeted by pathogens using molecular mimicry, which can be countered using new types of drugs. These studies also help us to repurpose existing drugs for treatment of new infectious diseases. Another interesting observation is that infectious diseases can be treated either by blocking the pathogen proteins, the host proteins, or the interactions between the two. Due to open questions about the structure and significance of host-pathogen networks, this remains an exciting area where a lot of new work is anticipated.
The emergence of machine learning
It is said that the human mind frequently imagines technology before it is made. Thus, the idea of a Frankenstein-like intelligence first floated into literature, and then condensed into reality. In today’s world, artificial intelligence (AI) pervades a number of familiar settings like the Google search engine, Netflix recommended watch list, Alexa/Siri speech recognition, etc.
Machine learning (ML) is one of the sub-domains of AI. It uses mathematical and statistical approaches. to improve the outcome on a specific learning task. ML has been used to accomplish many specific scientific tasks through open access coding libraries and modules. We can now conceive a generalised computer program that can be trained on a large number of learning instances to carry out tasks of decision making.
Just like network biology, machine learning allows for the integration of many diverse data types. Taken together, both techniques can make powerful tools for drug discovery.
ML made uncomplicated
Imagine you were teaching a toddler how to recognise an egg. To do this, you would collect a number of eggs of different shapes, sizes and colours. This is our positive dataset. You would also collect eggs that resemble eggs but are not eggs, for instance a round stone, an orange, an apple, and a ball. That is a negative dataset.
Taken together, the negative and positive sets are used for training the toddler – and form the training set. Shape, size and colour are the features based on which the egg can be recognised. Once the learning process has taken place, the toddler can easily go through a test or prediction set and separate eggs from non-eggs. This is essentially how a ML program works too.
In the ML approach, features are extracted from the training dataset and used to teach ML algorithms how to differentiate between positive and negative datasets. Repetitive rounds of improved training of the algorithm are carried out till the best results can be obtained. Next, the trained model is used to classify drug targets in an unknown set of molecules.
ML: Decoding a bewildering array of models and algorithms
Many types of ML algorithms have been explored, each representing a different approach. Each of these algorithms shows different performance metrics for different problems. Some of these are:
- The Naïve Bayes classifier assumes each feature to be independent and works by calculation of the posterior or conditional probability of each outcome.
- Support vector machine (SVM) aims to construct an optimal hyperplane that can accurately separate two distinct classes of objects.
- Decision trees work by using the features of the dataset to produce a set of decision rules, each with a specific outcome.
- The random forest (RF) method uses an ensemble of decision trees produced using randomly selected features, and the most frequently predicted tree is chosen.
- The K-Nearest Neighbour (KNN) method attempts to classify a data point based on its similarity to the labels that are closest to it in the training set.
- The logistic regression (LR) method is based on sigmoidal curve fitting to predict the probability of an object belonging to a certain class.
- Artificial Neural Network (ANN) is a popular ML technique based on the learning and decision-making capabilities of biological neurons.
The performance of the ML model is determined by a number of factors including; a) the type, number and interdependence of the features used to train the algorithm; b) having an accurate and exhaustive dataset for training; and, c) ensuring unbiasedness in the model.
Once the model is made, its performance can be measured by various metrics. A receiver operator characteristic (or ROC) curve plots true and false positive rates. Models with Area Under the Curve (AUC) close to 1 have very good performance.
How to use ML to recognise new drug targets
In the past, there have been several efforts to use ML algorithms to distinguish between drug targets and non-targets.
We applied ML to the problem of recognising drug targets of the cardiac complications caused by microbial infections using host- pathogen PPI networks.
Initially, the positive training dataset was prepared with successful drug targets from the DrugBank database – producing a set of 2,652 host and 1,929 pathogen drug targets. The negative dataset was difficult to identify. So, a random set of proteins was chosen from Uniprot as non-targets. Some of the proteins that are presently not drug targets could become so within the next few years. However, we reasoned that this was a very small fraction and would not impact our results significantly. To ensure that the negative data is as error-free as possible, five different negative datasets were tested.
Next, the features of the drug targets had to be selected. Apart from sequence features and post-translational modifications, descriptors of the physical and chemical nature of the targets were chosen. Structural and functional features were added. Then, network centrality features were extracted from extended host-pathogen networks of all the training set of proteins. Overall, a set of 68 descriptors were computed for the pathogen and 73 for the host proteins.
The information was then loaded into the Python environment. The features were encoded and scaled, and the ML models were built. A part of the training data was used to train the ML algorithm while another was used as a test set. The ML methods chosen were RF, SVM, LR and KNN. Then each method was optimised in multiple rounds.
New drug targets for microbial CVDs
A very good performance score of 0.99 using the RF model was obtained for both the host and pathogen training sets. Interestingly, the features most important for the high level of accuracy attained was the network centrality or connectedness. This shows that combining network centrality with ML has led to a very accurate method for the selection of drug targets.
The next step was to test our models on an independent prediction set. This came from a previous work from our laboratory, MorCVD. The MorCVD database contains all the human proteins that can be potentially involved in heart disease. It also shows the experimental interactions of these proteins with pathogen proteins. As no specific drug targets are currently known for heart disease induced by microbes, this dataset was the perfect option for mining for drug targets – both microbial and human.
The MorCVD proteins were prepared as a dataset, and the best performing RF method was used to find the drug targets – both in different pathogens, as well as in the human host. The drug targets predicted with high confidence were further filtered to give a list of 331 host and 743 pathogen proteins. A description of our method was published with the full list of host and pathogen targets predicted using this method (Singh and Bhatnagar, 2022).
Prospecting for new drug targets for Covid-19 related heart disease
As mentioned before, Covid-19 has serious implications for pre-existing and ensuing heart disease. Increasingly frequent cases of heart failure, stroke, thrombosis and cardiac impairment are being seen in long Covid patients even a year after recovery.
At the molecular level, the experimental and predicted PPIs of the SARS-CoV-2 with human host became available in 2020-21. The RF model trained on all the known host and pathogen data in our work was then used to predict drug targets for Covid-19 induced cardiovascular complications. The provisional patent for the work has been filed by the Council of Scientific & Industrial Research, India. Work is also under way for developing drugs for one of the promising targets and further laboratory testing.
The most frequent symptoms of Covid-19 are seen in the cardiovascular system, but also extend to the lungs, gastrointestinal tract and the neurological system. The ML method trained by us can be extended for finding the drug targets for any of the complications of Covid-19. It can also be used for many of the new and emerging infections.
ML and systems biology: potent tools for new drug discovery
When the super-intelligent Skynet system from the Terminator series gained self-awareness, it turned against humans. In direct contrast, AI today is being used to enhance human health and happiness across many frontiers. Combining ML with systems biology will bring enhancements to the understanding and practice of drug discovery and new drug design. It will help us identify novel ways to tackle existing and new diseases from diagnosis to treatment and management.
DDW Volume 24 – Issue 2, Spring 2023
About the author:
Sonika Bhatnagar is Professor and Head of the Department of Biological Sciences and Engineering, Netaji Subhas University of Technology Dwarka, New Delhi. Professor Bhatnagar has applied computational techniques to study pathogenesis and drug targets in cardiovascular and infectious diseases.