Machine learning needs Big Data to revolutionise drug discovery

Machine learning (ML) holds the promise to revolutionise drug discovery. By integrating vast datasets into predictive models, ML will help researchers arrive at the optimal candidate compounds more quickly, disrupting the long cycle times of preclinical discovery research. DEL technology is unmatched in its ability to produce ultra-large datasets. By harnessing the power of next-generation sequencing, DEL can produce hundreds of millions of chemical affinity data points in a single experiment. In this article, drug discovery expert, Matthew Clark, CEO of X-Chem, covers the use of ML and DEL.

Whether we realise it or not, artificial intelligence (AI) is increasingly touching many aspects of our lives. AI allows the distillation of vast collections of data into predictive tools that can accelerate and enhance numerous processes in science, business and everyday life. Algorithms based on AI help us translate documents without the need of a human translator. They help social media companies target us with content and advertising that are relevant to our desires and interests. They can beat grandmasters at games like chess and Go. They can even predict the folding shape of novel proteins.  

What most AI approaches have in common is their reliance on a large collection of source data, sometimes referred to as a corpus. The corpus forms the training set that “teaches” the AI algorithm the patterns of the system or domain in question. In the realm of translation, a corpus could be millions of pages of text translated into two or more languages. Such a training set lets the AI learn the correspondence of the grammar and vocabulary of one language with those of another. For social media companies, a corpus could be the billions of clicks and likes of its many users. For games like Go, thousands of archived games could form a corpus. In the area of protein folding, it is the thousands of experimentally derived protein structures solved over the last 50 years. Since these structures contain both protein shape and protein primary sequence, the AI learns the correspondence between the two.  In each of these cases, we can see that an AI requires copious amounts of real-world data as the foundation of its learning. The AI then applies its learning to novel situations, like an untranslated text or a new protein. In areas where such vast corpora do not exist, AI will have less predictive power. If a data collection is dirty, biased, sparse or otherwise inconsistent, its derived AI models will be less effective. Like most things in science and technology, AI is a “garbage in, garbage out” paradigm. 

One field where the hopes for AI are especially high is pharmaceutical discovery.

One field where the hopes for AI are especially high is pharmaceutical discovery. Developing a new drug is an incredibly long and expensive process (>$1B) and it is plagued by high attrition rates. It can require the preparation and testing of thousands of compounds to arrive at one suitable for clinical study, and even then, it may only have about a 10% chance of ending up as an approved and marketed drug. Anything that could decrease the time and expense of pharmaceutical discovery would have a huge impact on healthcare costs and patients’ lives. It is not surprising, therefore, that pharmas and biotechs have eagerly embraced AI as a means to improve the productivity of drug discovery. Most pharmas have invested in some kind of AI initiative or capability. One example is the GSK-led public/private ATOM Consortium, which is aimed at accelerating drug development using AI approaches. Additionally, small- and medium-sized entities such as Exscientia, Recursion and BenevolentAI have attracted investment by building AI capabilities to support drug discovery and applying those capabilities to an internal pipeline of discovery programs. All three have compounds in clinical development, with Exscientia claiming the first AI-derived compound to enter human trials. 

While the prospects for AI in drug discovery appear bright, there is one challenge that continues to stifle the approach. That is the difficulty in assembling a high-quality corpus of the appropriate data. Unlike Facebook user data or Protein Data Bank structures, drug discovery data can be wildly variable in terms of accessibility, quality, consistency, relevance and scope. Putting aside that most drug discovery data sit behind lock doors at pharmaceutical companies, the data that is widely available is of relatively limited utility. Variables as simple as the formatting of data values can make it difficult to combine information from different sources. Varying assay formats can lead to values that cannot be compared like-for-like. While data for a relatively simple measurement, like biochemical activity, may be clean and comprehensive, other important data points, like cellular response and in vitro ADME data, may be much sparser. Even when enough data exists to assemble a decent-sized corpus, lack of reproducibility in the assays could make the data be much dirtier than it at first appears. Such data are assembled over the course of years, in different locations, with different operators, often using different instruments or reagents. It should be no surprise that such data require extensive curating, cleaning and winnowing before they can serve as a training set for AI.  

For this reason, a handful of companies have turned to an innovative technique that can generate billions of data points in a matter of weeks, all under a single experimental condition. That technique is DNA-encoded library (DEL) technology. In DEL, millions or even billions of compounds are assembled and concomitantly attached to a unique piece of encoding DNA. Since each compound is labelled by a unique DNA sequence, they can be pooled together and screened simultaneously. And because the chemistry structure is represented in the DNA’s sequence, the output of a screening experiment is read out by high throughput DNA sequencing, which can generate 100s of millions of sequence reads in a single run. DEL technology, therefore, overcomes many of the data quality challenges cited above. The compounds are screened in a single experiment in a single condition by a single operator, so extensive data cleaning is no longer required. 

DEL technology was invented in the mid-2000s, but only in the last five years or so has it broadly penetrated the biopharma sector. A number of pharmas operate an internal DEL platform, while biotechs are able to access DEL through the handful of providers who offer DEL services. The leading provider of DEL services is the Waltham, MA-based X-Chem, Inc. In 2020, X-Chem was the first group to demonstrate the power of using DEL data as the corpus for a drug discovery AI. In a paper published in the Journal of Medicinal Chemistry, X-Chem (and its collaborators at Google) used DEL data to build a predictive binding model across three therapeutically relevant protein targets. The models were then used to score novel compounds for their likelihood to bind to the targets in question. When the candidate compounds were tested in the lab, it was revealed that the AI models had strong predictive power, with confirmation rates ranging from 15% to 75%.  

In the 18 months since X-Chem published the DEL+AI paper, the industry has seen an accelerating embrace of DEL by companies who want to accelerate drug discovery with AI. The paper demonstrated that DEL-enabled AI generates models with the power to accurately predict binding of unknown compounds. Such models are useful for identifying starting points for medicinal chemistry, but their real potential value lies in the later stages of drug discovery. By integrating the DEL-derived binding models with other data streams, such as ADME and selectivity data, AI could create a holistic candidate selection model. Such a model could accelerate the difficult multi-parameter optimisation process of medicinal chemistry, arriving at clinic-ready compounds more quickly.  

With this vision in mind, several companies have sought to couple DEL capability with AI to accelerate their internal discovery efforts. One such company is Insitro. Led by AI pioneer Daphne Koller, Insitro acquired the DEL technology company Haystack in October 2020. As stated in Insitro’s press release, “…DEL technology is uniquely aligned with Insitro’s philosophy of addressing the critical challenges in pharmaceutical R&D through predictive machine learning models, all enabled by producing quality data at scale.” Another company combining DEL with AI is Valo Health. Valo’s OPAL platform is “a unique ‘closed-loop’ active learning, self-reinforcing, in silico and in lab-experimental platform” aimed at accelerating the development of new drugs. Valo operates a DEL platform at its Lexington, MA, lab, presumably to feed ultra-large DEL datasets into OPAL. Anagenex is another player in this space. Their website states that “ML [machine learning] oriented DELs generate enormous amounts of useful data enabling us to identify better compounds faster.” Finally, in April 2021, Relay Therapeutics announced the acquisition of ZebiAI. The accompanying press release asserted that ZebiAI applies “massive experimental DNA encoded library data sets to power machine learning for drug discovery (ML-DEL).” 

It is clear from these examples that the DEL+AI field is continuing to grow and attract interest and investment. What remains to be seen is whether these efforts will bear fruit in the form of clinical candidates. The DEL platforms cited above do not have the deep track record of the more mature DEL providers. While the X-Chem DEL+AI paper serves as the touchstone for these efforts, it is worth noting that no other reports of successful DEL-driven model generation have been reported in the ensuing 18 months. It could be the case that only mature DEL platforms, bolstered by long-term investment in library preparation and screening experience, can generate the kinds of datasets that truly enable AI.  

With the recent acquisition of leading AI provider Glamorous AI, X-Chem has positioned itself as the only drug discovery service provider to bring the power of DEL+AI to the biopharma industry.  In conjunction with its additional investments in synthetic and medicinal chemistry, X-Chem is unique in its ability to go beyond using AI to generate starting points for lead discovery. Rather, by integrating the DEL-based binding models with proprietary ADME prediction tools invented by Glamorous AI, X-Chem can accelerate hit-to-lead and lead optimization medicinal chemistry, ultimately helping customers arrive at clinical candidates more quickly, with AI integration throughout the process. This approach will accelerate the efficiency of moving candidates from discovery to market, a major current challenge in the industry.   

The potential of AI-accelerated drug discovery is currently at an all-time high.

The potential of AI-accelerated drug discovery is currently at an all-time high. Making good on that promise will require the biggest and cleanest datasets that wet science can produce. DEL practitioners are thrilled that DEL technologies have an important role to play in advancing this exciting field. We look forward to hearing of, and reporting on, the successful application of these techniques in the near future, so that medicines can get to patients even faster. 

About the author

Matthew Clark was part of X-Chem’s founding team and served as VP of chemistry and SVP of research prior to his appointment to CEO. Before joining X-Chem, Clark was director of chemistry at GlaxoSmithKline, where he led the group responsible for design and synthesis of early-iteration DELs. He began his professional career at Praecis Pharmaceuticals and received his B.S. in Biochemistry from the University of California, San Diego, holds a Ph.D. in Chemistry from Cornell University and conducted postdoctoral studies at the Massachusetts Institute of Technology.

Suggested Reading

Join FREE today and become a member
of Drug Discovery World

Membership includes:

  • Full access to the website including free and gated premium content in news, articles, business, regulatory, cancer research, intelligence and more.
  • Unlimited App access: current and archived digital issues of DDW magazine with search functionality, special in App only content and links to the latest industry news and information.
  • Weekly e-newsletter, a round-up of the most interesting and pertinent industry news and developments.
  • Whitepapers, eBooks and information from trusted third parties.
Join For Free