AI: Advancing drug discovery with ultra-large library processing


Rob Scoffin, CEO, and Matthew Habgood, Principal Computational Chemistry Developer at Cresset, discuss the role of AI/ML in advancing screening methods by utilising virtual screening (VS) approaches to process compound libraries on an ultra-large scale.

Drug discovery is a complex process that relies on designing and filtering potential drug candidates through a funnel until a single drug compound remains. Traditionally, libraries of compounds are synthesized and screened against a druggable disease target to identify molecules that cause the desired effect, such as inhibiting or disrupting target functionality. Typical library sizes would be tens of thousands to low millions of compounds (for the very largest companies). However, this process is quasi-random and has a very low hit rate, thereby requiring ever larger libraries of compounds and more expensive screening resources to find a suitable starting point for a project1.

Computer-aided drug discovery (CADD) makes this more efficient, using in silico screening to filter out compounds prior to them being screened in vitro. Screening methods are sometimes categorised into ligand-based methods such as structural, shape, or field similarity and [target] structure-based methods such as docking, static binding energy calculation, and more recently free energy-based predictions of binding affinity such as Free Energy Perturbation (FEP). These methods offer many advantages in combination with and as a partial replacement of the ‘wet’ experiments; they also open up the chemical ‘landscape’ to tens of millions of structures.

With validated applications of artificial intelligence (AI) and machine learning (ML) on the rise, the drug discovery process is changing. By incorporating AI/ML methods into the early stages of the design-make-test-analyse (DMTA) cycle, billions of virtual compounds can be screened against a disease target. Applications of AI/ML in this respect allow for the identification of potential drug molecules based on ultrafast approximations to docking scores or binding energies. The pattern recognition abilities of AI/ML can also help to identify molecules that may have been missed using traditional approaches. The most dramatic examples of the approach in action utilise a positive feedback loop where the AI tools are used to generate a new population of molecules and predict their activities, and then [a selection of] the compounds are synthesised and tested in vitro and the results fed back into the next AI generation process.

The promise of AI in drug discovery was recently highlighted, with the FDA granting orphan drug designation to a compound designed using AI for treating idiopathic pulmonary fibrosis2. Currently, 73 drug candidates from biotechs using an AI-first approach have entered the clinical trial stage2.

The move towards virtual screening methods

Once a disease target is identified and validated, screening methods are adopted to identify molecules that exhibit the desired activity on the target, referred to as a ‘hit’. Traditionally, a tailored and optimised assay such as biochemical, biophysical, or cell-based is used to screen compound libraries against the disease target, where an observed desired change (e.g. colour, fluorescence, downstream signalling) signifies a ‘hit’. Relying heavily on design, optimization, and manual input, these methods are resource- and labour-intensive, and the insight gained is limited to the number of screens which can be run given budget constraints.

Utilising advances in automated workflows, high throughput screening (HTS) methods allow for accelerated processing of tens of thousands to millions of potential drug candidates per day4. This significantly increases the scale of compounds handled and reduces the need for manual input. However, these libraries still need to be synthesised or sourced commercially, and with a typical cost of $50 per compound, assembling a screening library is potentially costly.

Advances in computational techniques have allowed screening methods to move away from their reliance on ‘wet lab’ resources to in silico alternatives. So-called virtual screening (VS) was first proposed in the mid-1970s5. For the first time, scientists could harness 3D structures of target active sites to design drugs. Coupled with a growing understanding of structure-activity relationships (SARs), and advances in protein structure determination (eg. crystallography or cryogenic electron microscopy), the use of VS methods is now common practice, however, there are still shortcomings and issues that create opportunities for improvement.

The integration of AI/ML approaches into VS methods allied to the availability of ultra-large databases leads to the accelerated processing of billions of virtual compounds. The enhanced available chemical space gives the promise of better starting points for a discovery project, meaning improved compound properties, as well as chemical novelty for improved patentability.

Integrating AI/ML into virtual screening

One approach to VS is through structure-based drug design. Two of the most important techniques are:

  • Docking: Docking algorithms provide a rapid estimate of ligand binding stability by placing possible conformations into a protein active site and then determining a corresponding score using a variety of methods. For example, by estimating the component energy contributions to the binding, a scoring function can estimate the binding affinity of a ligand to the target protein. A compound with a higher affinity, where the ligand is more likely to impact target activity, is indicated by stronger binding and, therefore, a higher docking score.
  • Free energy perturbation (FEP): An FEP calculation, which proceeds through a series of detailed molecular simulations allows for either the direct calculation of binding affinity, or a relative estimation of the same with near experimental accuracy.

This combination of these methods can be utilised as a two-step approach, with docking being used as an initial screen, giving an approximate answer, and then FEP being used to give a much more accurate prediction for the most promising candidates. This is commonly used in traditional CADD workflows. However, by incorporating AI/ML approaches, it is possible to massively increase the speed and utility of the approaches. The overall structure of this approach is shown in schematic form in Figure 1.

Structure of AI/ML approaches to drug discovery.
Figure 1: Structure of AI/ML approaches to drug discovery.

The process starts with the generation of a large set of pseudo-random molecules using generative AI or library-building approaches. The molecules are docked to the protein target of interest, and then the top-scoring compounds are fed into the FEP calculation. The best compounds from there are synthesized and tested, and the results of this are then fed into an ML process. This builds a model based on structural descriptors, docking scores, and FEP energies against the measured activities. The process is then looped with further compounds, similar to the already identified ‘best’ are generated and fed back through the process.

There are, of course, issues with this novel method – the AI models are best suited to interpolation of existing chemical space, rather than extrapolation to novel areas. However, where this is an issue, additional compounds can be manually ‘injected’ into the workflow to provide a broader domain of applicability.

Overcoming challenges with innovative AI/ML methods

VS methods such as the processing of ultra-large virtual libraries (ULVLs) can be computationally resource-intensive. Despite being significantly faster than experimental approaches, the more compounds processed, the longer VS methods take. This challenge can be addressed using deep learning (DL), a powerful AI method which utilises complex neural networks to build predictive models based on a set of training data. For ULVL, this could be used in the development of a Deep Docking (DD) platform. This is an iterative method where only a subset of a chemical library is docked, and the remaining docking scores are predicted by the deep learning model. This helps to overcome the limitation of conventional docking, which often relies on extensive computational resources. This DL method allows for accurate ULVL screening while also reducing costs6,7.

AI methods are also of powerful utility in other aspects of a discovery project. For instance, a different application of deep learning to virtual screening comes from the need to access a 3D structure of the protein target. Although over 100,000 protein structures can be accessed via freely available databases, not all target proteins of interest can or have been elucidated8. An alternative approach is to build a DL model of protein structure based on the available databases, and then use this model to predict unknown structures, starting from their genetic sequence. These systems can reproduce protein crystal structures with near experimental accuracy, and whilst not perfect, they provide a starting point where none could be provided experimentally9.

Looking ahead with AI-driven VS

Integration of AI/ML into VS platforms has already led to considerable improvements, scaling to process ULVLs that have a high potential to bind to the target. With continuous advancement in structure prediction and DL methods, drug discovery will continue to be streamlined, proving AI/ML-driven VS is an essential addition to the drug discovery toolbox.


  1. Singh N, Vayer P, Tanwar S, et al. Drug discovery and development: Introduction to the general public and patient groups. Frontiers in Drug Discovery 2023;3:1201419.
  2. Insilico Gains FDA’s First Orphan Drug Designation for AI Candidate,
  3. Unlocking the Potential of AI in Drug Discovery,
  4. Lundblad RL. Drug Design. Elsevier eBooks 2023;182-192.
  5. Shoichet BK. (2004). Virtual screening of chemical libraries. Nature 2004;432(7019):862.
  6. Gentile F, Yaacoub JC, Gleave J, et al. Artificial intelligence–enabled virtual screening of ultra-large chemical libraries with deep docking. Nature Protocols 2022;17(3):672-697.
  7. Cherkasov A, Ban F, Li Y, et al. Progressive Docking:  A Hybrid QSAR/Docking Approach for Accelerating In Silico High Throughput Screening. Journal of Medicinal Chemistry 2006;49(25):7466-7478.
  8. Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Research 2019;47:D520–D528.
  9. Jumper J, Evans R, Pritzel A, et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021;596(7873):583-589.

Matthew HabgoodAbout the authors

Matthew Habgood has a PhD from the University of Oxford and has 20 years’ experience as a computational chemist and scientific modeller. He develops, sources, and evaluates new computational techniques for Cresset’s software.


Dr Robert ScoffinDr Robert Scoffin is an expert in the fields of molecular modeling and cheminformatics. His DPhil is in Chemistry from the University of Oxford. Scoffin joined Cresset as CEO in 2010 and now also serves as Chairman. He also serves as co-Chairman for Torx Software, a collaboration between Cresset and Elixir Software.


Related Articles

Join FREE today and become a member
of Drug Discovery World

Membership includes:

  • Full access to the website including free and gated premium content in news, articles, business, regulatory, cancer research, intelligence and more.
  • Unlimited App access: current and archived digital issues of DDW magazine with search functionality, special in App only content and links to the latest industry news and information.
  • Weekly e-newsletter, a round-up of the most interesting and pertinent industry news and developments.
  • Whitepapers, eBooks and information from trusted third parties.
Join For Free