Combining chemists expertise and a computer’s advanced capabilities to generate good ideas.

Combining chemists expertise and a computer’s advanced capabilities to generate good ideas.

By Dr Matthew Segall, Edmund Champness, Dr Chris Leeding, Dr Ryan Lilien, Dr Ramgopal Mettu and Dr Brian Stevens Drug

One of the defining challenges of drug discovery is the need to make complex decisions regarding the design and selection of potential drug molecules based on a relative scarcity of experimental data.

A high quality lead or drug candidate requires a balance of many properties, including potency, selectivity, absorption, distribution, metabolism, elimination (ADME) and safety. Synthesising compounds and generating experimental data, even using modern high-throughput methods, is time-consuming and expensive.

Therefore, the opportunity to explore new compound ideas has been limited. An experienced medicinal chemist can easily generate enough ideas to keep a team of synthetic chemists and biologists busy and each idea must be carefully considered. In this scenario, the risk is that opportunities to quickly identify high quality compounds may be missed, as the tendency to quickly focus on a relatively small range of chemical diversity prevents a broad search of chemical possibilities.

The emergence of predictive in silico models of the properties of potential drug compounds offers the ability to quickly and inexpensively generate vast quantities of predicted data on large numbers of compounds (1).

Furthermore, modern ‘multi-parameter optimisation’ methods allow the potential to integrate this information and assess a large number of compound ideas against the ideal profile of properties required in a high quality lead or candidate drug (2,3). In this new scenario, the limitation becomes the time and experience necessary to generate a wide diversity of compound ideas and manually enter these into a computer.

This article discusses an approach to overcome the relative scarcity of ideas in the case where it is easy to assess potential new compounds using predictive methods. The approach automatically generates chemically relevant compound ideas, assesses them against a project’s requirements and prioritises the ideas for detailed consideration by an expert. In this way, an optimal combination of the strengths of a computer and an expert user can be achieved.

The computer’s ability to analyse and prioritise a large number of chemical possibilities complements the fact that an expert cannot possibly examine all generated structures individually. At the same time, the expert’s ability to define and manage the project requirements ensures that the computational model explores the right regions of chemical space. The goal is to stimulate the creative process in hit-to-lead and lead optimisation, not necessarily to automatically find the final, optimal molecule.

By helping to consider a wide diversity of possible chemical strategies, this approach can also help to mitigate the risks of inherent biases in the way that people make decisions about potential courses of action (4). In particular, one common bias, called ‘confirmation bias’, reflects the tendency of people to focus on experiments that will tend to confirm, rather than challenge, their existing hypothesis.

In the context of drug discovery, this can lead to premature narrowing of the scope of exploration, with the potential to miss valuable opportunities to find high quality compounds. Automatic generation and prioritisation of new compound ideas can highlight alternative strategies and help to ‘think outside the box’.

The next section highlights an approach to generating relevant compound ideas that are interesting and acceptable to medicinal chemists. Scientists can consider how in silico models can be effectively applied in this scenario, despite the inherent uncertainties in the data generated by computational methods, and how all of this data can be brought together to prioritise the compound ideas.

Finally, the article will describe an illustrative example of the application of these methods to explore chemistry which is based on the lead compound that ultimately gave rise to the marketed serotonin reuptake inhibitor Duloxetine, before drawing some conclusions.

Generating relevant compound ideas

A successful method to generate compound ideas must satisfy a number of requirements:

– It must generate a wide diversity of chemistry, as the objective is to explore many ideas in the search for an optimal solution.

– The compound structures generated must be relevant. In particular, the number of ‘nonsensical’, eg chemically unstable or infeasible, compounds must be kept to a minimum.

– The user must be able to control the generation process; for example by specifying a group or template that must remain present or by limiting the breadth of search.

– The ideas generated must tend towards ‘druglike’ compounds.

Early approaches to computational generation of new compound structures, described under the term ‘de novo design’ (5), commonly worked by ‘growing’ a small fragment known to weakly bind to a biological target or linking two or more fragments. The newly generated molecules were chosen to fit a model of the binding pocket of the target, forming multiple interactions and hopefully resulting in increased binding efficiency.

The success of these methods was limited by the fact that the molecules proposed were often chemically infeasible or did not have sufficiently ‘drug like’ physicochemical and ADME properties. These limitations could, to some extent, be addressed by post-filtering of compounds to remove inappropriate compounds6.

An alternative approach, that helps to meet the requirements above was pioneered by a package called ‘Drug Guru’ (drug generation using rules), developed by a team at Abbott Laboratories (7) and has also been applied in other platforms such as Pareto Ligand Designer (8) and StarDrop™ (9). This approach works by applying a set of medicinal chemistry ‘transformation rules’ to an initial ‘parent’ molecule to generate related ‘child’ structures.

These transformations are based on collective medicinal chemistry experience and examples of transformation rules range from simple substitutions or functional group replacements to more dramatic modifications of the molecular framework such as ring opening or closing. This approach ensures that a high proportion of the compound structures generated are relevant; typically 90%-95% are acceptable to medicinal chemists, while encoding a wide range of different chemistries.

The transformations do not have to correspond to specific chemical reactions or synthetic routes; they are intended to describe changes to molecules that a medicinal chemist might consider in the course of an optimisation project. A single transformation might require multiple synthetic steps or the synthesis of new building blocks. However, the transformations are typically not major rearrangements – they are relatively feasible moves in chemical space.

Applying many transformations iteratively to generate multiple ‘generations’ of compound ideas can result in very large numbers of molecules. Therefore, it is important to allow the user to exert some control on the generation process. For example, it may be desirable to specify a region of the parent compound that must not be modified, to limit the number and types of transformations that are applied or to specify a property criterion against which to select a subset of the compounds in each generation to control the growth of compound numbers. An example of such a workflow is shown in Figure 1.

Figure 1 Illustration of workflow to initiate the generation of new compound structures, as implemented in StarDrop

How much can we trust predictive models?

In order to prioritise the large number of generated compound ideas and understand which are most likely to have appropriate properties, it is important to use in silico predictive models of the key properties. However, it is reasonable to ask how reliable the predictions from these models are. All in silico models have a high degree of statistical uncertainty in the values they predict and limitations to the range of chemistries to which they are applicable.

It is important that models explicitly indicate the uncertainties in their predictions and that these are taken into account to understand when a user can confidently distinguish between compounds based on these predictions.

It is not yet possible to design or select a specific molecule in a computer, confident that it will have the properties required when synthesised and tested. Instead, it is vital to use the information provided by predictive models to focus on the compounds with the highest likelihood of success and bias the odds of finding a high quality compound in our favour (10).

Given the uncertainty in the assessment of the quality of compound ideas, it is also important to explore a range of diverse compounds. It doesn’t make sense to focus too heavily on one group of closely related compounds, as it is possible that they may all fail for a common reason.

Wherever possible, users should look for a balance of quality with chemical diversity when choosing compounds, to mitigate risk, validate the predicted hypothesis and better understand the relationship between the compound structures and their properties. Automatic generation of compound ideas helps to ‘look outside the box’ and expand the search for a diverse range of options for further investigation.

Bringing together the data to prioritise compound ideas

The large volume of compound ideas and associated property data that may be generated is impossible for a person to examine ‘manually’. A successful compound must achieve a balance of multiple, often conflicting, property requirements and the objective is to identify compound ideas that are likely to meet this property profile. Furthermore, each drug discovery project with different therapeutic objectives is likely to require a different profile of properties. Therefore, prioritisation of the compound ideas to reduce the number subjected to detailed consideration is a major challenge.

Visualisation of the compound data may help, but is unlikely to be sufficient given the complexity of the data and objectives along with the uncertainties discussed above (11). However, a solution is provided by ‘multi-parameter optimisation’ that can integrate all of this data into an assessment of the overall quality of a compound against a profile of property criteria. One example is the probabilistic scoring algorithm implemented in StarDrop (3), which allows the user to define a scoring profile that represents the goals for an ideal compound (see Figure 2).

Figure 2 An example scoring profile for a project with the objective of identifying suitable compounds for a serotonin reuptake inhibitor

For each property, the user defines the desired outcome and the importance of that criterion to ensure that the overall score reflects the acceptable trade-offs between different properties. A score is then calculated for each compound, reflecting the likelihood of the compound successfully achieving the overall profile. An uncertainty in each score, due to the uncertainties in the underlying data, can also be calculated, to clearly identify when compounds can be confidently distinguished.

Plotting this information in a ‘chemical space’ which reflects the diversity of the chemistry being explored allows ‘hot spots’ to be quickly identified in which high quality compounds are most likely to be found. An example of a chemical space can be seen in Figure 5.

Illustrative example: from lead to drug

A number of examples have been published that illustrate how this approach could be used to aid the search for high quality compounds (7-9). Here we will summarise one example describing a retrospective application to the lead molecule that ultimately gave rise to the drug Duloxetine was used as a starting point.

The application of a set of 206 transformations produced 172 child compounds, which suggests that three generations would create approximately 1.7 million child compounds. Therefore, three generations were applied, but only the topscoring 10% of the compounds in each of generations one and two were used as the basis for subsequent generations. The scores were generated using predictions from QSAR models of inhibition of the serotonin transporter and key ADME properties.

The resulting data set contained 2,208 compounds out of the potential ~1.7 million and the scores for these compounds are plotted in Figure 3.

Figure 3 Graph of compounds generated by three generations of transformations starting with the lead compound that yielded the drug Duloxetine

From this it can be seen that the score typically increases with generation – the score for the initial lead is 0.09 and the averages for the compounds in subsequent generations are 0.32, 0.44 and 0.53 respectively – indicating that the compounds’ overall quality are improving. However, as the results from multiple uncertain predictions are combined to calculate the score, the uncertainties in the score are high, as shown by the error bars in Figure 3.

Therefore, it is difficult to discriminate between compounds with confidence, particularly in the later generations. Finally, it is notable that Duloxetine itself is present in the final generation, with a score that is significantly higher than the initial lead (level of significance ~0.1) and not significantly below that of the highest scoring compounds.

The structures and scores of the initial lead and Duloxetine are shown in Figure 4, along with the three highest ranking molecules generated.

Figure 4 The initial lead that ultimately gave rise to Duloxetine, compounds diagrams

Although none of the top-three compounds could be identified in a search of PubChem (12), the second- ranked compound bears a strong similarity (Tanimoto similarity >0.9) to Litoxetine, also shown on the right in Figure 4, which was progressed to clinical trials and is active against the serotonin transporter with an IC50 of 6nM (13).

The chemical space of the data set generated is shown in Figure 5.

Figure 5 The chemical space of compounds generated from the initial lead that gave rise to Duloxetine

From this it is clear that there are multiple ‘hot spots’ containing high-scoring compounds; the best scoring compounds are not concentrated in one region, indicating a number of different chemical strategies have been found that are worthy of further consideration. The top three ranked molecules are structurally diverse, within the range of diversity explored around the initial lead, and are distinct from both the initial lead and Duloxetine itself.


In order to get the most out of predictive methods, they should be used to evaluate a wide range of ideas and prioritise the best for detailed consideration by an expert. This achieves the best combination of experienced scientists, who can define the desired property profile and scrutinise the topranked compounds, with a computer’s capability to generate and objectively analyse large quantities of data.

However, while generating property data and prioritising ideas is inexpensive and quick, the bottleneck comes creating the ideas and entering them into the computer. Here too, computational approaches can help by encoding and applying the rules used by medicinal chemists to modify and optimise molecules. This, again, achieves a synergy between the chemists’ expertise – defining the transformations to be applied and controlling their application – with the computer’s capability to store and apply more transformations than an individual.

This approach may also be used as a tool to capture and share knowledge or even as an educational resource for less experienced scientists, as transformations may be shared and organised into groups tailored to specific objectives, such as improving metabolic stability or reducing plasma protein binding.

There are a wide range of potential applications of this technology, which include: aiding the rigorous exploration of chemistry around early hits, to identify those hits most likely to yield high quality lead series; helping to find strategies to overcome problems with compound properties in lead optimisation; and identifying patent busting opportunities by expanding the chemistry around existing development candidates or drugs to search for compounds with improved properties. DDW

This article originally featured in the DDW Fall 2011 Issue

Dr Matthew Segall is Director and CEO of Optibrium Ltd. Matt has a Master of Science in computation from the University of Oxford and a PhD in theoretical physics from the University of Cambridge. As Associate Director at Camitro (UK), ArQule Inc and then Inpharmatica, he led a team developing predictive ADME models and state-of-the-art intuitive decision-support and visualisation tools for drug discovery. In January 2006, he became responsible for management of Inpharmatica’s ADME business, including experimental ADME services and the StarDrop software platform. Following acquisition of Inpharmatica, Matt became Senior Director responsible for BioFocus DPI’s ADMET division and in 2009 led a management buyout of the StarDrop business to found Optibrium.

Edmund Champness is Director and CSO of Optibrium Ltd. After graduating with a degree in Mathematics in 1995, Ed joined GlaxoWellcome working as part of a pioneering team building predictive pharmaceutical tools. He developed the first graphical user-interfaces for working with predictive models which were adopted globally within GlaxoWellcome. He was a core member of the team which established the UK operation of Camitro in 2001 and remained with that company (now operating within BioFocus DPI following merger and acquisition) until 2008. During this time he designed and built the StarDrop software and, in 2009, co-founded Optibrium.

Dr Chris Leeding is the Product Manager, responsible for the StarDrop software platform, at Optibrium. Chris received a PhD in Chemistry from King’s College London and has more than 10 years’ experience in software development roles. In 2006, Chris joined the team responsible for StarDrop and has played a key role in its development, including implementation of the Auto- Modeller and Nova modules.

Dr Ryan Lilien’s research focuses on the use of advanced computational methods to provide Biologists and Chemists informational leverage in solving their problems. He is the Chief Scientific Officer at Cadre Research Labs, a Massachusettsbased scientific computing contract research organisation and he maintains an adjunct faculty appointment in the University of Toronto’s Department of Computer Science. Ryan has contributed papers in the areas of Protein Redesign, Drug Discovery, Clinical Medicine, Structural Biology, Mass Spectrometry, Search and Optimisation, Human Computer Interfaces, Machine Learning, and Machine Vision. Ryan received a BS in Computer Science with a concentration in Chemistry from Cornell University. At Dartmouth, he received a PhD in the Department of Computer Science and completed an MD at Dartmouth Medical School.

Dr Ramgopal Mettu completed his BS, MS and PhD degrees at the University of Texas at Austin in Computer Science. Ram’s dissertation research focused on developing approximation algorithms for basic problems in resource placement and clustering. Ram completed a postdoc in Computational Biology at Dartmouth College from 2002-05. Since then, he has held a faculty position in the Department of Electrical and Computer Engineering at the University of Massachusetts Amherst and is now a visiting faculty member in the Computer Science programme at Tulane University. Ram joined Cadre Research Labs in 2010 as a means of translating his academic research to real-world practice. He is partly supported by an NSF CAREER award and has published in the areas of Approximation Algorithms, Discrete Optimisation, Randomised Algorithms, Networking, Machine Learning, Structural Biology and Mass Spectrometry.

Dr Brian Stevens is a senior research associate at Cadre Research Labs where he focuses on projects involving small molecule biochemistry and techniques for molecular biology. He completed his PhD in Biochemistry at Dartmouth and his BA at Skidmore with a double major in Biology and Chemistry. His graduate research focused on understanding and modifying the substrate specificity of the phenylalanine-adenylating domain of gramicidin synthetase. Brian’s post-graduate research focused on the genetic associations of ADD/ADHD and the phylogeny of the invasive Asian Longhorned Beetle.

1 Van de Waterbeemd, H, Gifford, E. ADMET in silico modelling: towards prediction paradise? Nat. Rev. Drug Discovery. 2003;2:192-204.

2 Ekins, S, Boulanger, B, Swaan, P, Hupcey, M. Towards a new age of virtual ADME/TOX and multidimensional drug discovery. J. Comp. Aided Mol. Design. 2001;16:381-401.

3 Segall, M, Champness, E, Obrezanova, O, Leeding, C. Beyond Profiling: Using ADMET models to guide decisions. Chemistry & Biodiversity. 2009;6:2144-2151.

4 Chadwick, AT, Segall, MD. Overcoming psychological barriers to good discovery decisions. Drug Discovery Today. 2010;15((13/14)): 561-569.

5 Schneider, G, Fechner, U. Computer-based de novo design of drug-like molecules. Nature Reviews Drug Discovery. 2005;4(8):649-663.

6 Hartenfeller, M, Schneider, G. Enabling future drug discovery by de novo design. Wiley Interdisciplinary Reviews: Computational Molecular Science. 2011.

7 Stewart, K, Shiroda, M, James, C. Drug Guru: a computer software program for drug design using medicinal chemistry rules. Bioorg. Med. Chem. 2006;14:7011-22.

8 Ekins, S, Honeycutt, J, Metz, J. Evolving molecules using multiobjective optimization: applying to ADME/Tox. Drug Discov. Today. 2010;15:451-60.

9 Segall, M, Champness, E, Leeding, C, Lilien, R, Mettu, R, Stevens, B. A new Generation of Possibilities: Applying med chem transformations to guide the search for high quality leads and candidates. [Internet]. 2010 [cited 2010 March 7]. Available from:

10 Segall, MD. Why is it still Drug Discovery? European Biopharmaceutical review. 2008.

11 Segall, MEC. The Difference between Guiding and Supporting Decisions: Enhancing Decisions and Improving Success in Drug Discovery. Genetic Engineering News. 2010 September.

12 Bolton, E, Wang, Y, Thiessen, P, Bryant, S. PubChem: Integrated Platform of Small Molecules and Biological Activities. In: Annual Reports in Computational Chemistry. Vol 4. Washington DC: American Chemical Society; 2008. p. 217-241.

13 Andrews, M, Brown, A, Chiva, J, Fradet, D, Gordon, D, Lansdell, M, MacKenny, M. Design and optimisation of selective serotonin re-uptake inhibitors with high synthetic accessibility: part 2. Bioorg. Med. Chem. Lett. 2009;19:5893-5897.

Related Articles

Join FREE today and become a member
of Drug Discovery World

Membership includes:

  • Full access to the website including free and gated premium content in news, articles, business, regulatory, cancer research, intelligence and more.
  • Unlimited App access: current and archived digital issues of DDW magazine with search functionality, special in App only content and links to the latest industry news and information.
  • Weekly e-newsletter, a round-up of the most interesting and pertinent industry news and developments.
  • Whitepapers, eBooks and information from trusted third parties.
Join For Free