For physicochemical and ADME properties, the popular matched molecular pair analysis method has been a successful strategy; however, it notably fails in the goal of improving potency. Here we discuss a lead optimisation approach involving matched series, the extension of matched pairs to more than two R-groups, which can successfully be used to guide molecular design towards improved potency.

Furthermore, this approach retains the attractive features of matched pair analysis in that it is entirely driven by experimental data and is a natural fit to the medicinal chemistry approach of designing analogues by successive small changes to an existing molecule.

What molecule should I make next? This is the question that occurs again and again at each step of a lead optimisation project.

Answering this question well may mean the difference between project success and failure, or at least between rapid progress and wasting time following numerous dead-ends.

How one decides which molecule to synthesise will clearly vary from one person to the next, but ultimately it boils down to one of two things. The first of these is the medicinal chemist’s experience from working on related projects; for example, what worked last time? However, for the most part deciding what compound to make next is based on observed activity trends, from which a particular structure-activity relationship is inferred and then extrapolated to a new structure. This is commonly referred to as ‘chemical intuition’, but in fact relies on a chemist’s knowledge of potential structureactivity relationships and the relative property values of common R-groups.

Matsy1 is a lead optimisation strategy that combines both of these approaches and, rather than relying on an individual or group’s experience, it uses the experience garnered by tens of thousands of medicinal chemists and available in the literature. Instead of inferring structure-activity relationships using ‘intuition’, it bases them on this broader experience so that all predictions are made on the basis of previously observed experimental results.

Matched pairs and series
The starting point for this method is the concept of a matched pair (or more formally, a Matched Molecular Pair or MMP) although, as we shall see, such MMPs are not in themselves sufficient for this purpose. A MMP refers to two molecules with the same scaffold but different R-groups at the same position2, and has become very popular in recent years for rationalising trends in SAR3,4. The success of this approach is due to the fact that relative changes in property values are easier to predict than absolute values. It also fits very well with the common lead optimisation procedure of changing R-groups, while keeping the underlying scaffold constant.

Predictions based on this approach work well for physicochemical properties as well as for biological activities that correlate highly with such properties. However, in general, MMP analysis does not work well for predicting R-groups that improve biological activity. This was most clearly shown in a 2008 study by Hajduk and Sauer at Abbott5 for MMP data drawn from a broad range of targets. Potency changes associated with most MMP transformations were found to be nearly normally distributed around zero. The simple reason for this limitation of MMP analysis is that for one binding site environment changing group A to group B may increase activity, while for another binding site environment it may decrease activity. While attempts have been made to address this problem, for example by focusing on MMPs from just the target of interest6 or with a particular atom environment7, the underlying problem remains.

But all is not lost. If we revisit the analysis by Hajduk and Sauer, we can show that there may be a way forward. Let us take as an example those assays in the ChEMBL database8 (https://www.ebi. ac.uk/chembl/) with pIC50 activity data for compounds with ethyl, propyl and butyl as substituents at the same location on a scaffold. Figure 1a shows the pIC50 for the ethyl analogue versus that of the butyl, and sure enough we see a symmetric distribution of the activities around zero. In other words, changing ethyl to butyl is equally likely to increase activity as to decrease activity, consistent with the results found by Abbott.

However, what if we include additional knowledge about the context of the matched pair transformation? For example, suppose that we already know that the propyl analogue has a greater pIC50 than the butyl for our scaffold of interest. If we take the subset of the data in ChEMBL where the propyl analogue is more active than the butyl, and then regenerate the original plot (Figure 1b), the distribution of the ethyl minus butyl activities is now shifted to the right away from zero. In other words, knowing that the propyl is more active than the butyl dramatically increases the chance that ethyl is also more active. More generally, if we already know additional information about activities, it should improve our ability to predict the effect of a given R-group replacement.

The question is, how best to do this? One approach would be to throw more matched pairs at the problem; rather than simply considering two R-groups and the associated MMP, for any three R-groups, consider the three associated MMPs. However, this quickly becomes unwieldy once one progresses to four R-groups and the associated six MMPs or even longer series with larger numbers of combinations.

In fact, a much simpler and more elegant approach is to consider all of the associated Rgroups as parts of a single Matched Molecular Series (MMS), a concept introduced by Bajorath in 20119. This is simply a generalisation of the MMP concept to a series of any length, that is, N molecules with the same scaffold but different Rgroups at the same position. With a matched pair, we are asking the question: “Will changing B to C increase the activity?”; in contrast, if using a matched series of length 3, we are asking “Will changing B to C increase the activity, given that B is more active than A?” In other words, using longer series introduces a context regarding a particular binding site environment.

Although the term matched pairs was first described in 20052, and the limitations of the approach were already shown by Hajduk and Sauer in 2008, it is interesting to ask why it took so long to start looking beyond pairs to longer series? One hypothesis is that by naming the concept using the term ‘pair’, chemists focused on thinking in terms of two R-groups exactly and found it difficult to think outside this box. Furthermore, the concept of matched pairs has become synonymous for many with ‘a matched pair transformation’ (that is, a replacement of a terminal R Group), and this cemented the idea of two R-groups as a fundamental concept, rather than just a specific instance of a general case.

The following sections describe two approaches to guide lead optimisation using MMS, namely SAR Transfer and Matsy. In both cases, it will be apparent that such predictions are at their least reliable when based on matched pair data rather than data from longer matched series.

SAR Transfer predictions
The concept of SAR Transfer, as introduced by Bajorath10, is best explained with an example. Suppose that we have synthesised the set of eight analogues shown in Figure 2 that have different Rgroups at a particular location on a common scaffold (a). This, of course, is a MMS of length eight. Having measured the biological activities of these analogues, we need to decide what R-group to make next. One approach to do this is to search a database of biological activities to find MMS containing the same eight R-groups (or a large subset thereof) and where the order of the activities of the analogues is a close match to the original series (this can be measured using rank correlation). Having found such a match, eg the scaffold (b) and its corresponding R-groups, it is a reasonable assumption that any additional R-groups in this new series that have improved activity relative to the original eight R-groups, may also further improve the activity for the original series involving scaffold (a). In other words, we are transferring SAR from a database match to our own series. For the particular case in the example, the NH2 and SMe groups may offer improved activity for scaffold (a) based on the match to scaffold (b).

It should be clear that this approach is more likely to work as the number of R-groups in common increases, and the higher the correlation of the relative activities of the R-groups in the series. In particular, this is a useful technique to identify gaps that are worth exploring in a dense R-group matrix; for example, a scaffold with two R-group positions where many of the R1xR2 combinations have been synthesised and tested.

Unfortunately, the longer the MMS, the less likely it is that a match to a particular series will be found in a reference database, let alone a match with a high correlation of activities. On the other hand, if the length of the MMS is short, even if the activities have perfect correlation, you are unlikely to be confident that the SAR can be transferred to the original series from a single match in the reference database. Furthermore, the number of matches to a short series may be of the order of hundreds or even thousands. The next section describes the Matsy method, an approach that was developed to handle this situation.

Matsy
The Matsy algorithm1 can be considered a statistical version of the SAR Transfer method that can handle predictions based on short MMS. The origin of the method is the observation that, given a set of R-groups, certain activity orders are found more commonly in matched series composed of those R-groups. Given an existing matched series, the algorithm searches an activity database for all R-groups that have been measured along with those in the input, and calculates the percentage of times each R-group increased the activity beyond the most active R-group in the input series. The R-groups with the highest percentages are presented as the most likely candidates to try next.

Figure 3 presents this approach in the context of a MMS database where only five matches are found in the database. In this case, the R-group D had improved activity relative to the best R-group in the query (A) three times out of three, ie 100% of the time; in contrast, C only improved the activity once out of four times, ie 25% of the time. In practice a higher cut-off is applied to the number of observations so that the user can have some confidence in the results.

A more realistic example would be to search a MMS database derived from ChEMBL. Let’s assume that we have synthesised a matched series in which ethyl is more active than propyl and propyl itself is more active than methyl (that is, Et > Pro > Me). The top prediction from the Matsy algorithm is cyclopentyl on the basis of 23 observations in ChEMBL of which 39% increased the activity. The next best prediction is a bromine, which increased activity 38% of 21 times. It is worth noting that swapping an ethyl with a bromine will reduce the logP; this illustrates the fact that the predictions are not solely driven by logP (a frequently asked question), but are driven by observed trends in the data.

Practical application of matched series to guide design
As discussed, MMPs have proved to be attractive because the corresponding transformations are easily interpreted; the improved predictive power of MMS can also be accessed in an intuitive way. Existing series of compounds can be analysed to find corresponding matched series in a database and, from these, automatically generate new suggestions for optimisation. To gain confidence in the rationale for these suggestions, the underlying experimental evidence can be presented and easily explored. Coupling this with predictive modelling of other properties enables true multi-parameter optimisation to quickly prioritise new compounds to pursue, as illustrated in Figure 4.

Conclusion
The term Matched Molecular Pair made concrete a concept and technique that medicinal chemists had been aware of for years previously; namely that comparisons between the properties of two molecules that differ in a single substituent may be used to guide lead optimisation. However, the focus on two molecules rather than a set of molecules has hindered advances in property prediction. Now that there is an increasing awareness of Matched Molecular Series among chemists, we hope that they will start to look beyond matched pairs to matched series based techniques such as SAR Transfer and Matsy that overcome some of the limitations of matched pairs and open up new ways of thinking about, searching and predicting structureactivity relationships.


Dr Noel O’Boyle joined NextMove Software as a Senior Software Engineer in 2012. He has a PhD in computational chemistry from Dublin City University, and has held postdoctoral positions at the University of Cambridge, Cambridge Crystallographic Data Centre and University College Cork.

Dr Roger Sayle is the CEO of NextMove Software. He gained his PhD in computer science at the University of Edinburgh. Before starting NextMove Software, Roger worked at Glaxo-Wellcome, Metaphorics LLC and OpenEye Scientific Software.

Dr Matt Segall is CEO of Optibrium, developers of the StarDrop software platform. He has an MSc in computation from the University of Oxford and a PhD in theoretical physics from the University of Cambridge. Matt’s career has focused on developing intuitive decision-support, predictive modelling and visualisation tools for drug discovery.

References
1
O’Boyle, NM et al (2014). Using Matched Molecular Series as a Predictive Tool To Optimize Biological Activity. J. Med. Chem. 57, 2704-2713.

2 Kenny, PW and Sadowski, J (2004). Structure Modification in Chemical Databases. In Cheminformatics in Drug Discovery (Oprea, T. I., ed), pp. 271-285, Wiley-VCH.

3 Griffen, E et al (2011). Matched Molecular Pairs as a Medicinal Chemistry Tool. J. Med. Chem. 54, 7739-7750.

4 Dossetter, AG et al (2013). Matched Molecular Pair Analysis in drug discovery. Drug Discov. Today 18, 724-731.

5 Hajduk, PJ and Sauer, DR (2008). Statistical Analysis of the Effects of Common Chemical Substituents on Ligand Potency. J. Med. Chem. 51, 553-564.

6 Gleeson, P et al (2009). ADMET rules of thumb II: A comparison of the effects of common substituents on a range of ADMET parameters. Bioorg. Med. Chem. 17, 5906-5919.

Warner, DJ et al (2010). WizePairZ: A Novel Algorithm to Identify, Encode, and Exploit Matched Molecular Pairs with Unspecified Cores in Medicinal Chemistry. J. Chem. Inf. Model. 50, 1350-1357.

8 Bento, AP et al (2014). The ChEMBL bioactivity database: an update. Nucleic Acids Res. 42, D1083-D1090.

9 Wawer, M and Bajorath, J (2011). Local Structural Changes, Global Data Views: Graphical Substructure- Activity Relationship Trailing. J. Med. Chem. 54, 2944-2951.

10 Wassermann, AM and Bajorath, J (2011). A Data Mining Method To Facilitate SAR Transfer. J. Chem. Inf. Model. 51, 1857-1866.

11 Penning, TD et al (1997). Synthesis and Biological Evaluation of the 1,5- Diarylpyrazole Class of Cyclooxygenase-2 Inhibitors: Identification of 4-[5-(4- Methylphenyl)-3- (trifluoromethyl)-1H-pyrazol-1- yl]benzenesulfonamide (SC- 58635, Celecoxib). J. Med. Chem. 40, 1347-1365.

12 Puig, C et al (2000). Synthesis and Biological Evaluation of 3,4- Diaryloxazolones: A New Class of Orally Active Cyclooxygenase-2 Inhibitors. J. Med. Chem. 43, 214-223.