Some 5% of commercially available research compounds are sold as salt forms. These can have better properties than their parents, including preferable stability, solubility, bioavailability, and manufacturability. By selecting the correct salt form for an assay or as a reagent, scientists can optimise their research projects, says Andrievs Auseklis Auzin, Developer, Molport, Latvia.
- The research chemical supply market
In the search through chemical space for novel therapeutic agents to meet unmet medical needs, pharma and biotech companies require access to millions of diverse organic chemicals – as building block reactants for input to robotic synthesizers and for immediate use in high throughput screening (HTS) lead detection systems.Some companies synthesize the needed compounds themselves, while others outsource the syntheses to chemical suppliers, and then purchase the compounds they want.
It is estimated that there are around 30+ research chemical suppliers active in the screening compound market at any time providing some 10 million in-stock, varied compounds. Some maintain off-the-shelf-inventories to pick from and pre-selected, targeted lists, while others list synthesizable compounds that can be made in short order.The much larger building block market has 200+ suppliers offering 1 million off-the-shelf unique compounds. As well as individual suppliers, five major aggregator companies consolidate the catalogues and compound lists from multiple suppliers. These aggregators aim to save consumers time and effort by providing one-stop-shops with streamlined searching, ordering, and value-added sample processing, such as consolidated orders and sample dissolving and plating.
- The importance of salt forms in the commercially available chemicals market
When pharma and biotech companies buy commercially available research compounds as building blocks or screening compounds, they often have a choice of different forms of the molecule. Overall, some 5% of all commercial research compounds are listed in supplier catalogues not as the simple compound (i.e.an unadorned ‘parent’ molecule)but as a salt or solvated form (e.g. a hydrochloride salt, a hydrate, or some more complex set of adducts). Analysis of 87 suppliers’ catalogues identified 2522 unique salt/solvate descriptors (see Section 7: Success metrics), which illustrates the amount of variability.
Salts, solvates, and other addition compounds typically have different, sometimes preferable physical properties to their parent form, so maybe more soluble or stable, and therefore a better choice as a building block reactant for an automated synthesizer or as a screening sample in a particular solvent for HTS.
- Representing salt structures
As salt forms can have more favourable properties than their parents, it is crucial that they are visible and easy to locate. Most chemical suppliers and aggregators have online catalogues, often with chemical structure and substructure searchability, so that researchers can focus on particular scaffolds or substructures of interest. Suppliers also provide corresponding machine-readable files of data on the compounds they supply, typically including full structure, systematic or trivial name, and molecular formula (MF) and molecular weight (MW). This data is typically extracted from a supplier’s internal database and exported for use by the consumer.
Representing unadorned parent compound structures in chemical databases is straight forward, with well-accepted International Union of Pure and Applied Chemistry (IUPAC) drawing conventions described in Graphical Representation Standards for Chemical Structure Diagrams and covering most classes of compound. However, while ‘Salts and Related Form’ are included in the IUPAC standards, the examples there are relatively simple and limited in scope. This lack of clear guidance can give rise to different ways of both drawing and naming salts. Even something as simple as the chloride salt of a compound has been named in 40 different ways, as shown in Figure 1.
- Challenges a rising from inconsistent salt representations
This lack of standardisation or commonly accepted structure and naming conventions can cause problems when suppliers and catalogue indexers are faced with more complex multicomponent salts and/or partially solvated compounds(eg sesquihydrates), and this can lead to inconsistent adhoc representations in these more complex cases. These inconsistent ad hoc structures may be simple to draw, and may even pass some chemical integrity checkers, but any derived calculated values such as MF or MW could be suspect.
Modern chemical drawing programs make drawing parents, salt, and solvates simple, easy, and mostly accurate. But in the absence of agreed drawing conventions for salts, even a simple salt compound like this can be drawn and named in two different ways (Figure 2).
As salts and solvated forms get more complex, the salt/solvate information is sometimes omitted from the structure diagram and only included in the compound name or a salt text field; or is only partially added to the structure diagram and described more completely in a salt text field. This has an important knock-on effect if the compound’s MF and MW are calculated algorithmically from the structure diagram, as they may no longer accurately reflect the effective MF value which has been normalised to the version of the salt/hydrate that contains a single molecule of the parent compound.
This can have devastating experimental consequences. Suppose you want to makeup a 0.1 millimolar solution of the proton-pump inhibitor pantoprazole sodiumsesquihydrate for an assay. That requires knowing the compound’s MW, and a quick Google search (Figure 3) suggests that it is 864.8.
But is it?
Other open access chemical indexes agree: PubChem and ChemSpider both give its MF as c32H34F4N6Na2O11 and its MW as 864.8. These are derived from the structure in Figure 4. But further investigations via chemical suppliers’ websites tell another story. Both SigmaAldrich and Sinson Pharma give its molecular formula as C16H14F2N3NaO4S.1.5H2O and molecular weight as 432.37, which is the effective MW that should be used. This is a direct result of representing the compound in two different ways, with either two molecules of pantoprazole sodium or one, and leads to two putative molecular weights that differ in value by 100%. Weighing out a sample based on the wrong molecular weight to create a 0.1 millimolar solution for a bioassay or a milligram amount of a reactant for a synthesizer could have detrimental skewing effects on a derived dose response curve or could lead to an unbalanced reaction.
Today’s automated highthroughput bioassay systems and robotic synthesizers require accurate and correctly formatted machine-readable data to ensure flawless, uninterrupted operation. Any manual intervention to correct or supply salt form data will be disruptive, and time-consuming. Consider that pharma and biotech companies purchasing tens of thousands of compounds for HTS expect to receive correct compound data for immediate use. In addition, most pharma and biotech companies will want to incorporate the structures of the purchased compounds into their in-house registration and Inventory systems, and any differences in drawing and naming conventions will again require possible time-consuming manual intervention before the incoming data can be loaded.
Data exchange is also complicated by the use of different data file formats. Two thirds of catalogue structure/data files are sent as SD files, with the remainder sent as MS Excel files. The hardest to process are ‘dead’ PDF files.
- Approaches to processing and resolving salt form data
There would be a clear benefit from an approach that bridged the gap between the suppliers struggling to represent salts and solvates in a consistent and systematic way, and consumers of the supplied compounds expecting to receive correct and immediately usable structure and MF and MW weight data. Such efforts to address this important issue are now under way.
While it is feasible to manually inspect and adjust the data on small numbers of salt/solvate compounds, this becomes impractical with the vast numbers of compounds involved in HTS and combinatorial syntheses, making an automated algorithmic approach necessary. The key requirements are the ability to analyse the chemical structure, chemical name, salt text fields, and MF data.If suppliers provide no information about a salt compound, though, nothing can be done. However, any salt information that is provided is amenable to algorithmic analysis and standardisation. Salt data is provided in three different ways:
a. In a text field only.
b. In the structure only –including the parent compound structure and the salt/solvate structure.
c. In both text and structure.
As an example, one salt analysis algorithm developed by aggregator MolPort proceeds stepwise:
a. Identify if the compound is a salt, solvated, or some other type of addition compound, and ignore those that are unadorned parents.
b. For the remainder, identify where the salt information is provided and analyse the information in the salt text field and the salt/solvate fragments in the structure diagram.
c. Create the full structure, i.e. parent molecule + salt structure(s) in the correct ratio.
d. Clean the parent structure and the separate salt structure.
e. Adjust any charges so that the complete structure is neutral.
f. Save and register all three structures (parent, salt, and parent + salt, each with its own unique registration number) plus their stoichiometric ratio and corresponding data.
- Programmatic approaches to manage difficult, novel, and edge cases of salt form information
In the drive to automatically process incoming salt data from suppliers and generate correct and immediately usable MF/MW values, the salt structure/data analysis algorithm can have several detailed sub-routines. These parse the data to rapidly provide the most comprehensive coverage of complex and difficult cases as is possible; but at the same time to remain conservative in generating a correct MF/MW or flagging an error in uncertain or unresolvable cases. Where assumptions are made (eg.approximations in salts/solvates with 12 or more components) these are clearly stated.
This approach allows seamless and uninterrupted use of salt form compounds in assays and as reactants, as users can confidently auto-process the incoming compounds. There is no need to manually intervene to recalculate MW values, and data can be directly consumed by internal compound registry and inventory systems in all but a small number of difficult edge cases which have been flagged for attention.
- Success metrics
In one study, the programmatic approach described above analysed 8M catalogue entries from 87 suppliers where around 4% had salt information either in the structure or in a text field. This successfully processed 320K salt compounds and generated an accurate effective MF and MW for each.
The analysis identified 268 individual salt fragments with all their possible charge configurations. It processed 1083 different salt combinations (e.g. HCl&H2O vs. 2HCL&H2O). In total it recognised 2252 unique salt texts, which exemplifies the extent of the problem and the urgent need for automated processing and improved formats for supplying data.
- Proposals for improved, consistent formats for salt form structures and data
The analysis of millions of suppliers’ catalogue entries and automatically processing salt records to generate accurate MF and MW values has led us to generate the following proposals for improved methods for describing and supplying salt form information in text and structure diagrams.
d. If salt information is provided in the text column/field in the catalogue:
i. Normalise all salt stoichiometry to one equivalent of the main molecule.For example, 2 (Main component) :3 (HCl):1 (H20) should be 1:1.5:0.5.
ii. If possible, assume the main component is neutral.
iii. Provide the MW of the main component and full salt/solvate normalised to one molecule of the main component.
e. If salt information is in the structure, include the full stoichiometry. For example, if there are 12 waters, include all 12, not just one for example.
f. If salt information is in a text string, then include the stoichiometry in the text and make sure the text is uniquely identifiable.
g. Do not provide part of the salt in text form and the other part in the structure.
h. When providing both text and structure salt information, make sure they match.
i. Do not add text labels/comments in the structure –they are not machine readable.
j. Do not use pseudo/undefined atoms in defining the structure.
Salt forms of compounds are important in biopharma research, but there are difficulties in providing accurate structures and MF/MW values. As a pragmatic interim measure,a n automated algorithmic process can resolve many of these issues, avoid manual intervention, and provide correct, immediately usable values. Longer term, a more systematic approach to providing salt form information is needed, and we have proposed recommendations to help achieve this, with a view towards the market having consistent, reliable, and easily analysed information for researchers.
Volume 23 – Issue 3, Summer 2022
About the author:
Andrievs Auseklis Auzin, Developer, Molport, is a specialist in chemoinformatics and software development, Auseklis is currently working to advance chemical data standardisation and chemical search capabilities at Molport. Prior to joining the company, he served as a computational chemist at The Latvian Institute of Organic Synthesis and has many years of experience in software development.
1: 12008 Pure & Applied Chemistry, 80,277-410.