This paid-for advertorial by Benchling appeared in the DDW ebook – AI in drug discovery: How technology is driving innovation.
By Ashoka Rajendra, Head of Development & Manufacturing Products at Benchling
Artificial Intelligence (AI) is emblematic of the new era of modern biotech — data-driven, collaborative and ultimately, faster than ever. From discovery, to lead optimsation, process development, pre-clinical, and even investigational new drug filings, each stage of R&D stands to benefit from AI.
But as much as we hear about AI and machine learning (ML), not many are taking advantage of it in their scientific work. Labs understand that it’s obviously not as simple as sprinkling (ML) pixie dust into scientific design.
In this guide, we’ll talk you through the common hurdles we see, practical advice on how to set-up a strong foundation for AI and also the opportunity of moving AI from theoretical to operational in R&D.
Machine learning: expectations vs. reality
When applying AI, people often think that the majority of their time will be spent writing complex algorithms to find previously unknown insights. The reality is that the majority of the time will be spent acquiring and cleaning data.
In 2018, the CEO of Novartis declared that Novartis would become a data science company. A year later, he discussed the challenges of powering their R&D with data science and the need for good, clean data.
“The first thing we’ve learned is the importance of having outstanding data to actually base your ML on. In our own shop, we’ve been working on a few big projects, and we’ve had to spend most of the time just cleaning the data sets before you can even run the algorithm. It’s taken us years just to clean the datasets. I think people underestimate how little clean data there is out there, and how hard it is to clean and link the data,” said Vasant Narasimhan, CEO of Novartis.
So what’s the solution? If all the work scientists do is captured in software, it ought to be easier to analyse.
The solution is software, but it’s not that simple
Many companies have embarked on a digitalisation journey, convinced that their notebook entries, DNA and protein designs, samples used in experiments, and experimental conditions should all be tracked and stored in software. If we go down this path and digitalise everything scientists are doing, we should be well on our way to leveraging all those insights data science can provide.
However, there are challenges with this approach. First, your users are scientists, who are extremely well trained but not software specialists. Second, their workflows are complex, evolving, and in a deep domain area. It’s not easy to build high quality software in these conditions.
User-centric is key
When I was meeting with a leading technologist at a pharmaceutical company recently, he jokingly said: “The best way to get compliance and adoption of tools that you introduce to scientists is to uninstall Excel from all of their computers.” It seemed a bit extreme, but I saw where he was coming from. Excel is a very powerful and flexible tool, and it’s popular with scientists for a reason. If another system isn’t easy for them to use, if it doesn’t link data together, and if it doesn’t bring them immediate productivity, they’ll use Excel instead.
I don’t think the solution is to uninstall Excel; I think the solution is to build tools that bring scientists enough value such that they choose to use purpose-built software over Excel.
Software needs to accommodate the complexity of the data that scientists are producing, and it needs to do so with a flexible data structure. If you’re over-rotating on the end goals of digitisation, and not putting enough emphasis on the productivity and usability of the end-user scientist, you will not drive adoption.
R&D data is highly complex
Why is this such a challenge in life sciences? Large molecule data is extremely complex. Let’s take a relatively simple example: with antibody engineering, you might start with an antibody, its target, and its binding affinity. But you also need the DNA sequences that encode for the antibody chains, the combination of plasmids used to express the antibody, growth conditions for cell lines to express the antibody, screening data, and so on. All of this needs to be modelled, stored, tracked, and analysed, but most software isn’t equipped to deal with the complexity specific to large molecule R&D. Scientists will only use your software if it accommodates the complexity of the data they create and is flexible to change.
Life science R&D data needs to be centralised…
Specialised point solutions for each of your research and development teams can cause significant problems for teams doing data analysis. Each software has a different vendor, models its data differently, and instead of a ‘data lake’ pooling the data together, you end up with a data swamp. Rather than analysing data, your data science or bioinformatics team will be spending their time linking, reconciling, and cleaning data. Capturing data on a unified platform can dramatically reduce this burden.
Centralising and standardising data across disparate teams is a crucial challenge for life science R&D organisations.
…on a flexible software platform
Another big challenge for software adoption in life science is that while a tool may be initially configured to represent a scientific workflow, that workflow constantly evolves. Scientists may need to test, for example, new versions of a protein purification process in order to improve the quality of the purified sample generated.
Scientists will go where the data leads them, and it’s very important to have software that is adaptive to changes in experimental workflows. If the software doesn’t keep up with the science, the tools will become out of date and scientists will opt for unstructured notebook entries, Word documents, or pen and paper. Given the pace of the scientific process today, you need software to be able to change in days or weeks, not months.
Companies that want to leverage advanced data science techniques need to digitise their scientists’ work. Many companies have embarked on this journey but have had challenges with adoption and fragmentation of tools. At Benchling, we work with companies to consolidate many tools onto a single, unified platform.
Unlocking new capabilities
Laying this foundation with a suite of applications on a scientifically aware platform allows companies to more easily layer on advanced analysis techniques that are internally or externally developed.
Benchling’s rich application functionality sits on top of a platform that allows data to be accessed via our API and data warehouse. Your data scientists and engineers can write code that pulls unified data out of Benchling, integrates it with internal systems, and feeds data through analysis pipelines. Scientists can get recommended experimental conditions, managers can look across programs and determine where more resources are needed, and executives can flag programs that are promising or risky.
AlphaFold is an exciting example of how externally developed breakthroughs can be easily applied to your scientific data. AlphaFold is an AI system built by Google’s DeepMind that predicts 3D structure from an amino acid sequence with relatively high accuracy. It has been described as a significant achievement in computational biology and a game-changing tool that has the potential to speed up protein structure characterisation, work that generally takes months or years at the bench.
Benchling’s scientifically aware data model allows us to easily add this functionality for scientists, without the need for investing in major computational know how or power. Having put in place this foundation and having achieved this level of data analysis, our clients are already beginning to apply AI to their R&D efforts using data from Benchling.