COVID-19 has shown the pace of innovation that is possible to develop vaccines and therapeutics. Matt Jones, Lead Data Scientist Strategist, Tessella explains how to set up R&D and data science teams to keep that going.
The drive to develop new therapeutics and vaccines for COVID-19 saw researchers move at unprecedented speeds. Years became months, months became weeks.
We are still in the midst of this pandemic and things will continue at this pace for a while. But we also need to ask what we will have learned that we can take forward. All crises reveal opportunities and push people to discover what is possible when they think differently. Having shown its possible, pressure to maintain this pace is likely to remain.
The accelerated pace has, at least in part, been possible due to data – we have never had so much and so many sophisticated tools to draw insights from it.
But the digital acceleration throws up major data challenges. Decisions must now be made with new and limited data, which isfull of uncertainties. And models based on this data need to be built quickly.
The experts designing drugs and vaccines do not necessarily have the expertise, or time, to curate, engineer and validate unfamiliar data and build models at this speed. R&D departments need to establish new ways of working to ensure the people and processes are in place to deliver the right data and models to researchers as quickly as possible.
At Tessella, we have long experience of modelling complex situations, at speed, with limited data. We see the following four areas as crucial to doing data science quickly in uncertain environments.
Accessing the right data
A core challenge for COVID was lack of data uncertainty. Most disease research is built on years of study. Here, we are dealing with data on biological mechanisms and patient responses where understanding is still evolving. Investigating potential secondary indications of existing drugs, for example, involves data based on subjective assessments by doctors who are still getting to grips with the disease.
Open source data, or data from clinical trials or hospitals, may include bias or misreporting, and different sources may use different labels and data capture mechanisms. Even if a model is perfect, it will still produce a wrong result if the data going in is incorrect or incomplete.
To resolve this, as data enters the organisations databases, it must be assessed by subject matter experts for errors and bias. Data scientists must make necessary changes to ensure consistency. They must also remove confounding elements, eg labels added to scans by physicians, which will confuse models.
Metadata should be added on what the data represents – eg type of molecule or toxicology, but also provenance, timestamps, usage licences, etc. There must also be a consistent taxonomy established for naming things so models (and humans) can find and make sense of the data.
Once complete, data must be fed into central and accessible data stores, and tools and integrators setup to pipe data to data science teams.
Good data is FAIR: Findable, Accessible, Interoperable, Reusable. It is stored in a way that makes it easy to identify by anyone who searches for it. It is in formats that can be read by humans and machines. And it is clear about any limitations or rules about how it can be used.
We have seen many data projects come undone, or fail to get off the ground, because modellers drew invalid conclusions from data sets which contained errors, bias, or lacked contextual information. This was the case when we worked with a pharma company to explore how pre-clinical data could be used to predict late-stage failures.
They had expected the problem to be about piping the data to the right place, but after speaking to the modellers, it became clear that the problem was that the data was hard to understand, labourious to use, and risky to draw conclusions from. By focusing instead on improving data management, we significantly improved models and reduced late stage failures.
Choosing the right models
The next potential sticking point is building the right model.
There is no rule for which approach is best for a particular problem. The nature and context of the problem, data quality and quantity, computing power needs, speed, and intended use, all feed into model choice and design. An image recognition tool for screening cell lines will look very different to a model which analyses molecule libraries to identify likely candidates for new drug targets.
Start by understanding the type of problem. Is it classification or regression, supervised or unsupervised, predictive, statistical, physics-based, etc? Don’t just go for the approach you are most familiar with.
Screen data to understand what is possible. Perform rapid and agile early explorations using simple techniques to spot the correlations that will guide your plan. From this analysis, identify candidate modelling techniques (eg empirical, physical, stochastic, hybrid) before narrowing down to the most suitable model for that specific problem.
‘Most powerful’ is not the same as ‘must suitable’. Techniques such as machine learning need lots of well understood data and so are ill-suited to challenges at the cutting edge, where data is still being understood. Approaches such as Bayesian uncertainty quantification – which involves updating our knowledge and its uncertainty with each data point, so each piece adds incrementally to the richness of the model – may be better where limited trusted data is available.
Ensure your answers are trusted
The best model in the world will fall down if users don’t trust it.
Trust requires more than just a working model. Over-complicated or frustrating user interfaces, or models which cause more problems than they solve, undermine trust and slow uptake. We have seen this very prominently in track and trace apps, but it is equally true of drug discovery platforms.
So does a lack of explainability. If users can’t understand why the model reached a result, they will end up having to repeat work manually. A good model contains tools to analyse what data was used, its provenance, and how the model weighted different inputs, then report on that conclusion in clear language.
Privacy and ethical concerns also undermine trust. When using patient data, we must be sure it was freely given (labelling usage rights in meta data as mentioned above) and keep it securely.
We are regrettably seeing that the virus disproportionately affects certain ethnic and social groups. This needs to be reflected carefully in data models – drug formulations or dosages which work particularly well in just one ethnic group and badly in another will reflect very badly on a company.
Deploying models at scale
Models must work for the researchers in their day to day lives, not as a data science proof of concept.
Usually that involves engineering the final model into a piece of software and integrating it into a mobile or web app, or a bespoke piece of technology. This requires an understanding of the rules and complexities of enterprise IT or edge computing where the model must operate.
This may involve wrapping models in software (‘containers’) which translate incoming and outgoing data into a common format, to allow it to slot into an IT ecosystem. It will require allocating power and compute demands relevant to the application. It means planning for ongoing maintenance, support, and retraining. This is where a lot of models face big holds ups, since pharma researchers, and even data scientists, are not usually software engineers.
If all goes well, the user is presented with a clear interface. They enter the relevant inputs, eg desired pharmacological properties. The model runs and presents the resulting insight in an easy to understand way, that the user is comfortable acting upon.
Bringing it all together for rapid results
The effective use of data is a critical part of shortening cycles in life sciences R&D, as well as in improving efficiency and reducing failures. The steps above outline how to get key elements of data right to support fast but accurate R&D.
Experience of responding to COVID-19 has shown the benefit of establishing focussed teams who are responsible for ensuring data is captured and curated correctly and build robust models. These must not exist in isolation but be able to work closely with researchers to understand data and needs, and with IT teams to deploy models into enterprise IT.
Speed isn’t about cutting corners; it is about doing things right first time so you don’t have to abandon projects and start again. That means efficient allocation of resources – selecting the right skills for the right job. Getting data experts to handle the data, modellers to do the models, and software engineers to do the software.
Volume 21, Issue 4 – Fall 2020
Main image credit: Charles Deluvio

Dr Matt Jones holds a PhD in synthetic organic chemistry and has over 20 years’ experience in pharmaceutical R&D. He has been at Tessella since 2014, before which he held a number of technical and management roles at GSK. In 2015 he was elected to the board of directors of Pistoia Alliance.
This article is based on Tessella’s whitepaper, COVID-19: Effective Use of Data and Modelling to Deliver Rapid Responses, developed with input from a range of modelling experts.