The life sciences data landscape is growing in complexity and scale, with organisations generating increasingly high volumes of data. Yet, these data are still stored and searched using outdated methods. As a result, there are vast amounts of unstructured data stored in various siloed locations. Vladimir Makarov, Consultant at Pistoia Alliance, discusses this challenge and ways to move forward.
One example of this challenge lies in current approaches to bioassay protocols, which provide information about the methods for biological research. This information is hard to find, compare, analyse, or use in data mining. Considerable time investments are required along with a specialised expertise. In fact, the Pistoia Alliance conducted a number of interviews with scientists on this subject, finding that they spend up to twelve weeks per assay selecting and planning new experiments.
The challenge of siloed data is symptomatic of a larger failure to collaborate in the life sciences, and it also represents the reluctance of life science organisations to fully take advantage of digitalisation. In many ways, digitalisation has transformed the life sciences, yet it seems data management systems are the last remaining mark of legacy approaches.
Making Data FAIR
It’s essential that any new method of storing and searching life science information ensures that data are made FAIR: findable, accessible, interoperable and reusable. To facilitate a more collaborative approach and ensure that organisations and individuals all work to the same standards, the FAIR principles should be applied across the industry. If data are made FAIR, they become more easily retrievable and sharable, preventing unnecessary duplication of research, and perhaps more importantly, repetition of experiments that failed in the past.
Taking again the earlier example of bioassay protocols illustrates the broader issue with current data systems. At this time bioassay data are not recorded in the FAIR format. The assay protocols are widely accessible, as they are stored public data banks, either in form of research papers or a metadata attached to scientific results, however, both exist in plain-text formats. This means assay protocols are not machine-readable, and therefore require manual review. Many of the current assay protocol annotations also lack the depth or quality needed to drive the research forward. As a result, scientists spend huge amounts of time manually sifting through vast libraries of old records, rather than conducting new research or being able to apply AI and machine learning to datasets.
AI and Machine Learning: A Case Study
The FAIR principles enable data to be machine-readable and so mitigate this challenge. With this change, it becomes far easier to implement AI and machine-learning, which will transform the data-searching process by saving time and reducing room for error. One example of this in practice is the DataFAIRyproject, which demonstrates the advantages offered by an approach that combines FAIR with AI and ML. In the DataFAIRy project the assay unstructured metadata is processed by an automated Natural Language Processing engine,
and the output is then vetted by human experts (a ‘human in the loop’ approach) to ensure the quality of annotations. To develop the DataFAIRy method, the Pistoia Alliance first conducted extensive analysis of the needs of a typical scientist in the pharmaceutical industry. Then, the project team developed an ontology-based model that would allow the typical data mining questions to be answered.
Perhaps the greatest advantage offered by adopting a DataFAIRy type approach to data management is the vast potential for time saving. As the cost of developing new drugs continues to rise, it is vital for scientists to work more productively, spending more time on analysis and as little as possible on preliminary research.
Why change now?
As datasets generated by organisations grow in volume and complexity, there is an urgent need for new search and storage methods that assist scientists working in R&D, rather than slowing them down. With the acceleration of digitalisation, we should look to new technologies and standards to solve the problems presented by manual search methods.
Projects like DataFAIRy encourage a collaborative approach between scientists and organisations, so that data can be accurately shared between teams and organisations, thus reducing the time wasted by errors in data, or by repeating experiments that have already been completed. With the Pistoia Alliance aiming to scale up DataFAIRy’s annotation process to thousands of assay protocols at a time in the next phase of the project, this method – and others like it – has the potential to transform the way in which bioassay protocols, as well as other types of important data, are recorded and searched in life sciences.
About the author
Vladimir Makarov is a consultant at Pistoia Alliance, a global, not-for-profit members’ organisation collaborating to lower barriers to R&D in life sciences. He has experience working in informatics and biotechnology, and has a PhD in Computational Biology.