Disruptive change by linked data and semantic technologies in healthcare and life sciences.
The pharmaceutical industry is undergoing an enormous shift in structure and strategy to reinvent how to bring new drugs to market efficiently.
Global drug development strategies are changing from the ‘we can do all’ to functional outsourcing relationships, academic alliances, building innovation networks with multiple players, consolidation and partnerships, expansion into emerging markets, use of public private partnerships (PPP), orphan drugs and translational science. All reflect new thinking aimed at achieving higher R&D efficiency and greater return on investment.
Semantic technologies – based on advanced statistics, data mining, machine learning and knowledge management – have existed for years, but are gaining much more momentum dealing with the growing need for efficient information management (IM). Gartner recently named it in the ‘Top 10 Technology Trends Impacting Information Infrastructure, 2013’1 because of the renewed business requirement for monetising information as a strategic asset. Increasing volumes, variety and velocity – big data - requires semantic technology that makes sense out of data for humans, or automates decisions.
Better application of semantic technologies in linked data combined with the urgent need for better data-sharing may well be the next disruptive wave for efficient information handling in healthcare and life sciences.
The need for disruptive innovation in the pharmaceutical sector
New drug sales grow slowly and the likelihood that they will skyrocket to blockbuster status soon after launch is no longer reality. It typically takes five years or longer to achieve such success, and many never do. Despite unprecedented investment in pharmaceutical research and development (R&D), drug approvals at the FDA are on the upswing if we look at the three-year trend taking into concern that these trends should be compared overall instead of the commonly used productivity spike in 1996 due to a backlog. Still very few blockbusters were released in 2012 and 2013 which certainly has an impact in terms of aggregate industry revenue and the need for new business models.
Pharmaceutical companies introduced ~1,200 new drugs that have been approved by the FDA since 1950, and new-drug output from pharmaceutical companies in this period has essentially been constant, and remains so despite the attempts to increase. This reflects the limitations of current R&D models. The crucial question is how to achieve sustainability for the pharmaceutical industry that needs to embrace more radical change and seize the opportunity to redesign its business model2.
Strangely enough, this innovation crisis in the pharmaceutical industry is occurring in the midst of a new golden age of scientific discovery. Large companies must innovate and reinvent their research model with lower costs and increased output as a result of harnessing the scientific diversity of biotechnology companies and academic institutions, and combine it with their own development expertise. Some experiments already bring radical and successful change and could be used as building blocks or for inspiration; for example, Innocentive3, Chorus4, the Open Malaria Product Pipeline5, public-private partnerships such as IMI OpenPHACTS6 and IMI eTRIKS7, open-source R&D8, FIPNet9, opening clinical trial data10, accelerated clinical discovery using self-reported patient data collected online and patient-matching algorithms11, and various combinations of these and other initiatives. Harnessing the ‘global brain’ to access the best science and ideas, wherever they may be, demonstrates the key advantages of open R&D architectures amplifying competition, reducing costs, increasing agility and enabling ‘disruptive innovation’ located outside the corporate walls and bottom-up approaches enabled by passionate entrepreneurs.
What is linked data, open data and semantic technologies
To understand linked data requires an understanding of the world wide web (WWW) itself. Most data published on the WWW exists as raw dumps in comma-separated values (CSV), XML or marked-up HTML table formats. These formats make it very difficult for people to mix and match different content to make use of this growing wealth of information. The terms ‘linked data’ and ‘open data’ are not the same, because when data is accessible on the WWW, it is ‘open’ but it doesn’t mean this data is linkable to other sources. ‘Open data’ is simply data on the WWW, whereas ‘linked data’ is data ‘in’ the web. The term ‘linked data’ is often associated with the concept known as the ‘Web of Data’ or the ‘Semantic Web’, a concept proposed by Tim Berners-Lee. Linked data applies the machine-readable formats that makes it easy for web applications to operate and communicate without human intervention or additional programming. Linked data gets more and more adoption to discover and connect new and legacy data sources, private and public, to make big datasets useful, showing the unique information, uncovering new patterns to improve decision-making. It is clear in the next five years, semantic technologies will play key roles in modernising information management and in making the role of information governance increasingly important, closely tied to trends of big data and modern information infrastructure. Semantic technologies extract meaning from data, ranging from quantitative data and text, to video, voice and images. Many of these techniques have existed for years and are based on advanced statistics, data mining, machine learning and knowledge management. One reason they are garnering more interest is the renewed business requirement for monetising information as a strategic asset. Even more pressing is the technical need. Increasing volumes, variety and velocity – big data – in IM and business operations, requires semantic technology that makes sense out of data for humans, or automates decisions1.
‘The five-star scheme’ or the rules reusable open data must apply to
It is essential that open data complies to the proposed five-star scheme for assessing the degree to which datasets are reusable by Tim Berners-Lee, inventor of the WWW and director of the World Wide Web Consortium (W3C)12.
★ Available on the web (whatever format) but with an open licence.
★★ As machine-readable structured data (eg excel and not image table scans).
★★★ Using non-proprietary formats (eg CSV instead of excel).
★★★★ All the above plus use of open standards from W3C to identify things.
★★★★★ All the above, plus link your data to other people’s data providing context.
All data should have metadata about the data itself, and that metadata should be available from a major catalogue. Any open dataset (or even datasets which are not but should be open) should be registered. And then linked data is essential to actually connect the semantic web. This is quite easy to do and should become second nature taking into regards various common sense considerations to determine when to make a link and when not to. If healthcare data supports and applies these conventions in a scalable architecture, linked data will disrupt data driven R&D resulting in significantly shorter drug discovery cycles.
Open data to lever open innovation in the biomedical space: trust, but verify
Open Innovation is the use of purposive inflows and outflows of knowledge to accelerate innovation. With knowledge now widely distributed, companies cannot entirely rely on their own research, but must acquire inventions or intellectual property from other companies to help advance their business model. HW Chesbrough explains this well in his book The New Imperative For Creating And Profiting From Technology18. Open innovation is changing the research function while researchers expand their thinking from mainly using inside knowledge to more and more external knowledge applying the mechanism of knowledge brokering. This means linking and sharing inside and outside knowledge in both directions to make the most value out of it. This approach is getting more and more uptake in government, life sciences but also other industries with new incentive and reward mechanisms. Merck already in 2000 understood this via its annual report statement: “Merck accounts for about 1% of the biomedical research in the world. To tap into the remaining 99%, we must actively reach out to universities, research institutions and companies worldwide to bring the best of technology and products into Merck. The cascade of knowledge flowering from biotechnology and the unravelling of the human genome is far too complex for any one company to handle alone.”
What we still see in similar initiatives in linked data in the healthcare and life sciences domain is that researchers still are sceptic about the completeness, freshness and quality of the data. Atul Butte, researcher in biomedical informatics and biotechnology entrepreneur in Silicon Valley, answered a question regarding the scepticism of mining public databases for research very well. Should someone trust data from experiments they haven’t done or overseen themselves? “Any one data set or database might not be fully trustworthy but there is wisdom and accuracy in the crowd. If two, three, a dozen or a hundred such databases all exist and are open for use, it might be intuitive for some to see that what is seen in common across those databases might be of the highest accuracy, and not subject to the measurement biases that might be present in any one database. In the end, one has to follow that old Ronald Regan aphorism, ‘trust, but verify’”19.
This approach is, of course, the right approach and technology is ready to support this research model. Open data will not only create added value in early R&D where open data is more prominent but also for decision-making for physicians in clinical trials and beyond. Independent researchers must be able to reproduce findings, use meta-data to analyse and link similar trials so they can discover new insights very easily finding new ways for faster and better drug discovery.
Who can ask the right question will win
Data-driven science works faster and more effective because there are billions of measurements made across the health system every day. Every time a physician orders a medication, every time a pharmacist dispenses a drug, every time a blood test is performed, every scan taken… it all ends up in a database somewhere. Understanding how all these measurements relate or link helps you with more new ideas and the ability to ask the right question.
In the past 10 years a lot of investments have been made in the growing numbers of online resources to enable and find new methods in drug discovery. A lot of freely available resources include software, databases and online tools ranging from molecular properties including ADMET, QSAR study descriptors, compound profiling, protein-protein interactions, sequence alignment information and off-target predictions17. Even more, clinical trial data and patient information gets available online. These growing massive sources of big data have barely been tapped in a linked way for medical research. New easy to use solutions driven by community efforts applying crowd sourcing practices must urge at the horizon to allow curated cross domain mappings so that linked standards, parameters and critical values enable researchers to ask the right question in this wealth of big data. The right standards, benchmarks, content-based authorisation and transparent licensing need to be in place in a scalable architecture to make appropriate use of all of this data. Federated querying could be the answer to generate ‘our view’ of domain in question, it is flexible and easy to modify, it removes the need for low level knowledge of underlying data sources, it simplifies the querying, it manages changes in integrated data sources and it allows easier integration with other data sources, empowering the knowledge feed coming out of combined internal and external sources. In the federated approach, performance is still the biggest bottleneck and if heavy analytics is required, the traditional datawarehouse approach clearly still wins as a solution. We believe that a combined approach will be the answer to integrate huge sets of heterogeneous data.
Quite some challenge, but the technology pieces are out there and linked data using the semantic technology stack is part of it. Of course we and our community hope and pledge for more funding and sponsorship in this space to make it happen sooner than later.
Democratising technologies and gamification make disruptive innovation happen
Clayton Christinson explains disruptive innovation in a way that applies to semantic technologies, that must be transformed from an expensive and complicated product to which only a few people have access, to a more affordable and accessible solution to which a much larger public gets access13. There is a very urgent need for democratised semantic technology that everybody has access to, so linked data becomes reality not only for R&D experts but for all scientists, healthcare professionals, policymakers and patients.
A team of MIT and Max Planck researchers released EyeWire14,15 and demonstrated it as an online game for any user as citizen scientist. By ‘playing’ Eyewire, a game of colouring brain images, citizen scientists can help map the connections of a neural network without prior specialised knowledge of neuroscience. Improvements in the underlying computational technology will eventually make it powerful enough to detect ‘miswirings’ of the brain that are hypothesised to underlie disorders such as autism and schizophrenia. This is an example of new ways of science harnessing the power of now up to 50,000 people online for free, instead of paid experts, and has many years to go. This approach shows that democratised tools with easy-to-understand interfaces and a layer of gamification lower thresholds to engage users quickly and on a large scale. This approach will also be applied in the linked data field with success in the coming years to assist the semantic technologies breakthrough sharing and integrating information. Often these approaches are called killer apps, but in reality are agile, quick and bottom-up, fast-growing, start-up business models taking on new markets.
Semantic technologies and linked data to reinforce the call for sharing in healthcare and life sciences
From the dawn of time, the sharing of knowledge has been one of the main forces driving science and innovation. Yet in recent decades, a proprietary culture, which wrongly posits that all intellectual property must be restricted, has spread across the pharmaceutical industry and threatens to stall the engine that has given us so many valuable outcomes. Pharmaceutical companies, together with universities and government agencies will gain much from reversing that trend and engaging in widespread collaboration early in the research process to expand foundational knowledge and create a shared infrastructure to tap it making everyone better informed with increased success as result.
The new internet, the semantic web, offers unprecedented potential holding the hope of repowering a new golden age of drug discovery eliminating duplication and redundancy so pharmaceutical companies compete in areas that offer a viable return on investment where pre-competitive collaboration helps all of us discover new therapies more effectively and efficiently, as patients and society demand it.
In summary, we want to reinforce and reissue the call to action stated by Bernard Munos and William Chinn together with many colleagues in the pharmaceutical industry, universities and government agencies to join hands and intensify sharing in order to help repower pharmaceutical innovation16. And we believe semantic technologies, if applied appropriately and in an agile democratised way, could catalyse and make this change happen at an incredible and continuously growing speed.
Hans Constandt is CEO of Ontoforce. He has a background in medicine, biotechnology, business modelling, bioinformatics, portfolio management, knowledge management and business liaison in the pharmaceutical industry. His expertise in drug discovery, ICT, sustainable business models and fundraising serves to bring innovative solutions like DISQOVER to reality in healthcare and life sciences.
1 Zaino, Jennifer. March 7, 2013. Gartner Names Semantic Technologies To Its Top Technology Trends Impacting Information Infrastructure in 2013 by SemanticWeb.com. http://semanticweb.com/gartne r-names-semantictechnologies- to-its-toptechnology- trends-impactinginformation- infrastructure-in- 2013_b35767.
2 Munos, B. Lessons from 60 years of pharmaceutical innovation. Nat. Rev. Drug Discov. 8, 959-968 (2009).
3 Wilan, K. Profile: Alpheus Bingham. Nature Biotech. 25, 1072 (2007).
4 Bonabeau, E, Bodick, N and Armstrong R. A more rational approach to new product development. Harvard Bus. Rev. (March 2008).
5 Moran, M et al. The malaria product pipeline. The George Institute for International Health, Sydney, 2007.
6 OpenPHACTS IMI JU funded project – http://www.openphacts.org/.
7 eTRIKS IMI JU funded project – http://www.etriks.org/.
8 Munos, B. Can open-source R&D reinvigorate drug research? Nature Rev. Drug Discov. 5, 723–729 (2006).
9 Maurer, S. Choosing the right incentive strategy for research and development in neglected diseases. Bull. World Health Organ. 84, 376–381 (2006).
10 Rabesandratana, Tania. Drug Watchdog Ponders How to Open Clinical Trial Data Vault. Science 22 March 2013: Vol. 339 no. 6126 pp. 1369-1370.
11 Wicks, Paul, Vaughan, Timothy E, Massagli, Michael P and Heywood, James. Accelerated clinical discovery using self-reported patient data collected online and a patient-matching algorithm. Nature Biotechnology 29, 411- 414 (2011) 24 April 2011.
12 Berners-Lee, Tim. Tim Berners-Lee proposed 5 star scheme. June 2009. http://www.w3.org/DesignIssue s/LinkedData.html.
13 Disruptive Innovation Explained. HBR Blog Network. March 6 2012. http://blogs.hbr.org/video/2012/ 03/disruptive-innovationexplaine. html.
14 Eyewire, play a game to map the brain. http://eyewire.org/.
15 Neural ‘connectomics’ game unveiled. Nature News Blog 11 December 2012. http://blogs.nature.com/news/2 012/12/neural-connectomicsgame- unveiled.html.
16 Munos, Bernard H and Chin, William W. A Call for Sharing: Adapting Pharmaceutical Research to New Realities. Drug Discovery – 2 December 2009; Volume 1 Issue 9 9cm8.
17 Villoutreix, BO, Lagorce, D, Labbé, CM, Sperandio, O and Miteva, MA. One hundred thousand mouse clicks down the road: selected online resources supporting drug discovery collected over a decade.
18 Chesbrough, HW. Open Innovation: The New Imperative for Creating and Profiting from Technology.
19 Steakley, Lia. Atul Butte discusses why big data is a big deal in biomedicine. Ask Stanford Med, Public Health, Research, Technology. April 29th, 2013. http://scopeblog.stanford.edu/2 013/04/29/atul-butte-discusseswhy- big-data-is-a-big-deal-inbiomedicine/# sthash.DYC3zMS A.dpuf.
20 Say goodbye to endless lists of pointless search results at app.disqover.com.
21 Heudecker, Nick (Analyst). Hype Cycle for Big Data, 2013. 31 July 2013 G00252431.