How large language models can elevate clinical research

Brain Awareness Week

Deepika Khedekar investigates the use of LLMs in clinical trials and how we can overcome the potential challenges to their wider application.

$2.3 billion1 is the estimated cost of bringing a new drug from the drawing board to pharmacy shelves today. Clinical trials make up a large part of this investment, and yet 95% of them fail2. Amidst the sea of challenges that these trials face, lately, there is a new one that is becoming obvious – data.

The healthcare industry is practically a data-producing giant, churning out about 30% of the world’s data3. Hospitals alone are generating 50 petabytes of data every year4. A single phase III clinical trial can rack up to 3.56 million data points5. With nearly 479,000 trials6 registered globally as of early 2024, we are facing an outright data avalanche. Each piece of this data is a potential breakthrough waiting to happen, but the sheer volume is overwhelming.

With all these incredibly diverse and complex research datasets, how do we sift through them to find those golden insights that can lead to more efficient clinical trials and, ultimately, better healthcare outcomes? How do we move past the daunting 95% failure rate2 we encounter in clinical trials? This is where large language models (LLMs) can help.

LLMs in clinical trials: Overview and impact

LLMs are specialised artificial intelligence (AI) systems specifically designed to understand and interpret text with an impressive level of precision. You’ve probably heard of some big names like OpenAI’s GPT, Meta’s LlaMA, or Google’s PaLM. These are top-tier LLMs in the field. Now, you might wonder: how exactly can these advanced artificial intelligence models propel clinical research forward?

In the context of clinical trials, these models can help us add context to insights. Conventional AI models allow us to extract insights from the clinical data but they do lack context. The strength of LLMs lies in their ability to add context to these insights uncovered by such AI models. So now, we can leverage these models to chew through vast amounts of text – from medical studies and patient records to heaps of research data – and extract meaningful and context-driven insights to support clinical studies. This capability is particularly beneficial in identifying patterns, trends, and correlations within the data, which may not be immediately apparent to human researchers. By doing so, LLMs can uncover valuable insights that inform decision-making processes, enhance the accuracy of clinical research outcomes, and streamline the development of therapeutic interventions. The fusion of LLMs with clinical research processes marks a significant evolution and can help us reshape our approach to drug development and patient care.

So, what are the roles of LLMs in clinical trials? First off, data management has been a herculean task in the clinical research cycle, with researchers traditionally sifting through electronic health records (EHRs) for patient information. LLMs, such as GatorTron, are revolutionising this process by enabling the extraction of context-driven insights with precision and speed, turning what used to be a month-long endeavour into a matter of days in clinical trials.

LLMs can uncover valuable insights that inform decision-making processes, enhance the accuracy of clinical research outcomes, and streamline the development of therapeutic interventions.

Patient recruitment, a critical yet challenging phase of trials, is also getting a boost from LLMs. By swiftly analysing vast datasets, these models identify potential participants more efficiently, ensuring trials commence without unnecessary delays. Similarly, for protocol development, platforms like Vial leverage LLMs to condense weeks of work into just a couple of days. And when it comes to patient engagement, simplifying informed consent documents into clear and concise language has become more manageable, thanks to the natural language processing capabilities of LLMs.

However, it’s not just about speeding up processes. LLMs offer a more nuanced understanding of data, assisting in everything from fine-tuning trial protocols, to ensuring regulatory compliance. They’re essentially becoming invaluable assistants in clinical trials, tasked with a range of responsibilities that extend far beyond what was initially imagined. Organisations across the world have started integrating LLMs in healthcare and clinical research.

The following examples showcase just how promising the technology is and illustrate the profound impact LLMs have on the field of clinical research.

Streamlining clinical trials with GPT-3 inspired LLMs

Clinical trials often grapple with the cumbersome task of sifting through electronic health records (EHRs) to extract critical patient data, a process mired in inefficiency due to the prevalence of medical jargon and abbreviations. This challenge not only delays the initial phases of trials, such as patient selection and recruitment, but also negatively impacts the accurate monitoring and assessment of trial outcomes. Traditional methods for processing these records require extensive manual labor and domain expertise, leading to significant delays and potential inaccuracies in clinical research.

A study7 by MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) shines a light on the capabilities of LLMs, such as a GPT-3 inspired model, to revolutionise this process. The model demonstrated up to 90% accuracy in sifting through and making sense of complex medical information contained within clinical notes, all without relying on manually labelled data. This represents a significant stride towards optimising the data extraction process, bolstering the efficiency and precision of patient selection flows, data normalisation methods, and enabling a holistic trial management system. The CSAIL team’s application of LLMs in enabling automated and context-driven processing of clinical data not only offers a scalable answer to a longstanding challenge but also underscores the vast potential of these models to elevate the calibre and velocity of clinical research. This is especially true for therapeutic areas where the failure rates of clinical trials are extremely high, such as oncology.

The COLT initiative: Specialised LLMs for oncology trials

In the realm of oncology trials, the extraction and interpretation of clinical data from notes written by doctors and healthcare professionals poses a significant challenge, hindering the scalability and efficiency of advanced cancer research programmes. An oncology trial generates about 3.1 million data points compared to 1.9 million for non-oncology studies8. The prevailing method of manual abstraction, despite being the most accurate, is labour-intensive, expensive, and severely limits the potential for large-scale data analysis in oncology research programmes.

Addressing this critical issue, Triomics, in collaboration with the National Cancer Center Informatics Society9(Ci4CC), has launched COLT (Collaboration for Oncology-focused LLM Training), an initiative to develop an oncology-focused clinical trials language model. This project aims to leverage the capabilities of LLMs to automate the data extraction and quality control workflows in oncology clinical research programmes. By training a specialised LLM with over 30 billion parameters on structured and unstructured oncology datasets, the initiative promises to significantly enhance the scalability and efficiency of such research programmes. This targeted approach is designed to overcome the drawbacks of traditional LLMs, providing a scalable way to address key challenges in cancer clinical trials.

By employing LLMs innovatively, COLT not only streamlines the data management process but also accelerates the pace at which clinical trials can deliver actionable insights, ultimately advancing the field of cancer treatment. The trend of developing specialised LLMs extends beyond oncology, as demonstrated by GatorTron, showcasing the growing interest in leveraging this technology for comprehensive clinical research applications.

The GatorTron breakthrough: A leap forward in clinical trial data management

GatorTron GPT, the largest specialised Large Language Model (LLM) for clinical research to date10, marks a significant leap forward in tackling the primary challenge of clinical Natural Language Processing (NLP): the efficient extraction and interpretation of the vast amounts of patient information hidden within EHRs. This challenge, critical for identifying suitable trial participants, monitoring their progress, and accurately assessing outcomes, has traditionally been met with slow, expensive, and hard-to-scale manual parsing of unstructured clinical narratives, creating a bottleneck in clinical research.

Constructed from a dataset of over 90 billion words, including clinical notes and scientific literature, and trained on data from more than two million patient records from the University of Florida health system11, GatorTron utilises up to 8.9 billion parameters. This scale and precision make it hundreds of times larger than any pre-existing LLM models in healthcare, enabling the automation of clinical information extraction with unparalleled accuracy. By significantly reducing the time and resources needed for data management in clinical trials, GatorTron streamlines the processes of identifying trial participants and assessing outcomes. This not only accelerates the research cycle, paving the way for the swift development and deployment of medical innovations but also highlights the crucial role of specialised LLMs like GatorTron in evolving clinical research.

Navigating challenges: Integrating LLMs in clinical trials

LLMs are becoming invaluable assistants in clinical trials, tasked with a range of responsibilities that extend far beyond what was initially imagined, yet integrating LLMs into clinical trials brings its own set of challenges and risks, spanning from ethical considerations to patient safety and beyond. At the heart of these concerns is the ethical handling of sensitive patient data. Ensuring the confidentiality and security of this information is paramount, yet the integration of LLMs introduces complex cybersecurity risks. The vast pools of data fed into these models can become targets for breaches, putting patient privacy at risk. Moreover, the current landscape lacks comprehensive regulations specifically tailored to the use of AI in healthcare, leaving a gray area around the responsible deployment of LLMs. This regulatory vacuum raises questions about accountability and oversight, especially in scenarios where trial outcomes and patient safety could be directly affected.

Another significant hurdle is the phenomenon known as the AI model hallucination effect, where LLMs might generate convincing yet inaccurate or unverified outputs. In clinical trials, such errors could mislead research efforts, endangering patient safety. Compounding this issue is data fragmentation, a scenario where patient data is scattered across various healthcare organisations, stored in incompatible formats, and located in different places. This lack of data uniformity and accessibility hampers LLMs’ learning processes, as they thrive on vast, integrated datasets to develop accurate insights. Without access to comprehensive and diverse patient information, there’s a risk that LLMs could produce biased or incomplete analyses. Addressing these challenges requires careful consideration of how LLMs are implemented in clinical research, ensuring that advancements in AI technology are matched with robust ethical, safety, and regulatory standards to support their beneficial use.

A path forward with LLMs in clinical research

As we explore the potential of LLMs in clinical research, it’s evident that their adoption comes with significant challenges, including risks of misinformation, biases, data fragmentation, cybersecurity threats, and ethical concerns. Addressing these issues demands a proactive approach, combining advanced cybersecurity measures, ethical guidelines, and tailored regulatory frameworks, alongside fostering interdisciplinary collaboration among technologists, clinicians, ethicists, and policymakers. By tackling these challenges head-on, we can harness LLMs’ innovative capabilities to streamline clinical trials, ensuring that the evolution of healthcare research is both revolutionary and responsible, with unwavering commitment to scientific integrity, ethical conduct, and patient safety.


  1. Deloitte pharma study: Drop-off in returns on R&D investments – sharp decline in peak sales per asset. Accessed Jan 2024.
  2. Wong CH, Siah KW, Lo AW. Estimation of clinical trial success rates and related parameters. Biostatistics 2019;20(2):366.
  3. RBC Capital Markets. The healthcare data explosion. Accessed Feb 2024.
  4. Alysa Taylor. Microsoft Official Blog. Microsoft introduces new data and AI solutions to help healthcare organizations unlock insights and improve patient and clinician experiences. Accessed Feb 2024.
  5. Greg Licholai. Forbes. AI In Clinical Research Now And Beyond. Accessed Feb 2024.
  6. Total number of registered clinical studies worldwide since 2000 as of January 2024. Accessed Jan 2024.
  7. Rachel Gordon. MIT CSAIL. MIT News. Large language models help decipher clinical notes. Accessed Jan 2024.
  8. Charlie Passut. CenterWatch. Oncology Trials Outpacing Rest of the Field in Complexity and Duration, Study Shows. Accessed Jan 2024.
  9. National Cancer Center Informatics Society (Ci4CC). Onco-LLM – Multi Center Oncology AI Initiative & Collaboration Building The Nation’s first Clinical Research LLM. Accessed Feb 2024.
  10. Peng C, Yang X, Chen A, et al. A study of generative large language model for medical research and healthcare. npj Digit Med. 2023;6:210.
  11. Gatortron – The Biggest Clinical Language Model. Accessed Feb 2024.

About the author

Deepika KhedekarDeepika Khedekar is a Centralized Clinical Trial Lead at IQVIA, where she spearheads clinical trial monitoring programmes for major pharmaceutical companies. In her 12+ years in the pharmaceutical industry, she led clinical trial programmes in diverse therapeutic areas for leading US and Australia-based pharmaceutical organizations, such as Gilead Sciences, Macleods Pharma and Arrowhead Pharmaceuticals.


Related Articles

Join FREE today and become a member
of Drug Discovery World

Membership includes:

  • Full access to the website including free and gated premium content in news, articles, business, regulatory, cancer research, intelligence and more.
  • Unlimited App access: current and archived digital issues of DDW magazine with search functionality, special in App only content and links to the latest industry news and information.
  • Weekly e-newsletter, a round-up of the most interesting and pertinent industry news and developments.
  • Whitepapers, eBooks and information from trusted third parties.
Join For Free