Language models in the biomedical and clinical tasks

Exploring the use cases and limitations

Ahmad Albarqawi
8 min readJan 23


imaginary image representing the language models as a growing brain — generate by author using Midjourney

Large language models (LLMs) provide unprecedented opportunities to augment humans in various industries, including healthcare. However, understanding the language models’ limitations and mitigations is essential before applying them in regulated environments. In recent years multiple studies have been published to propose new techniques for tuning existing language models for medical tasks, and selecting the most suitable one for specific tasks requires exploring all of the model’s capabilities and weaknesses.
The availability of scientific data over the internet helped to advance the fine-tuning techniques to provide task-specific models like SciBERT and BioBERT, which are trained on various biomedical and clinical sources to provide focused capabilities on specific tasks. Most industry medical models are built based on fine-tuning models, which require extensive training data to achieve high-quality results in new domains, limiting their expansion based on data availability. Few-shot learners, such as GPT3 provided the ability to train the model on a new domain with zero or few examples, resolving the need for operational cost to label the vast amount of data. Still, it comes with issues, like the tendency to generate nonfactual information, and the research is restricted because GPT3 is available only behind an API. For this, Meta released open-source few-shot learners with the ability to download for the researcher leading the way to understand these models and prepare them for the industry. A recent alternative to pre-trained models is retrieval models with the ability to search trillions of words, which can address potential privacy, bias, and toxicity concerns of using language models in a healthcare system while maintaining quality.

Biomedical and clinical tasks as language models

Biomedical and clinical tasks encompass many research areas, but this review focuses on the functions and datasets that language models can address. They consist of, but are not limited to:

  • Named entity recognition (NER) tasks to identify chemical and disease concepts and entities of interest in microbiology.
  • Identify the mention of protein, gene and species annotations.
  • Annotate a variety of medical concepts in clinical text.
  • Classify cancer concepts and chemical-protein interactions from scientific articles.
  • Extract annotated gene-disease interactions from literatures, journals, and books.
  • Summarisation of medical dialogue, capturing medically relevant information.

For conclusions and proposed research area you can navigate to “solutions and research area” section.

Review and results

language models types — image by author

The focus of this study is not only to evaluate the existing language models for medical tasks, but also to develop proposed research areas to improve the usage of existing models in production systems. For this, I reviewed the results of research focusing on biomedical tasks and papers focusing on large language models improvements.

Pretrained Language Models for Biomedical and Clinical Tasks: Understanding and Extending the State-of-the-Art

This research is focused on evaluating fine-tuned language models for biomedical and clinical tasks. The authors of the paper compared six models with five publicly available models, including models that focused on the biomedical domain like SciBERT and BioBERT, clinical tasks like ClinicalBERT, and general language models like RoBERTa, and BioMed-RoBERTa. They also built a custom bio-clinical-RoBERTa model.
The evaluation used a wide range of datasets, including ten sequence labeling tasks, eight classification tasks, and many other datasets. The research showed that the larger the model, the better the results, even when comparing a sizeable general model against a smaller fine-tuned model (Table1). The general large models did well on clinical tasks without fine-tuning. However, fine-tuning provided significant enhancement to biomedical tasks in addition to clinical tasks.
The paper provided a detailed introduction to the different bio-tasks, and it’s a valuable reference for future work. However, it should have discussed the GPT3 era for the huge language model with billions of parameters. The following papers will review the vast models and expected impact on the biomedical industry.

Table 1 — Mean (M) of results for evaluated tasks.

Medically Aware GPT-3 as a Data Generator

Creating dialogue summarization models for the medical industry requires a lot of data to correctly train the model to capture relevant medical concepts and affirmations. However, obtaining this data can be difficult due to patient privacy concerns and the high cost and time required to label the data manually. In “Medically Aware GPT-3 as a Data Generator” the authors evaluate the use of GPT-3 to summarize medical dialogues while capturing relevant medical information.
The authors used OpenAI’s GPT-3 model, capable of data generation tasks using a few examples. They found that by using GPT-3, they could scale 210 manually labeled examples to yield results comparable to 6400 manually labeled data and fine-tuned summarization model using model-generated data. The doctors who participated in the evaluation preferred the model’s results which were trained on combining human labels and synthetic data over only human-labeled data. However, using GPT-3 directly for dialogue summarization missed important medical concepts and confirmation, which provided unuseful results. To overcome this, the researcher introduced GPT3-ENS, an ensembling mechanism to infuse medical knowledge without training it on extensive data. The labeling training technique is based on generating multiple summaries by invoking GPT3 with N examples randomly selected from the training data, generating K summaries, and choosing the one high recall of capturing medical concepts. The doctors found that GPT3-ENS summaries produced higher quality results than the general GPT3.

Table 2: illustrates the sample of GPT3 summary results, it shows that GPT3-ENS summaries produce higher quality results:

Table 2 — dialogue summarization results.

Overall, this paper presents a technique to generate a medically-aware GPT3 by using an ensemble mechanism to infuse medical knowledge without training it on extensive data. It resolved the issue of labeling data availability and helped improve the models’ dialogue summary to capture medical concepts.

Improving language models by retrieving from trillions of tokens

Retrieval-enhanced transformer (RETRO) by Deoemind presented an autoregressive language model that uses a chunk cross-domain module to retrieve text from a database with trillions of tokens with linear time complexity. The RETRO model is not limited to the data seen during the training; it extends the traditional model knowledge to retrieve data from a massive dataset in linear time.
The experiments show that RETRO performance increases with the dataset size and the number of returned neighbors per input. The retrieval model can produce results comparable to large language models like GPT3 using 25x fewer parameters on different datasets, such as C4, Wikitext103, Curation Corpus, Lambada, and the Pile. However, it performed poorly on Pile’s DM Mathematics sub-set. Example of the model results:

Input: Beavers are interesting animals that live near rivers. They build

Completion (RETRO 7.5B): their houses called beaver dams in the riverbeds. They also live on land. Beavers use their strong teeth and strong jaws to cut down trees and branches to build their homes. They also use their teeth and jaws to chew up food.

This research shows the ability of the retrieval augmented language model to outperform larger models with fewer parameters and provide the ability to guide the models and increase their knowledge without the need for full pre-training. This can give significant trust to use the large language models in biomedical and clinical research by addressing some of the limitations of LM related to patients’ safety and privacy. Additionally, retrieval models can mitigate privacy concerns by offering the ability to erase retrievable data at inference time without retraining the model and filter offensive data retroactively to provide more insights into model outputs.

Proposed solutions and research area

Many fine-tuning-based models provide a private environment to train specialized use cases, as they are published open-sourced and come with a reasonable size to run offline with acceptable cost. For example, BioBERT is a fine-tuned BERT model focusing on the biomedical domain, and ClinicalBERT focuses on clinical tasks. The limitation of the fine-tuning models is that they require massive data to get quality results.

OpenAI API provided a GPT3 model with billion of parameters capable of data generation tasks using a zero or few shot of examples. GPT3 model is available only through an API in the shared cloud, which introduces research limitations and privacy risks. Another hurdle is that the API could accept a limited number of examples with a maximum context window length of 2048 tokens.

Retrieval-enhanced transformer RETRO by Deepmind is an autoregressive language model alternative that uses a chunk cross-domain module to retrieve text from a database with trillions of tokens with linear time complexity. The RETRO model is not limited to the data seen during the training; it extends the traditional model knowledge to retrieve data from a massive dataset in linear time due to the chunking and embedding mechanism. Retrieval models help to resolve the unfactual information issue as you can guide the model to retrieve from specific sources. However, these models do not specialize in particular tasks.

Retrievals with specialised models mixer — image by author

When a domain specialization is required, and we want to mitigate the model response with outdated training data and hallucinate unfactual information, we can mix retrieval techniques with existing models to provide results from trusted sources tuned to a specific task.

Few-shot learners with fine-tuned models
mixer — image by author

When privacy is a priority, the fine-tuned based models allow the training without privacy concerns due to openness and ability to run in a local environment with reasonable computing, and GPT3-scale models with billions of parameters can assist those models in providing training data and guidance. In this scenario, GPT3 will not have access to private data like patient information but will be used to synthesize training data from a few training examples.

In conclusion, the challenges in existing models, like ineffectual information generation, the tendency to generate toxic text, etc. limited their usage in production-regulated environments; however, using the best practices and a mix of models can mitigate those concerns.




Ahmad Albarqawi

Master’s data science scholar at UIUC.