Refining medical large language models: key insights from instruction tuning

Muneerah Q. Alqahtani; Abdullah Albarakati; Fahd Alotaibi

doi:10.7717/peerj-cs.3216

Refining medical large language models: key insights from instruction tuning

Muneerah Q. Alqahtani , Abdullah Albarakati, Fahd Alotaibi

Information System Department, King Abdul Aziz University, Jeddah, Makkah, Saudi Arabia

DOI: 10.7717/peerj-cs.3216

Published: 2025-09-23
Accepted: 2025-08-22
Received: 2025-03-10

Academic Editor: Shibiao Wan

Subject Areas: Bioinformatics, Artificial Intelligence, Computational Linguistics, Data Mining and Machine Learning, Data Science
Keywords: Large language model, Natural language processing, Medical LLM, Instruction fine tuning, Clinical applications, NLP tasks, Dataset curation, Instruction tuning

Copyright: © 2025 Alqahtani et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.

Cite this article: Alqahtani MQ, Albarakati A, Alotaibi F. 2025. Refining medical large language models: key insights from instruction tuning. PeerJ Computer Science 11:e3216 https://doi.org/10.7717/peerj-cs.3216

The authors have chosen to make the review history of this article public.

Abstract

This literature review introduces a comprehensive summary of the most recent scholarly work on instruction-tuning strategies for medical large language models (Med-LLMs). It begins by reviewing three fundamental approaches to creating an instruction dataset: human-crafted datasets, synthesized datasets generated by LLMs, and datasets that incorporate Retrieval-Augmented Generation (RAG). This article explores the role of medical instruction datasets by reviewing thirteen different medical models, evaluating their effectiveness across multiple clinical tasks, and examining how their utilization can improve outcomes in the medical domain. This research discusses key insights for optimizing instruction-based fine-tuning of language models. It analyzes the effectiveness of the phased instruction method and the benefits of integrating mixed-prompt techniques. Additionally, it assesses the effect of choosing an appropriate backbone model before fine-tuning. Furthermore, it demonstrates how the selection of words when crafting instructions influences a model’s performance. The survey emphasizes that carefully curated instructional data, coupled with well-crafted strategies, can greatly enhance the potential of Med-LLMs in real-world healthcare applications. Nevertheless, several challenges must be addressed to ensure the safe, ethical, and effective deployment of Med-LLMs. This article outlines future research directions, including mitigating racial and gender biases, leveraging external knowledge sources, and reinforcing privacy through robust anonymization of patient information and regulatory adherence (e.g., Health Insurance Portability and Accountability Act (HIPAA)). Addressing these challenges will pave the way for reliable, safe, and ethical artificial intelligence (AI)-driven healthcare applications.

Introduction

The application of large language models (LLMs) in the medical domain has received considerable attention (Nazi & Peng, 2024) for their potential to enhance various medical tasks, such as clinical decision-making (He et al., 2023), question-answering (Li et al., 2023) and summarization (Mishra et al., 2014). Several Medical LLMs (Med-LLMs) have been introduced in recent years, which demonstrate exceptional performance across a wide variety of Natural Language Processing (NLP) tasks. Some of these notable models are Medical Pathways Language Model (Med-PaLM) (Singhal et al., 2023), GatorTron (Yang et al., 2022), ClinicalBidirectional Encoder Representations from Transformers (BERT) (Alsentzer et al., 2019), ClinicalT5 (Lu, Dou & Nguyen, 2022), and PubMedBERT (Gu et al., 2022), which are specifically designed to process and understand healthcare and clinical data for automation, enhancement, and streamlining of various aspects of healthcare.

Medical language is generally known for its difficulty, including complex diagnosis terms, medication names, and treatment phrases that require careful interpretation (Yuan et al., 2023). Thus, one solution is to develop a model capable of understanding doctors’ intent and generating outputs that align with their needs. Accuracy is essential in medical models, where misalignment between model outputs and what the doctor wants to say is unacceptable (Yuan et al., 2023). Non-instructed LLMs often fail to meet these demands, and therefore, they may produce responses that may be inaccurate, harmful, or misaligned with user expectations (Wang et al., 2023c, 2023b). Because of this limitation, researchers have started exploring new methods to make models better understand and respond to what people actually want and need (Ouyang et al., 2022).

Instruction fine-tuning is a critical process for alignment of Med-LLMs with specific medical tasks. Figure 1 illustrates the conceptual framework for this process that details the steps from data curation and preprocessing to fine-tuning and hyperparameter optimization. This structured approach ensures that Med-LLMs achieve high accuracy and reliability in diverse clinical scenarios. Having a model that responds in ways that truly match what people need and expect is important to closing the gap between what these language models can do and what humans actually need from them (Zhang et al., 2023a). Instruction tuning is one of these alignment techniques that involves fine-tuning LLMs on datasets containing specific task instructions paired with corresponding responses (Ouyang et al., 2022). Instruction tuning aims to guide models in following human-written instructions to perform tasks effectively and generalize to new, unseen scenarios (Ouyang et al., 2022). In the field of healthcare, datasets are tailored to reflect diverse clinical applications, which include patient-doctor communication, diagnosis assistance, report summarization, and patient discharge instructions (Li et al., 2023). This process enables models to closely align with real-world medical scenarios (Zhang et al., 2023a).

Figure 1: Conceptual framework of instruction fine-tuning for Med-LLMs.

Download full-size image

DOI: 10.7717/peerj-cs.3216/fig-1

Instruction tuning faces several key challenges despite its demonstrated effectiveness:

1.

Balancing datasets in terms of data quantity and quality remains a critical challenge for optimizing model performance (Zhao et al., 2024; Zhou et al., 2023).
2.

The selection of the base model significantly impacts the overall performance of the fine-tuned model (Gudibande et al., 2023).
3.

Instruction-tuned models often poorly comprehend complex medical scenarios and therefore produce responses that reflect surface-level patterns rather than true representation. This sensitivity to phrasing can lead to varying outputs due to slight perturbations in wording, even when the intended meaning remains unchanged (Arroyo et al., 2024).

Recent studies are increasingly focused on adopting LLMs for healthcare, including advancements in pre-training models (He et al., 2023), applications (Yang et al., 2023; Liu et al., 2024a), and medical-specific algorithms (Liu et al., 2024a). This survey highlights state-of-the-art methodologies to emphasize instruction tuning as a cornerstone for aligning LLMs with complex medical environments. Specifically, the survey explores the use of instruction datasets, medical datasets, and optimization techniques that enhance Med-LLMs performance. Figure 2 provides a high-level overview of the key components and strategies involved in instruction tuning for Med-LLMs. These strategies include dataset curation methodologies, medical datasets tailored for instruction tuning, models designed for medical tasks, and optimization techniques that improve model alignment with the healthcare domain.

Methods

A systematic search was conducted using Google Scholar and PubMed to evaluate the latest advancements in instruction tuning for medical large language models (Med-LLMs). The review identifies key trends in dataset curation, optimization techniques, and the effectiveness of instruction tuning across clinical NLP tasks. The initial research was carried out by the first author and subsequently refined by the second author.

This review was conducted without prior preregistration, as the requirement for scoping reviews was not recognized at the beginning of this study. However, we committed to established scoping review guidelines to maintain transparency and methodology. In future reviews, we will incorporate preregistration to align with best practices.

The inclusion criteria for the selected articles were structured based on the Population, Intervention, Comparator, and Outcome (PICO) (Schardt et al., 2007) framework to ensure a systematic and consistent selection process. Additionally, this article adheres to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) extension for the literature review (Tricco et al., 2018).

Search strategy and relevant studies

A systematic search was performed using multiple databases, including PubMed, IEEE Xplore, ACM Digital Library, Scopus, Web of Science, and Google Scholar, to identify relevant studies. These databases were selected based on their relevance to medical informatics, artificial intelligence, and computational linguistics. The selection process followed predefined inclusion and exclusion criteria to ensure comprehensive and unbiased selections. Studies were selected if they exclusively used instruction tuning for large language models in the medical domain. Instruction tuning (IT) emerged as a significant advancement in LLM, started in 2022. It is marked by the introduction of InstructGPT (Wei et al., 2022). Our literature review explores the intersection of this technology with medical domain from 2022–2024. The researcher investigated articles over several months, ensuring the inclusion of cutting-edge research from July 4th, 2024, to December 31st, 2024. We limit the search to English-language articles only.

The search queries included combinations of the following keywords: “Medical Large Language Models” “Instruction Tuning in the medical domain” “Fine-Tuning LLMs for Clinical Applications” The complete search strings used across databases included: —**PubMed**: (“Medical Large Language Models” OR “Clinical LLM” OR “Healthcare Language Models”) AND (“Instruction Tuning” OR “Fine-tuning” OR “Instruction Following”)—**Google Scholar**: “Medical LLM instruction tuning” OR “Clinical language model fine-tuning”—**Additional terms**: “Med-LLM”, “Clinical NLP”, “Healthcare AI instruction datasets”.

Inclusion and exclusion criteria

The inclusion criteria were: (1) studies focusing on instruction-tuned medical LLMs. (2) Research published in peer-reviewed journals or preprints with significant citations. (3) Articles evaluating instruction datasets for medical NLP applications. (4) Studies that report quantitative performance metrics. (5) Publication dates range from 2022–2024. (6) Articles with qualitative evaluation. (7) Studies were limited to text-based input and output generation. The exclusion criteria include: (1) non-medical LLM applications. (2) Studies without a clear evaluation of instruction tuning. (3) Non-English studies. (4) Studies with other forms of text generation, such as image, video, or audio-based generation, and vice versa.

Study screening

Mendeley Reference Manager managed the article collections and helped to detect duplicate articles. All articles’ titles and abstracts were examined for the initial inclusion phase. In the second phase, we reviewed the full text and the methods part of all articles to ensure that all included articles have two key criteria: relevance to the medical domain and instruction tuning methods.

Data extraction and synthesis

For each selected article, the following details were reviewed and systematically categorized, including: (1) each article’s main objectives and contributions to instruction tuning for Medical LLMs. (2) Existing instructed datasets used to fine-tune Med-LLMs, including a review of models and their characteristics, sources of datasets, dataset size, and their various NLP and clinical tasks. (3) Extract the promoting techniques and the training techniques.

Search result

Our literature review identified thirteen instruction-tuned models in the medical domain, all of which focused on NLP-based clinical tasks and their potential applications. All the included studies pertain to the medical domain, covering various branches such as radiology, mental health, and general medical applications, and involve instruction-based fine-tuning to optimize model performance.

Figure 3 provides a visual overview of the key components and strategies involved, including dataset curation methodologies, commonly used medical datasets, fine-tuning techniques, and optimization strategies specifically adapted to healthcare contexts.

Figure 3: Key components and strategies in instruction tuning of medical LLMs, such as dataset curation methodologies, medical datasets, fine-tuning techniques, and optimization strategies tailored for healthcare.

Download full-size image

DOI: 10.7717/peerj-cs.3216/fig-3

Instruction datasets curation techniques

The effectiveness of instruction tuning fundamentally depends on dataset curation. High-quality datasets require clear and explicit instructions that guide the model in performing the desired NLP task. Unlike traditional training datasets that simply pair inputs with their outputs or questions with answers, instruction-tuned datasets provide detailed guidance for each input to generate the desired output. They provide step-by-step guidelines for task completion. Through such explicit guidance, these datasets improve model alignment with human expectations by enhancing performance across diverse and complex tasks. This section explores three primary approaches to dataset curation: human-crafted datasets, synthesized datasets generated using LLMs, and hybrid datasets created through Retrieval-Augmented Generation (RAG) combined with LLMs. Each technique presents distinguished features, challenges, and applications (Ouyang et al., 2022).

In the development phase of dataset curation for instruction tuning, a systematic pipeline is essential to ensure quality and relevance. Figure 4 illustrates the dataset curation pipeline that highlights the key steps from data collection to deployment. These steps include intermediate processes such as augmentation and quality assurance that ensure the creation of robust datasets tailored for medical applications. This workflow enables dataset curation to support instruction tuning in a structured and efficient manner. It emphasizes critical elements such as data augmentation for diversity and quality assurance for reliability by making it a cornerstone for developing high-quality datasets for medical applications.

Figure 4: Block diagram illustrating the dataset pipeline for instruction tuning Med-LLMs.

Download full-size image

DOI: 10.7717/peerj-cs.3216/fig-4

Human-crafted dataset

In the category of human-crafted datasets, manually curated datasets are reformatted in such a form to make them suitable for instruction tuning training. This approach simply depends on experts to format the dataset: input, output, and their corresponding output (Si et al., 2023). This dataset format enables the model to follow human instructions, which improves its alignment with doctors’ preferences. Again, such datasets are written with expert oversight without the support of popular LLMs (Wu et al., 2022). Although this approach is resource-intensive, these datasets offer unparalleled alignment with clinical standards by using iterative refinement techniques like Human-In-The-Loop (HITL) (Mosqueira-Rey et al., 2023).

Similarly, human annotators can be hired and their role is to correct only the unclear outputs generated by the model. The process starts with estimating the confidence level for each output and proceeds to find the examples with the lowest confidence scores. These usually have the most uncertain predictions. These low-confidence outputs are flagged for annotator review, where they refine these specific cases. This targeted method, called uncertainty sampling in active learning, optimizes annotation. It focuses human efforts not on the whole dataset but on the most unclear examples (Kirsch, van Amersfoort & Gal, 2019).

Synthesize datasets generated using LLMs

Synthetic dataset approaches depend on the power of LLMs for generating the instructions, instead of manual curation. For example, GPT-4 (OpenAI et al., 2023) offers a valuable tool for creating instructions or augmenting existing ones (Brown et al., 2020; Ouyang et al., 2022). For instance, MedInstruct-52k demonstrates the capability of LLMs to augment existing datasets. The researchers create a variety of instructions and wide-ranging NLP tasks by using GPT-4 only. This approach requires an initialization seed, which helps reliably augment existing datasets (Lavita, 2023).

In the same way, template-based datasets can be used too. This method uses predefined templates to format the dataset. It provides a systematic template for formulating instructions, inputs, and expected outputs. The approach ensures consistency for the whole expected output. It is a valuable method in domains where high precision is important (Zhang et al., 2023b). By using a fixed format, template-based generation reduces errors and enhances clarity. However, this approach carries a problem, which is a possible mismatch with human needs. A major reason behind mismatches arises when templates fail to capture some details and domain variations (Syriani, Luhunu & Sahraoui, 2018; Kale & Rastogi, 2020). Continuous iterative refinement stands as a solution to capture all possible formats (Kale & Rastogi, 2020).

Dataset generation using RAG and LLMs

The RAG technique (Sung, Lee & Tsai, 2024) gathers the strengths of both retrieving relevant information from relevant databases and generating required information using LLM in order to build an instructional dataset. RAG mainly works through two building blocks: a retriever and a generator. First, the retriever block identifies and extracts the most relevant documents from external databases. The importance of the retrieval block is that it gives the groundwork for the generation phase, ensuring that the generated content is contextually relevant. The retrieved information is then fed into an LLM to generate precise outputs, which can be utilized as custom instructions. For instance, Fig. 5 shows the integration of RAG with Mistral-7B, to generate an instructional dataset (Sung, Lee & Tsai, 2024).

Figure 5: Illustration of instructions-tuned dataset generation process using RAG approach.
The process integrates Mistral-7B with RAG to retrieve data from DSM-5 documents to be used for fine-tuning the model. Figure adapted from Sung, Lee & Tsai (2024).

Download full-size image

DOI: 10.7717/peerj-cs.3216/fig-5

In comparison with other methods, RAG offers greater efficiency and scalability by minimizing noisy data. Despite its effectiveness, RAG heavily depends on the quality of the external databases, which must be reliable and comprehensive. Therefore, ensuring the accuracy and reliability of the databases is a major challenge (Sung, Lee & Tsai, 2024). However, RAG demonstrates its capabilities by efficiently generating customized, domain-specific datasets. RAG approach particularly excels in sophisticated fields like healthcare where specialized knowledge is crucial (Sung, Lee & Tsai, 2024). The approaches to dataset generation—such as human-crafted, LLM-synthesized, and hybrid RAG-LLM—each offer distinct advantages and limitations. Table 1 provides a comparative overview of these dataset curation techniques, highlighting their respective strengths and weaknesses. The subsequent section explores various medical instruction datasets that have been used to fine-tune Med-LLMs, thereby enhancing their performance across diverse clinical tasks.

Table 1:

Comparison of instruction dataset curation techniques for Med-LLMs.

Approach	Merits	Demerits	Best-fit scenarios
Human-crafted dataset	High precision and alignment with human intent; iterative refinement through human-in-the-loop (HITL) methods and uncertainty sampling enhances quality.	Resource-intensive and time-consuming; requires significant domain expertise.	Suitable for high-stakes tasks such as clinical decision-making, patient discharge instructions, and scenarios demanding expert validation.
LLM-synthesized dataset	Scalable and automated; enables rapid data augmentation and generation using LLMs.	May lack alignment with nuanced human preferences; template-driven outputs can be repetitive or incomplete.	Useful for exploratory tasks, early-stage model training, and expanding datasets for general medical applications.
RAG-LLM hybrid dataset	Generates domain-specific, contextually relevant examples; reduces noise by grounding with high-quality external sources.	Heavily dependent on the reliability and quality of retrieved source documents.	Best suited for specialized domains such as rare disease modeling, mental health (e.g., DSM-5), and dynamic fields requiring continual updates.

DOI: 10.7717/peerj-cs.3216/table-1

Medical instruction datasets

Instruction datasets play a fundamental role in fine-tuning Med-LLMs in order to address the multiple requirements of healthcare applications. These datasets target specific medical areas, including United States Medical Licensing Examination (USMLE) exam questions, clinical diagnostics, symptom extraction, and patient interaction summaries. Below are instances of the most prominent medical instructional datasets utilized in research, each one designed to prompt the capabilities of LLMs in healthcare.

MedInstruct-52k instruction dataset

The MedInstruct-52k dataset (Taori et al., 2023b) is a large-scale, instruction-tuned dataset designed to fine-tune LLM in the medical domain. It consists of 52,000 medical tasks. Each instance in the dataset is comprised of instructions paired with corresponding input and output data. These tasks reflect different clinical tasks, including patient diagnosis, treatment plans, clinical summarization, and conversation-like question-answering. The MedInstruct-52k dataset (Taori et al., 2023b) collects its data from reliable medical materials, including books, articles, clinical guidelines, electronic health records (EHRs), question banks, public datasets, and expert input. It is created using a combination of automated generation and manual expert curation. Models such as AlpaCare (Zhang et al., 2023c), Aloe model (Gururajan et al., 2024), and Medical Large Language Model Meta AI (Me-LLaMA) model (Xie et al., 2025) use MedInstruct-52k dataset for fine-tuning. The following subsections are further details of how those models adopt the MedInstruct-52k dataset.

AlpaCare instruction dataset

The AlpaCare model (Taori et al., 2023b) utilizes the MedInstruct-52k dataset. Before the fine-tuning process, the dataset is augmented with an initialization seed of 167 instructions written by experts. These seeds cover various dimensions, including difficulty levels, task types, perspective views, and medical topics. Fine-tuning AlpaCare with this specialized dataset enhances the model’s capability to effectively handle different difficulty levels. This fine-tuning methodology optimizes the model’s responses to complex medical scenarios and boosts its ability to generalize.

Aloe instruction dataset

The Aloe model (Gururajan et al., 2024), fine-tuned using the MedInstruct-52k dataset (Taori et al., 2023b), but does not depend exclusively on this dataset. Instead, it integrates additional datasets from both medical and general domains. By maintaining an 8:1 ratio of medical to general domain data. This integration is necessary to avoid catastrophic forgetting, a problem that occurs in the fine-tuning process where previously learned information is lost or faded (Luo et al., 2023; Gururajan et al., 2024). In detail, the dataset is developed by incorporating a variety of specialized medical datasets and general ones and formatting them using Mixtral-7B. It deploys both single-turn and multi-turn question-answer templates. The goal of using those multiple templates is to ensure that the model is exposed to diverse input formats. The final dataset comprises 750,257 instances, each containing a question and its corresponding answer (Taori et al., 2023b).

ME-LLAMA instruction dataset

The Me-LLaMA model (Xie et al., 2025) uses a primarily biomedical instructed dataset collected from biomedical research, patient records, and medical guidelines. Also, it incorporates MedInstruct-52k dataset (Taori et al., 2023b). The dataset is composed of biomedical literature, clinical notes, and general-domain content in a ratio of 15:1:4. This instructed dataset leverages a wide range of medical tasks based on the provided instructions. For example, it supports summaries for patient records.

BioInstruct instruction dataset

The BioInstruct dataset (https://github.com/bio-nlp/BioInstruct; Tran et al., 2023) is a large-scale synthesized instructed dataset specialized in the biomedical domain. The BioInstruct dataset started with just 80 handwritten seeds that covered key medical instructions. Then, GPT-4 (OpenAI et al., 2023) was used to expand the dataset. It is built to fine-tune LLMs and help the model follow instructions. Examples include clinical diagnosis, treatment recommendation, symptom analysis, and patient-doctor conversation summarization. The dataset consists of 25,005 instances; each instance has well-defined instructions associated with input and a corresponding output. It also includes metadata for each instance to deliver enriched context and specific details about the task.

IMHI interpretable mental health instruction dataset

The Interpretable Mental Health Instruction (IMHI) dataset (Yang et al., 2024) is a unique dataset instructed to improve the understandability of mental conditions. It comprises 105,000 reformatted data collected from 10 real-world conversations on social platforms, including Reddit, Twitter, and SMS texts. Platforms where people may share their thoughts and explain their struggles. This dataset provides interpretative analysis for specified mental health conditions such as depression, stress, anxiety, post-traumatic stress disorder (PTSD), and suicidal thoughts. It supports assorted mental health tasks, including symptom detection, causal factor identification, and risk factor analysis. The IMHI dataset aims to equip trained models with the ability to detect mental health conditions, whereas also generates high-quality explanations for these conditions. Figure 6 illustrates the components of IMHI dataset, including task-specific instructions, expert-crafted examples, and queries associated with the target posts. This structure ensures the dataset is well-suited for fine-tuning of the model in clinical and mental health domains. Table 2 presents a range of instruction datasets that have been developed to support both clinical and NLP tasks. Each dataset employs distinct models and methodologies tailored to specific medical applications.

Figure 6: Illustration of a structured instructional dataset curated for mental health research (Yang et al., 2024).

Download full-size image

DOI: 10.7717/peerj-cs.3216/fig-6

Table 2:

Summary of instruction datasets for clinical and NLP tasks.

Ref	Year	Model	Entries	Clinical tasks	NLP tasks	Technique
MedInstruct-52k	2024	Aloe, AlpaCare, Me-LLaMA	52,000	Medical questions, USMLE MCQs, diagnostics, clinical info extraction, summarization, note generation, decision-making	QA, IE, NLI, text gen., summarization, rewriting, multi-hop reasoning	Semi-automated using GPT-4/ChatGPT. Rouge-L filtering. 214k instruction dataset. Complex reasoning focus (CDSS).
BioInstruct	2024	BioInstruct	25,005	Biomedical QA, diagnostics, treatment planning, medication extraction, coreference resolution, conversation summarization	QA, IE, NLI, text generation	Self-Instruct with GPT-4. Few-shot prompting. Optimized for clinical extraction and coreference resolution.
Medical Meadow	2023	MedAlpaca	160,000	Clinical questions, diagnostics, info extraction, report summarization, radiology, USMLE MCQs, clinical notes	QA, IE, NLI, NER, RE, summarization, text gen.	Synthetic data via GPT-3.5-Turbo. Rephrases existing Q&As. Clinical report summarization.
IMHI	2024	MentaLLaMA	105,000	Symptom detection, mental health condition identification, psychological risk factor identification	IE, NLI, classification, text gen., summarization	ChatGPT generates explanations from few-shot examples. Mental health interpretation focus.

DOI: 10.7717/peerj-cs.3216/table-2

MedInstruct-52k (Gururajan et al., 2024; Zhang et al., 2023c; Xie et al., 2025) is specifically designed for tasks involving deep reasoning and multi-hop diagnostics. This dataset excels in decision-making scenarios, which makes it highly suitable for assisting clinical decision support systems (CDSS). It focuses on directing medical language models to solve complex diagnostics with higher performance.

BioInstruct (Tran et al., 2023) is particularly effective in tasks that involve conversation-summarization. It is derived from doctor-patient conversations. These conversations contain medical information, such as symptoms, and then draw a structured summary. This enables the models to extract medical information for further analysis or present it in a more readable form. These capabilities make it very important to speed up documentation workflows.

Medical Meadow (Han et al., 2023) is optimal for documentation tasks, including reporting and summarization. Thus, it can be used to create a model for the automatic generation of clinical documentation. The automation aims to reduce the administrative burden on clinicians and enhance workflow efficiency, which will give them more time for patient care.

The IMHI dataset (Yang et al., 2024) provides a well-crafted dataset for mental health applications. It focuses on some tasks, such as symptom detection and risk identification. It aims to provide outputs in a more explainable form since all data is derived from real social platform posts. This dataset is particularly effective for models that aim to improve diagnostic support and patient monitoring in mental health scenarios. These datasets form the foundation for training robust Med-LLMs, enabling them to excel in a variety of medical tasks. These models provide solutions to critical problems in clinical processes and decision-making by using the advantages of each dataset. The key models that were refined using these datasets and the ones that use different datasets to accomplish comparable goals will be discussed in the section that follows.

Foundation transformer models in medical NLP

Instruction-tuned medical large language models (Med-LLMs) like Med-PaLM and Me-LLaMA are impressive, but they build on earlier transformer models designed specifically for medical tasks. These foundational models showed why medical data needs special handling and set the stage for today’s advanced Med-LLMs. Below, we examine three foundational models in medical domain:

BioBERT, marked a significant milestone as the first adaptation of BERT for biomedical texts (Lee et al., 2019). It was trained on PubMed abstracts and PubMed Central (PMC) full-text articles, and it demonstrated that domain-specific training enhances performance in biomedical NLP tasks. For instance, it achieved F1-scores of 89.7% on the National Center for Biotechnology Information (NCBI)-disease and 92.9% on the BC5CDR-chemical dataset in a named entity recognition task (Lee et al., 2019). Its success highlighted the need to adopt medical terminology in order to handle medical domain tasks. It directly influenced the design of all successors models.

ClinicalBERT extended BioBERT work using clinical notes from the Medical Information Mart for Intensive Care III (MIMIC-III) dataset of diverse patient records (Alsentzer et al., 2019). It demonstrated improved performance in tasks such as clinical text classification and medical concept extraction, enhancing the ability to process real-world patient data. For example, ClinicalBERT improved hospital readmission predictions (Alsentzer et al., 2019).

PubMedBERT went all-in on medical literature, training from scratch on PubMed abstracts and PMC articles instead of adapting a general model (Gu et al., 2021). This paid off with strong results, like 87.4% accuracy on Medical Natural Language Inference (MedNLI) (a medical reasoning task) and a 91.1% F1-score for linking chemicals and proteins in ChemProt (Gu et al., 2021). PubMedBERT showed that medical terms and writing styles work best with models built just for them.

BioBERT, ClinicalBERT, and PubMedBERT demonstrated the importance of tailoring models to medical data, providing a strong foundation for instruction-tuned Med-LLMs. These early models are optimized for specific tasks, such as named entity recognition and text classification, making them highly effective for precise applications in clinical settings. By contrast, instruction-tuned models, as an example, Med-PaLM and Me-LLaMA, are working on complex tasks, including responding to detailed medical queries and executing multi-step instructions, thereby supporting advanced clinical reasoning (Singhal et al., 2023; Xie et al., 2025). While foundational models remain highly effective at specific tasks, such as named entity recognition or text classification, instruction-tuned models have expanded these capabilities and provide more flexibility. They enable more advanced applications that require reasoning. These new models do not replace the earlier ones; instead, they complement them.

Medical instruction fine-tuning models

Instruction-tuning has emerged as a transformative approach in the development of LLMs, especially within specialized domains like healthcare. Fine-tuning the LLM models with carefully curated instruction datasets enables this technique to perform a wide range of domain-specific tasks with higher precision and better alignment with human intent. In the healthcare domain, such instruction-tuned models have demonstrated significant advancements to support clinical decisions, summarize notes, and assist diagnoses. This section reviews several notable medical instruction-tuned models with focusing on their features, limitations, and distinguishing contributions to the domain.

The Med-PaLM model

The Med-PaLM (Singhal et al., 2023) is a medical model that builds on top of the PaLM model. It is designed to excel in medical QA tasks because researchers trained it using real medical exam questions, such as MultiMedQA and some other medical datasets. The researchers combine several approaches to boost the model’s responses, including instruction-tuning, few-shot prompting, and self-consistency. Additionally, they use a chain-of-thought (CoT) approach for reasoning to ask the model to think step by step before generating the answer. The Med-PaLM shows a remarkable improvement in handling medical questions. It achieves 17% improvement when tested by existing QA benchmarks, including the MedQA (United States Medical Licensing Examination (USMLE)) medical exam dataset, PubMedGPT, and BioGPT. The instructions play a valuable role in sharpening the model’s ability to understand the provided questions. It guarantees consistent, relevant, and accurate answers, making it an excellent model for medical QA tasks.

The Aloe model

The Aloe model family (Gururajan et al., 2024) has introduced two open-source instruction fine-tuned models: LLaMA3-Aloe-8B-Alpha and Mistral-Aloe-7B. These models are developed by fine-tuning LLaMA and Mistral using a combination of specialized healthcare datasets and general instruction datasets. The healthcare datasets used for this purpose include MedQA, MedMCQA, PubMedQA, BioASQ, and MedQuAD. This blend ensures that the models are optimized for both medical and general-domain tasks. The Aloe models employ three key techniques during the fine-tuning process: instruction tuning, synthetic data generation, and CoT prompting. These techniques altogether upgrade the capabilities of the model in medical tasks, such as question answering, where accuracy and contextual understanding are important. Additionally, the Aloe model adopts supervised fine-tuning to better align with domain-specific requirements. In model evaluation, LLaMA3-Aloe-8B-Alpha demonstrated a 7% performance improvement over Meditron-70B (Chen et al., 2023b), which is a specialized open-source healthcare model. This underscores the effectiveness of instruction tuning in enhancing model performance compared to models that are exclusively trained on specialized healthcare data. Figure 7 provides an illustration of the instructional fine-tuning process, which is implemented during the training phase of the Aloe model.

Figure 7: Instructional fine-tuning process of Aloe model, demonstrating the integration of supervised fine-tuning and direct preference optimization (Gururajan et al., 2024).

Download full-size image

DOI: 10.7717/peerj-cs.3216/fig-7

BioInstruct model

The BioInstruct model (Tran et al., 2023) is an instruction-tuned Med-LLM based on the BioInstruct dataset, a specialized instruction dataset in the biomedical domain. The dataset is generated using GPT-4 (OpenAI et al., 2023) to provide diverse task-specific instructions. The model utilizes LLaMA-7B and LLaMA-13B as the backbone for training (Touvron et al., 2023) and uses Low-Rank Adaptation (LoRA) (Hu et al., 2021) for parameter-efficient fine-tuning, which allows for effective adaptation without extensive computational overhead. The BioInstruct model is designed for tasks such as QA, information retrieval, and free-text generation. Instruction tuning significantly improves the task’s performance by gaining higher accuracies of 84.29% in QA tasks and 75.63% in extractive tasks compared to non-instruction-tuned models (Tran et al., 2023). Likewise, the free-text generation task is evaluated by GPT-4 in terms of faithfulness, completeness, correctness, and coherence. The model achieves substantial performance gains across generative tasks, as demonstrated by benchmarks like Conv2note and Doctor-Patient QA, thereby proving its effectiveness in generating documentation (Tran et al., 2023). The integration of instruction tuning and parameter-efficient techniques like LoRA highlights the effective approach followed by the BioInstruct model (Tran et al., 2023).

LLaMACare model

The LLaMACare model (Li, Wang & Yu, 2024) ran instruction tuning on LLaMA-2 (Touvron et al., 2023), which comes in two variants: LLaMACare-7B-Clinical and LLaMACare-13B-Clinical. Those models utilize the MIMIC-III medical dataset (Johnson et al., 2016). Generating discharge instructions based on discharge notes and classification tasks is the main core task of this model. The dataset is formatted by hiring experts to annotate the dataset at first, and then a GPT-4 (OpenAI et al., 2023) is utilized to self-construct the dataset. Later, this dataset was fed to LLaMA-2 with clear instructions to use step-by-step reasoning before outputting solutions (Li, Wang & Yu, 2024). LLaMACare studies the effectiveness gained from instructions through ablation experiments comparing instructed vs non-instructed models. These experiments clearly demonstrate the role of instructions in the tuning process. After being trained with specific instructions, LLaMACare is able to generate outputs that align with and meet the needs of healthcare professionals (Li, Wang & Yu, 2024).

Med42 model

The Med42 model (Christophe et al., 2024) is another LLaMA-2 (Touvron et al., 2023) instructed model. Although Med42, captures many NLP tasks, but it gives more attention to QA tasks. It intentionally adopts specialized medical datasets, such as MedQA, HeadQA, and USMLE. These datasets enable the model to excel in QA of all types. The model evaluation demonstrates its effectiveness in handling simple to complex clinical questions by providing an accuracy of 72% on USMLE dataset as an example. The research studies the performance of full-parameter tuning and Parameter-Efficient Fine-Tuning (PEFT) when trained with the instructed dataset (Christophe et al., 2024). Furthermore, the findings of this research demonstrate that Parameter-Efficient Fine-Tuning achieves comparable performance to full-parameter tuning in several scenarios. It proves that instruction fine-tuning can effectively improve task-specific performance while using fewer computational resources (Christophe et al., 2024).

AlpaCare model

The AlpaCare (Zhang et al., 2023c) is an instructed fine-tuned model that uses MedInstruct-52k, an instructed dataset, and is based on LLaMA (Touvron et al., 2023). Experts create 167 instructions designed to handle a wide range of NLP tasks. After that, GPT-4 (OpenAI et al., 2023) and ChatGPT (Brown et al., 2020) augment seed instructions to generate more instructions. The final dataset was validated several times with real-world clinical use tasks. AlpaCare is designed for tasks such as medical text generation, information extraction, and inquiry response (e.g., QA). The evaluation of the model performance in QA tasks only achieves an average score of 40.6% on several benchmark datasets (Zhang et al., 2023c). Additionally, the research examines how the model performs compared to others using a similar approach when tested with free-form instructions. AlpaCare records an improvement of 38.1% over other models. These results illustrate the importance of prioritizing the quality and the diversity of the created instructed dataset, which would directly affect the model’s robustness (Zhang et al., 2023c).

MedAlpaca model

The MedAlpaca model (Han et al., 2023) is an open-source model designed for medical applications. It is trained on a big Medical Meadow dataset that consists of over 160,000 entries curated from diverse sources such as medical flashcards, Stack Exchange, and WikiDoc. The dataset is reformatted into an instruction-following structure to fine-tune LLaMA (Touvron et al., 2023), which enables them to adopt the medical domain and medical tasks. Their approach is to design a model that has the ability to perform well in two tasks: generating educational scenarios for training purposes and answering medical questions. The model’s performance proves its ability to pass medical exams, which demonstrates its understanding of any given question. It is worth noting that some medical questions are in the form of free-text scenarios about patients. The research’s main goal is to assist medical students. The results highlight its potential as a reliable tool for medical education.

Me-LLaMA model

The Me-LLaMA model (Xie et al., 2025) is a med-LLM constructed using instruction tuning, continual pre-training, and few-shot prompting. In the first phase, the model is pre-trained on 129 billion tokens from medical datasets such as MIMIC-III (Johnson et al., 2016), MIMIC-IV (Johnson et al., 2023), Medical Information Mart for Intensive Care-Chest X-Ray (MIMIC-CXR) (Johnson et al., 2019), PubMed Central, and PubMed Abstracts. In order to prevent catastrophic forgetting, the model has used RedPajama to balance general and domain-specific knowledge. In the second phase, the model is further fine-tuned using 214,595 examples from datasets like HealthCareMagic, Icliniq, MedInstruct, MedIQA, and MedicationQA. These datasets enhance task generalization for medical tasks such as summarization, classification, and question-answering. Me-LLaMA is available in full-parameter (13B and 70B) and parameter-efficient LoRA-based versions (13B-chat and 70B-chat) (Hu et al., 2021; Xie et al., 2025). These variants are designed to optimize performance while reducing computational requirements. Figure 8 illustrates the development stages of the model: continual pre-training, instruction fine-tuning, task-specific fine-tuning, and evaluation. This research presents comprehensive findings demonstrating the effectiveness of their methodology, which combines continual pre-training and instruction fine-tuning. The results reveal that the integrated approach yields better results than the one with only single components, demonstrating the contribution of each component (Xie et al., 2025).

Figure 8: Demonstration of continued pre-training of LLaMA2 models in combination of pre-training instructional datasets (Xie et al., 2025).

Download full-size image

DOI: 10.7717/peerj-cs.3216/fig-8

ChatDoctor model

The ChatDoctor model (Li et al., 2023) is a Med-LLM based on a LLaMA-7B based (Touvron et al., 2023), which is tailored to medical dialogue. It leverages the HealthCareMagic-100k dataset (Li et al., 2023), which contains 100,000 patient-doctor conversations. This conversation format prompts the model to follow the dialogue’s flow as instructions do. The main task is to create a chatbot with reliable information. This research’s primary contribution is to study how the model performs when connected to an external offline database to ensure the accuracy and relevance of all responses. This knowledge repository includes up-to-date medical information from various sources, such as the MedlinePlus database, which covers diseases, symptoms, medications, and other medical information. The model demonstrates strong performance in handling questions and inferring correct responses, whether the questions are about symptoms, treatments, or other medical topics. In conclusion, the research illustrates how the model may respond when connected to a valid external database (Li et al., 2023).

Radiology-LLaMA2 model

The Radiology-LLaMA2 (Liu et al., 2023) is an instruction fine-tuned model designed explicitly for radiology reporting tasks. It is based on LLaMA-2 (Touvron et al., 2023) and is fine-tuned using an instruction dataset derived from MIMIC-CXR (Johnson et al., 2019). The MIMIC-CXR dataset (Johnson et al., 2019) focuses on radiology reports. To prepare the dataset for training, each instance of radiology reports aligns with its corresponding interpretations to build an instructional dataset. The primary task in this research is to generate an interpretation (output) from the given radiology report (input) using specific instructions. Experts are brought in to evaluate the model based on three key criteria: how concise, coherent, and relevant its outputs are to the input reports. In addition, the model is tested on the MIMIC-CXR (Johnson et al., 2019) and OpenI datasets, where it achieves ROUGE-L scores of 0.4427 and 0.4087, respectively, when using ChatGPT-4.

Overall, this study demonstrates the potential effectiveness of building a model that interprets radiology reports and highlights promising directions for future research.

Radiology-GPT model

The Radiology-GPT model (Liu et al., 2024b) is an instruction fine-tuned designed for radiology report generation. Unlike the Radiology-LLaMA2 model, it is based on the Alpaca-7B (Taori et al., 2023a) and employs LoRA (Hu et al., 2021) for parameter-efficient fine-tuning. Its instruction dataset is derived from MIMIC-CXR (Johnson et al., 2019) aligning findings and impressions with corresponding output instructions of radiology-specific tasks. Expert radiologists ‘evaluation of the model was based on understandability, conciseness, relevance, and clinical utility. The model demonstrates performance comparable to ChatGPT-4’ surpasses LLaMA-7B (Touvron et al., 2023) and Dolly-12B (Databricks, 2023) and establishes its effectiveness in radiology tasks. These results highlight the potential of Radiology-GPT as a domain-specific alternative to general-purpose LLMs and generate accurate and clinically useful radiology reports.

SciFive model

The SciFive model (Phan et al., 2021) is a text-to-text transformer based on the T5 architecture (Raffel et al., 2020). This model is fine-tuned through an instructed dataset derived from NCBI Disease, BC5CDR, CHEMPROT, MedNLI, and BioASQ. These datasets are designed to support various NLP tasks, including question-answering, named entity recognition (NER), information extraction, and inference. These variations enable SciFive to perform effectively in multiple biomedical tasks.

MentaLLaMA model

The MentaLLaMA model (Yang et al., 2024) is an LLM model based on LLAMA-2 designed to predict a mental health condition from given non-formal or non-academic text. It is fine-tuned using the IMHI dataset, comprising 105,000 instances from 10 social platforms. This dataset supports eight mental health conditions and is designed to predict binary classification to detect depression or stress, as well as multi-class classification of other mental health conditions. The dataset also generates human-understandable reasoning behind classification decisions. MentaLLaMA is available in two versions: MentaLLaMA-chat-7B and MentaLLaMA-chat-13B. The model demonstrates superior performance compared to ChatGPT and other baseline models and excels in correctness, consistency, and explanation generation. These features make MentaLLaMA a valuable tool for mental health assessment.

Models for low-resource healthcare settings

Some Med-LLM initiatives are now explicitly designed to address low-resource healthcare settings. Some recent efforts show exciting steps forward in training these models to handle some challenges, such as limited computational infrastructure, linguistic diversity, and scarcity of large-scale medical datasets. In the following subsection, we present a selected low-resource medical models that adopt instruction tuning as a core method.

The L2M3 model (Gulati et al., 2024) represents recent efforts to adapt instruction-tuned large language models for low-resource healthcare environments. It specifically targets community health workers (CHWs) in low- and middle-income countries (LMICs), addressing healthcare workforce shortages in underserved regions. The model is a great example of how Med-LLMs can make a difference in poorer countries, where language gaps and a scarcity of healthcare providers (Ahmed et al., 2022; WHO, 2016).

L2M3 employs a novel combination of instruction-tuned language model and machine translation. One of its goals is to enable multilingual medical interactions across languages such as English, Hindi, Telugu, Arabic, and Swahili. It is trained on a large-scale, open-source medical dataset with nearly a billion words from doctor-patient chats and health guides. It is a deliberately open-source architecture with cost-effective deployment on limited hardware resources to fit poor countries’ needs. It demonstrates how instruction tuning can help create an affordable, accessible, and culturally appropriate medical tool. It supports healthcare scalability in a resource-constrained environment, contributing to the global goal of improving health equity (Gulati et al., 2024).

Summary of reviewed models

For question-and-answer tasks, models such as Med-PaLM (Singhal et al., 2023), BioInstruct (Tran et al., 2023), Aloe (Gururajan et al., 2024), Me-LLaMA (Xie et al., 2025), AlpaCare (Zhang et al., 2023c), and Med42 (Christophe et al., 2024) rely on high-quality datasets such as MedQA, PubMedQA, and MedMCQA. These models employ specialized instructed fine-tuning techniques to enhance comprehension of the given questions and deliver accurate responses which showcasing exceptional performance in the medical QA. Each reviewed model deploys unique methodologies alongside specialized instructions. For example, Med-PaLM (Singhal et al., 2023) excels in the medical QA task by using big QA datasets and advanced self-consistency techniques. AlpaCare (Zhang et al., 2023c) proves notable performance on free-text questions through instruction-tuning with a highly crafted instruction dataset, MedInstruct-52k. Me-LLaMA (Xie et al., 2025), on the other hand, integrates both instruction-tuning and continual pre-training, which boosts and enhances its performance. Other models use instructed fine-tuning as a core method, such as Radiology-GPT and Radiology-LLaMA2, which address the same problem of generating reports, while MentaLLaMA focuses on mental health domains. Table 3 provides a comparative overview of the models by highlighting their base architectures, their dataset sources, target tasks, and fine-tuning techniques. These models exemplify the versatility and effectiveness of instructional fine-tuning in the advancement of domain-specific LLMs. Despite their achievements, further refinement of fine-tuning strategies is essential to unlock their full potential. The next section explores advanced optimization techniques to enhance model performance and adaptability.

Table 3:

Taxonomy and comparison of medical models with domain classification, tuning strategy, prompt engineering, evaluation datasets, and performance metrics.

Domain	Model name	Base model	Parameters	Training dataset	Tasks	Tuning strategy	Prompt engineering	Evaluation datasets	Performance metrics
General medical practice
General medical practice	MED-PaLM (Singhal et al., 2023)	PaLM	540B	MultiMedQA, general large-scale datasets	QA models	Instruction tuning (Flan)	Few-shot, CoT, self-consistency, instruction prompt tuning	MultiMedQA benchmark (MedQA, MedMCQA, PubMedQA, MMLU, LiveQA, MedicationQA, HealthSearchQA)	Accuracy (67.6% MedQA); human evaluation (LiveQA, MedicationQA, HealthSearchQA)
General medical practice	Aloe model (Gururajan et al., 2024)	LLaMA-3 Mistral	–	Medical, general dataset, synthetic data generation, ethical alignment	QA models	Not reported	SC-CoT, MedPrompt	MedMCQA, MedQA (USMLE), PubMedQA, MMLU (Medical), CareQA, OpenMedQA	Accuracy (MedMCQA: 69.14%, MedQA: 64.47%, PubMedQA: 71.01%, MMLU: 80.20%, CareQA: 79.92%); Human evaluation (OpenMedQA)
General medical practice	Me-LLaMA (Xie et al., 2025)	LLaMA 2	13B/70B	Instruction tuning, continual pre-training dataset	QA, multi-task	Instruction tuning (implied)	Zero-shot and few-shot prompts	MedQA, MedMCQA, PubMedQA, EmrQA, 2010 i2b2, 2013 DDI, HoC, MTSample, BioNLI, MedNLI, PubMed, MIMIC-CXR, NEJM CPCs	Accuracy (MedQA: 62.3%, MedMCQA: 64.3%, PubMedQA: 81.4%, EmrQA: 91.7%); Macro-F1 (i2b2: 86.9%, DDI: 80.3%, HoC: 89.1%, MTSample: 93.5%, BioNLI: 87.4%, MedNLI: 87.3%); ROUGE-L (PubMed: 46.5, MIMIC-CXR: 34.2), BERTScore (PubMed: 0.896, MIMIC-CXR: 0.877); Top-1 Accuracy (NEJM CPCs: 74.3%)
General medical practice	AlpaCare (Zhang et al., 2023c)	LLaMA-1/2	13B	MedInstruct-52k, ICliniq	QA models	Instruction tuning (Templates + Filtering)	Automatically generated templates, Rouge-L filtering	MedQA, HeadQA, PubMedQA, MedMCQA, MeQSum, AlpacaFarm, MMLU, BBH, TruthfulQA	Accuracy (MedQA: 35.5%, HeadQA: 30.4%, PubMedQA: 74.8%, MedMCQA: 33.5%), ROUGE-L (MeQSum: 29.0%), human evaluation
General medical practice	Med42 (Christophe et al., 2024)	LLaMA 2	7B/70B	Medical QA dataset, general domain	QA models	Full and parameter-efficient tuning	Structured templates, few-shot prompting	USMLE, MedQA, PubMedQA, MedMCQA, HeadQA, Anatomy, Professional medicine	Accuracy (USMLE: 72.0%, MedQA: 61.5%, PubMedQA: 76.8%, MedMCQA: 60.9%)
General medical practice	MedAlpaca (Han et al., 2023)	Alpaca	7B/13B	Medical flashcards, forums, WikiDoc, CORD-19, MedQA, PubMed, MIMIC-III/IV	Clinical text generation	Not reported	Not reported	USMLE Step 1, Step 2, Step 3 self-assessment datasets	Accuracy (Step 1: 47.3%, Step 2: 47.7%, Step 3: 60.2%)
Specialized domains
Specialized domains	MentaLLaMA (Yang et al., 2024)	LLaMA	7B/13B	Mental health	Binary mental health detection, Multi-class detection	Instruction tuning (Few-shot, template-based)	Few-shot, instruction, template-based	IMHI benchmark (DR, Dreaddit, CLP, SWMH, T-SID, SAD, CAMS, IRF, MultiWD)	Prediction correctness (state-of-the-art on 7/10 test sets); explanation quality (comparable to ChatGPT); generalizability (outperforms ChatGPT on unseen tasks)
Specialized domains	LLaMACare (Li, Wang & Yu, 2024)	LLaMA 2 chat	7B	MIMIC-III, auto-generated instructions	Clinical text generation	Instruction tuning (Self-instruct)	Few-shot, multi-template prompting	Discharge summaries, clinical outcome prediction datasets	ROUGE-L (27.2), BLEU-4 (18.8); AUROC improvements of 2–5 points
Specialized domains	Radiology-LLaMA2 (Liu et al., 2023)	LLaMA 2	7B	MIMIC-CXR, OpenI	Radiology reporting	Instruction tuning	Zero/Few-shot learning	MIMIC-CXR, OpenI	ROUGE-1 (MIMIC-CXR: 0.4834, OpenI: 0.4185)
Specialized domains	Radiology-GPT (Liu et al., 2024b)	Alpaca-7B	7B	MIMIC-CXR	Radiology reporting	Instruction tuning	Not reported	MIMIC-CXR, OpenI	Human evaluation (Understandability, Coherence, Relevance, Conciseness, Clinical utility)
Biomedical research
Biomedical research	BioInstruct (Tran et al., 2023)	LLaMA 1/2	7B/13B	BioInstruct, BioNLP datasets, Conv2note, ICliniq	QA models	Instruction tuning (Self-instruct)	Few-shot prompting	BioASQ, PubMedQA, NER datasets, RE datasets, Summarization datasets	Accuracy (QA: +17.3%), F1-score (IE: +5.7%), GPT-4 evaluation score (GEN: +96%)
Biomedical research	SciFive (Phan et al., 2021)	T5	220–770M	C4, PubMed, PMC, NCBI Disease, BC5CDR, CHEMPROT, MedNLI, BioASQ	NER, relationship extraction, biomedical QA	Not reported	Not reported	NCBI disease, BC5CDR, BC4CHEMD, BC2GM, JNLPBA, Species-800, ChemProt, DDI, MedNLI, HoC	F1-score (NCBI disease: 89.39%, BC5CDR chemical: 94.76%, ChemProt: 88.95%, DDI: 83.67%); accuracy (MedNLI: 86.57%); F1 (HoC: 86.08%)
Patient communication
Patient communication	ChatDoctor (Li et al., 2023)	LLaMA	7B	HealthCareMagic-100k, patient-doctor conversations	Dialogue-based models	LoRA	Reasoning prompts	iCliniq real-world patient-doctor conversations	Accuracy (91.25%), BERTScore (improved precision, recall, F1)

DOI: 10.7717/peerj-cs.3216/table-3

Taxonomy framework for medical instruction tuning

This literature review establishes a taxonomy for categorizing instruction-tuned medical LLMs by application domains. The taxonomy organizes models into four primary categories based on their clinical use cases. General Medical Practice models handle broad medical knowledge and multi-specialty applications. Specialized models combine specific medical fields like mental health and radiology. Biomedical Research models focus on medical literature. Patient Communication models enable direct patient interaction. Table 3 presents this categorical organization with comprehensive technical details, including data used, tuning techniques, prompt methods, and performance metrics. This table makes it easy to compare models.

Limitations in evaluation methodologies and reproducibility challenges

Medical large language models (Med-LLMs) have made exciting strides, but when we look closely at how they are tested, we see serious gaps. These gaps make it hard to trust, compare, or apply the results in real-world healthcare. Below, we break down the main issues and suggest ways to improve.

Inconsistent testing methods

Testing Med-LLMs is all over the place. For example, Med-PaLM scored 67.6% on the MedQA benchmark using a few example prompts with step-by-step reasoning and a method to double-check its answers (Singhal et al., 2023). Meanwhile, Me-LLaMA got 62.3% on the same test but used a mix of approaches without that double-checking (Xie et al., 2025). Other models vary too: AlpaCare uses auto-generated templates (Zhang et al., 2023c), while ChatDoctor relies on reasoning prompts (Li et al., 2023). These differences make it tough to know if one model is truly better or if the test setup is skewing results.

Another problem is how test data is handled. Some models, like AlpaCare with its MedInstruct-52k dataset (Zhang et al., 2023c) or BioInstruct (Tran et al., 2023), do not clearly prove that their test questions were not seen during training. Models like Me-LLaMA, which continue training on medical texts (Xie et al., 2025), might accidentally include test data in their training, raising the risk of overfitting to benchmark questions.

Limited benchmarks and overfitting risks

Most studies keep testing Med-LLMs on the same three benchmarks: MedQA, MedMCQA, and PubMedQA. Table 3 shows nearly every model reports scores on these datasets. But this narrow focus is a problem. These tests, especially MedQA, rely on multiple-choice questions, which do not fully test the deep reasoning doctors need in real-world clinics.

MedQA shows up in almost every study, yet it has clear flaws. GatorTron researchers noted it only has about 12,000 questions and does not always match current medical practice. MetaMedLLM warns that models are being tweaked just to ace MedQA, not to develop true medical thinking (Zhang et al., 2024). Some studies, like Med-PaLM (Singhal et al., 2023) and Med42 (Christophe et al., 2024), admit that multiple-choice tests fall short of real clinical challenges. But many others—Aloe (Gururajan et al., 2024), Me-LLaMA (Xie et al., 2025), BioInstruct (Tran et al., 2023), and ChatDoctor (Li et al., 2023)—share high MedQA scores without mentioning these limitations or the risk of overfitting, where models get good at these tests but struggle elsewhere.

We need broader, more realistic tests that reflect actual medical work to ensure models are useful in hospitals, not just good at passing a few exams.

Reproducibility challenges

It is not easy to recreate the results of many studies because they often skip writing important details. For example, articles might not explain how they crafted their prompts, which examples they used, or how they adjusted model settings such as decoding temperature. For instance, BioInstruct claims a “17.3% improvement” in answering questions (Tran et al., 2023), but details about the prompt structure are missing. Another example: MentaLLaMA claims to outperform other models on “7 out of 10” test sets (Yang et al., 2024), yet provides only a high-level description of its prompt design, without full details on example selection. Without this level of detail, it is difficult for other researchers to verify results or reproduce the work.

Strategies optimizing instruction fine-tuning

Optimizing instruction fine-tuning for Med-LLMs requires a multifaceted approach. Research is being focused on exploring strategies to enhance model performance by balancing data volume with instruction quality, implementing phased training and mixed prompting to improve adaptability, prioritizing strong base models while mitigating the risks of imitation, and addressing bias to ensure reliable outputs. This section delves into established and emerging techniques that contribute to more robust and efficient Med-LLM development.

Long is more for alignment

Achieving high performance in instruction-tuned models has traditionally relied on meticulously curated datasets (Zhao et al., 2024). However, Zhao et al. (2024) introduced a hypothesis that challenges this notion, suggesting that fine-tuning models using only long, detailed instructions from existing datasets can yield results comparable to state-of-the-art methods like Less Is More for Alignment (LIMA) (Zhou et al., 2023) and AlpaGasus (Chen et al., 2023a). The Alpaca-1k-longest model (Zhao et al., 2024) has been fine-tuned using a dataset composed of 1,000 instruction-response pairs, carefully chosen based on their length. These entries, sourced from the Alpaca-52k (Taori et al., 2023b) and Evol-Instruct-70k (Zeng et al., 2024) datasets, represent the longest and most detailed examples. Despite the significant reduction in dataset size, this approach proves that a smaller set of highly informative examples can offer a more efficient and cost-effective solution without sacrificing model quality (Zhao et al., 2024).

Less is more for alignment

Many current approaches rely on large-scale datasets to fine-tune models for specific tasks (Bommasani et al., 2021). Nevertheless, Zhou et al. (2023) state that small datasets with high-quality data can be as effective, overcoming the challenge of relying on large-scale datasets for instructional fine-tuning. Researchers prove the claim by fine-tuning the LIMA model with a small dataset of only 1,000 instances, which is carefully curated instruction-response pairs. These instances were sourced from multiple platforms such as Stack Exchange, WikiHow, and Reddit. The primary contribution is to maintain an exceptional quality of the dataset. This experiment shows that strong performance can be achieved through a high-quality dataset without the need for extensive data collection efforts (Zhou et al., 2023).

Enhancing base models instead of imitation

A proprietary model consistently surpasses open-source models, showing the ongoing performance gap between the two (Zhou et al., 2023; Zhao et al., 2024). To bridge this gap, imitating a proprietary model by fine-tuning open-source with several outputs of a more advanced proprietary model enables the weaker model to “learn” from the stronger one (Gudibande et al., 2023). Gudibande et al. (2023) argues that imitation models are effective at replicating the superficial style and fluency of proprietary models, but they fail to resolve advanced challenges. Such that, those models may fail to provide robust solutions for complex problems. This article (Gudibande et al., 2023) stresses the importance of starting with a strong base model, emphasizing the need to focus on the foundational quality rather than relying on imitation.

How far can camels go?

Instructional datasets improve the performance of any given models, whether open-source or proprietary (Wang et al., 2023a). However, this raises a research question: whether using a single dataset or a combination of datasets would deliver optimal performance across several NLP tasks. The researcher (Wang et al., 2023a) demonstrates that models fine-tuned on single specialized domain datasets tend to excel in specific tasks but struggle to generalize. In contrast, combining datasets improves overall performance but fails to achieve top results in any single specific domain. As a result, no individual dataset consistently excels in all tasks.

Scaling instruction-finetuned language models

Many current approaches to LLM development focus on scaling model size or increasing pretraining data to improve performance (Brown et al., 2020). The study by Chung et al. (2022) proves that scaling instruction tuning by increasing the number of tasks and incorporating chain-of-thoughts (CoT) data significantly boosts model performance. Despite these gains, a critical question remains: how much scaling is enough? When the number of tasks increases from a small number (e.g., nine tasks) to a larger set (e.g., 282 tasks), there is a substantial performance gain. This is because the model is exposed to more diverse tasks, allowing it to learn from a wider range of instructions. However, beyond 282 tasks, adding more tasks (up to 1,836 in the study) still improves performance, but the improvement becomes minor. Chung et al. (2022) recommend prioritizing the selection of a diverse and meaningful set of tasks to optimize model performance rather than simply increasing the number of tasks to ensure more effective fine-tuning.

Long-tail tasks

The dataset’s imbalanced task distribution in an instructional dataset is characterized by a predominant representation of short-response tasks, leading to a significant underrepresentation of extended dialogues and complex, long-tail tasks (Longpre et al., 2023; Wang et al., 2022). This skewed distribution limits the model’s capabilities to capture complex tasks. The study by Wang et al. (2022) illustrates the critical importance of incorporating a broader range of task types, including both short and extended responses. This approach facilitates the creation of models capable of managing a broader spectrum of interactions and producing more nuanced outputs, extending beyond simple short answers.

Mixed prompt training

Earlier studies trained models on single prompt settings, either zero-shot or few-shot prompts, but not both (Sanh et al., 2022; Wang et al., 2022). The study (Longpre et al., 2023) investigates the role of mixing different prompt strategies, such as zero-shot, few-shot, and CoT, for a single specific task. The finding proves that a mixture of prompt approaches leads to better model performance. In their study (Longpre et al., 2023), the baseline performance of models trained only with single prompt types (zero-shot or few-shot) was compared against models trained with mixed prompt settings using ablation studies where they systematically removed components of the mixed prompt training setup (e.g., few-shot templates, CoT prompts) to measure their individual contributions. Longpre et al. (2023) recommend adopting mixed prompt training as a core strategy for instruction tuning, as it increases the model’s capability to manage varied prompts.

Sensitive to instruction phrasings

Arroyo et al. (2024) highlight the problem of disparities in the model outputs because of instruction phrasing. LLMs exhibit distinct variations in performance when presented with instructions that are phrased differently but convey the same semantic meaning. In their article, Arroyo et al. (2024) thoroughly evaluated open-source medical models on a range of classification and extraction tasks using prompts specifically crafted by medical professionals. Their article illustrates that minor alterations in wording significantly affect the outputs of medical LLMs. This work examines LLaMA2-7B and Asclepius-7B when prompted with specific phrases. At first, the researchers use the terms ‘white’ or ‘non-white’ and analyze the outputs. The result shows quite different responses based on those terms. This study may reflect potential racial biases in the training data. Likewise, when analyzing gender-based responses, the researchers found concerning differences in accuracy. Experiments reveal that using the term ‘female’ in a given task often leads to less accurate results compared to identical queries using ‘male.’ This disparity suggests underlying issues in how these models process gender-specific information. This article provides a recommendation to address biases and racial discrimination to ensure any model’s reliability across diverse demographic groups (Arroyo et al., 2024). This sensitivity to phrasing shows the potential risks of deploying such models in healthcare systems (Arroyo et al., 2024).

Phased instruction fine-tuning

Phased Instruction Fine-Tuning (Phased IFT) (Pang et al., 2024) addresses limitations in the traditional One-off Instruction Fine-Tuning (Ouyang et al., 2022) by introducing a structured training process aimed at improving large language models’ ability to follow instructions (Pang et al., 2024). The core issue with One-off IFT is that it treats all instructions as equally difficult, leading to inefficient learning, especially with more complex instructions. This lack of a phased approach prevents gradual improvement in the model’s ability to align with user intent. Therefore, phased IFT (Pang et al., 2024) addresses this by assessing instruction difficulty using GPT-4 (Arroyo et al., 2024), then segmenting the dataset into subsets ordered by increasing complexity. Phased IFT (Pang et al., 2024) introduces the Progressive Alignment Hypothesis, which proposes that models require gradual training to align with human intent effectively. Experimental results show that Phased IFT significantly outperforms One-off IFT across multiple benchmarks (Pang et al., 2024). In summary, phased IFT enhances instruction-following performance by implementing a gradual, phase-based training process that handles different task levels (Pang et al., 2024).

Emotional awareness in medical instruction tuning

Emotional awareness is defined as the ability of an intelligent model to detect and interpret emotional vocabularies in human communication to enhance the model’s responses (Picard, 1997). In healthcare contexts, emotional awareness enables Med-LLMs to better align with patient needs, especially in emotionally sensitive interactions such as the mental health domain.

An emotion-aware method can be integrated to enhance mental health counseling sessions. Rasool et al. study a method that highlights the need to optimize Med-LLMs through emotion-aware instruction tuning. It demonstrates how incorporating emotional intelligence can transform basic clinical responses into more empathetic, contextually appropriate patient text generation. This method improves patient communication by recognizing emotional distress signals and generating culturally appropriate responses that respect patients’ diverse backgrounds, beliefs, languages, and values. These capabilities mark a significant step toward more humanistic medical LLMs (Rasool et al., 2024).

Opportunities and limitations

Recent advancements in instruction fine-tuning of LLMs have highlighted a range of strategies that collectively improve the performance of Med-LLMs. One approach underscores the benefits of either using long, detailed instructions or smaller, high-quality datasets to achieve significant performance gains. Another approach emphasizes the importance of building upon robust base models rather than relying on the imitation of proprietary models. In addition, scaling the number of tasks and incorporating mixed prompting strategies have both been shown to substantially enhance the model’s alignment with human intent and improve overall effectiveness. In medical settings, models’ sensitivity to instruction phrases has shown the risk of bias, calling for more rigorous methods to ensure fairness. Finally, Phased Instruction Fine-Tuning approaches introduce a phase and incremental training process that determines varying levels of complexity, which provides better alignment with human intent compared to traditional one-off methods. While these optimization strategies have collectively pushed the performance envelope of Med-LLMs, there are some broader challenges that must be addressed to fully unlock the potential of these models. Researchers should prioritize mitigating bias in output. Patient’s privacy and safety are equally critical to the practical deployment of Med-LLMs. Addressing these challenges will allow this field to move closer to realizing the goal of reliable, equitable, and impactful LLM solutions in healthcare.

Key challenges and future directions

This section outlines a number of key topics that require further study building on the knowledge acquired from our evaluation of instruction tuning for Med-LLMs. These guidelines provide specific recommendations to improve Med-LLMs functionality and ethical basis while also reflecting growing emphasis on safety, interoperability, and equity in healthcare applications. A visual summary of the key challenges, their proposed solutions, and relative priorities is presented in Fig. 9 to provide an overview of the interconnected nature of these issues.

Figure 9: Visual representation of key challenges in Med-LLMs and their proposed solutions.
The Sankey diagram highlights the interconnection between identified challenges (left) and corresponding solutions (right), along with their relative priorities. The width of the flows corresponds to the significance of each challenge and solution.

Download full-size image

DOI: 10.7717/peerj-cs.3216/fig-9

Bias and hallucination mitigation in medical LLMs

Med-LLMs have persistent biases and the generation of hallucinations’ outputs. Those two limitations pose substantial risks in medical contexts and require targeted mitigation strategies to ensure safe and equitable deployment. Med-LLMs may exhibit racial, gender, and demographic biases, which arise from imbalances in datasets and biomedical literature that underrepresent or inaccurately depict certain populations (Poulain, Fayyaz & Beheshti, 2024). For example, studies have shown that medical models may underestimate pain severity in non-white patients or recommend differential treatment based on gender type (Poulain, Fayyaz & Beheshti, 2024; Arroyo et al., 2024). Such biases can hold in healthcare delivery and reduce the trust of the generated output. Studies of eight LLMs found that medical fine-tuning did not reduce bias, with Hispanic women receiving different pain medication recommendations than other groups (Poulain, Fayyaz & Beheshti, 2024). Notably, both general-purpose models (e.g., GPT-4 (OpenAI et al., 2023)) and specialized medical models showed similar bias patterns. These findings demonstrate that neither increasing model size nor instruction tuning on medical data guarantees reduced bias, and may in fact amplify existing demographic biases in clinical recommendations (Poulain, Fayyaz & Beheshti, 2024; Ji et al., 2023).

Med-LLMs generate plausible and convincing output, but not always accurate, which presents another critical challenge. A model such as Med-PaLM has identified instances of fluent but factually incorrect responses, including unsafe medication combinations and outdated treatment plans. These errors, if undetected, could lead to serious safety issues (Singhal et al., 2023; Ji et al., 2023). To tackle bias and false medical facts in Med-LLMs, we can use trusted sources to guide their answers. Incorporating verified medical databases to models, including UMLS, SNOMED CT, or RxNorm, keeps outputs updated with current medical standards (Bodenreider, 2004). Implementing an automated verification method that can compare outputs against reliable references can effectively detect biases and inaccuracies. Similarity-based methods, for instance, can flag deviations from trusted sources, enabling real-time correction (Honovich et al., 2022). Another important approach is to maintain healthcare professionals’ oversight as part of a human-in-the-loop framework, which is essential for ensuring clinical accuracy and relevance. Human validation helps identify subtle biases or hallucinations that automated systems may overlook, especially in complex clinical scenarios (Ji et al., 2023; Mittelstadt, 2019). Equitable evaluation of Med-LLM performance across diverse demographic groups, linguistic contexts, and clinical scenarios is critical to ensure fairness. This includes conducting rigorous testing on underrepresented patient groups to discover potential problems in any model performance (Mehrabi et al., 2021; Mittelstadt, 2019). The successful integration of these strategies relies on close collaboration between computer science researchers and clinical experts to refine Med-LLMs for real-world use. By combining external knowledge grounding, automated verification, human oversight, and equitable testing, the field can move closer to developing Med-LLMs that deliver accurate results.

Instruction complexity, confidence, and alignment challenges in Med-LLMs

Med-LLMs currently face alignment problems, which means the model does not always produce outputs that match what the user needs (Singhal et al., 2023). These models are too sensitive to how questions are written. Small changes in phrases can alter their answers (Arroyo et al., 2024). This inconsistency creates a problem and an untrusted model for clinical use.

Med-LLMs also struggle with tough medical tasks that need careful, step-by-step reasoning. They can answer basic medical scenarios but struggle with multi-step diagnostic processes (Liévin et al., 2022; Nori et al., 2023). Singhal et al. showed that model accuracy drops significantly on solving complex patient cases compared to simple ones (Singhal et al., 2023). Such limitations reduce the practical usage of Med-LLMs in assisting doctors with their decision-making in sophisticated cases.

Another concern is miscalibrated confidence. Med-LLMs often sound highly confident even when their answers are incorrect. Such models can sound certain in their responses, while they are wrong or less certain. Overconfidence is a major problem when doctors need to trust the model responses (Jiang et al., 2021; Xiong et al., 2024; Kadavath et al., 2022). This poses a challenge, where clinicians must be able to recognize when the model is uncertain.

To address these fundamental challenges, researchers have proposed several approaches. Future research should focus on making language models more accurate and reliable. Researchers could develop enhanced methods to improve factual consistency and reduce hallucinations by strengthening output verification against reliable sources and refining training protocols for knowledge claims. Future work should also improve confidence calibration, which is the ability of models to accurately match their expressed confidence with their actual accuracy, and this will be essential for clinical applications where practitioners need reliable uncertainty indicators (Xiong et al., 2024).

Larger models get easily confused by extra information in questions. This means we need better ways to help models justify their confidence. One method, known as linguistic confidence, encourages models to explicitly explain their confidence in words rather than relying solely on raw probability scores (Mielke et al., 2022; Xiong et al., 2024). Another method, consistency-based estimation, measures confidence by evaluating the degree of similarity between responses generated from repeated identical prompts; inconsistent responses indicate the model is likely hallucinating (Manakul, Liusie & Gales, 2023; Lin, Trivedi & Sun, 2023). Integrating multiple methods can also help improve how well models estimate their confidence (Xiong et al., 2024; Shrivastava, Liang & Kumar, 2023).

Artificial intelligence (AI) agents offer a potential strategy to help mitigate some of these problems. Agents are autonomous intermediaries that can interact with users and models, such as between doctors and Med-LLMs (Wilk et al., 2016). They can confirm the model’s responses against medical knowledge bases and help identify potential errors. Agents can also adapt prompt engineering, which is the process of structuring questions to get better responses, based on what doctors need (Liu et al., 2024a).

These approaches represent important steps toward making Med-LLMs more reliable and trustworthy in clinical practice. The focus must be on practical integration and continuous learning from healthcare professionals. Addressing these challenges is critical as healthcare systems increasingly adopt AI technologies for clinical decision support.

Implementing strong guardrails for privacy and safety

Guardrails protect patients and prevent misuse of medical AI systems. These safety measures stop unauthorized access to patient records and ensure Med-LLMs work properly (Dong et al., 2024; Gangavarapu, 2024). Our analysis found privacy problems in several models. ChatDoctor uses actual conversations between patients and doctors (Li et al., 2023). MentaLLaMA trains on mental health posts from social media (Yang et al., 2024). Both approaches create data privacy risks that require stronger protection measures.

To safeguard patient data, it is important to formulate robust protection strategies through layers of anonymization, role-based access controls, and incorporation of effective methods like differential privacy (Ji, Lipton & Elkan, 2023), k-anonymity (Sweeney, 2002), and l-diversity (Machanavajjhala et al., 2007). These methods go beyond standard de-identification seen in datasets like MIMIC-III/IV (Xie et al., 2025) and these steps help lower the chance of someone guessing a patient’s identity (Dyda et al., 2021; Ziller et al., 2021).

Comprehensive guardrails combine regulatory compliance with technical privacy protections. Health Insurance Portability and Accountability Act’s (HIPAA’s) basic requirements for encryption, access controls, audit logging, and authentication (U.S. Congress, 1996; Momentum, 2024) form the regulatory base, which must be enhanced with advanced privacy methods like differential privacy and k-anonymity that provide mathematical privacy protection. Some medical fields need their own protections. For example, mental health models need tighter privacy rules (Momentum, 2024).

Guardrails can automatically spot dangerous queries and alert human supervisors, making AI safer even when we can’t see how it works (Momentum, 2024). These safety steps work together to make Med-LLMs safe, ethical, and legal to use in hospitals.

Broadening healthcare AI access: multilingual models and global deployment

To ensure that Med-LLMs work in real-world hospital settings, we need to measure several metrics such as diagnostic accuracy, patient satisfaction, and how models help doctors in their work (Singhal et al., 2023). We need to engage experts in medicine, AI, ethics, and policy to develop guidelines that keep these models safe and useful. The aim is to create medical models that are safe and meet high ethical standards (Mittelstadt, 2019). Working closely with regulatory bodies ensures adherence to up-to-date medical guidelines and promotes transparency, accountability, and explainability in AI-driven healthcare (Dong et al., 2024).

Equally important for establishing global standards and ethical frameworks is the development of cross-lingual models tailored to communities that do not speak English. Most current Med-LLMs are trained mainly on English data, which makes them less helpful for people who speak other languages, creating significant gaps for non-English speaking populations.

A recent study, “Better to Ask in English: Cross-Lingual Evaluation of Large Language Models for Healthcare Queries,” shows that models such as GPT-3.5 and MedAlpaca struggle with medical questions in Spanish, Chinese, or Hindi. Their answers are 18% less correct, 29% less steady, and 13% less trustworthy than in English (Jin et al., 2024; Qin et al., 2025). This performance degradation is expected, as current models are not trained on sufficient medical data in non-English languages. To make these models fair and useful worldwide, it is essential to train these models on more linguistically and culturally diverse medical corpora (Qin et al., 2025). A solution to this problem is demonstrated by the Apollo project. This project was developed based on ApolloCorpora, a 2.5-billion-token multilingual medical corpus covering six languages (English, Chinese, Hindi, Spanish, French, and Arabic). It used medical texts from local sources, including doctor-patient dialogues specific to each language and culture. The Apollo model, instruction-tuned on this multilingual corpus, demonstrates that native-language training significantly outperforms translation-based approaches. This approach provides good evidence that instruction tuning with linguistically and culturally grounded native-language data offers a viable path toward effective worldwide healthcare deployment (Wang et al., 2024).

Conclusion

This literature review demonstrated that instruction tuning which encompassed curated datasets held substantial promise for advancing medical language models in real-world clinical tasks. Crafting well-designed instruction datasets was the core of instructional tuning. Several medical models such as Med-PaLM showed the significant effect of adopting such methodology. This research examined several ways to optimize and craft instruction datasets. For example, leveraging ‘long, detailed instructions’ or ‘small yet high-quality datasets.’ Also, one of the ways was to use phased and mixed-prompt approaches to enhance model adaptability and user alignment. Additionally, the research pointed out the importance of selecting the most appropriate words in creating the instructions. Findings in this research demonstrated how instruction tuning has revolutionized the integration of LLM capabilities with healthcare needs. Encouraged the healthcare community to recognize med-LLms as valuable tools that could enhance clinical decision support and patient care delivery.

[1] Ahmed S, Goudge J, Topp SM, Witter S, Labonté R, Sheikh K. 2022. Community health workers and health equity in low- and middle-income countries: systematic review and recommendations for policy and practice. International Journal for Equity in Health 21(1):1-12

[2] Alsentzer E, Murphy JR, Boag W, Weng W-H, Jin D, Naumann T, McDermott MBA. 2019. Publicly available clinical BERT embeddings.

[3] Arroyo AMC, Munnangi M, Sun J, Zhang KY, McInerney DJ, Wallace BC, Amir S. 2024. Open (clinical) LLMS are sensitive to instruction phrasings.

[4] Bodenreider O. 2004. The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Research 32(suppl_1):D267-D270

[5] Bommasani R, Hudson DA, Adeli E, Altman R, Arora S, von Arx S, Bernstein MS, Bohg J, Bosselut A, Brunskill E, Brynjolfsson E, Buch S, Card D, Castellon R, Chatterji N, Chen A, Creel K, Davis JQ, Demszky D, Donahue C, Doumbouya M, Durmus E, Ermon S, Etchemendy J, Ethayarajh K, Fei-Fei L, Finn C, Gale T, Gillespie L, Goel K, Goodman N, Grossman S, Guha N, Hashimoto T, Henderson P, Hewitt J, Ho DE, Hong J, Hsu K, Huang J, Icard T, Jain S, Jurafsky D, Kalluri P, Karamcheti S, Keeling G, Khani F, Khattab O, Koh PW, Krass M, Krishna R, Kuditipudi R, Kumar A, Ladhak F, Lee M, Lee T, Leskovec J, Levent I, Li XL, Li X, Ma T, Malik A, Manning CD, Mirchandani S, Mitchell E, Munyikwa Z, Nair S, Narayan A, Narayanan D, Newman B, Nie A, Niebles JC, Nilforoshan H, Nyarko J, Ogut G, Orr L, Papadimitriou I, Park JS, Piech C, Portelance E, Potts C, Raghunathan A, Reich R, Ren H, Rong F, Roohani Y, Ruiz C, Ryan J, Ré C, Sadigh D, Sagawa S, Santhanam K, Shih A, Srinivasan K, Tamkin A, Taori R, Thomas AW, Tramér F, Wang RE, Wang W, Wu B, Wu J, Wu Y, Xie SM, Yasunaga M, You J, Zaharia M, Zhang M, Zhang T, Zhang X, Zhang Y, Zheng L, Zhou K, Liang P. 2021. On the opportunities and risks of foundation models. ArXiv

[6] Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D. 2020. Language models are few-shot learners. ArXiv

[7] Chen Z, Cano AH, Romanou A, Bonnet A, Matoba K, Salvi F, Pagliardini M, Fan S, Köpf A, Mohtashami A, Sallinen A, Sakhaeirad A, Swamy V, Krawczuk I, Bayazit D, Marmet A, Montariol S, Hartley M-A, Jaggi M, Bosselut A. 2023b. Meditron-70B: scaling medical pretraining for large language models.

[8] Chen L, Li S, Yan J, Wang H, Gunaratna K, Yadav V, Tang Z, Srinivasan V, Zhou T, Huang H, Jin H. 2023a. AlpaGasus: training a better alpaca with fewer data. ArXiv

[9] Christophe C, Kanithi PK, Munjal P, Raha T, Hayat N, Rajan R, Al-Mahrooqi A, Gupta A, Salman MU, Gosal G, Kanakiya B, Chen C, Vassilieva N, Amor BB, Pimentel MA, Khan S. 2024. Med42—evaluating fine-tuning strategies for medical LLMs: full-parameter vs. parameter-efficient approaches. ArXiv

[10] Chung HW, Hou L, Longpre S, Zoph B, Tay Y, Fedus W, Li Y, Wang X, Dehghani M, Brahma S, Webson A, Gu SS, Dai Z, Suzgun M, Chen X, Chowdhery A, Castro-Ros A, Pellat M, Robinson K, Valter D, Narang S, Mishra G, Yu A, Zhao V, Huang Y, Dai A, Yu H, Petrov S, Chi EH, Dean J, Devlin J, Roberts A, Zhou D, Le QV, Wei J. 2022. Scaling instruction-finetuned language models. ArXiv

[11] Databricks. 2023. Databricks Dolly: a large language model trained on the databricks machine learning platform. (accessed 24 March 2023)

[12] Dong Y, Mu R, Jin G, Qi Y, Hu J, Zhao X, Meng J, Ruan W, Huang X. 2024. Building guardrails for large language models. ArXiv

[13] Dyda A, Purcell M, Curtis S, Field E, Pillai P, Ricardo K, Weng H, Moore JC, Hewett M, Williams G, Lau CL. 2021. Differential privacy for public health data: An innovative tool to optimize information sharing while protecting data confidentiality. Patterns 2(12):100366

[14] Gangavarapu A. 2024. Enhancing guardrails for safe and secure healthcare AI. ArXiv

[15] Gu Y, Tinn R, Cheng H, Lucas M, Usuyama N, Liu X, Naumann T, Gao J, Poon H. 2021. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH) 3(1):1-23

[16] Gu Y, Tinn R, Cheng H, Lucas M, Usuyama N, Liu X, Naumann T, Gao J, Poon H. 2022. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare 3(1):1-23

[17] Gudibande A, Wallace E, Snell C, Geng X, Liu H, Abbeel P, Levine S, Song D. 2023. The false promise of imitating proprietary LLMs. ArXiv

[18] Gulati A, Ghosh S, Subramanian SR, Liang P, Hashimoto T, Zhang Y. 2024. L2M3: a low-resource multilingual medical model for community health workers. ArXiv

[19] Gururajan AK, Lopez-Cuena E, Bayarri-Planas J, Tormos A, Hinjos D, Bernabeu-Perez P, Arias-Duart A, Martin-Torres PA, Urcelay-Ganzabal L, Gonzalez-Mallo M, Alvarez-Napagao S, Ayguadé-Parra E, Garcia-Gasulla UCD. 2024. Aloe: A family of fine-tuned open healthcare LLMs. ArXiv

[20] Han T, Adams LC, Papaioannou J-M, Grundmann P, Oberhauser T, Löser A, Truhn D, Bressem KK. 2023. MedAlpaca—an open-source collection of medical conversational AI models and training data. ArXiv

[21] He K, Mao R, Lin Q, Ruan Y, Lan X, Feng M, Cambria E. 2023. A survey of large language models for healthcare: from data, technology, and applications to accountability and ethics. ArXiv

[22] Honovich O, Scialom T, Koncel-Kedziorski R, Levy O, Schmid H, Plank B. 2022. TRUE: re-evaluating factual consistency evaluation. ArXiv

[23] Hu EJ, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, Wang L, Chen W. 2021. LoRA: low-rank adaptation of large language models. ArXiv

[24] Ji Z, Lee N, Frieske R, Yu T, Su D, Xu Y, Ishii E, Bang YJ, Madotto A, Fung P. 2023. Survey of hallucination in natural language generation. ACM Computing Surveys 55(12):248:1-248:38

[25] Ji Z, Lipton ZC, Elkan C. 2023. A survey on differential privacy for medical data analysis. BMC Medical Informatics and Decision Making 23(1):1-15

[26] Jiang Z, Araki J, Ding H, Neubig G. 2021. How can we know when language models know? On the calibration of language models for question answering. Transactions of the Association for Computational Linguistics 9:962-977

[27] Jin Y, Chandra M, Verma G, Hu Y, Choudhury MD, Kumar S. 2024. Better to ask in English: cross-lingual evaluation of large language models for healthcare queries. ArXiv

[28] Johnson AE, Bulgarelli L, Pollard TJ, Horng S, Celi LA, Mark RG. 2023. MIMIC-IV. PhysioNet

[29] Johnson AE, Pollard TJ, Berkowitz SJ, Greenbaum NR, Lungren MP, Deng CY, Mark RG, Horng S, Celi LA, Zech JR. 2019. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Scientific Data 6(1):1-8

[30] Johnson AE, Pollard TJ, Shen L, Lehman L-H, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, Mark RG. 2016. MIMIC-III, a freely accessible critical care database. Scientific Data 3(1):1-9

[31] Kadavath S, Conerly T, Askell A, Henighan T, Drain D, Perez E, Schiefer N, Hatfield-Dodds Z, DasSarma N, Tran-Johnson E, Johnston S, El-Showk S, Jones A, Elhage N, Hume T, Chen A, Bai Y, Bowman S, Fort S, Ganguli D, Hernandez D, Jacobson J, Kernion J, Kravec S, Lovitt L, Ndousse K, Olsson C, Ringer S, Amodei D, Brown T, Clark J, Joseph N, Mann B, McCandlish S, Olah C, Kaplan J. 2022. Language models (mostly) know what they know. ArXiv

[32] Kale M, Rastogi A. 2020. Template guided text generation for task-oriented dialogue. Version 2, Last Revised 13 November 2020. ArXiv

[33] Kirsch A, van Amersfoort J, Gal Y. 2019. BatchBALD: efficient and diverse batch acquisition for deep Bayesian active learning. ArXiv

[34] Lavita. 2023. MedInstruct-52k dataset. (accessed 5 January 2025)

[35] Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J. 2019. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4):1234-1240

[36] Liévin V, Hother C, Nielsen SL, Lockmann F, Lundgaard JH, Hansen MH, Lioma C. 2022. Can large language models reason about medical questions? 15 ArXiv

[37] Li Y, Li Z, Zhang K, Dan R, Jiang S, Zhang Y. 2023. ChatDoctor: a medical chat model fine-tuned on a large language model meta-AI (LLAMA) using medical domain knowledge. Cureus 15(6):e40895

[38] Li R, Wang X, Yu H. 2024. LlamaCare: an instruction fine-tuned large language model for clinical NLP. In: Calzolari N, Kan M, Hoste V, Lenci A, Sakti S, Xue N, eds. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC–COLING 2024). Torino, Italy: ELRA and ICCL. 10632-10641

[39] Lin Z, Trivedi S, Sun J. 2023. Generating with confidence: uncertainty quantification for black-box large language models. ArXiv

[40] Liu Z, Li Y, Shu P, Zhong A, Yang L, Ju C, Wu Z, Ma C, Luo J, Chen C, Kim S, Hu J, Dai H, Zhao L, Zhu D, Liu J, Liu W, Shen D, Liu T, Li Q, Li X. 2023. Radiology-LLAMA2: best-in-class large language model for radiology. ArXiv

[41] Liu L, Yang X, Lei J, Liu X, Shen Y, Zhang Z, Wei P, Gu J, Chu Z, Qin Z, Ren K. 2024a. A survey on medical large language models: technology, application, trustworthiness, and future directions. ArXiv

[42] Liu Z, Zhong A, Li Y, Yang L, Ju C, Wu Z, Ma C, Shu P, Chen C, Kim S, Dai H, Zhao L, Sun L, Zhu D, Liu J, Liu W, Shen D, Li X, Li Q, Liu T. 2024b. Radiology-GPT: a large language model for radiology. ArXiv

[43] Longpre S, Hou L, Vu T, Webson A, Chung HW, Tay Y, Zhou D, Le QV, Zoph B, Wei J, Roberts A. 2023. The flan collection: designing data and methods for effective instruction tuning. ArXiv

[44] Lu Q, Dou D, Nguyen TH. 2022. ClinicalT5: a generative language model for clinical text.

[45] Luo Y, Yang Z, Meng F, Li Y, Zhou J, Zhang Y. 2023. An empirical study of catastrophic forgetting in large language models during continual fine-tuning. ArXiv

[46] Machanavajjhala A, Kifer D, Gehrke J, Venkitasubramaniam M. 2007. L-diversity: privacy beyond K-anonymity. ACM Transactions on Knowledge Discovery from Data (TKDD) 1(1):3–es

[47] Manakul P, Liusie A, Gales MJF. 2023. SelfCheckGPT: zero-resource black-box hallucination detection for generative large language models. ArXiv

[48] Mehrabi N, Morstatter F, Saxena N, Lerman K, Galstyan A. 2021. A survey on bias and fairness in machine learning. ACM Computing Surveys 54(6):1-35

[49] Mielke SJ, Szlam A, Dinan E, Boureau Y-L. 2022. Reducing conversational agents’ overconfidence through linguistic calibration. Transactions of the Association for Computational Linguistics 10(8):857-872

[50] Mishra R, Bian J, Fiszman M, Weir CR, Jonnalagadda S, Mostafa J, Fiol GD. 2014. Text summarization in the biomedical domain: a systematic review of recent research. Journal of Biomedical Informatics. 48:145-155

[51] Mittelstadt B. 2019. Principles alone cannot guarantee ethical AI. Nature Machine Intelligence 1(11):501-507

[52] Momentum. 2024. AI in healthcare: key HIPAA compliance requirements. (accessed December 2024)

[53] Mosqueira-Rey E, Hernández-Pereira E, Alonso-Ríos D, Bobes-Bascarán J, Fernández-Leal Á. 2023. Human-in-the-loop machine learning: a state of the art. Artificial Intelligence Review 56(4):3005-3054

[54] Nazi ZA, Peng W. 2024. Large language models in healthcare and medical domain: a review. Informatics 11(3):57

[55] Nori H, King MR, McKinney SM, Carignan D, Horvitz E. 2023. Capabilities of GPT-4 on medical challenge problems. ArXiv

[57] Ouyang L, Wu J, Jiang X, Almeida D, Wainwright CL, Mishkin P, Zhang C, Agarwal S, Slama K, Ray A, Schulman J, Hilton J, Kelton F, Miller L, Simens M, Askell A, Welinder P, Christiano P, Leike J, Lowe R. 2022. Training language models to follow instructions with human feedback. ArXiv

[58] Pang W, Zhou C, Zhou X-H, Wang X. 2024. Phased instruction fine-tuning for large language models. ArXiv

[59] Phan LN, Nallapati R, Arnold A, Xiang B, Tran KCM. 2021. SciFive: a text-to-text transformer model for biomedical literature. ArXiv

[60] Picard RW. 1997. Affective computing. Cambridge, MA: MIT Press.

[61] Poulain R, Fayyaz H, Beheshti R. 2024. Bias patterns in the application of LLMs for clinical decision support: a comprehensive study. ArXiv

[62] Qin L, Chen Q, Zhou Y, Chen Z, Li Y, Liao L, Li M, Che W, Yu PS. 2025. A survey of multilingual large language models. Patterns 6(1):101118

[63] Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu PJ. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21(140):1-67

[64] Rasool A, Shahzad MI, Aslam H, Chan V, Arshad MA. 2024. Emotion-aware embedding fusion in large language models (Flan-T5, Llama 2, DeepSeek-R1, and ChatGPT 4) for intelligent response generation. AI 6(3):56

[65] Sanh V, Webson A, Raffel C, Bach SH, Sutawika L, Alyafeai Z, Chaffin A, Stiegler A, Scao TLe, Raja A, Manan Dey MSB, Xu C, Thakker U, Sharma SS, Szczechla E, Kim T, Chhablani G, Nayak N, Datta D, Chang J, Jiang MT-J, Wang H, Manica M, Shen S, Yong ZX, Pandey H, Bawden R, Wang T, Neeraj T, Rozen J, Sharma A, Santilli A, Fevry T, Fries JA, Teehan R, Bers T, Biderman S, Gao L, Wolf T, Rush AM. 2022. Multitask prompted training enables zero-shot task generalization. ArXiv

[66] Schardt C, Adams MB, Owens T, Keitz S, Fontelo P. 2007. Utilization of the PICO framework to improve searching PubMed for clinical questions. BMC Medical Informatics and Decision Making 7(1):270

[67] Shrivastava V, Liang P, Kumar A. 2023. Llamas know what GPTs don’t show: surrogate models for confidence estimation. ArXiv

[68] Si C, Gan Z, Yang Z, Wang S, Wang J, Boyd-Graber J, Wang L. 2023. Prompting GPT-3 to be reliable. ArXiv

[69] Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, Scales N, Tanwani A, Cole-Lewis H, Pfohl S, Payne P, Seneviratne M, Gamble P, Kelly C, Babiker A, Schärli N, Chowdhery A, Mansfield P, Demner-Fushman D, Agüera y Arcas B, Webster D, Corrado GS, Matias Y, Chou K, Gottweis J, Tomasev N, Liu Y, Rajkomar A, Barral J, Semturs C, Karthikesalingam A, Natarajan V. 2023. Large language models encode clinical knowledge. Nature 620(7972):172-180

[70] Sung C, Lee Y, Tsai Y. 2024. A new pipeline for generating instruction dataset via RAG and self fine-tuning.

[71] Sweeney L. 2002. k-anonymity: a model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10(5):557-570

[72] Syriani E, Luhunu L, Sahraoui H. 2018. Systematic mapping study of template-based code generation. Computer Languages, Systems & Structures 52:43-62

[73] Taori R, Gulrajani I, Zhang T, Dubois Y, Li X, Guestrin C, Liang P, Hashimoto TB. 2023a. Alpaca: a strong, replicable instruction-following model. ArXiv

[74] Taori R, Gulrajani I, Zhang T, Dubois Y, Li X, Guestrin C, Liang P, Hashimoto TB. 2023b. Stanford Alpaca: an instruction-following Llama model. (accessed 13 March 2023)

[75] Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M-A, Lacroix T, Rozière B, Goyal N, Hambro E, Azhar F, Rodriguez A, Joulin A, Grave E, Lample G. 2023. Llama: open and efficient foundation language models. ArXiv

[76] Tran H, Yang Z, Yao Z, Yu H. 2023. BioInstruct: instruction tuning of large language models for biomedical natural language processing.

[77] Tricco AC, Lillie E, Zarin W, O’Brien KK, Colquhoun H, Levac D, Moher D, Peters MDJ, Horsley T, Weeks L, Hempel S, Akl EA, Chang C, McGowan J, Stewart L, Hartling L, Aldcroft A, Wilson MG, Garritty C, Lewin S, Godfrey CM, Macdonald MT, Langlois EV, Soares-Weiser K, Moriarty J, Clifford T, Tunçalp Ö, Straus SE. 2018. PRISMA extension for scoping reviews (PRISMA-ScR): checklist and explanation. Annals of Internal Medicine 169(7):467-473

[78] U.S. Congress. 1996. Health insurance portability and accountability act of 1996. Public Law 104-191. Enacted to Improve Healthcare Privacy, Portability, and Accountability

[79] Wang X, Chen N, Chen J, Wang Y, Zhen G, Zhang C, Wu X, Hu Y, Gao A, Wan X, Li H, Wang B. 2024. Apollo: a lightweight multilingual medical LLM towards democratizing medical AI to 6B people. ArXiv

[80] Wang Y, Ivison H, Dasigi P, Hessel J, Khot T, Chandu KR, Wadden D, MacMillan K, Smith NA, Beltagy I, Hajishirzi H. 2023a. How far can camels go? Exploring the state of instruction tuning on open resources. ArXiv

[81] Wang Y, Kordi Y, Mishra S, Liu A, Smith NA, Khashabi D, Hajishirzi H. 2023b. Self-instruct: aligning language models with self-generated instructions.

[82] Wang Y, Mishra S, Alipoormolabashi P, Kordi Y, Mirzaei A, Arunkumar A, Ashok A, Dhanasekaran AS, Naik A, Stap D, Pathak E, Karamanolakis G, Lai HG, Purohit I, Mondal I, Anderson J, Kuznia K, Doshi K, Patel M, Pal KK, Moradshahi M, Parmar M, Purohit M, Varshney N, Kaza PR, Verma P, Puri RS, Karia R, Sampat SK, Doshi S, Mishra S, Reddy S, Patro S, Dixit T, Shen X, Baral C, Choi Y, Smith NA, Hajishirzi H, Khashabi D. 2022. Super-natural instructions: generalization via declarative instructions on 1600+ NLP tasks. ArXiv

[83] Wang Y, Zhong W, Li L, Mi F, Zeng X, Huang W, Shang L, Jiang X, Liu Q. 2023c. Aligning large language models with human: a survey. ArXiv

[84] Wei J, Bosma M, Zhao VY, Guu K, Yu AW, Lester B, Du N, Dai AM, Le QV. 2022. Finetuned language models are zero-shot learners closed-book QA. ArXiv

[85] WHO. 2016. Working for health and growth: investing in the health workforce. Report of the High-Level Commission on Health Employment and Economic Growth. Technical report. World Health Organization, Geneva.

[86] Wilk S, Kezadri-Hamiaz M, Rosu D, Kuziemsky C, Michalowski W, Amyot D, Carrier M. 2016. Using semantic components to represent dynamics of an interdisciplinary healthcare team in a multi-agent decision support system. Journal of Medical Systems 40(2):1-12

[87] Wu X, Xiao L, Sun Y, Zhang J, Ma T, He L. 2022. A survey of human-in-the-loop for machine learning.

[88] Xie Q, Chen Q, Chen A, Peng C, Hu Y, Lin F, Peng X, Huang J, Zhang J, Keloth V, Zhou X, Qian L, He H, Shung D, Ohno-Machado L, Wu Y, Xu H, Bian J. 2025. Medical foundation large language models for comprehensive text analysis and beyond. NPJ Digital Medicine 8(1):141

[89] Xiong M, Hu Z, Lu X, Li Y, Fu J, He J, Hooi B. 2024. Can LLMs express their uncertainty? An empirical evaluation of confidence elicitation in LLMs. ArXiv

[90] Yang X, Chen A, PourNejatian N, Shin HC, Smith KE, Parisien C, Compas C, Martin C, Costa AB, Flores MG, Zhang Y, Magoc T, Harle CA, Lipori G, Mitchell DA, Hogan WR, Shenkman EA, Bian J, Wu Y. 2022. A large language model for electronic health records. NPJ Digital Medicine 5(194):1-9

[91] Yang R, Tan TF, Lu W, Thirunavukarasu AJ, Ting DSW, Liu N. 2023. Large language models in health care: development, applications, and challenges.

[92] Yang K, Zhang T, Kuang Z, Xie Q, Huang J, Ananiadou S. 2024. MentaLLaMA: interpretable mental health analysis on social media with large language models.

[93] Yuan Z, Yuan H, Tan C, Wang W, Huang S, Huang F. 2023. RRHF: rank responses to align language models with human feedback without tears. ArXiv

[94] Zeng W, Xu C, Zhao Y, Lou J-G, Chen W. 2024. Automatic instruction evolving for large language models.

[95] Zhang S, Dong L, Li X, Zhang S, Sun X, Wang S, Li J, Hu R, Zhang T, Wu F, Wang G. 2023a. Instruction tuning for large language models: a survey.

[96] Zhang T, Lee M, Li XL, Shen E, Hashimoto TB. 2023b. TempLM: distilling language models into template-based generators. In: Rogers A, Boyd-Graber J, Okazaki N, eds. Findings of the Association for Computational Linguistics: ACL 2023. Toronto, Canada: Association for Computational Linguistics. 1970-1994

[97] Zhang C, Liu F, Chen T, Zhang Z, Ma T, Huang W, Dai X, Sun K, Zhang Y, Liu M, Lai Y, Li Y, Zhang Y, Ji R, Xue F, Zhu Q, Zhang G, Li Z, Li Z, Wang Q, Yang Q, Yang Z, Lu Z, Wang Y. 2024. MetamedLLM: scaling medical LLMs with continual pretraining and instruction tuning. ArXiv

[98] Zhang X, Tian C, Yang X, Chen L, Li Z, Petzold LR. 2023c. AlpaCare: instruction-tuned large language models for medical application. ArXiv