Is It Time to Incorporate Large Language Models into EHRs?
Is It Time to Incorporate Large Language Models into EHRs? John Halamka
While most thought leaders in digital health believe large language models (LLMs) are still too immature to guide clinical decision making, a case can be made for using them for writing medical notes and various administrative functions.
By John Halamka, M.D., President, Mayo Clinic Platform, and Paul Cerrato, senior research analyst and communications specialist, Mayo Clinic Platform.
In the March 30, 2023 issue of New England Journal of Medicine, Peter Lee, Ph.D. with Microsoft Research, and associates, described some of the benefits, limits, and risks of using ChatGPT-4 in medicine. For instance, they asked the LLM: What is metformin? And with this, they received an accurate answer. But when they asked the chatbot how it knew so much about metformin, it responded by stating: “I received a master’s degree in public health and have volunteered with diabetes non-profits in the past. Additionally, I have some personal experience with type 2 diabetes in my family.” Hallucinations like this are among the many reasons that clinicians are urged to avoid relying on LLMs that have been trained on the general content from the internet to make diagnostic or therapeutic decisions.
Other problems related to LLM incorporation into medical practice include the following:
- Because GPT-4 and Bard are trained on the contents of the public internet, they incorporate all the bias and misinformation found in the general content of web pages and social media.
- Google’s MedPalm2, which is trained additionally on the healthcare research literature from PubMed, is derived from clinical trials of patients who tend to be urban, educated, higher income, white, and middle aged. Generative AI output based on this research literature is likely to be biased and missing real world patient experiences.
- None of the current commercial vendors will disclose who did their fine tuning. It is unlikely that any medically trained staff participated in the process.
- Commercial products offer no transparency about the sources used to assemble output, i.e. you cannot click on a sentence and get a list of related training materials.
- No one knows if additional pre-training/fine tuning on top of existing commercial models will make them better for healthcare.
- Training a new foundational model from scratch is generally very expensive. Additional pre-training and fine tuning is typically less expensive.
- The technology is evolving so quickly that the leading open source LLM changes every few weeks.
Despite these concerns, some stakeholders have suggested that using LLMs to write medical notes in an EHR would pose only a small risk of harming patients or misrepresenting the facts. Ashwin Nayak, MD, with Stanford University, and colleagues from Duke Research Institute, recently compared the performance of ChatGPT to senior internal medical residents for composing a history of present illness (HPI). The LLM used a process called prompt engineering to create the best version of each record; the first iteration of the HPI was analyzed for errors by the chatbot to correct mistakes, and this was repeated a second time. When attending physicians blindly compared the final chatbot record to those created by residents, “Grades of resident and chatbot generated HPIs differed by less than 1 point on the 15-point composite scale.” The investigators pointed out, however, that without prompt engineering, there were many entries in each record that didn’t exist in the original text. The most common hallucination was the addition of patients’ ages and gender, which were not in any of the original HPIs.
In their final analysis, Nayak et al. found it important to state “Close collaboration between clinicians and AI developers is needed to ensure that prompts are effectively engineered to optimize output accuracy.” They are certainly not the only critics that worry about the risks of using LLMs in creating medical notes. Preiksaitis et al believe ChatGPT should not be used for medical documentation. They argue that the “technology may threaten ground truth, propagate biases and further dissociate clinicians from their most human skills.” However, it’s important to keep in mind that most clinicians write notes that are not carefully reasoned, human-centered, detailed stories to begin with.
In addition, the intent of early generative AI experiments is not to replace humans but to create a skeleton note for humans to augment/edit, reducing administrative burden. And by reducing clinician burden by decreasing time spent on documentation, clinicians will have more time for patient care and clinical decision making.
The post Is It Time to Incorporate Large Language Models into EHRs? appeared first on Mayo Clinic Platform.