Use of GPT-4 to Diagnose Complex Clinical Cases


Abstract

We assessed the performance of the newly released AI GPT-4 in diagnosing complex medical case challenges and compared the success rate to that of medical-journal readers. GPT-4 correctly diagnosed 57% of cases, outperforming 99.98% of simulated human readers generated from online answers. We highlight the potential for AI to be a powerful supportive tool for diagnosis; however, further improvements, validation, and addressing of ethical considerations are needed before clinical implementation. (No funding was obtained for this study.)

Introduction

The combination of a shortage of physicians and the increased complexity in the medical field partly due to the rapidly expanding diagnostic possibilities already constitutes a significant challenge for the timely and accurate delivery of diagnoses. Given demographic changes, with an aging population this workload challenge is expected to increase even further in the years to come, highlighting the need for new technological development. AI has existed for decades and previously showed promising results within single modal fields of medicine, such as medical imaging.1 The continuous development of AI, including the large language model (LLM) known as the Generative Pretrained Transformer (GPT), has enabled research in exciting new areas, such as the generation of discharge summaries2 and patient clinical letters. Recently, a paper exploring the potentials of GPT-4 showed that it was able to answer questions in the U.S. Medical Licensing Examination correctly.3 However, how well it performs on real-life clinical cases is less well understood. For example, it remains unclear to what extent GPT-4 can aid in clinical cases that contain long, complicated, and varied patient descriptions and how it performs on these complex real-world cases compared with humans.

We assessed the performance of GPT-4 in real-life medical cases by comparing its performance with that of medical-journal readers. Our study utilized available complex clinical case challenges with comprehensive full-text information published online between January 2017 and January 2023.4 Each case presents a medical history and a poll with six options for the most likely diagnosis. To solve the case challenges, we provided GPT-4 with a prompt and a clinical case (see Supplementary Methods 1 in the Supplementary Appendix). The prompt instructed GPT-4 to solve the case by answering a multiple-choice question followed by the full unedited text from the clinical case report. Laboratory information contained in tables was converted to plain text and included in the case. The version of GPT-4 available to us could not accept images as input, so we added the unedited image description given in the clinical cases to the case text. The March 2023 edition of GPT-4 (maximum determinism: temp=0) was provided each case five times to assess reproducibility across repeated runs. This was also performed using the current (September 2023) edition of GPT-4 to test the behavior of GPT-4 over time. Because the applied cases were published online from 2017 to 2023 and GPT-4’s training data include online material until September 2021, we furthermore performed a temporal analysis to assess the performance in cases before and after potentially available training data. For medical-journal readers, we collected the number and distribution of votes for each case. Using these observations, we simulated 10,000 sets of answers to all cases, resulting in a pseudopopulation of 10,000 generic human participants. The answers were simulated as independent Bernoulli-distributed variables (correct/incorrect answer) with marginal distributions as observed among medical-journal readers (see Supplementary Methods 2).

We identified 38 clinical case challenges and a total of 248,614 answers from online medical-journal readers.4 The most common diagnoses among the case challenges were in the field of infectious disease, with 15 cases (39.5%), followed by 5 cases (13.1%) in endocrinology and 4 cases (10.5%) in rheumatology. Patients represented in the clinical cases ranged in age from newborn to 89 years old (median [interquartile range], 34 [18 to 57]), and 37% were female. The number of correct diagnoses among the 38 cases occurring by chance would be expected to be 6.3 (16.7%) due to the six poll options. The March 2023 edition of GPT-4 correctly diagnosed a mean of 21.8 cases (57%) with good reproducibility (55.3%, 57.9%, 57.9%, 57.9%, and 57.9%), whereas the medical-journal readers on average correctly diagnosed 13.7 cases (36%) (see Supplementary Table 1 and Supplementary Methods 1). GPT-4 correctly diagnosed 15.8 cases (52.7%) of those published up to September 2021 and 6 cases (75.0%) of those published after September 2021. Based on the simulation, we found that GPT-4 performed better than 99.98% of the pseudopopulation (Fig. 1). The September 2023 edition of GPT-4 correctly diagnosed 20.4 cases (54%).

Figure 1

Number of Correct Answers of GPT-4 Compared with Guessing and a Simulated Population of Medical-Journal Readers.

Limitations

An important study limitation is the use of a poorly characterized population of human journal readers with unknown levels of medical skills. Moreover, we cannot assess whether the responses provided for the clinical cases reflect their maximum effort. Consequently, our results may represent a best-case scenario in favor of GPT-4. The assumption of independent answers on the 38 cases in our pseudopopulation is somewhat unrealistic, because some readers might consistently perform differently from others and the frequency at which participants respond correctly to the cases might depend on the level of medical skills as well as the distribution of these. However, even in the extreme case of maximally correlated correct answers among the medical-journal readers, GPT-4 would still perform better than 72% of human readers.

Conclusions

In this pilot assessment, we compared the diagnostic accuracy of GPT-4 in complex challenge cases to that of journal readers who answered the same questions on the Internet. GPT-4 performed surprisingly well in solving the complex case challenges and even better than the medical-journal readers. GPT-4 had a high reproducibility, and our temporal analysis suggests that the accuracy we observed is not due to these cases’ appearing in the model’s training data. However, performance did appear to change between different versions of GPT-4, with the newest version performing slightly worse. Although it demonstrated promising results in our study, GPT-4 missed almost every second diagnosis. Furthermore, answer options do not exist outside case challenges. However, a recently published letter reported research that tested the performance of GPT-4 on a closely related data set, demonstrating diagnostic abilities even without multiple-choice options.5

Currently, GPT-4 is not specifically designed for medical tasks. However, it is expected that progress on AI models will continue to accelerate, leading to faster diagnoses and better outcomes, which could improve outcomes and efficiency in many areas of health care.1 Whereas efforts are in progress to develop such models, our results, together with recent findings by other researchers,5 indicate that the current GPT-4 model may hold clinical promise today. However, proper clinical trials are needed to ensure that this technology is safe and effective for clinical use. Additionally, whereas GPT-4 in our study worked only on written records, future AI tools that are more specialized are expected to include other data sources, including medical imaging and structured numerical measurements, in their predictions. Importantly, future models should include training data from developing countries to ensure a broad, global benefit of this technology and reduce the potential for health care disparities. AI based on LLMs might be relevant not only for in-patient hospital settings but also for first-line screening that is performed either in general practice or by patients themselves. As we move toward this future, the ethical implications surrounding the lack of transparency by commercial models such as GPT-4 also need to be addressed,1 as well as regulatory issues on data protection and privacy. Finally, clinical studies evaluating accuracy, safety, and validity should precede future implementation. Once these issues have been addressed and AI improves, society is expected to increasingly rely on AI as a tool to support the decision-making process with human oversight, rather than as a replacement for physicians.

What the New GPT-4 AI Can Do


OpenAI just released an updated version of its text-generating artificial intelligence program. Here’s how GPT-4 improves on its predecessor.

What the New GPT-4 AI Can Do

Tech research company OpenAI has just released an updated version of its text-generating artificial intelligence program, called GPT-4, and demonstrated some of the language model’s new abilities. Not only can GPT-4 produce more natural-sounding text and solve problems more accurately than its predecessor. It can also process images in addition to text. But the AI is still vulnerable to some of the same problems that plagued earlier GPT models: displaying bias, overstepping the guardrails intended to prevent it from saying offensive or dangerous things and “hallucinating,” or confidently making up falsehoods not found in its training data.

On Twitter, OpenAI CEO Sam Altman described the model as the company’s “most capable and aligned” to date. (“Aligned” means it is designed to follow human ethics.) But “it is still flawed, still limited, and it still seems more impressive on first use than it does after you spend more time with it,” he wrote in the tweet

Perhaps the most significant change is that GPT-4 is “multimodal,” meaning it works with both text and images. Although it cannot output pictures (as do generative AI models such as DALL-E and Stable Diffusion), it can process and respond to the visual inputs it receives. Annette Vee, an associate professor of English at the University of Pittsburgh who studies the intersection of computation and writing, watched a demonstration in which the new model was told to identify what was funny about a humorous image. Being able to do so means “understanding context in the image. It’s understanding how an image is composed and why and connecting it to social understandings of language,” she says. “ChatGPT wasn’t able to do that.”

A device with the ability to analyze and then describe images could be enormously valuable for people who are visually impaired or blind. For instance, a mobile app called Be My Eyes can describe the objects around a user, helping those with low or no vision interpret their surroundings. The app recently incorporated GPT-4 into a “virtual volunteer” that, according to a statement on OpenAI’s website, “can generate the same level of context and understanding as a human volunteer.”

But GPT-4’s image analysis goes beyond describing the picture. In the same demonstration Vee watched, an OpenAI representative sketched an image of a simple website and fed the drawing to GPT-4. Next the model was asked to write the code required to produce such a website—and it did. “It looked basically like what the image is. It was very, very simple, but it worked pretty well,” says Jonathan May, a research associate professor at the University of Southern California. “So that was cool.”

Even without its multimodal capability, the new program outperforms its predecessors at tasks that require reasoning and problem-solving. OpenAI says it has run both GPT-3.5 and GPT-4 through a variety of tests designed for humans, including a simulation of a lawyer’s bar exam, the SAT and Advanced Placement tests for high schoolers, the GRE for college graduates and even a couple of sommelier exams. GPT-4 achieved human-level scores on many of these benchmarks and consistently outperformed its predecessor, although it did not ace everything: it performed poorly on English language and literature exams, for example. Still, its extensive problem-solving ability could be applied to any number of real-world applications—such as managing a complex schedule, finding errors in a block of code, explaining grammatical nuances to foreign-language learners or identifying security vulnerabilities.

Additionally, OpenAI claims the new model can interpret and output longer blocks of text: more than 25,000 words at once. Although previous models were also used for long-form applications, they often lost track of what they were talking about. And the company touts the new model’s “creativity,” described as its ability to produce different kinds of artistic content in specific styles. In a demonstration comparing how GPT-3.5 and GPT-4 imitated the style of Argentine author Jorge Luis Borges in English translation, Vee noted that the more recent model produced a more accurate attempt. “You have to know enough about the context in order to judge it,” she says. “An undergraduate may not understand why it’s better, but I’m an English professor…. If you understand it from your own knowledge domain, and it’s impressive in your own knowledge domain, then that’s impressive.”

May has also tested the model’s creativity himself. He tried the playful task of ordering it to create a “backronym” (an acronym reached by starting with the abbreviated version and working backward). In this case, May asked for a cute name for his lab that would spell out “CUTE LAB NAME” and that would also accurately describe his field of research. GPT-3.5 failed to generate a relevant label, but GPT-4 succeeded. “It came up with ‘Computational Understanding and Transformation of Expressive Language Analysis, Bridging NLP, Artificial intelligence And Machine Education,’” he says. “‘Machine Education’ is not great; the ‘intelligence’ part means there’s an extra letter in there. But honestly, I’ve seen way worse.” (For context, his lab’s actual name is CUTE LAB NAME, or the Center for Useful Techniques Enhancing Language Applications Based on Natural And Meaningful Evidence). In another test, the model showed the limits of its creativity. When May asked it to write a specific kind of sonnet—he requested a form used by Italian poet Petrarch—the model, unfamiliar with that poetic setup, defaulted to the sonnet form preferred by Shakespeare.

Of course, fixing this particular issue would be relatively simple. GPT-4 merely needs to learn an additional poetic form. In fact, when humans goad the model into failing in this way, this helps the program develop: it can learn from everything that unofficial testers enter into the system. Like its less fluent predecessors, GPT-4 was originally trained on large swaths of data, and this training was then refined by human testers. (GPT stands for generative pretrained transformer.) But OpenAI has been secretive about just how it made GPT-4 better than GPT-3.5, the model that powers the company’s popular ChatGPT chatbot. According to the paper published alongside the release of the new model, “Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.” OpenAI’s lack of transparency reflects this newly competitive generative AI environment, where GPT-4 must vie with programs such as Google’s Bard and Meta’s LLaMA. The paper does go on to suggest, however, that the company plans to eventually share such details with third parties “who can advise us on how to weigh the competitive and safety considerations … against the scientific value of further transparency.”

Those safety considerations are important because smarter chatbots have the ability to cause harm: without guardrails, they might provide a terrorist with instructions on how to build a bomb, churn out threatening messages for a harassment campaign or supply misinformation to a foreign agent attempting to sway an election. Although OpenAI has placed limits on what its GPT models are allowed to say in order to avoid such scenarios, determined testers have found ways around them. “These things are like bulls in a china shop—they’re powerful, but they’re reckless,” scientist and author Gary Marcus told Scientific American shortly before GPT-4’s release. “I don’t think [version] four is going to change that.”

And the more humanlike these bots become, the better they are at fooling people into thinking there is a sentient agent behind the computer screen. “Because it mimics [human reasoning] so well, through language, we believe that—but underneath the hood, it’s not reasoning in any way similar to the way that humans do,” Vee cautions. If this illusion fools people into believing an AI agent is performing humanlike reasoning, they may trust its answers more readily. This is a significant problem because there is still no guarantee that those responses are accurate. “Just because these models say anything, that doesn’t mean that what they’re saying is [true],” May says. “There isn’t a database of answers that these models are pulling from.” Instead, systems like GPT-4 generate an answer one word at a time, with the most plausible next word informed by their training data—and that training data can become outdated. “I believe GPT-4 doesn’t even know that it’s GPT-4,” he says. “I asked it, and it said, ‘No, no, there’s no such thing as GPT-4. I’m GPT-3.’

Now that the model has been released, many researchers and AI enthusiasts have an opportunity to probe GPT-4’s strengths and weaknesses. Developers who want to use it in other applications can apply for access, and anyone who wants to “talk” with the program will have to subscribe to ChatGPT Plus. For $20 per month, this paid program lets users choose between talking with a chatbot that runs on GPT-3.5 and one that runs on GPT-4.

Such explorations will undoubtedly uncover more potential applications—and flaws—in GPT-4. “The real question should be ‘How are people going to feel about it two months from now, after the initial shock?’” Marcus says. “Part of my advice is: let’s temper our initial enthusiasm by realizing we have seen this movie before. It’s always easy to make a demo of something; making it into a real product is hard. And if it still has these problems—around hallucination, not really understanding the physical world, the medical world, etcetera—that’s still going to limit its utility somewhat. And it’s still going to mean you have to pay careful attention to how it’s used and what it’s used for.”

GPT-4 is here: what scientists think


Researchers are excited about the AI — but many are frustrated that its underlying engineering is cloaked in secrecy.

The GPT-4 logo is seen in this photo illustration on 13 March, 2023 in Warsaw, Poland.
The GPT-4 artificial-intelligence model is not yet widely available.

Artificial intelligence company OpenAI this week unveiled GPT-4, the latest incarnation of the large language model that powers its popular chat bot ChatGPT. The company says GPT-4 contains big improvements — it has already stunned people with its ability to create human-like text and generate images and computer code from almost any prompt. Researchers say these abilities have the potential to transform science — but some are frustrated that they cannot yet access the technology, its underlying code or information on how it was trained. That raises concern about the technology’s safety and makes it less useful for research, say scientists.

GPT-4 was released on 14 March, and one upgrade is that it can now handle images as well as text. And as a demonstration of its language prowess, OpenAI, which is based in San Francisco, California, says that it passed the US bar legal exam with results in the ninetieth centile, compared with the tenth centile for the previous version of ChatGPT. But the technology is not yet widely accessible — only paying subscribers to ChatGPT so far have access.ChatGPT listed as author on research papers: many scientists disapprove

“There’s a waiting list at the moment so you cannot use it right now,” Says Evi-Anne van Dis, a psychologist at the University of Amsterdam Medical Center. But she has seen demos of GPT-4. “We watched some videos in which they demonstrated capacities and it’s mind blowing,” she says. One instance, she recounts, was a hand-drawn doodle of a website, which GPT-4 used to produce the computer code needed to build that website, as a demonstration of the ability to handle images as inputs.

But there is frustration in the science community over OpenAI’s secrecy around how the model was trained and what data were used, and how GPT-4 actually works. “All of these closed-source models, they are essentially dead ends in science,” says Sasha Luccioni, a research scientist specializing in climate at HuggingFace, an open-source AI cooperative. “They [OpenAI] can keep building upon their research, but for the community at large, it’s a dead end.”

‘Red team’ testing

Andrew White, a chemical engineer at the University of Rochester, New York, has had privileged access to GPT-4 as a ‘red-teamer’: a person paid by OpenAI to test the platform to try and make it do something bad. He has had access to GPT-4 for the past six months, he says. “Early on in the process, it didn’t seem that different,” compared with previous iterations.Abstracts written by ChatGPT fool scientists

He put to the bot queries about what chemical reaction steps were needed to make a compound, predict the reaction yield, and choose a catalyst. “At first, I was actually not that impressed,” White says. “It was really surprising because it would look so realistic, but it would hallucinate an atom here. It would skip a step there,” he adds. But when as part of his red-team work he gave GPT-4 access to scientific papers, things changed drastically. “It made us realize that these models maybe aren’t so great just alone. But when you start connecting them to the Internet to tools like a retrosynthesis planner, or a calculator, all of a sudden, new kinds of abilities emerge.”

And with those abilities come concerns. For instance, could GPT-4 allow dangerous chemicals to be made? With input from people such as White, OpenAI engineers fed back into their model to discourage GPT-4 from creating dangerous, illegal or damaging content, White says.

Fake facts

Outputting false information is another problem. Luccioni says that models such as GPT-4, which exist to predict the next word in a sentence, can’t be cured of coming up with fake facts — known as hallucinating. “You can’t rely on these kinds of models because there’s so much hallucination,” she says. And this remains a concern in the latest version, she says, although OpenAI says that it has improved safety in GPT-4.

Without access to the data used for training, OpenAI’s assurances about safety fall short for Luccioni. “You don’t know what the data is. So you can’t improve it. I mean, it’s just completely impossible to do science with a model like this,” she says.How Nature readers are using ChatGPT

The mystery about how GPT-4 was trained is also a concern for van Dis’s colleague at Amsterdam, psychologist Claudi Bockting. “It’s very hard as a human being to be accountable for something that you cannot oversee,” she says. “One of the concerns is they could be far more biased than, for instance, the bias that human beings have by themselves.” Without being able to access the code behind GPT-4 it is impossible to see where the bias might have originated, or to remedy it, Luccioni explains.

Ethics discussions

Bockting and van Dis are also concerned that increasingly these AI systems are owned by big tech companies. The researchers want to make sure the technology is properly tested and verified by scientists. “This is also an opportunity because collaboration with big tech can of course, speed up processes,” she adds.

Van Dis, Bockting and colleagues argued earlier this year for an urgent need to develop a set of ‘living’ guidelines to govern how AI and tools such as GPT-4 are used and developed. They are concerned that any legislation around AI technologies will struggle to keep up with the pace of development. Bockting and van Dis have convened an summit of invited participants at the University of Amsterdam on 11 April to discuss these concerns, with representatives from organizations including the science-ethics committee of UNESCO, the United Nations’ scientific and cultural agency, the Organisation for Economic Co-operation and Development and the World Economic Forum.

Despite the concern, GPT-4 and its future iterations will shake up science, says White. “I think it’s actually going to be a huge infrastructure change in science, almost like the Internet was a big change,” he says. It won’t replace scientists, he adds, but could help with some tasks. “I think we’re going to start realizing we can connect papers, data programs, libraries that we use and computational work or even robotic experiments.”