In one study, OpenAI’s GPT-4 correctly diagnosed 52.7% of complex and difficult cases, compared to 36% for medical journal readers and 99.98% for simulated human readers. . Research published by New England Medical Journal.
The evaluation, conducted by Danish researchers, utilized GPT-4 to examine 38 complex clinical cases using textual information published online between January 2017 and January 2023. I found a diagnosis for the problem. GPT-4 responses were compared to 248,614 responses from readers of online medical journals.
Each complex clinical case included a medical history along with a questionnaire containing six options for the most likely diagnosis. The prompts used for GPT-4 required the program to solve a diagnosis by answering multiple-choice questions and analyzing the unedited full text of a clinical case report. Each case was presented to GPT-4 five times to assess reproducibility.
Alternatively, researchers collected votes for each case from readers of medical journals, simulated 10,000 sets of responses, and created a pseudopopulation of 10,000 human participants.
The most common diagnoses were 15 (39.5%) in infectious diseases, 5 (13.1%) in endocrinology, and 4 (10.5%) in rheumatology.
Patients in the clinical cases ranged from neonates to 89 years of age, and 37% were female.
In the recent March 2023 edition of GPT-4, 21.8 cases or 57% were correctly diagnosed with good reproducibility, and medical journal readers correctly diagnosed 13.7 cases or an average of 36%.
The latest release of GPT-4 in March includes online documentation through September 2021. Therefore, the researchers also evaluated the before and after cases of available training data.
In that case, GPT-4 correctly diagnosed 52.7% of cases published up to September 2021 and 75% of cases published after September 2021.
“GPT-4 has high reproducibility, and our temporal analysis suggests that the accuracy we observed is not due to these cases appearing in the model’s training data. However, performance appeared to vary between different versions of GPT-4. “Performance of the latest version is slightly worse. Although our study showed promising results, GPT -4 missed almost all second diagnoses,” the researchers wrote.
“…our results, together with recent findings by other researchers, indicate that the current GPT-4 model may still be clinically promising today. However, this Appropriate clinical trials are required to ensure that the technology is safe and effective for clinical use.”
Why is it important?
The researchers noted limitations of the study, including unknowns about the medical skills of medical journal readers, and noted that their results may represent a best-case scenario in favor of GPT-4.
Still, the researchers concluded that even with “best correlated correct answers” among medical journal readers, GPT-4 still outperformed 72% of human readers.
The researchers emphasized the importance of future models that incorporate training data from developing countries and the need for ethical considerations to ensure global benefits for the technology.
“As we move towards this future, we need to address not only regulatory issues around data protection and privacy, but also the ethical implications surrounding the lack of transparency from commercial models such as GPT-4. ” wrote the study authors.
“Finally, clinical studies assessing accuracy, safety, and effectiveness should be conducted prior to future implementation. Once these issues are resolved and AI improves, society will rely on human surveillance rather than human surveillance. We can expect to increasingly rely on AI as a tool to support the decision-making process” as a replacement for doctors. ”