By Dharm Sodha, BS, Roth LaFleur, Florian Geier, and Peter Lio, MD
In the rapidly expanding realm of machine learning and artificial intelligence (AI), innovative “chatbots” have captured the attention of many. ChatGPT surpassed Instagram for the fastest-growing user base by reaching one million users within the first five days of its launch and a staggering 180.5 million users as of March 2024.1
The remarkable capabilities of ChatGPT, including composing full-length feature articles, developing entire websites, and even passing medical examinations, have sparked widespread conversations about the potential of AI.2 These sophisticated large language models are evolving at an unprecedented pace, as evidenced by the arrival of the most recent update (GPT-4o) amid continual adaptation of their models in response to user feedback.3 It seems inevitable that patients and healthcare practitioners will seek medical advice and medical decision support.
In this exploration, we took a brief look at how ChatGPT and other large language models respond to the questions posed by Top Derm, an educational game for dermatology clinicians created by Level Ex4.
In our comparison of AI responses to questions from the dermatology education game, we focused on questions regarding treatment for atopic dermatitis (AD) and identification of squamous cell carcinoma (SCC) risk from histological variants. These questions were in text form, and the same options were provided to the large language model clinicians would see in the app interface. Researchers asked ChatGPT five questions about AD including “Is beginning phototherapy an appropriate approach for a patient stepping up from mild to moderate atopic dermatitis or from moderate to severe atopic dermatitis?”
ChatGPT performed remarkably well with the questions, scoring 5/5 correctly. However, there were multiple instances of a “non-answer” in which the language model had to be redirected to provide a specific, less ambiguous answer.
In the rarer or more exotic “zebra” SCC cases, ChatGPT also performed adequately, scoring four out of 5 questions correctly. It needed to be redirected once to come up with the correct response. In these questions, ChatGPT was asked to determine if the risk for SCC was lower/moderate or not low/moderate in the following cases: lymphoepithelioma-like carcinoma, keratoacanthomas, verrucous carcinomas, adenosquamous carcinoma, pseudovascular SCC.
ChatGPT added a disclaimer to these responses, noting that “it’s important to remember that these risk categories are generalizations, and individual cases may differ. The risk classification can be influenced by factors such as tumor size, location, and the patient’s immune status. It’s always best to consult with a healthcare professional for specific information and management of cSCC cases.”
Confabulation, Hallucinations, and Other ChatGPT Shortcomings
ChatGPT’s largest shortcomings were in classifying histological variants for cSCC as higher risk or not higher risk. ChatGPT only correctly answered two out of the five questions in this series of questions.
A more interesting and alarming discovery was made during some of our prompts: something referred to in the AI community as the “confabulation problem” In other words, ChatGPT sounds equally confident when incorrect, making it very difficult to know when an answer might be untrustworthy. There was no sense of “hedging” and never did the model say something like: “I’m not sure of this answer…” or other statements that could prompt a questioner to verify the information.
When ChatGPT responded with an incorrect response, researchers clarified the question. However, if it was still providing incorrect information, they sought to learn how it was arriving at its answer. Asking these language models about what sources informed the answer would—sometimes—produce false journal names, false citations, and even false author names, often combining the first and last names of two different authors. This has been called a hallucination by an AI model and is somewhat chilling.
Here is an excerpt of the occurrence:
Researchers: Follow-up Question: which studies suggest this?
Chat GPT: “There are only a few studies that have reported on the clinical behavior and outcomes of SCC with osteoclast-like giant cells due to its rarity. However, some case reports and case series have suggested that this variant may have a higher risk of recurrence and metastasis compared to conventional SCC.
For example, a case series published in the Journal of Cutaneous Pathology in 2017 reported on three cases of SCC with osteoclast-like giant cells, all of which had local recurrence and one of which had regional lymph node metastasis. Another case report published in the Journal of Dermatological Case Reports in 2019 described a case of SCC with osteoclast-like giant cells that had local recurrence and multiple regional lymph node metastases.
Although these studies are limited in scope, they suggest that SCC with osteoclast-like giant cells may have a more aggressive clinical course compared to conventional cutaneous SCC (cSCC). However, further studies with larger sample sizes and longer follow-up periods are needed to better understand the clinical behavior and outcomes of this rare variant of cSCC.”
Researchers: Can you provide a link to these studies?
Chat GPT: “Certainly! Here are the links to the studies I mentioned:
“Squamous cell carcinoma with osteoclast-like giant cells: a series of three cases and review of the literature”: https://onlinelibrary.wiley.com/doi/abs/10.1111/cup.12893
“Squamous cell carcinoma with osteoclast-like giant cells and pulmonary metastases: a case report”: https://www.journalofcasereports.com/articles/squamous-cell-carcinoma-with-osteoclast-like-giant-cells-and-pulmonary-metastases-a-case-report.html
Please note that these studies are for informational purposes only and should not be used as a substitute for medical advice. If you have any concerns about your health, please consult a qualified healthcare professional.”
We could not find these studies. The link provided leads to a study called “Angiotropic syringomatous carcinoma,” which, while related to carcinomas, is decidedly not about SCC with osteoclast-like giant cells.
ChatGPT vs. Bing vs. Google Bard
We also analyzed some responses from Bing’s AI search and Google’s Bard. Each had its own shortcomings. Bing would provide correct answers to many questions, with the disclaimer that the answer must be readily available with an internet search. Bing was able to cite the websites from which it pulled information properly, but it did very little “thinking” on its own. This led to many general “it depends” type answers or avoiding the question altogether if the answer was not readily found on an internet search.
By contrast, Bard often responded by clarifying that “I’m a language model and don’t have the capacity to help with that.” The few times we successfully pulled an answer from Bard involved multiple-choice questions. BARD could not answer questions of a binary (less severe/more severe) nature.
Among these AI systems, ChatGPT-4 was the only one to actually answer specific questions more than 90% of the time in a way that was usable. However, all of these technologies are rapidly changing and advancing, so this is very likely to be in continuous flux.
Regarding medical advice and knowledge, ChatGPT can determine answers to certain dermatological topics at rates similar to those of an actual dermatologist. Despite the impressive responses, there are dangers of relying on AI in receiving medical advice or diagnosis. ChatGPT is often confidently incorrect, and patients are often more complex than multiple-choice questions.
So, should ChatGPT should be utilized for diagnosis instead of physicians? We believe ChatGPT itself said it best:
“No, ChatGPT should not be used to diagnose patients or replace physicians. While ChatGPT is a powerful language model that can understand and process large amounts of text, it is not a substitute for the knowledge, skills, and experience of a trained medical professional.
Diagnosing and treating medical conditions requires a combination of medical knowledge, clinical experience, and patient interaction, which cannot be fully replicated by a machine-learning model like ChatGPT. Additionally, the consequences of misdiagnosis or inappropriate treatment can be severe, making it essential to have human medical professionals involved in the process.
ChatGPT can be a helpful tool for providing information, answering questions, and assisting with decision-making in the medical field. Still, it should always be used in conjunction with, not instead of, medical professionals.”
About the authors
Peter Lio, MD
Peter A. Lio, MD is a Clinical Assistant Professor of Dermatology and Pediatrics at Northwestern University Feinberg School of Medicine and a partner at Medical Dermatology Associates of Chicago.
Disclosures: Dr. Lio is an advisor for Level Ex.
Dharm Sodha
Dharm Sodha is a fourth-year medical student at the University of Illinois at Chicago.
Disclosures: None
Roth LaFleur
Roth LaFleur is Head of Product at Level Ex, Inc.
Disclosures: Employee of Level Ex, Inc.
Florian Geier, PhD, is the Vice President and Head of Strategy and Pharma Sales at Level Ex, Inc.
Disclosures: Employee of Level Ex, Inc.
References:
- How Many Users Does ChatGPT Have? Statistics & Facts (2024) https://seo.ai/blog/how-many-users-does-chatgpt-have
- Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198.
- Hughes A. ChatGPT: Everything you need to know about OpenAI’s GPT-4 tool. BBC Science Focus Magazine. Published April 3, 2023. Accessed April 11, 2023. https://www.sciencefocus.com/future-technology/gpt-3/
- Top Derm. Developed by Level Ex. Available at:https://www.levelex.com/games/top-derm/ (accessed on 4/12/2023)