A Qualitative Assessment of Medical Diagnosis Capabilities of Three Artificial Intelligence Models: ChatGPT-4o, CodyMD, and Dr. Gupta ()
1. Introduction
The rapid development of technologies related to artificial intelligence (AI) offers new opportunities for its implementation in clinical practice. The vast availability of diverse patient data, such as medical images [1], text, and electronic health records [2], combined with numerous studies evaluating the potential of AI in areas such as prevention [3], screening and diagnosis [4], disease progression prediction [5], and clinical decision-making and treatment selection [6], provides the foundation for developing AI solutions that can greatly enhance the diagnostic process.
There are studies evaluating the diagnostic capabilities of various AI models [7], but important questions arise regarding their reliability, accuracy, and ability to interact with medical professionals and patients. This study focuses on three AI models—ChatGPT-4o, CodyMD, and Dr. Gupta—with the aim of assessing their diagnostic capabilities based on patient-reported symptoms. Key aspects such as diagnostic accuracy, level of interaction, and the provided differential diagnoses are evaluated. Understanding these factors is critically important, as AI has the potential to assist medical professionals in decision-making, but there are also risks of errors, especially in more complex cases. To implement AI effectively in medical practice, research must focus on evaluating the diagnostic accuracy of various AI models. This is crucial for their development, improvement, and training of Natural Language Processing (NLP) models. Such research is vital for advancing medical care and enhancing public health, as it enriches the datasets required for model training and provides key insights for optimizing and adapting algorithms to meet real clinical needs. Therefore, this type of research is particularly relevant and valuable.
This study aims to qualitatively evaluate the diagnostic capabilities of three different AI models—ChatGPT-4o, CodyMD, and Dr. Gupta—in terms of accuracy, user approach, level of detail, user interaction, and the number of differential diagnoses generated, based solely on patient-reported symptoms. ChatGPT-4o is one of the most widely used AI models, while CodyMD and Dr. Gupta are specifically designed for medical purposes. The qualitative evaluation of their diagnostic methods in this study provides valuable insights for both users and developers.
2. Materials and Method
Based on the results of our previous research [7] on the diagnostic capabilities of ChatGPT-3.5, where the AI demonstrated over 70% diagnostic accuracy, we decided to extend our research and investigate the diagnostic capabilities of ChatGPT-4o as well as other active medical AIs. The methodological approach (Figure 1) follows a structured process that includes selecting the AI models, choosing clinical cases, and analyzing the diagnostic results from the AI models.
Figure 1. Flowchart of the research methodology.
2.1. AI Model Selection
The selection of AI models for this study was carefully considered to ensure a diverse and relevant evaluation. ChatGPT-4o was chosen due to its broad usage by the general population for a wide range of informational purposes, including medical diagnostics. In contrast, CodyMD and Dr. Gupta were selected specifically for their focused applications in the medical field, offering specialized diagnostic capabilities that align with the study’s objectives. During the research phase, other AI models like Docus and Doctronic also emerged through a search using the keyword “AI doctors”. However, CodyMD and Dr. Gupta were ultimately selected for their established reputation and strong performance in medical diagnosis. CodyMD is known for its integration with clinical decision-making, and Dr. Gupta has a track record of being used in real-world healthcare scenarios, making them more suitable for this comparison.
CodyMD was created by David Sanders and Albert DiPiero. It is designed to work alongside real doctors and offers several features, including “Medical Diagnosis”, “Specialists”, “Treat”, “Ask”, “Interpret”, and “Talk”. It is clear from the website that this is a sophisticated product with a wide range of capabilities. In the present study, we utilize the “Medical Diagnosis” feature.
Dr. Gupta is another AI specifically developed for medical purposes by DL Software Inc. It utilizes advanced natural language processing and machine learning tools to interpret user queries and provide accurate medical information and advice. Dr. Gupta allows users to input data such as age, weight, symptoms, allergies, medications, vitals (temperature, heart rate, respiratory rate, oxygen saturation, waist circumference, hip circumference, systolic blood pressure, and diastolic blood pressure), and lab test results. While this offers a more comprehensive view of the patient’s condition, our goal is to explore diagnostic models based solely on patient-reported symptoms. Therefore, we limit our input to age and symptom data for each patient. This ensures that all three AIs receive the same information and operate under identical conditions for this study.
2.2. Literature Review
The literature review was conducted through an extensive search of databases including PubMed, Google Scholar, Web of Science, and Scopus (Figure 1). Our focus was on articles published in English that relate to medical diagnoses made by AI. For our search, we used a combination of keywords: “medical diagnosis” and “artificial intelligence”. However, due to the large number of articles, we narrowed our focus to “ChatGPT and medical diagnosis”. A literature search using the keywords “Dr. Gupta and medical diagnosis” and “CodyMD and medical diagnosis” yielded no results. The absence of studies involving both models makes this study unique.
2.3. Selection of Clinical Cases to Evaluate the Diagnostic
Capabilities of Models
For an effective evaluation of the AI models’ diagnostic performance, it was essential to select clinical cases representing a diverse range of medical conditions with varying levels in complexity and severity. This approach ensures that the AI models are challenged across various diagnostic scenarios, mimicking the daily challenges faced by a general practitioner and allowing for a better assessment of their diagnostic capabilities. For this purpose, we conducted an extensive search in PubMed (Figure 1) and selected cases that cover a spectrum of diseases, ranging from rare to common, with varying levels of complexity and clearly defined diagnoses confirmed by specialists. The case selection criteria were as follows:
Variety of Medical Conditions: The selected cases needed to include a wide range of conditions, including both common diseases and less common, diagnostically difficult cases. This diversity tests the flexibility of the AI models and ensures they are not trained on a specific subset of medical conditions.
Clear and Confirmed Diagnoses: Each case report had to include a well-documented diagnosis, confirmed by medical professionals and supported by clinical research. This criterion ensures a reliable benchmark against which the diagnostic accuracy of AI models can be measured.
Real-World Complexity: The selected cases presented varying levels of diagnostic difficulty to simulate real-world clinical scenarios that a general practitioner might encounter. Our chosen cases involved conditions with overlapping symptoms (e.g., COVID-19, systemic lupus erythematosus, celiac disease), requiring AI models to accurately distinguish between similar diseases. This allows us to assess the models’ ability to manage ambiguity in medical conditions.
Diversity in Symptom Presentation: The cases had to include a broad spectrum of symptoms, ranging from clear and distinctive to more subtle or non-specific manifestations. This criterion tests the AI’s ability to recognize both clear and nuanced clinical presentations.
Risk and Severity Levels: To evaluate how well AI models prioritize and recognize emergencies, we included cases with a mix of high-risk (e.g., appendicitis, COVID-19, heart failure) and low-risk conditions (e.g., allergic rhinitis, celiac disease). This aspect is crucial for assessing the models’ ability to differentiate between conditions requiring immediate intervention and those that do not.
Potential for Diagnostic Ambiguity: To challenge the AI models’ capacity for critical thinking and differential diagnosis, we included cases (e.g., irritable bowel syndrome, celiac disease, Alzheimer’s disease) where symptoms closely resemble those of other conditions, complicating the diagnosis. This encourages the models to consider a broader range of possibilities.
Based on these criteria, ten clinical cases were selected (Table 1). Each case was assigned a diagnostic difficulty rating on a scale of one to five stars, determined by the following factors:
Complexity of Symptoms: The variety and number of symptoms, including the degree of overlap with other conditions.
Diagnostic Ambiguity: The extent to which the symptoms resemble those of other diseases, making accurate diagnosis more challenging.
Severity and Level of Risk: The potential impact on the patient’s health and the urgency required for diagnosis and treatment.
For each case, a clinical vignette was created that included the patient’s age and disease symptoms (Table 1). These vignettes were presented to the three AI models in separate, isolated chat sessions, with each model being asked, “What is the most likely diagnosis?” This process ensured that none of the models were influenced by previous interactions. Each model received the same patient-reported symptoms and medical history presented consistently, and without any additional cues or guidance, allowing independent analysis and response to the clinical data. The diagnoses generated by the models are presented in Table 2.
Table 1. Reported clinical symptoms, corresponding diagnoses, and diagnostic difficulties.
Case Report |
Complaints of the patient |
Correct diagnosis |
Diagnostic difficulty |
1 |
An 18-year-old female patient presents the following
symptoms: - Partially blocked left nostril along with bilateral nasal itchy feeling - Sneezing for up to 1h and 80 to 100 sneezes every day
usually in the morning time - Watery discharge from the nose (rhinorrhea) - Heaviness in the head region - Loss of concentration - Weakness [8] |
Allergic Rhinitis |
|
2 |
A 27-year-old woman has progressively deteriorating
abdominal symptoms over the previous 5 years. Complaints include: - Diarrhoea alternating with constipation, and on occasions, episodes of faecal incontinence - Colicky abdominal pain on a daily basis accompanied by abdominal distension - Lethargy - Low back pain - Nausea - Bladder symptoms consistent with a diagnosis of irritable bladder [9] |
Irritable Bowel
Syndrome |
|
3 |
A 7-year-old boy presents the following symptoms: - Intermittent fevers - Lower quadrant abdominal pain - Vomiting, without bilious and bloody emesis [10] |
Appendicitis |
|
4 |
A 36-year-old woman presents the following symptoms: - Cyclic pain on the C-section scar - Moderate to severe dysmenorrhoea and dyspareunia - Painful, palpable, small firm mass of approximately 3 cm in the lower abdomen wall, at the site of the caesarean
section scar [11] |
(Abdominal Wall) Endometriosis |
|
5 |
An 18-year-old boy presents the following symptoms: - Acute onset of fever - Rhinitis - Myalgia - Headache - Decreased taste and smell sensation [12] |
COVID-19 |
|
6 |
A 41-year-old woman has hypertension, hypothyroidism, and asthma. She presents the following symptoms: –1-month history of fever associated with chills and rigors –Pleuritic chest pain –Pain in the small joints of the hand –Cold in the extremities –Photosensitivity –1-year history of progressive fatigue, arthralgia, 20 kg
weight loss, and intermittent low- and high-grade fever [13] |
Systemic Lupus
Erythematosus |
|
7 |
A 35-year-old man has a medical history of hyperglycemia, hyperhidrosis, and high blood pressure. He presents the
following symptoms: –Fatigue and cough –Dyspnea, accompanied by chest tightness, and inability of lying supine at night –Bout of cold followed by general malaise [14] |
Heart Failure |
|
8 |
A 6-year-old girl, who is a known asthmatic, presented the following symptoms: –Generalized bruises –Fever –Eczematous rashes –Six or seven episodes of loose stools per day for 3 months accompanied by loss of appetite [15] |
Celiac Disease |
|
9 |
A 30-year-old white male presents the following symptoms: –Decrease of visual acuity –Intermittent diplopia –Photophobia in both eyes –Paresthesia of left hand [16] |
Multiple Sclerosis |
|
10 |
A woman in her late 70s presents the following symptoms: –Significant short-term memory impairment –Episodes of confusion –Difficulty with language skills –Reclusive and disengaged from her previous social
networks –Disorientation during seasonal changes that leads to
periods of wandering and becoming lost [17] |
Alzheimer’s disease |
|
Table 2. Diagnoses offered by the three AI models.
Case Report |
ChatGPT-4o |
CodyMD |
Dr. Gupta |
Correct Diagnosis |
1 |
Allergic Rhinitis |
1. Allergic Rhinitis
(60% probability); 2. Vasomotor Rhinitis
(25% probability); 3. Common Cold
(15% probability) |
Allergic Rhinitis |
Allergic Rhinitis |
2 |
Irritable Bowel Syndrome |
1. Irritable Bowel Syndrome (IBS) (60% probability); 2. Inflammatory Bowel Disease (IBD) (30% probability); 3. Functional Dyspepsia
(10% probability) |
1. Irritable Bowel
Syndrome (IBS); 2. Inflammatory Bowel Disease; 3. Gastrointestinal
infection |
Irritable Bowel Syndrome |
3 |
Appendicitis |
1. Gastroenteritis
(40% probability); 2. Urinary tract infection
(25% probability); 3. Appendicitis (20% probability) |
1. Viral or bacterial gastroenteritis; 2. Appendicitis; 3. Urinary tract
infection |
Appendicitis |
4 |
Cesarean Scar Endometriosis |
1. Cesarean Scar Endometriosis (70% probability); 2. Scar Tissue Adhesion
(20% probability); 3. Incisional Hernia
(10% probability); |
Incisional Hernia |
(Abdominal Wall) Endometriosis |
5 |
COVID-19 |
1. Influenza (40% probability); 2. Common Cold
(30% probability); 3. COVID-19 (30% probability) |
1. Common cold or the flu; 2. Sinus infection |
COVID-19 |
6 |
Systemic Lupus Erythematosus (SLE) |
1. Rheumatoid arthritis
(40% probability); 2. Systemic lupus erythematosus (30% probability); 3. Tuberculosis (20% probability) |
1. Systemic Lupus
Erythematosus (SLE); 2. Tuberculosis (TB); 3. Rheumatoid Arthritis or other autoimmune conditions |
Systemic Lupus Erythematosus |
7 |
Congestive Heart Failure (CHF) |
1. Chronic obstructive pulmonary disease (COPD)
(40% probability); 2. Asthma (30% probability); 3. Gastroesophageal reflux disease (GERD) (20% probability) |
1. Respiratory
infection (bronchitis or pneumonia); 2. Allergies, asthma, anxiety |
Heart Failure |
8 |
Wiskott-Aldrich syndrome (WAS) |
1. Inflammatory Bowel Disease (30% probability); 2. Eczema Herpeticum
(25% probability); 3. Food Allergy (20% probability) |
1. Viral
Gastroenteritis; 2. Inflammatory Bowel Disease; 3. Blood disorder or a bleeding disorder |
Celiac Disease |
9 |
Multiple Sclerosis (MP) |
1. Multiple Sclerosis
(30% probability); 2. Migraine (25% probability); 3. Optic Neuritis
(20% probability) |
1. Multiple Sclerosis (MS); 2. Asthma; 3. Uveitis or optic
neuritis |
Multiple Sclerosis |
10 |
Alzheimer’s
disease |
1. Alzheimer’s disease
(40% probability); 2. Vascular dementia
(30% probability); 3. Mild cognitive impairment (20% probability) |
Alzheimer’s disease or another form of
dementia |
Alzheimer’s
disease |
2.4. Selection of Criteria for Evaluating the Diagnostic Capabilities
of Models
The diagnostic results provided by each AI model were evaluated based on the following qualitative criteria:
Accuracy: Defined as the percentage of correct primary diagnoses made in relation to the total number of clinical vignettes.
Approach: The AI’s approach to providing information is categorized as informative, educational, or interactive (engaging in dialogue with the patient).
Level of Detail: Categorized as low, moderate, or high, depending on the specificity and detail of the information provided in the diagnosis:
Low-Level Information: Presented in simple, easy-to-understand language for patients or a general audience. It describes the main symptoms and essence of the disease without using medical terminology. The response includes a plausible explanation but does not cover all possible diagnoses, focusing on a single condition, which may be limiting given the variety of presenting symptoms.
Moderate-Level Information: More detailed and explained with some medical terms. It may be comprehensible to general practitioners or students in training, but not sufficiently understandable for the general public.
High-Level Information: Comprehensive information using medical terms, intended for specialists, researchers, and medical personnel with expert knowledge.
The level of detail can vary based on several factors, including the AI model itself, the data it is trained on, and how the diagnosis is presented.
Interaction with the User: The ability of the model to participate in patient-centered dialogue, define diagnoses, and provide possible solutions to improve the patient’s condition. This includes imitating empathy and exploring possible solutions for enhancing the patient’s well-being.
Number of Differential Diagnoses Made: This criterion evaluates the AI’s capability to generate multiple potential diagnoses based on the provided symptoms.
3. Results
The AI diagnoses (Table 2) were compared with those published in the literature and evaluated based on the criteria outlined above. The results are presented in Table 3. The table shows that out of ten clinical vignettes, ChatGPT made correct and accurate diagnoses in nine of them. In contrast, CodyMD and Dr. Gupta provided differential diagnoses, with the correct diagnosis being the first one about half the time. This can be viewed as a positive feature of these models, as they enable physicians to explore multiple potential diagnoses instead of being restricted to just one. By promoting a broader differential diagnosis, these models help prevent premature conclusions and foster a more thorough assessment, ultimately leading to more accurate and informed decision-making in patient care.
For case 3, CodyMD assigned a probability of 20% to the correct diagnosis, and for cases 5 and 6, –30%. Dr. Gupta provided potential differential diagnoses but did not assign any probability percentages. In case 3, the correct diagnosis was listed second, while for cases 4, 5, 7, and 8, Dr. Gupta did not provide a correct diagnosis. These two AIs are designed to provide medical information, but symptoms alone may not be sufficient for accurate diagnoses due to the numerous diseases with overlapping symptoms. As a result, they may require additional information about the patient’s physiological state, including blood tests, imaging studies (e.g., X-rays, MRI, or CT scans), and other diagnostic tools. These tools can be electrocardiograms (ECGs) to assess heart function, pulmonary function tests to evaluate lung capacity, biopsies for tissue analysis, genetic testing to identify hereditary diseases and urinalysis to detect abnormalities in kidney function.
ChatGPT-4o provides only one differential diagnosis based on the patient’s symptoms, using an informative approach without engaging in direct dialogue with the patient. It communicates in accessible language for the general public and provides a brief rationale for its response. While it suggests possible conditions, it typically emphasizes the need for further investigations, tests, and a physical examination by a medical professional. Although it mentions potential causes and provides context for the symptoms, it does not offer more than one differential diagnosis compared to other AI. It often focuses on a single condition, which can be limiting given the variety of presenting symptoms. The level of detail in its responses is moderate, providing plausible explanations in accessible language. However, it lacks personalized interaction, making it less engaging for the patient.
CodyMD, on the other hand, provides three differential diagnoses for all cases and assigns a percentage probability to each potential diagnosis. This AI uses an interactive approach, creating a conversational format that resembles a doctor-patient dialogue. It asks detailed questions to gather additional information about symptoms and pain levels, aiming to refine the diagnosis. CodyMD provides a list of three potential diagnoses and offers a detailed treatment plan for one of them. The level of detail is high, including comprehensive explanations, self-care tips, and potential lifestyle changes. CodyMD is very interactive, friendly, and polite, frequently using encouraging phrases such as “Thanks for the confirmation!” and “Thanks for sharing with me!” It maintains constant communication with the patient by asking questions like “What should I call you?” and “Would you like to see a treatment plan?” This fosters a collaborative atmosphere, and strengthens the physician-patient connection, and mimics a sense of empathy for the patient.
Dr. Gupta, like CodyMD, provides differential diagnoses in most cases (see Table 2) but does not state the percent likelihood for each. It begins by stating the most likely diagnosis and offers a brief explanation of the condition, along with potential causes of the symptoms. Dr. Gupta employs an educational approach, making it suitable for medical students during their training. The AI explains the disease and the causes of specific symptoms, emphasizing that a formal diagnosis can only be made by a medical professional through evaluation. It suggests appropriate diagnostic steps (such as physical exams and imaging) and recommends treatment options (including pain control, hormone therapy, and surgery). While it provides clear and precise explanations, it does not delve deeply into the specifics of the disease and lacks the interactivity and personalization found in CodyMD’s responses.
Table 3. Comparative analysis of the three AI models.
Criteria |
ChatGPT-4o |
CodyMD |
Dr. Gupta |
Accuracy |
90% |
50% |
50% |
Approach |
informative |
interactive |
educational |
Level of Detail |
moderate |
high |
high |
User Interaction |
Does not engage in dialog |
Engages in dialog and seeks
more information from the
patient |
Does not engage in dialog |
Number of possible
diagnoses provided |
1 |
3 |
1, 2 or 3 |
4. Conclusion
The results of the present study indicate that the three AIs investigated have the potential to assist healthcare professionals in diagnosing complex cases with multiple symptoms. They can also be used as virtual assistants for initial patient consultations, referring patients to appropriate tests or specialists based on their reported symptoms. In both cases, however, optimization and refinement are needed to improve their accuracy and reliability in clinical settings. The interactivity demonstrated by Cody is particularly useful for acquiring more accurate and complete information about the disease and its history. There is a need to advance more personalized AI decision-making based on patient-supplied symptoms and the analysis of genetic and clinical data to enable more accurate diagnoses and individualized treatment plans. Future studies will need to focus not only on refining the algorithms, but also on integrating these systems into real clinical scenarios to ensure their effectiveness and practical applicability.