Artificial Intelligence and Computer Vision during Surgery: Discussing Laparoscopic Images with ChatGPT4—Preliminary Results ()
1. Introduction
Robotic surgery has changed dramatically the way we used to think and operate during the past 24 years. Since the initial emergence of surgical robots in the operating theaters, issues about robotic autonomy in the future of surgery were raised almost in all surgical society discussions, including concerns whether a robot would be autonomously performing surgery someday, with full potential to gradually replace surgeons. Ultrafast latest developments in artificial intelligence (AI) have recently multiplied these concerns, as it seems that robotic autonomy is closer than ever before. Lately, chatbots able to code in any programming language, play chess, suggest financial and marketing strategies as well as analyze psychological problems have made their appearance as potentially disruptive game changers. Some of them can manipulate documents, lyrics, compose music and even control small or larger robots. Literature is rather scarce about the potential of AI during surgery. Image recognition in laparoscopy has made tremendous steps in recognizing intraabdominal anatomy, position of instruments, as well as, phases of the operation using deep learning. Image recognition has been successfully applied in real-time vascular anatomical image navigation [1] , or for automatically measuring distance of anatomical landmarks or size of organs intraoperatively, for example bowel length [2] . In 2022, Kitaguchi et al. also reported on development of a model for laparoscopic colorectal surgical instrument recognition system using convolutional neural network-based instance segmentation and videos [3] . Real-time surgical phase recognition has been presented in many papers, using neural network-based deep learning techniques [4] [5] . Similarly, Shinozuka et al. presented an artificial intelligence software offering surgical phase recognition in laparoscopic cholecystectomy [6] .
To our knowledge, there is no current report on clinician’s interaction with a chatbot using laparoscopic images, making this the first report on computer vision for cognitive understanding of the surgical context from an AI chatbot.
2. Materials & Methods
Current commercially available AI tools for image analysis were applied over a series of laparoscopic scenes. OPENAI CHATGPT 4.0 (3, August 2023 version— https://chat.openai.com/) with its novel image recognition plugin (SceneXplain) was used with full access to GPT4 capabilities.
SceneXplain (addressed as the image-processing bot in this paper) and ChatGPT (addressed as the chatbot), platforms was fed with n = 100 images from a selected list of laparoscopic snapshots including procedures of General Surgery: cholecystectomy (n = 14), hiatal hernia (n = 14), gastrectomy (n = 10), splenectomy (n = 10), colectomy (n = 10), appendectomy (n = 12), abdominal wall hernias (n = 10) and Gynecology (n = 20) cases into separate chat sessions.
Image Selection Criteria and Grouping
Real-life, clear and self-explanatory laparoscopic surgery pictures were used, according to the inclusion criteria presented in Table 1. The set of images was divided into two groups: unlabeled (Group A) and labeled (Group B). Out of the 100 images processed, 35 were labeled and 65 were unlabeled. Anatomical landmark abbreviations were poorly recognized during trials; as a result, images containing only these abbreviations (n = 10) were not included in the ‘labeled’ Group. In addition, they were categorized according to surgical field of interest, image resolution, and whether they represented single or multiple snapshots of the same procedure.
Building a Chatbot Surgical Assistant Persona
The chatbot was configured to follow specific instructions prior to the commencement of the study (Chatbot Persona/Table 2), which were applied to all evaluations. It was informed a priori that it would have to assess a set of laparoscopic surgical pictures that it needed to describe. It was asked to answer formally and name relevant anatomy, organs and any surgical instruments. Then try to name relevant disease from current scene, to guess relevant surgical procedure implied, analyze anatomy related to the image, analyze probable disease context related to the image as well as the surgical procedure related to the image.
Interpretation Scale
An interpretation scale with a range from 0 - 5 was developed, in order to score reliability of received answers from image-recognition bot (Table 3(a)) and another scale of 0 - 5 to assess chatbot final answers (Table 3(b)). Our evaluation identified and evaluated accuracy of isolated answers, even in cases when conclusion of the report was unclear, or contained wrong, irrelevant or irrational clues. That means that even a single correct comment in the report, was in favor of the chatbot.
Avoiding Bias of Personas
In order to ensure that the results were not biased by the custom instructions given to the chatbot, a set of control images (non-medical everyday-life themes), were used in the middle of each session. As a matter of fact, the AI was able to
Table 2. Instructions given to chatbot (creation of ChatGPT Persona).
recognize that the scene or image did not focus on medical or anatomical aspects of surgery, that there were no surgical instruments, and would comment that the picture in question “involved a leisurely scene”, for example, that of a beagle puppy, “irrelevant to the clinical context of the discussion”.
Avoiding Bias of Labels
A similar method was employed to assess weight of titles and labels in the image recognition process. For every labeled surgical image, we used a false control image (everyday life, nonmedical image entitled on purpose as a surgical procedure). In all cases, the AI realized that there was an issue, more specifically, it replied “there was a discrepancy in the image explanation and that the image analysis might not have been accurate”.
3. Results
From the 100 images processed, 35 were labelled and 65 were unlabeled. Eighteen of them contained multiple images of the case while the remainder 82 is single. Average image quality was 903.71 × 546.3 px [Median 800 × 508.5 px] (Figure 1). Scoring of the two bots during recognition of images from different surgical fields is presented at Figure 2. Overall score of the AI application in interpreting laparoscopic surgery images was 3.265 out of 5 (65.3%).
The score of image-recognition plugin was 3.49/5, which was able to recognize correctly the context of surgical-related images in 97% of its reports. Recognition
(a) (b) (c)
Figure 1. Raw data from image assessment.
Figure 2. Scoring of the two bots during recognition of images from different surgical fields (light = image-processing bot, dark = chatbot).
of labelled images (n = 35, score = 3.84) was better than for unlabeled ones (n = 65, score 2.89) (Figure 3). For labelled pictures the chatbot scored 3.91/5, while for unlabeled 2.6/5. For labelled surgical pictures in Group A the two bots together scored 3.95/5 (79%), while for unlabeled Group B they scored 2.905/5
Figure 3. AI interpretations with and without image labelling.
(58.1%). With rates 4 - 5 out of 5, the chatbot was able to talk in detail about the implied surgical procedure. When correctly scanned by the image-processing bot, labels affected the train of thoughts of the chatbot and its final reports, coming to full correct conclusion (5/5 score) in 21 out of 35 (60%) descriptions.
Recognition of higher resolution images (even same images) was not better than for those of lower quality (score 3.22 for n = 48 images > 800 px vs score 3.29 for n = 52 images < 800 px); (Figure 4). Recognition of multiple images from different procedure phases bound together (n = 18, score = 3.19), was not better than for isolated snapshots of the procedure (n = 82, score = 3.28). “Right” and “left” orientation was mixed-up upon spotting structures in their exact location on the photo (n = 3%). Among viscera, there was a slight preference in recognition of “stomach” as well as “uterus”. The term “stomach” was mentioned multiple times, even when it did not appear in the image, most likely as a synonym for “viscera”. It also seems easier for it to recognize material, even with bizarre terms such as “rope” instead of suture, “piece of cloth” instead of sponge, “plastic” instead of suture applicators and drains, “needle”, “scissors”, or just “piece of metal” instead of laparoscopic grasper. These observations stress out the high importance of enhancing the recognition of more patterns for internal organ anatomy and surgery in similar future applications.
Multiple labelled images and high-resolution clear anatomy often had no bearing on the unfavorable outcome at all. In contrast, really poor snapshots from a laparoscopic video, yielded full recognition of anatomy and related operation. Overall accuracy of image-processing bot depended on details, such as the shape of internal organs from different views and angles, lighting and shading of the photos, existence of blood or cutting instruments etc. One fact that is worth mentioning is that an obvious “craving” was observed in the chatbot’s behavior, for a triggering piece of information, that would immediately “activate” it, and put it into action. Otherwise, it would simply perform another uninteresting delivery of its task, indifferent of the poor result. This threshold depended upon logical clues, such as a clear, big title of the operation over the image, or multiple labels of relevant anatomy or pathology, or a characteristic surgical instrument.
Figure 4. Image resolutions used in descending order (left axis—black line) in comparison with overall scoring of image interpretation (right axis—gray dots).
Optical Character Recognition of the image-bot did not yield favorable results over n = 8 out of 35 labelled images (22.85%), with consequent irrational conclusions from the chatbot. Abbreviated labels were recognized in five out of ten cases: in the first case, the image-bot identified the uterus and the chatbot correctly connected the abbreviation “UT” with it. In the second case, it analyzed “GE” junction to “gastroesophageal” accurately in the context of hiatal hernia surgery. In the third and fourth cases, the chatbot realized that the abbreviations were explained at the bottom of the image. In the fifth case, the chatbot concluded that the abbreviations were possibly describing anatomical landmarks.
Our scoring was performed by validating the accuracy of the answers even in cases when the report was not very clear, or contained wrong, irrelevant or irrational clues. Steady recognition of the surgical field throughout the whole study (97% of cases) was enough rationale for reporting our results. Of course, without a doubt, in cases where the bot was reporting with certainty for an inaccurate diagnosis or surgery, this would be clearly judged as a wrong answer. As it can be observed in the tables below, only answers scored below 4, have been considered wrong.
Throughout the study, image-processing bot being aware of its limitations, occasionally added the following comment to most of its reports: “Please note that my analysis is based on the image and should not replace professional medical advice.” The chatbot was admittedly trying to do its best (overall score 3.06/5), often with only poor information from its partner image-recognition bot. At times, it would even seek help even from the file name of the given image, as a last resort for information, in order to find clues for the correct answer.
Upon successful interpretation, theory around the procedure was reported accurately in both groups. As soon as the first bot recognized correctly the topic, the chatbot presented high-level capability to talk in detail about the indications, contraindications, stages, relative instrumentation, complications and outcome rates of the operation in discussion. Characteristically, even with a lot of misleading information from the image-processing bot, the chatbot was often able to select the most probable situation and gave a correct report. Wrong answers, including bizarre and paradoxical responses were also noted.
4. Discussion
Existing literature on AI and medicine is rapidly proliferating, with most of published works focusing on chatbot’s skills to analyze data and knowledge and reform it in formally generated reports. For example, the literature has examined the chatbot’s assessment of teaching methods in medicine [7] , diagnostic abilities for certain pathology [8] , its role in multiple clinical and research scenarios [9] , proofing of surgery documentation [10] , or other impact on surgical profession [11] [12] [13] [14] . Image-recognition and AI is currently mostly published for radiological applications [15] . Many colleagues queried AI impact focused on specialties, such as colorectal surgery, and gynecology [16] .
ChatGPT (Generative Pre-trained Transformer) technology, developed by OpenAI©, is a state-of-the-art conversational agent. It utilizes deep learning through neural networks with multiple layers and attention mechanisms. ChatGPT is trained on a diverse set of big data to provide human-like responses in a conversational environment [17] . The model’s architecture allows it to capture complex patterns in language, including semantics, syntax, and context. This makes it a powerful tool for applications ranging from customer service to healthcare [18] . However, it is essential to note that while the model is proficient in text generation, it lacks true understanding or consciousness [19] . In our experience with feeding laparoscopic images to the chatbot, it was surprising that it was surprising that it showed no interest in the unique aspects of our research. The SceneXplain by Jina AI GmbH plugin technology is a groundbreaking advancement in the field of computer vision and processing of natural language. It is based on convolutional neural networks (CNNs) for image recognition [20] and transformer-based language models for textual description [21] . This image-recognition “bot” provides detailed, context-aware explanations of visual content. It exploits various technologies such as pattern recognition, scene segmentation, and contextual analysis, to generate comprehensive descriptions of images. It could find application to thousands of health-related fields, ranging from aiding the visually impaired to aid diagnostic processes in medical imaging [21] . It aims towards improving human-computer interaction and automating tasks that require visual understanding [22] . Its manufacturers admit that although the technology is highly accurate, it is not without errors and should be used as an adjunct to expert human judgment for critical applications.
Novel image-recognition plugin still looks in its infancy, in contrast to the widely accepted chatbot. Image-processing bot performed better with everyday life pictures such as a crowded market, or dogs running in the field. In some of them, apart from describing what was seen, it would also comment on the feelings born from a certain picture. Clearly, that was what it was originally designed for. If the existing robotic systems armed with AI, gain ability to “see” and “understand” relevant anatomy and pathology in a similar fashion, then the way for an autonomous surgical assistant will be permanently established. A lot of published work has been around for years for autonomous instrument tracking and visual servoing, and it seems that AI has arrived to connect the missing dots [23] [24] [25] . An autonomous surgical assistant with deep learning capabilities could customize its learning curves according to each surgeon’s personal preference and style, with active input during surgery aiding decision-making, or protection from human error or poor human judgment. For younger colleagues, it could act as a supervisor, aiding the training process. Teaching could take place by means of assisting maneuvers, or by drawing the dissection planes on the laparoscopic screen. Finally, although still looking distant, an autonomous robot-surgeon could be carrying cumulated experience from thousands of surgeons (their motion data sets are already digitally recorded since the era of the first robotic system), present with faster turnover rates, lower complication rates, less need for “human” conditions to perform surgery. An autonomous robot would use more standardized instrumentation and techniques with more reproducible results. This material would be even more standardized and homogenous for higher-quality multicentered randomized studies.
Ethical concerns have been raised since the era of science fiction and continue to appear in the existing literature on evolving AI chatbots [26] [27] [28] [29] . Rational questions have arisen, such as the role of a robot surgeon that misinterprets an image, video or situation during surgery. This leads to the assumption that a robot might end up with erroneous conclusions and might be willing to react to them react inappropriately manner. Consequences of AI errors in a surgical environment could prove fatal and fall into legislative inconsistencies about source of responsibility. Every important ethical and medicolegal ramification of AI decisions and interventions during surgery, should be analyzed separately in a multidisciplinary setting and obligatory international regulations should be outlined for its use. We are not the first to claim that we received paradoxical statements from chatting with an AI platform. Often extreme, unrealistic, and even frightening interpretations would make their appearance throughout our short experience in this study. Paradoxically, this also happened in conjunction with completely correct answers. This behavior was particularly disturbing, especially when in simple anatomy descriptions the bot would also recognize “a surgeon putting a pen in a patient’s mouth”, or “a woman bleeding profusely from the neck”, sometimes also doubting rationality in its own conclusions. In our opinion, this type of “errors” is of outmost importance, stressing out that AI should be sealed by strict safety measures before commencing a permanent presence in the hospital environment.
This study came along with the very first appearance of a novel image recognition GPT plugin, and its beta testing period. As a result, restricted from several limitations, we attempted to assess a new software in a very immature phase. Furthermore, this software has not been designed for medical image interpretation, an ability which, at present, requires many years of experience and training from human health care professionals. There was no attempt to teach the chatbot, since the study aimed to assess its knowledge and reactions from its “factory settings”, i.e. the first version of a newborn (but fully functional) AI application. Deep learning technicalities behind familiarization process of the chatbot were considered beyond the scope of this paper. Our concept was far simpler than that: If the chatbot (without prior teaching process), can recognize a woman walking down the street looking sad, why shouldn’t it recognize a gallbladder being removed? Laparoscopic snapshots can be really difficult to interpret even for human surgeons and it is affected by altered anatomy from severe pathology, level of illumination, reflection of red color from bleeding sites, shadows, angles of the scope and amount of smoke or fog.
Variability of selected image compressions and formats in this study may have potentially affected quality of results, although resolution alone (acceptable for human eyes) did not seem to affect efficiency of interpretations. Other technical limitations of the present study include small sample size, different quality, lighting and sources of the various images used, early phase of the newly appearing image-recognizing bot, mixed label and unlabelled pictures, and absence of a structured preparation of the chatbot through a surgical teaching protocol. Therefore, no safe conclusions can be drawn on the validity of this technology in the clinical setting. Arousal among clinicians and the need to design and materialize further studies on AI in the clinical and surgical setting seems obligatory in the face of this unavoidable evolution.
Future goals of this study include a uniform set of images created by the same device under the same conditions of light and color balance, as a more standardized, ideal setting for further evaluations of AI vision in laparoscopy. Real-time assessment of streaming surgical videos is a critical next step, as soon as appropriate plugins will be able to process efficiently at such speeds. This step should yield far better results, since in our small experience, responses to mixed laparoscopic images at different phases of the operation, were generally better. Furthermore, radiological real-time correlation with images from PACS hospital network shown simultaneously within the surgical field is another promising target for AI interpretation.
5. Conclusion
Interaction between surgeon and chatbot appears to be an interesting frontend for further research by clinicians in parallel with evolution of its complex underlying infrastructure. In this early phase of using artificial intelligence for image recognition in surgery, no safe conclusions can be drawn by small cohorts with commercially available software. Further development of medically-oriented AI software and clinical world awareness are expected to bring fruitful information on the topic in the years to come.