DETECT IPN: Real-World Experience with Automated Detection of Incidental Pulmonary Nodules in an All-Comer Population ()
1. Introduction
Lung cancer (LC), with an annual incidence of nearly 57,000 cases (11.5% of all cancer cases), is one of the most common cancers in Germany and has the highest mortality rate, with 45,000 deaths per year (19.5% of all cancer-related deaths) [1]. National screening programs based on low-dose computed tomography (LDCT) examinations can help diagnose the disease at earlier stages and thus demonstrably reduce the risk of mortality [2]-[9]. In Germany, a corresponding lung cancer screening (LCS) regulation (LuKrFrühErkV) was enacted by the Federal Ministry for the Environment (BMUV), allowing heavy smokers aged 50 - 75 years to undergo standardized LDCT examinations starting from July 1, 2024 [9] [10]. This regulation also mandates the use of software for computer-aided detection of pulmonary nodules.
Computer-aided detection (CAD) systems to assist in the radiologic evaluation of imaging modalities have been in development for more than twenty years [11]-[13]. CAD systems autonomously mark regions suspected of pathology (such as pulmonary nodules) within image data. Radiologists can then confirm these marked areas as true positives or reject them as false positives. This can improve diagnostic quality while reducing workload [13]-[15]. In the past, conventional CAD systems have been based on manually created image processing algorithms. However, the introduction of deep learning models based on artificial neural networks (“artificial intelligence”, AI) in recent years has led to significant development and improvement of these systems [16]-[22].
Deep learning models “learn” autonomously based on large training datasets, gradually recognizing specific patterns within the data using neural layers until the result is optimized. A deep learning model developed by Google, which outperformed six experienced radiologists in the detection of cancer in terms of sensitivity and specificity in the 2017 Data Science Bowl (Booz Allen Hamilton & Kaggle), garnered significant attention [23]. This model was trained, tuned, and tested on 42,290 CT datasets from the U.S. National Lung Cancer Screening program. However, typical pitfalls of automated detection systems are the potential divergence between (homogenous) training datasets and (heterogenous) application data, as well as finding a balance between sensitivity and specificity [24]-[26].
Targeted screening programs using LDCT are effective for early lung cancer detection but are generally restricted to older individuals with a significant smoking history. However, these criteria overlook many cases, as up to 25% of lung cancer cases occur in younger individuals who have never smoked [27] [28]. In addition to people with insufficient smoking history and people not willing to participate, screening programs might not cover more than half of the LC-vulnerable population, highlighting a significant gap in current screening approaches. An efficient way to increase early LC detection rates, regardless of smoking status, is through the management of incidental pulmonary nodules (IPNs). This approach not only complements existing screening efforts but also enables early LC detection in persons ineligible for conventional LC screening. IPNs are incidental findings from diagnostic thoracic CTs performed for reasons other than LCS or clarification of suspected malignancy [29]. Recommendations for the standardized diagnosis and follow-up of IPN have been incorporated into the current S3 guidelines in Germany [30].
The growing use of diagnostic CT scans in daily practice enhances the potential for detecting IPNs. However, this is balanced by the substantial workload it places on radiologists [31]-[35]. An obvious solution to increase the sensitivity and efficiency of IPN detection is the use of AI-based detection systems. Therefore, within the DETECT-IPN study, we wanted to investigate the extent to which commercially available AI systems for detecting pulmonary nodules can be used for IPN detection in clinical routine by analyzing its precision in terms of the true positive rate (true positive IPN/all detected IPN).
2. Methods
This was a retrospective study investigating two AI-based, computer-aided IPN detection systems at three German radiological centers: the Institute for Diagnostic and Interventional Radiology at Hannover Medical School (MHH), the Department of Diagnostic and Interventional Radiology at Heidelberg University Hospital (UKHD), and the Radiology Center for Diagnostics and Therapy Munich (RDTM). The study included diagnostic thoracic CT scans (trauma CTs, vascular CTs, thoracic CTs, and pulmonary angiographies), collected within clinical routine at the respective centers in 2021 (observation period: 12 months). In most cases, slice thickness was 1 mm. Various CT devices and examinations with and without contrast agents were used, representing the spectrum of procedures used in clinical practice. Due to this unselected real-world approach, neither analysis of nor correction for these technical differences between the study sites was performed. Only adult patients without known malignancies or acute suspicion of lung cancer were included. At UKHD and RDTM, immunosuppressed patients were also excluded from the study. Two detection systems for pulmonary nodules were used for IPN detection: A research software prototype with no clearance for medical use (ChestCT Explore) from Siemens Healthcare GmbH (AI1) and the commercially available ADVANCE Chest CT CE from contextflow (AI2). The suitability for the in-tended application was clarified with the manufacturers beforehand. Subsequently, cases with AI-detected IPN of both systems were reviewed and verified separately by experienced radiologists from the three centers (one per site) to study the rate of IPN detection and precision of the two detection systems. Radiologists could confirm (true positive) or reject (false positive) AI-detected IPN. At MHH, only cases with previous CT examinations were reviewed by radiologists, and the IPN review was based solely on the guideline definition of a lung nodule as a coincidental, single, localized process in the lung that does not exceed 3 cm in diameter and is completely surrounded by lung tissue. In contrast to this broad definition, at the UKDH and RDTM only IPN requiring further clarification were judged as true positive, whereas obviously benign nodules like inflammatory processes, calcified granulomas, perifissural lymph nodes, and lingular calluses were judged as false positive. Due to this difference, the guideline-based (MHH) and the interpretation-based (UKHD & RDTM) IPN verification will be presented and discussed separately. Based on the German guidelines and expert recommendations [30] [36], the following cut-off values of IPN diameters (automated CAD-based measurement) were applied to select IPN and only appropriate IPN were reviewed by radiologists within the three centers:
•MHH: Mean diameter ≥ 5 mm and ≥ 8 mm;
•UKHD: Mean diameter ≥ 5 mm and ≥ 8 mm;
•RDTM: Maximum diameter ≥ 5 mm and ≥ 8 mm.
At RDTM, IPN with a maximum diameter ≥ 5 mm and ≥ 8 mm were detected and recorded using the AI-based systems. Due to local data protection regulations, the analysis data had to be stored online and were mostly lost during automatic data cleanup. Therefore, detection results at RDTM are only known for AI2, and only IPN with a maximum diameter ≥ 8 mm could be radiologically reviewed for both systems.
3. Results
3.1. Patient Characteristics
Overall, the datasets of 1552 CT scans were analyzed using the two AI-based detection systems (Table 1). Of these, the datasets and IPN findings of 317 (AI1) and 308 (AI2) patients who met the relevant size cut-off criteria and, in the case of MHH, had previous examinations, were subsequently reviewed by experienced radiologists (Table 2). Additionally, the baseline data of the included patients were documented at UKHD. The gender ratio was balanced with 45% female and 55% male patients, with included women being slightly older with a median age of 64 years compared to men with a median age of 58 years. Pulmonary angiographies accounted for 43% of the total CT scans and 53% of the scans in women, while trauma CTs were most common in men at 42%. Vascular CTs accounted for 19% of the datasets, and thoracic CTs were very rare at 2%. Contrast agents were used in all cases, most frequently in the pulmonary arterial phase (44%), mixed arterial/venous (28%), or purely arterial (22%).
Table 1. AI-based IPN detection at the three study sites.
Study site |
AI1 (siemens) |
AI2 (contextflow) |
IPN |
Patients |
IPN |
Patients |
MHH
(n = 766) |
|
|
|
|
IPN positive |
2.000 |
534 (70%) |
1.741 |
495 (65%) |
≥5 mm |
1.325 (66%) |
450 (59%) |
1.172 (67%) |
542 (71%) |
≥8 mm |
489 (24%) |
254 (33%) |
542 (31%) |
293 (38%) |
No findings |
- |
232 (30%) |
- |
271 (35%) |
UKHD
(n = 201) |
|
|
|
|
IPN positive |
422 |
121 (60%) |
241 |
99 (49%) |
≥5 mm |
318 (75%) |
109 (54%) |
171 (71%) |
75 (37%) |
≥8 mm |
169 (40%) |
63 (31%) |
102 (42%) |
50 (25%) |
No findings |
- |
80 (40%) |
- |
102 (51%) |
RDTM
(n = 585) |
|
|
|
|
IPN positive |
- |
- |
1.843 |
482 (82%) |
≥5 mm |
- |
- |
1.357 (74%) |
385 (66%) |
≥8 mm |
- |
- |
568 (31%) |
221 (38%) |
No findings |
- |
- |
- |
103 (18%) |
AI: artificial intelligence; IPN: incidental pulmonary nodule.
3.2. AI-Based IPN Detection
Within the 1552 CT datasets from the three study sites, the two AI-based detection systems detected IPN of any size in approximately two-thirds of the cases, with the lowest value of 49% by the context flow system (AI2) at UKHD and the highest value of 82% also by AI2 at RDTM (Table 1). The Siemens system (AI1) detected IPN in about two-thirds of the cases (between 60% at UKHD and 70% at MHH). On average, 2 IPN per CT were detected by both systems. AI2 achieved the lowest value of 1.2 IPN/CT at UKHD and the highest value of 3.2 IPN/CT at RDTM, whereas AI1 achieved 2.1 IPN/CT at UKDH and 2.6 IPN/CT at MHH. Approximately two-thirds of the IPN fell into the size category ≥5 mm, and roughly one-third into the size category ≥8 mm. The comparability of detection results was greater at the center level between the two systems than at the system level between centers.
3.3. Verification of AI-Based IPN Detection
At MHH, IPN review was based on guideline-based IPN definition and only CT datasets with previous CT examinations were manually reviewed, which applied to 93 out of 766 cases (Table 2). Importantly, processes that fell under the IPN definition but did not require follow-up were considered true positives (TP) at MHH. In this subgroup with pre-examinations, a slightly higher rate of IPN was detected than in the overall cohort (for comparison, see Table 1), achieving approximately 4 IPN/CT of any size, 2 IPN/CT ≥ 5mm, and 0.8 IPN/CT ≥ 8mm. The radiological review of AI-detected IPN with a mean diameter ≥ 5 mm could be confirmed in 62% (AI1) and 72% (AI2) of cases at MHH (= precision), corresponding to a relatively low rate of false-positive (FP) findings per CT of 0.9 FP/CT (AI1) and 0.6 FP/CT (AI2), respectively.
The precision of both systems was lower in the IPN size category ≥ 8 mm at MHH, with only half of the IPN being confirmed by radiologists as true positive (45% (AI1) and 52% (AI2)). Nevertheless, the rate of false positive findings per CT was still low with approximately 0.4 FP/CT for both systems.
Table 2. Manual review of AI-detected IPN by guideline-based IPN definition (MHH).
MHH
(n = 93) |
AI1 (siemens) |
AI2 (contextflow) |
IPN |
Patients |
IPN |
Patients |
IPN
positive |
380 |
70 (75%) |
393 |
63 (68%) |
≥5 mm |
214 (56%) |
59 (63%) |
206 (52%) |
47 (51%) |
Review |
TP |
FP |
TP |
FP |
TP |
FP |
TP |
FP |
133
(62%) |
81
(38%) |
40
(68%) |
19
(32%) |
149
(72%) |
57
(28%) |
33
(70%) |
14
(30%) |
≥8 mm |
73 (19%) |
39 (42%) |
71 (18%) |
34 (37%) |
Review |
TP |
FP |
TP |
FP |
TP |
FP |
TP |
FP |
33
(45%) |
40
(55%) |
19
(49%) |
20
(51%) |
37
(52%) |
34
(48%) |
20
(59%) |
14
(41%) |
No
findings |
- |
23 (25%) |
- |
30 (32%) |
AI: artificial intelligence; FP: false-positive findings; IPN: incidental pulmonary nodule; TP: true-positive findings.
At UKHD and RDTM, the review of IPNs relied on expert judgment, and nodules deemed not to require follow-up were classified as false positives (Table 3). In this case, only 16% (AI1) and 23% (AI2) of IPN ≥ 5 mm and 6% (AI1) and 11% (AI2) of IPN ≥ 8 mm could be confirmed as true positive by radiologists at the UKHD. This resulted in a slightly higher rate of false-positive findings per CT, ranging from 0.8 FP/CT (AI1) and 0.5 FP/CT (AI2) of IPN ≥ 8 mm to 1.3 FP/CT (AI1) and 0.7 FP/CT (AI2) of IPN ≥ 5 mm.
At RDTM, the precision of both detection systems was slightly better than at UKDH but generally still low, achieving 23% (AI1) and 21% (AI2) true positive rates of IPN ≥ 8 mm by radiologists’ confirmation with 0.7 and 0.6 FP/CT, respectively.
Again, the radiological review of AI-based detection findings yielded quite comparable results between the two systems at the site level, whereas comparability was reduced between the study sites.
Table 3. Manual review of AI-detected IPN by experts’ interpretation-based IPN judgement (UKHD & RDTM).
Study
site |
AI1 (siemens) |
AI2 (contextflow) |
IPN |
Patients |
IPN |
Patients |
UKHD
(n = 201) |
|
|
|
|
IPN positive |
422 |
121 (60%) |
241 |
99 (49%) |
≥5 mm |
318 (75%) |
109 (54%) |
171 (71%) |
75 (37%) |
Review |
TP |
FP |
TP |
FP |
TP |
FP |
TP |
FP |
51 (16%) |
267 (84%) |
28 (26%) |
81 (74%) |
40 (23%) |
131 (77%) |
24 (32%) |
51 (68%) |
≥8 mm |
169 (40%) |
63 (31%) |
102 (42%) |
50 (25%) |
Review |
TP |
FP |
TP |
FP |
TP |
FP |
TP |
FP |
10 (6%) |
159 (94%) |
9 (14%) |
54 (86%) |
11 (11%) |
91 (89%) |
9 (18%) |
41 (82%) |
No findings |
- |
80 (40%) |
- |
102 (51%) |
RDTM
(n = 585) |
|
|
|
|
IPN positive |
- |
- |
1,843 |
482 (82%) |
≥8 mm |
521 |
149 (25%) |
461 (25%) |
186 (32%) |
Review |
TP |
FP |
TP |
FP |
TP |
FP |
TP |
FP |
122 (23%) |
399 (77%) |
31 (21%) |
118 (79%) |
97 (21%) |
364 (79%) |
44 (24%) |
142 (76%) |
No findings |
- |
- |
- |
103 (18%) |
AI: artificial intelligence; FP: false-positive findings; IPN: incidental pulmonary nodule; TP: true-positive findings.
4. Discussion
This study aimed to evaluate the rate and precision of automated IPN detection under practical conditions by commercially available AI systems. Despite the heterogeneity of the datasets used, the rates of detected IPN by both AI systems were largely comparable both with each other and between the three study centers. Overall, IPN of any size were detected in approximately 50 to 80 percent of the patients. This range seems quite plausible, as it is supported by data from two large Danish clinics with a detection rate of 50 percent in 2019 [33]. IPN of relevant size categories requiring further follow-up were detected in 37 to 71 percent (IPN ≥ 5 mm) and 25 to 38 percent (IPN ≥ 8 mm) of the patients. These detection rates, however, are higher than expected, as in the current Danish study, IPN ≥ 5 mm were only found in 23.4 percent of the cases.
The manual review of AI-based detection results by experienced radiologists varied between the study centers and by the approach to classifying true positives and false positives. Consequently, the rate of true-positive findings was highest at MHH with approximately 60 to 70 percent for IPN ≥ 5 mm. IPN ≥ 8 mm achieved a lower precision at MHH of about 50 percent. At UKHD and RDTM, the precision of AI-detected IPN was strikingly lower than at MHH, ranging from 20% for IPN ≥ 5 mm at UKHD and IPN ≥ 8 mm at RDTM down to 10% for IPN ≥ 8 mm at UKHD. The precision of IPN ≥ 8 mm was generally lower than that of IPN ≥ 5 mm in all study sites.
It is noteworthy that the comparability of the results of both systems at the center level was consistently given, while the results varied between the centers and with the type of manual IPN review. For this reason, it is very likely that the low precision observed in UKHD and RDTM is mainly caused by the strict interpretation-based verification of IPN at these sites. Since we did not correct for site-specific differences in technical details but used unselected real-world data instead, such center-specific differences as well as inter-reader-variabilities represent additional explanations for differences between the sites. However, since all three study sites are experienced and specialized radiological centers, we are convinced that the data included represent actual best practice through-out Germany.
AI-based detection systems have so far been trained on relatively homogeneous datasets for application within LCS programs without the use of contrast agents [20]-[22]. The performance of such detection systems is crucially dependent on the scope, selection, and quality of the training datasets [24] [25]. Differences between the training datasets and the datasets on which the models are later applied affect the models’ sensitivity and selectivity. Hence, another very likely explanation for the relatively low precision of the two detection systems tested in this study is data heterogeneity and divergence from training data sets. This could be due to technical imaging parameters (e.g. differences in the devices/CT protocols and the use of contrast agents) as well as composition of the patient cohorts. For example, while participants in LCS programs are typically asymptomatic and largely healthy, this study’s cohort was quite different. Moreover, the study encompassed CTs reflecting various pathologies including inflammatory processes, especially as the observation period coincided with the COVID-19 pandemic. Therefore, CTs for the clarification of SARS-CoV-2-mediated pneumonia as well as other inflammatory pathologies including lung fibrosis cases were included in the evaluation. Since this was not planned beforehand, only documentation at RDTM enabled a respective analysis. The mean rate of AI-based IPN detection in all CTs was 3.2 IPN/CT. We separated the 186 patients with IPN ≥ 8 mm into non-inflammatory cases (n = 161) and inflammatory cases (n = 25). Within CTs of non-inflammatory cases, the mean detection rate was 5.0 IPN/CT, of whom 34% fell into the size category ≥ 8 mm and 67% of these were classified as false-positive during radiological review. In contrast, in CTs of inflammatory pathologies the mean IPN detection rate was 13.6 IPN/CT, of whom 55% fell into the size category ≥ 8 mm and 91% of these were classified as false-positive during radiological review. Although this was not specifically analyzed at the other two sites, personal reports from the participating radiologists indicate that the rate of false-positive IPN indeed was particularly higher in inflammatory CTs when compared to CTs of other non-inflammatory pathologies.
When interpreting the results of this study, it’s crucial to consider the primary goal of AI-based detection systems in radiology: enhancing reader sensitivity to avoid missing potential IPNs. These systems often prioritize sensitivity over specificity, which is acceptable as long as the effort to rule out false positives doesn’t offset the benefits of quick and efficient image analysis. Expert verification remains essential in this workflow. Despite the low precision of AI-detected IPNs at UKHD and RDTM, the false-positive rate per CT was low, ranging from 0.4 to 1.3 false positives per CT. For software designed to assist radiological assessments as a second read, a rate of approximately 1 false positive per CT is acceptable and doesn’t significantly increase workload, particularly benefiting less experienced readers who gain from the system’s high sensitivity. Both systems in this study demonstrated high sensitivity, although false-negative rates weren’t systematically evaluated. At RDTM, a review of the respective medical reports for IPNs requiring further investigation revealed that none were missed by the AI systems. However, for experienced readers, the low precision observed in practical applications may outweigh the high sensitivity, underscoring the need for further advancements in this area.
The observations of our study are supported by a very recent systematic review by Julia Geppert and colleagues on the test accuracy of AI-based nodule and cancer detection within LCS programs [37]. It was found that while AI-based detection systems were able to increase the readers’ sensitivity and to reduce the reading time, specificity was generally reduced. Since false positive findings carry the danger of increasing unnecessary interventions on subjects who will not develop cancer, these observations, together with our study results, highlight the need to increase the specificity of commercially available detection systems.
In fact, AI-based software solutions are subject to continual development regarding the models and the diversity of data used for training and validation. To our knowledge, both AI systems have been updated since conduction of this study and training datasets were complemented by more heterogenous CT data of patients with comorbidities to better reflect clinical reality.
5. Conclusions
In this study, we found that both AI-based systems for detecting IPNs under real-world conditions showed comparable detection rates, although their precision was generally low and varied significantly between study sites. This variance was partly due to different review processes at MHH and UKHD/RDTM. Inflammatory conditions, such as pneumonia and fibrosis, likely contributed to several false positives. Additionally, site-specific technical differences, extensive use of contrast agents, and inter-reader variability may have affected the systems’ precision.
Despite these challenges, the false-positive rate per CT was approximately 1, which is considered acceptable if used as a supportive radiological assessment tool, especially for less experienced readers. AI-based detection systems play to their strengths by reducing otherwise missed IPNs as long as they are integrated into existing workflows and tailored to specific needs.
Taken together, this study indicates that commercially available systems are not yet capable of functioning autonomously and still require expert oversight. Enhancing training datasets with more diverse, routine CT data comprising CT applications of daily practice, which include contrast agent usage, comorbid patients, and inflammatory pathologies, is crucial for improving the effectiveness of AI-based IPN detection in the future. Representativeness of such datasets would be improved by big multicentric collaborations and/or even multinational initiatives, highlighting the need for proper data sharing practice.
Acknowledgements
We thank Dr. Johannes Gerlach (Alcedis GmbH, Gießen, Germany) for medical writing assistance.