A Validation Study of the Deep-Learning-Based Prostate Imaging Reporting and Data System Scoring Algorithm

Purpose: The Prostate Imaging Reporting and Data System (PI-RADS) was introduced to standardize prostate cancer diagnosis by MRI. However, the inter-reader agreement by PI-RADS scoring is not always high. The purpose of this study was to validate a deep-learning-based diagnostic algorithm of PI-RADS. Methods: We applied a Siemens Healthineers Prostate Artificial Intelligence (AI) prototype (work in progress) for fully automated prostate lesion detection, classification and reporting. More than 2000 bi-parametric MRI studies along with the PI-RADS reports were included as training, validation, and test data. This prospective validation study includes 101 consecutive patients suspected of prostate cancer, and 100 patients were included in the analysis. All subjects underwent a noncontrast-enhanced bi-parametric MRI including T2-weighted and diffusion-weighted imaging. Two board-certified radiologists independently scored the PI-RADS, validated and shown to help score PI-RADS.


Introduction
Prostate cancer is the second most frequently diagnosed cancer in males in the world, and it is the most frequently diagnosed cancer among men especially in developed countries [1] [2].
The difference in prostate cancer diagnosis rates between regions is largely due to the prevalence of prostate-specific antigen (PSA) testing [3]. PSA testing is widely used in screening for prostate cancer, but there is a certain probability of false positives and false negatives [4]. The definitive diagnosis is a pathological diagnosis by needle biopsy, but is highly invasive [5]. MRI has come to be used as a noninvasive technique supporting the diagnosis and localization of prostate cancer [6]. The Prostate Imaging Reporting and Data System (PI-RADS) was introduced to standardize prostate cancer diagnosis by MRI [7]. However, image interpretation by PI-RADS scoring requires experience, and it has been reported that even if this score system is used, the inter-reader agreement is not always high [8] [9] [10].
In recent years, artificial intelligence (AI) has been actively used in the field of diagnostic imaging [11] [12]. In particular, deep learning can potentially discriminate suspicious and nonsuspicious images with very high accuracy. There are many reports using AI in the field of prostate cancer, such as computer-aided diagnosis of the Gleason score from pathological images [13].
Siemens Healthineers has developed a system that detects and segments prostate lesions and outputs PI-RADS scores using bi-parametric MRI including T2-weighted images (T2WI) and diffusion-weighted images (DWI) as input.
Utilization of the AI model is expected to contribute to quick and accurate diagnosis of prostate cancer. In order to operate the developed AI model, it must be validated in an actual clinical setting. The purpose of this study was to validate a deep-learning-based diagnostic algorithm of PI-RADS compared with the interpretation of radiologists.

AI Model Development
We applied an AI prototype (Prostate AI Prototype version on December 21, 2019, work in progress, Siemens Healthcare, Erlangen, Germany) for fully automated prostate lesion detection, classification and reporting. The prototype consists of a web-based reading platform for viewing and interpreting the image data and AI-based results, as well as the actual AI preprocessing pipeline and a Open Journal of Radiology component for lesion detection and classification, based on deep learning [14] [15]. The preprocessing stage begins with a fully automated segmentation of the prostate gland and peripheral zone on T2WI using a 3D convolutional neural network (CNN). Then, T2WI and DWI are co-registered, and an apparent diffusion coefficient (ADC) map and calculated DWI at b = 2000 s/mm 2 are computed. Using 2D CNNs, Prostate AI automatically detects clinically relevant lesions (PI-RADS 3 or above) within the prostate gland based on the T2WI, ADC and b = 2000 s/mm 2 images, followed by a false-positive reduction step using a 2.5D multi-scale neural network. Finally, an independently trained 2.5D convolutional neural network predicts the PI-RADS v2 category of each lesion. 2170 bi-parametric MRI studies from seven different clinical institutions were used during model training, testing and validation.

Sample Selection of the Validation Study
The present prospective analysis was approved by the Institutional Review Board, and 101 consecutive patients suspected of prostate cancer from March to July 2019 were included. The mean age ± standard deviation was 67.0 ± 10.2 years. The mean PSA for all patients was 10.4 μg/mL (range 0.018 to 203 μg/mL).
Two board-certified radiologists (R. I. and M. A.) independently scored the PI-RADS score for each case, and if there were disagreements, another radiologist (S. O.) made a final decision and confirmed the diagnosis. When multiple lesions were detected in a single patient, the lesion with the highest category was adopted. We compared the results of the AI model with the interpretation results by the radiologists.

Results
Of the 101 patients, one was excluded because the misalignment of the T2WI and DWI was so strong due to a gross body movement between the image series that it could not be analyzed by the AI model. In total, 100 patients were included in this study. For the cases with PI-RADS ≥ 4, the AI model correctly identified 29 cases of Open Journal of Radiology those as category ≥ 4. The sensitivity of our AI model for PI-RADS ≥ 4 was 0.76, and the specificity was 0.76. For the cases with PI-RADS ≥ 3, the AI model correctly diagnosed 40 cases as category ≥ 3. The sensitivity for PI-RADS ≥ 3 was 0.69, and the specificity was 0.76 (Table 1). In the lesion-based analysis, 7 PI-RADS 3, 16 PI-RADS 4, and 10 PI-RADS 5 lesions were identified as PI-RADS ≥ 3 in the peripheral zone, with AI detection rates of 43%, 63%, and 100%, respectively.

Discussion
For lesions of category 4 and above, the AI model correctly diagnosed the lesions with an accuracy of 76% and 76% sensitivity/76% specificity.
The AI model correctly diagnosed lesions larger than 15 mm in size ( Figure   1), except for one case ( Figure 2). Moreover, the lesion was detected even in one miscategorized case. The reason one PI-RADS 5 lesion was diagnosed as category 3 may be that the lesion was too large for the AI model to recognize the boundary of the lesion.
More than half (62%) of the PI-RADS 4 lesions smaller than 15 mm were correctly detected (Figure 3) though 8 of them were classified as PI-RADS 5. Some cases with small lesions could not be detected correctly by the AI model ( Figure   4). Small but clinically significant cancers should not be overlooked.
The detection rate of lesions in PI-RADS 3, especially in the transition zone, was low. In our institution, radiologists tended to recognize areas with faint      so future studies will be needed to assess the actual cancer detection rates based on histopathological samples. Prostate cancer is often not detected pathologically in lesions of PI-RADS 3 [9] [10]. Therefore, it is considered important to correctly diagnose lesions of PI-RADS 4 or higher, and the present result was considered to be acceptable.
False positives in AI diagnosis were caused by BPH nodules, chronic prostatitis, and rectal gas artifact. These conditions cannot be diagnosed by signal intensity alone and require careful consideration of morphology and image properties, which the AI is trained to perform but still does not always get right.
In future clinical practice, the radiologist will make the final diagnosis after AI presents the lesion. If there are many false positives, the confirmation work of radiologists will increase, but if there are many false negatives, there is a possibility that oversights will increase. It is necessary to use AI diagnosis support wisely depending on the situation.
In this validation study, one of the limitations is that no comparison with histopathological diagnosis has been made. This was done by purpose, as we wanted to reflect a clinical, prebiopsy scenario as accurately as possible. The PI-RADS category does not indicate the definite existence of prostate cancer, so the algorithm was trained on detecting radiological lesions and the purpose of the software is to support radiologists during their work. It would also be clinically important to evaluate the pathology-based truth. Second, we used PI-RADS v2 and not v2.1. During the truthing process, v2 was the most recent reference system, and all consequent steps were designed based on this system. Third, the validation study was done on one MR device in one institution. In future studies, further research at more institutions and studies using MRI of different vendors are desired.

Conclusion
Our deep-learning-based algorithm has been validated and shown to help score PI-RADS.