A Deep Learning Interpretable Model for Novel Coronavirus Disease (COVID-19) Screening with Chest CT Images

In this article, we propose a convolutional neural network (CNN)-based model, a ResNet-50 based model, for discriminating coronavirus disease 2019 (COVID-19) from Non-COVID-19 using chest CT. We adopted the use of wavelet coefficients of the entire image without cropping any parts of the image as input to the CNN model. One of the main contributions of this study is to implement an algorithm called gradient-weighted class activation mapping to produce a heat map for visually verifying where the CNN model is looking at the image, thereby, ensuring the model is performing correctly. In order to verify the effectiveness and usefulness of the proposed method, we compare the obtained results with that obtained by using pixel values of original images as input to the CNN model. The measures used for performance evaluation include accuracy, sensitivity, specificity, positive predictive value, negative predictive value, F1 score, and Matthews correlation coefficient (MCC). The overall classification accuracy, F1 score, and MCC for the proposed method (using wavelet coefficients as input) were 92.2%, 0.915%, and 0.839%, and those for the compared method (using pixel values of the original image as input) were 88.3%, 0.876%, and 0.766%, respectively. The experiment results demonstrate the superiority of the proposed method. Moreover, as a comprehensible classification model, the interpretability of classification results was introduced. The region of interest extracted by the proposed model was visualized using heat maps and the probability score was also shown. We believe that our proposed method could provide a promising computerized toolkit to help radiologists and serve as a second eye for them to classify COVID-19 in CT scan screening examination. Open Access


INTRODUCTION
A coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has widely spread all over the world and has become a pandemic. The outbreak of COVID-19 has brought effects on many aspects, like daily lives, public health and the global economy. As of June 28, 2020, The World Health Organization has announced that there are more than 10 million confirmed cases of COVID-19 in the world, and more than 499,000 people have died. In addition, the basic reproduction number (R0), defined as the average number of secondary cases produced by one infected individual, is about 6.47 (range 1.66 -10) in China, 2.6 in South Korea, and 4.7 in Iran [1][2][3], indicating that the spread of COVID-19 is getting seriously. Due to unavailability of specific therapeutic drugs or vaccines for COVID-19 [4], it is the top priority to halt the spread of COVID-19 by screening a large number of suspicious cases and isolating the infected individuals from the community. According to the latest guidelines issued by the Chinese government, the diagnosis of COVID-19 should be confirmed by a reverse transcription polymerase chain reaction (RT-PCR) test. However, RT-PCR might not be high enough in terms of sensitivity. Also, false negatives can occur if the sample contains insufficient quantities of the virus; therefore, the test may need to conduct several times before finally confirmed [5][6][7]. Thus, fast and accurate diagnostic methods or tools are urgently and essentially necessary to fight against SARS-CoV-2.
Chest CT is a routine imaging tool for pneumonia diagnosis, thereby providing benefit for diagnosis of COVID-19. The majority of COVID-19 patients demonstrate similar features on CT images, including ground-glass opacities, pulmonary consolidation, and/or interstitial changes with a peripheral lung distribution [8,9]. Although chest CT could serve as a practical approach for early screening of COVID-19, it may show some similar imaging features between COVID-19 and other types of infectious and inflammatory lung diseases. Thus, it is not easy for differentiating COVID-19 from other viral pneumonia. Also, radiologists may take a long time to recognize the features. Moreover, manual reading of CT images is a time-consuming task and subject to fatigue, in turn resulting in human error. Therefore, techniques using artificial intelligence (AI) based automated analysis have the potential to help radiologists analyze COVID-19 from CT images.
Deep learning (DL) is an important breakthrough in AI. One of the typical DL architectures is the convolutional neural network (CNN). The CNN has been widely used in the medical field due to its powerful feature representation [10][11][12][13][14]. Application of CNN techniques together with radiological imaging can be helpful for the accurate detection and classification of COVID-19 [15]. Recent works using the CNN approach for classifying CT images of COVID-19 and Non-COVID-19 are reported [16][17][18][19]. Generally, a region of interest (ROI) from a CT image is cropped and used as input to the CNN models. These studies have achieved a satisfactory performance; however, there is ample room for improvement in terms of sensitivity and accuracy of classification.
In our previous studies, we used wavelet coefficients of original images as input to the CNN systems to histologically classify lung diseases [20] and to discriminate different breast densities [21]. We have obtained satisfactory results in terms of classification accuracy. In the present study, we propose a wavelet-based CNN system for automatically discriminating COVID-19 pneumonia from Non-COVID-19 pneumonia. The inputs to the network are wavelet coefficients of the entire image without cropping any parts of the original image. The present work mainly focuses on further improving performance of classification between COVID-19 and Non-CIVID-19. In this work, a well-known pre-trained CNN model, ResNet-50 was used [22,23]. ResNet, a short name for residual network, is a pre-trained model that has been trained on more than one million images in the ImageNet database [24] and was the winner of Im-ageNet challenge in 2015. ResNet can have a very deep network of up to 152 layers. There are 5 versions of ResNet models, which contains 5, 34, 50, 101, 152 layers respectively. ResNet-50 corresponds to a 50 layer residual network.
While DP has achieved satisfactory accuracy in image classification, one of its problems is model interpretability, a key component in model understanding. Understanding an accurate classification model could provide us more confidence that the model really captures the correct patterns in the target region.
Other than adopting the use of wavelet coefficients instead of raw image data as input to the CNN model, one of the main contributions of this study is that we implemented an algorithm called gradient-weighted class activation mapping (Grad-CAM) [25] to produce a heat map to visually verify where in the image the CNN model is looking at and to ensure the model is performing correctly. In order to demonstrate the effectiveness and usefulness of the proposed method, the results obtained by using pixel values of the original images as input to the CNN model are compared.
The remainder of this paper is organized as follows: In Section 2, we describe the image data set, the proposed CNN model, and the model interpretability of CNN. In Section 3, we present the experimental results. In Section 4, we bring the discussion of the results. In Section 5, we draw the conclusion of this work.

MATERIALS AND METHODS
The CNN environments for implementing the CNN model are as follows. Hardware: Windows10, graphics: NVIDIA Quadro, framework: MATLAB. A ResNet-50-based pre-trained CNN was used and fine tuning operation was conducted. Input data to the network were the wavelet coefficients obtained from COVID-19 and Non-COVID-19 pneumonia CT images. For comparison and verification, raw data of original images (pixel values) were also used as input (hereafter referred to as compared method). The performance of the proposed method and the compared method were evaluated using the 10-fold cross-validation procedure. After obtaining the classification results, the localized region of the chest CT image that determined the final classification was identified by Grad-CAM [25]. Furthermore, the important feature region that influences the probability score of the classification class was visualized using occlusion sensitivity technique.

Image Datasets
The image datasets used in this study were the COVID-19 CT datasets publicly published by a team of researchers at the University of San Diego [26, 27] and the image set published by Joseph Cohen at the University of Montreal [28]. Thus, ethics issues do not arise in this work and the requirement to obtain informed consent was waived. A total of 720 images selected from the above described databases consist of 345 COVID-19 CT images and 375 Non-COVID-19 CT images. The collected images varied in matrix size ranging from 153 × 124 to 1853 × 1485. Since these CT images were collected from COVID-19-related papers from bioRxiv, medRxiv, Lancet, NEJM, JAMA, etc., there is a concern that the Hounsfield unit (HU) values were lost and the resolution of images was reduced. However, a senior radiologist at Tongji Hospital in Wuhan, China (a doctor who has been diagnosing and treating COVID-19 patients) said that the concern did not significantly affect the accuracy of diagnosis decision-making [27]. The difference in the number of images between the two categories was due to the fact that the maximum number of images for the same patient was 3, also, inappropriate images, such as the inclusion of markers in the objects were excluded. Figure 1 shows an example of CT images of COVID-19 and Non-COVID-19 pneumonia.

Wavelet Transforms
The two-dimensional (2D) wavelet transform (WT) has been widely used as an image processing method. Applications to medical image processing include image data compression, image enhancement, and noise removal [29]. In the wavelet analysis, an image is initialized at level 0. The image is decomposed into 4 components of level 1: a low frequency component called low-low (LL) component and 3 detailed components called low-high (LH), high-low (HL) and high-high (HH) components, respectively. Decomposition is further performed on the LL component. More details about the WT can be found in [30].
We implemented a 2D redundant discrete WT (RDWT) method. The RDWT, unlike the conventional WT, does not perform down-sampling operations. Thus, the four components at each level remain the same size as the original image of level 0. In this way, the problem of shift invariance and disappearance of the contour of the decomposed images could be solved. In this study, Daubechies order 2 (db2) was used as the wavelet basis function. The reason of using it is that db2 is a compactly supported orthogonal wavelet. As a result, the coefficient values, which might be able to distinguish features of interest shown in chest CT images, can be captured. Figure 2 shows two examples of four decomposition components of RDWT at level 1 corresponding to Figure 1(a) and Figure 1(c), respectively.

Architecture of the Proposed CNN Model
In this study, fine tuning on ResNet-50 model was implemented. The outline of the network configuration used is shown in Figure 3. ResNet-50 model consists of 16 processing blocks and incorporates two types of shortcut modules (Figure 3(b)). One is a module called identity (ID) block (Figure 3(c)) that there is no convolutional layer in the shortcut path (the input has the same dimension as the output). The other is a module called convolutional block (Figure 3(d)) that there is a convolutional layer in the shortcut path (the dimension of the input is smaller than that of the output). Both modules contain bottleneck structures consisting of 1 block with 3 layers (1 × 1, 3 × 3, and 1 × 1 convolutional layers). It is possible to reduce the number of parameters without significant decline in model performance.   We retrained all layers of the network with CT images obtained from the datasets. In other words, two categories, COVID-19 and Non-COVID-19, were classified using fine-tuning network without frozen layers. The last fully connected layer and the final classification layer of the network were replaced with a new layer which can classify the input images into two categories. The input data to the proposed model were wavelet coefficients obtained from RDWT of the original chest CT images. Of the LL, LH, HL, and HH components, a highly accurate combination of LL, LH, and HH components were selected as 3-channel input to the network [20,21]. In comparison to the proposed method, the pixel values of 3 identical, original CT images were also used as inputs to the same network (the compared method).
Since Resnet-50 requires input images to be of size 224 × 224, the input images were resized with the bi-cubic interpolation. As a pre-processing, the resizing was automatically performed prior to proceeding to the input image layer (Figure 3(a)). We applied 10-fold cross-validation for the network re-training. Of the total 720 images, 648 images were used for re-training and the remaining 72 images for validation. The mini batch size was 81 and optimization algorithm chosen for re-training was stochastic gradient descent with momentum. During the re-training phase, in order to improve the accuracy, the learning speed was made faster in the newly replaced fully connected layer, on the contrary, the learning speed was made slower in the transfer layer. Also, parameters were adjusted so that the learning rate decreased every 5 epochs. Furthermore, an L2 norm regularization was applied to the cost function (also referred to loss function) in order to prevent overfitting. Regarding the epoch setting, accuracy was verified at each iteration cycle, and re-training stops after 5 consecutive iterations when the accuracy has stopped improving.

Gradient-Weighted Class Activation Mapping (Grad-CAM)
The CNN model combines the feature extraction and classification modules into one integrated system. In general, the classification module contains a fully connected neural network model, and the extracted features are converted into a probability score of each class at the softmax layer. The final prediction (classification result) of the network is the category with the highest probability score. Grad-CAM [25] is class-discriminative and localizes the relevant image regions and it uses the gradient (derivative) of the feature map of the final convolutional layer of the network to highlight the significant region in the image for final prediction. Regions with high gradients are the areas that have great effects on the classification result. Figure 4 depicts the flowchart of how to implement Grad-CAM. More details about Grad-CAM can be found in [25].

Occlusion Sensitivity Approach
Occlusion sensitivity is an approach for understanding which parts of an image are most important for classification. Occlusion sensitivity helps us understand the learning behavior of the underlying task by determining whether the network is actually categorizing based on task-specific features [31]. The procedure of the approach can be divided into six steps as follows.
step 1: Classify the target image using the fine-tuned network, and confirm the probability score of the classification category. step 2: Mask (Block) one part of the target image. step 3: Input the masked image created in step 2 to the fine-tuned network . step 4: Calculate the probability score of the classification category. step 5: Move the masked region to another position and implement steps 3 and 4. step 6: Repeat steps 2 -5 until the mask has moved over the entire image. Note that the smaller the mask size and stride, the higher the resolution of the occlusion map will be. However, in this study, taking into account of shortening of computation time, the mask size and the stride were set at integer values closest to 20% and 30% of the input size, respectively. From the results of the above described steps 1 -6, we could consider the following two possibilities. As a result of masking (blocking) one part of the target image and inputting to the network; 1) classified as a Non-target class (or the probability score of the target class dropped significantly). In this case, it is highly possible that the masked (blocked) region is a very important feature for determining the target class; 2) the probability score of the target class increased significantly. In this case, the masked (blocked) region is likely to be a cause area that might be misclassified as a Non-target. In this study, the region that has the greatest influence on the probability score is specified using a heat map.

RESULTS
In this study, we attempted to construct a CNN model for discriminating COVID-19 pneumonia and to make the model interpretable and explainable. The image input layer of the network was replaced with the wavelet transform layer, and the redundant wavelet coefficients of the CT image were used as input data, and all layers of the network were re-trained. For comparison, learning and classification were also conducted using the pixel values of the image as input data (we call it as the compared method). The entire image without cropping it was used as input for both methods. Table 1 indicates the classification performance of the proposed and the compared methods. The measures used for performance evaluation include accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), F1 score, and Matthews correlation coefficient (MCC). The PPV and NPV are the proportions of positive and negative results in diagnostic tests that are true positive and true negative results, respectively [32]. They describe the performance of a diagnostic test. The higher the value for the PPV/NPV, the more accurate the diagnostic test is. F1 score is an overall measure of a model's accuracy that combines precision and recall. A good F1 score means that a model (or a classifier) being evaluated has low false positive and low false negative. The model is considered perfect when the F1 score is 1, while the model is a total failure when the score is 0. The MCC is used as a measure of the quality of binary and multiclass classifications. The MCC is in essence a correlation coefficient value between −1 and +1. A coefficient of +1 indicates a perfect prediction, 0 an average random prediction and −1 a completely inverse prediction. Figure 5 shows the receiver operating characteristic (ROC) curve for the proposed and compared methods. The area under the ROC curve (AUC) is also shown for each method as an overall measure of classification performance.
In this study, the region being focused on by the network at the time of final judgment (classification) was specified by Grad-CAM. Meanwhile, occlusion sensitivity was used to identify the feature regions that have the strongest influence on the probability score for prediction. Figure 6 illustrates examples of flowcharts for probability scores of classification categories. The upper lows of Figures 6(a)-(c) show COVID-19 CT images. The middle rows of that are results after using occlusion sensitivity. The red region J. Biomedical Science and Engineering of the heat map is the feature region that has the greatest effect on the probability score of the classification category. That is, when an image is input with the red region occluded, the score of the target class decreases and the probability score of the different category increases. The 6 figures in the lower rows of Figure 6 are the results of Grad-CAM. The red regions are the regions that the network paid the most attention to when making the final decision. Figure 6(a) is an example of correct classification (true positive) when wavelet coefficients is used as input, while misclassified (false negative) when original image is used as input. Figures 6(b) shows the result of correct classification (true positive) for the both methods and Figure 6(c) shows that of misclassification (false negative) for both methods, respectively.

DISCUSSION
As shown in Table 1, the overall accuracy of the classification using the proposed method (fine tuning with wavelet-coefficient input) is 0.922, and that using the compared method (fine tuning with the pixel values of the original images) is 0.883 (p < 0.05). A higher accuracy is obtained by the proposed method. Sensitivity and specificity for the proposed method are 0.904 and 0.933, and that for the compared Table 1. Performance results obtained from the proposed method (using wavelet coefficients as input) and the compared method (using pixel values of the original image as input).   method are 0.864 and 0.901, respectively. The results show that the proposed method is superior to the compared method. From the table, our proposed method shows a PPV of 92.6% and a NPV of 91.4%, and that of the compared method shows 89% and 87.8%, respectively. The results suggest that the proposed method perform better as compared to the compared method. The F1 score of the proposed method and that of the compared method are 0.915 and 0.876, respectively, which shows the superiority of the proposed method. The MCCs are 0.839 and 0.766 for the proposed method and the compared method, respectively. The results indicate that the proposed method has a higher correlation with the correct label as compared to the compared method. It is clear from Figure 5, the proposed method achieves better result (AUC = 0.976) as compared to the compared method (AUC = 0.959).
In the present study, we used the whole CT image without cropping the ROIs, such as the portions of lesions. As a general perception, there is a problem of where the network is looking at for judgement. Figure 6 is an example to illustrate the ROI where the network is paying attention to. The red regions in the middle rows (occlusion sensitivity map) of the 3 figures show the most important feature regions that influence the probability scores of the classification classes. In all the cases of Figure 6, the important features extracted by the network are strongly focused in the thoracic cavity, rather than in the background or body figure. This suggests that the network has learned the task-specific features. In addition, by using the proposed method, the network has a strong tendency to determine COVID-19 with confidence that the probability score is ranging from 99% to 100%. On the other hand, in the case of the compared method, the probability score tends to vary from 50.2% to 100%. This demonstrates the instability of learning due to the use of the compared method. Figure 6(a) and Figure 6(b) are examples showing correct classification of COVID-19 with a probability score of 100% by using the proposed method. Here, if you make the red region of the input image unclear, the probability score decreases. It means that the red region is an important feature for judging COVID-19. The blue region is generally a region that negatively affects the score of the class. However, in the case of the proposed method shown in Figure 6(a) and Figure 6(b), the probability score for each case is 100%, which means that the probability score is not affected by the blue region. This implies that there is no necessity to crop the image when the proposed method is used. The red regions at the lower rows (Grad-CAM) of Figures 6(a)-(c) are the regions where the network pays the most attention to at the time of the final judgement and has a great influence on the classification result. Figure 6(b) is an example of successful classification for both methods, while their corresponding Grad-CAM regions do not match each other. In the proposed method, the judgment is made with emphasis on the consolidation of the posterior basal segment of the left lung. In contrast, in the compared method, not only the consolidation of the posterior basal segment of the both lungs and that of the medial segment of the left lung, but also the latissimi dorsi is regarded as the ROI in the process of judgement. In fact, the latissimi dorsi is unrelated to COVID-19. That is, the judgement made by the compared method was influenced by a bias irrelevant to the classification category. Figure 6(c) is an example showing a case of misclassification judged by both methods. It can be seen from the red region of occlusion sensitivity map that the network of the proposed method focuses on the pneumonia feature, i.e., ground-glass opacity. In contrast, for the compared method, the anterior mediastinum of the heart is also shown in slightly red. This might be due to learning the wrong features. As a result, the respective probability scores are 55.5% and 57.9% and misclassified as Non COVID-19, respectively. As shown in red color on Grad-CAM, the ROIs focused by both methods are ground-glass opacity of the posterior basal segment of the right lung. The region localization is considered to be correct, however, the judgment is incorrected. In other words, a CT image with a wide mediastinum range, a narrow lung field area, and frosted glass shadows in both lungs might confuse the judgment made by the network. In order to solve this problem, it would be necessary to increase similar learning data for re-training.
There are several limitations in this study. First, a high sensitivity is considered to be a significantly important factor in the screening of COVID-19, however, in our experiments, specificity was higher than sensitivity (see Table 1). To deal with this issue, further improvement in the proposed method is required. Thus, adjusting the parameters of the network might be necessary. Second, the image data set used in this study contained some images with reduced resolution and loss of CT values. Appropriate selection of learning data is considered important in the subsequent study. Third, we did not confirm the treatment outcomes of the analyzed patients, because they are beyond the purpose of this study. In addition, the wavelet basis function used was db2. However, it is undeniable that the use of other basis functions or optimization of the combination of wavelet coefficients might yield better results. Further investigation on it will be conducted in our future study. Moreover, we employed ResNet-50 in this study. Thus, it would be possible that the network layers are too deep for the number of training data used in the experiments. To verify this issue is also one of our future tasks.

CONCLUSIONS
In this study, a pre-trained network based on ResNet-50 model was employed. Transfer learning was performed using wavelet coefficients of CT images as input to the fine tuning CNN. The network was used to classify COVID-19 pneumonia and Non-COVID-19 pneumonia from chest CT images. As a comprehensible classification model, the interpretability of classification results was introduced. The region of interest extracted by the network was visualized using heat maps and the probability score was also shown. For comparison, the case of using pixel values of the original image as the input of the fine tuning CNN was also shown. The overall accuracy of the classification of the proposed method was as high as 92.2% as compared to 88.3% obtained from the compared method. Other than overall accuracy, in all the calculated performance, measures obtained from the proposed method were higher than that obtained from the compared method. The experiment results demonstrated the superiority of the proposed method over the method that used the pixel values of the original images as input to the CNN.
Furthermore, by visualizing the extracted features from the region of interest created by the CNN, it is obvious that by using the proposed method, i.e., using the wavelet coefficients of the entire image without cropping way any parts of the image as input to the CNN, the network could correctly learn the extracted features. In addition, the basis of the judgment, i.e., what the CNN is looking, was apparent by visualizing the ROI at the final classification stage. This kind of comprehensible classification model could give insights to users about important predictive relationships in the image data. It is considered that model comprehensibility is very important for the model's acceptance by users in classification & prediction applications. We believe that our proposed method will provide a promising computerized toolkit to help radiologists and serve as a second eye for them to classify COVID-19 in CT scan screening examination.