Multi-Label Chest X-Ray Classification via Deep Learning

In this era of pandemic, the future of healthcare industry has never been more exciting. Artificial intelligence and machine learning (AI&ML) present opportunities to develop solutions that cater for very specific needs within the industry. Deep learning in healthcare had become incredibly powerful for supporting clinics and in transforming patient care in general. Deep learning is increasingly being applied for the detection of clinically important features in the images beyond what can be perceived by the naked human eye. Chest X-ray images are one of the most common clinical method for diagnosing a number of diseases such as pneumonia, lung cancer and many other abnormalities like lesions and fractures. Proper diagnosis of a disease from X-ray images is often challenging task for even expert radiologists and there is a growing need for computerized support systems due to the large amount of information encoded in X-Ray images. The goal of this paper is to develop a lightweight solution to detect 14 different chest conditions from an X ray image. Given an X-ray image as input, our classifier outputs a label vector indicating which of 14 disease classes does the image fall into. Along with the image features, we are also going to use non-image features available in the data such as X-ray view type, age, gender etc. The original study conducted Stanford ML Group is our base line. Original study focuses on predicting 5 diseases. Our aim is to improve upon previous work, expand prediction to 14 diseases and provide insight for future chest radiography research.


Introduction
Radiology is a branch of medicine that can be divided into diagnostic radiology How to cite this paper: Pillai, A.S. (2022) Multi-Label Chest X-Ray Classification via and interventional radiology [1]. Diagnostic radiology involves examining the medical images to diagnose diseases and abnormalities. Chest X-ray radiography is the most common imaging examination that demands correct and immediate interpretation to avoid life-threatening diseases. The challenge arises when these images have to be interpreted by radiologists who are limited by speed, experience and the cost involved to get a certified radiologist. Therefore, health care industry turned towards deep learning algorithms to automate and generate accurate radiology reporting.
Deep learning in health care is bursting with possibility and remarkable innovation by providing the ability to analyze vast quantities of data at exceptional speed without compromising on accuracy [2]. It enables the creation of algorithms that can learn and make predictions. In contrast to rules-based algorithms, machine learning takes advantage of increased exposure to large data sets and possesses the ability to improve performance with such exposures and learn with experience.
Deep learning is a subset of artificial neural networks that are statistical and mathematical methods inspired by the way biological nervous system processes information with a large number of highly connected neurons, nodes or cells.
Neural networks are structured as one input layer, one or more hidden layers and an output layer. Every hidden layer consists of a set of neurons that are fully connected to all neurons in the previous layer. The strength of such a connection is determined by weights of variable or features that associate inputs with outputs. For a neural network model to perform efficiently and accurately, these weights must be set to suitable values, which are estimated through training.
From deep learning perspective, radiology images need to be pre-processed differently due to variations in processors and memory restrictions. X-ray images in general are 2D images of a 3D human body. DL algorithms especially convolutional neural networks (CNN) have proved more successful in training the models with 2D images than 3D images that adds an additional dimensionality to the problem. Convolution is a mathematical operation that employs a type of filtering to determine the most useful features from a dataset, thereby having applications in finding patterns in signals or filtering signals.

Related Works
In recent years, large sets of radiology images have been made public. The availability of such datasets helped crowd-source the development and evaluation of deep learning models [3] [4] [5] [6]. In 2017, NIH Clinical Center released over 100,000 chest x-ray images, which comprises 108,948 frontal-view X-ray images of 32,717 unique to the scientific community. There were several studies based on this data.
Wang et al. [7] through their paper "ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on weakly-Supervised Classification and Localization of Common Thorax Diseases" demonstrated that the commonly occurring thoracic diseases can be detected and even spatially located via a unified weakly supervised A. S. Pillai multi-label image classification and disease localization framework.
In 2019, MIT published MIMIC-Chest X-Ray Database (MIMIC-CXR) collection of more than 350,000 chest x-ray images associated with 227,943 studies. These images consist of both frontal and lateral views and were compiled from Beth Israel Deaconess Medical Center in Boston during years 2011 to 2016. Images are provided with 14 labels derived from free-text radiology reports using NLP tools [8].
CheXpert is another such dataset consisting of 224,316 chest radiographs of 65,240 patients who underwent a radiographic examination from Stanford University Medical Center between October 2002 and July 2017, in both inpatient and outpatient centers. Based on associated radiology reports, these X-rays images were labeled as positive, negative, or uncertain for the presence of 14 common chest radiographic observations. CheXpert data has attracted strong attention in building pre-trained learning models to address the challenges in X-Ray image processing, classification and segmentation. CheXNet is one prominent project by Stanford ML Group which is considered to have state of the art performance in classifying diseases even better than expert radiologists.

Baseline
In this paper, we take inspiration from state-of-the-art CheXNet model to train similar models based on CNN that perform multi-class, multi-label classification with transfer learning from models trained on Imagenet data. Irvin et al. [1], in their paper developed models to predict five lungs' conditions such as Atelectasis, Cardiomegaly Consolidation, Edema and Pleural Effusion. Their study achieved AUC ranging from 0.85 to 0.93. Our goal is to expand the predictions from 5 diseases to 14 diseases and achieve comparable AUCs, and also attempt to compare various pre-trained CNN models and their performance.

Approach and Methods
Our approach primarily focuses on developing CNN classifier to predict the labels. During research, it is a good practice to evaluate models on different datasets since deep learning algorithms are often validated on historical data, yielding guaranteed performance but are unable to attain same levels of accuracy when operating outside the training data range. This is called as overfitting. We employ a method to compare the performance of different models based on success metrics of the model on test data set of images and determine the best performing model under different parameter setting.

Convolutional Neural Network
CNN is a deep Learning algorithm which can take an input image, assign importance as weights to various features in the image and be able to differentiate one from the other. The main advantage of Convolutional Neural network is that it has the capability to capture the temporal and spatial dependencies in an Journal of Intelligent Learning Systems and Applications image by applying relevant filters [9]. CNN architecture is composed of a convolutional layer, a pooling layer and a fully connected layer. The function of a convolutional layer is to extract features from images. Each convolutional layer can have multiple convolution kernels and the convolutional layer is calculated as follows: All the experiments in our work involve millions of parameters for multiple features and multiple layers of neural networks. This requires us to use a faster processing machine with graphic processing units (GPU) and cloud architecture that have pre-configured drivers and come with popular Python packages to build neural networks.

Experimental Setup
We used AWS Deep Learning AMI (Ubuntu 18.04) for training and prediction.
These AMIs are pre-configured with PyTorch 1.7.1 and Python3.7 (CUDA 11.1 and Intel MKL) along with other basic data science software like pandas, numpy, scipy etc. We used g4dn.2xlarge machine with 1 Tesla T4 GPU.

Dataset
We used a version of Chexpert data with down sampled resolution to train the model. The original data is 439 GB in size whereas the down sampled version is just 11 GB. We chose the down sampled version to save training time and computing power. Sample images are shown in Figure 1.
Each individual X-ray radiograph is represented by the image and a feature vector containing patient id, sex, age, study number, X-ray view-type, and a vector of fourteen expert-labeled observations. We clean the data to omit the sex and age features from the equation because we evaluate solely on the image.

Train Data Distribution
This subset of dataset is split into train and validation sets for model evaluation with a split proportion of 80:20. Figure 2 and Figure 3 denote the distribution of train data and train labels respectively. An unseen test data set is used to evaluate predictions of classes by the trained models.

Data Uncertainty
Chest X-ray dataset contains images that are hard to classify for a certain disease to be present. So, radiologists often leave an uncertain label (value = −1) on such images. We followed approach similar to Irvin et al. [4] to handle uncertainties.
Labels were split into u-zero and u-one categories based on previous studies.
Category u-one means uncertain labels will be treated as positive and u-zero means negative. We considered labels Atelectasis, Edema as u-ones and rest of the labels as u-zero.

Data Pre-Processing
Dataset used for training and test are pre-processed through augmentation methods. This was done to increase the size and quality of the dataset. This process helps in solving problems related to overfitting and enhances the model's generalization ability during training and prediction of class labels.

Modeling
Based on our research, CNN architecture performs better on multi-class, multi-label classification of image dataset due to the reduction in number of parameters involved, without losing features that are critical for getting a good prediction. So, we investigated multiple models based on CNN architecture that will be discussed in detail further. Each of this model is run on train and test data with a batch size of 96 for up to 40 epochs, binary cross entropy as loss function, Adam optimizer with an initial learning rate of 0.001 which is multiplied by 10 each time the validation loss plateau after an epoch.

Custom Net (Simple Baseline Model)
We started with a simple custom-made CNN models. Trained the model from scratch where input is passed through 4 convolutional layers by random initialization and fine tuning the weights of all the layers. Each convolutional layer has a max pool layer and ReLU activations. Pooling layers used to reduce spatial volume. Figure 4 demonstrates list of parameters for each layer. Output of these layers is then passed to a fully connected sigmoid function to classify a collection of chest X-ray images. Sigmoid function helps to convert raw output values from classifier to corresponding probability values. We considered probability greater than 0.50 as positive detection.

DenseNet121
The core idea of DenseNet is to ensure maximum information flow between layers in the network by connecting all layers directly with each other. It has a stack of dense blocks followed by transition layers. Dense blocks contain different units such as convolutions, batch normalization and ReLU activations. Each dense block generates a fixed number of feature vectors which is called the Growth rate i.e., the amount of information that layers can transmit. We trained DenseNet121 with initial weights from a pre-trained network on ImageNet data [10].

ResNet-50
ResNet-50 is a convolutional neural network that is 50 layers deep. We initialize the network with pre-trained weights, where knowledge is transferred from Im-ageNet data. Pre-trained network initial weights are frozen for first 6 layers used for feature extraction and only the weights of the last layer are adapted to re-train one or more layers with samples from the X ray dataset [11].

Inception_V3
This network has 48 layers depth that can make several improvements including label smoothing, factorized convolutions and uses an auxiliary classifier to propagate label information lower down the network. We initialize the network with pre-trained weights, where knowledge is transferred from ImageNet data and freeze these weights for first 8 layers [12].

Vgg16
The VGG network is a neural network that has already been pretrained on over a million images from the ImageNet database. The network has 41 layers. There are 16 layers with learnable weights, 13 convolutional layers, and 3 fully connected layers. One of the major disadvantages of the VGG16 Neural Network is the huge number of trainable parameters. It has more than 134 million trainable parameters. We did freeze the first 6 layers so as to limit the trainable parameters to 57 k [11] [13]. Figure 5 demonstrates trainable parameters for each of these models.

Success Metrics
Accuracy: It is the ratio of number of correct predictions to the total number of input samples.
Area Under Curve: AUC of a classifier is equal to the probability that the classifier will rank a randomly chosen positive example higher than a randomly chosen negative example [14].

Model Training
We did a random 20% split of train data as validation data and trained model for a maximum of 40 epochs. ROC was used to determine early stopping criteria. Most of the models achieved best performance with 20 to 25 epochs. Figure 6 demonstrates variations in training metrics per epoch. DenseNet model achieved highest training AUROC 78 and highest training accuracy 87% as indicated in Figure 7.

Results
To evaluate the performance of models, we used unseen test data to predict the multi classifications labels. The validation set contains 200 studies from 200 patients randomly sampled from the full dataset with no patient overlap with the train set. Test data consisted of 234 images. Distribution of test data and labels is shown in Figure 8.

Test Metrics
Model performance validated against success metrics. Details available are in Figure 9.
Based on overall test metrics, DenseNet121 achieved the best performance. This model achieved ROC score of 0.78 and accuracy of 87%. Other models ROC values ranged from 0.69 to 0.75 and accuracy from 83% to 86%. Journal of Intelligent Learning Systems and Applications

Test Metrics for Labels-AUROC
Densenet121 performed better in the case of individual labels also. Figure 10 shows AUROC values for various labels. Densenet121 model achieved AUROC ranging from 0.82 to 0.93 for the 5 labels those were part of the study conducted by Stanford ML group [1]. The performance was good for some other labels also like Pleural Other (AUROC: 0.97) and Lung Opacity (AUROC: 0.91). The worst performance was for Enlarged Cardiomediastinum (AUROC: 0.49).

Test Metrics for Labels-Accuracy
Densenet121 gave the best accuracy for individual labels. Figure 11 shows Accuracy values for various labels. Fracture, Lung Lesion, Pleural Other, Pneumonia and Pneumothorax achieved more than 95% accuracy. Enlarged Cardiomediastinum had the least accuracy of 53%. Figure 10. Test labels comparison-AUROC. Figure 11. Test labels comparison-Accuracy.

Confusion Matrix
Accuracy results may be misleading in some cases when the data set is unbalanced. We used confusion matrix to visualize true positives, false positives, true negatives, and false negatives as shown in Figure 12. Confusion Matrices for the best model, CustomNet and Densenet121 is given below.

Conclusion
The models were able to achieve ROC of about 0.78 and overall accuracy of Journal of Intelligent Learning Systems and Applications about 87 percent. Dense121 pre-trained model gave the best performance for test data prediction. However, all models failed to accurately predict positive cases of certain diseases in spite of higher overall prediction accuracy rate. After examining the results, we hypothesize that these poor results are primarily due to lack of balanced training data. The number of positive cases in training data was very less compared to the negative class. Considering the uncertainty label as positive is not an effective approach for handling uncertainty in the dataset and is particularly ineffective on certain diseases. Overall, for 5 diseases, the best model was able to achieve AUROC comparable with baseline studies. Also model successfully predicted several other labels like Fracture, Lung Lesion, Pleural Other, Pneumonia and Pneumothorax with more than 95% accuracy.

Optimization
We realize that the class imbalance affected model's ability to predict positive cases of certain labels. For example, Cardiomediastinum has only 4% positive cases in train data and Pneumonia has only 3% positive cases. In the future, we wish to address this by experimenting with over sampling or under sampling techniques. The bias towards the dominant class can be reduced by altering the training data in order to decrease the imbalance.

Conflicts of Interest
The author declares no conflicts of interest regarding the publication of this paper.