Segmentation of Diabetic Retinopathy Lesions: The Common Fallacy and Evaluation of Real Segmenters

In the context of automated analysis of eye fundus images, it is an important common 
fallacy that prior works achieve very high scores in segmentation of lesions, 
and that fallacy is fueled by some reviews reporting very high scores, and 
perhaps some confusion with terms. A simple analysis of the detail of the few 
prior works that really do segmentation reveals scores between 7% and 70% in 
sensitivity for 1 FPI. That is clearly sub-par with medical doctors trained to 
detect signs of Diabetic Retinopathy, since they can distinguish well the 
contours of lesions in Eye Fundus Images (EFI). Still, a full segmentation of 
lesions could be an important step for both visualization and further automated 
analysis using rigorous quantification or areas and numbers of lesions to 
better diagnose. I discuss what prior work really does, using evidence-based 
analysis, and confront with segmentation networks, comparing on the terms used 
by prior work to show that the best performing segmentation network outperforms 
those prior works. I also compare architectures to understand how the network 
architecture influences the results. I conclude that, with the correct 
architecture and tuning, the semantic segmentation network improves up to 20 
percentage points over prior work in the real task of segmentation of lesions. 
I also conclude that the network architecture and optimizations are important 
factors and that there are still important limitations in current work.

distinguish sufficiently well not only coarse regions around lesions but also the contours and areas of many individual lesion instances. Likewise, the better the automated segmentation procedure is at classifying each pixel, the more precise automated analysis and quantification will be (e.g. areas, contours, number of instances). Difficulties segmenting lesions include the size of those lesions, frequent lack of contrast and varying shapes. Most prior works frequently said to be related to this task do not segment at all. The subtle difference is that to segment an approach needs to receive as input the whole image and needs to trace the contours of each lesion. Most "lesion detection" approaches in related work do not even find locations of lesions in the image. Instead, they receive as input small squares and classify those squares as a certain lesion or as background. In other words, they do the easy work and leave out the difficult work. Worse still, surveys such as [1] [2] [3] can report scores of 90% to 100% in tasks that are segmentations of lesions but are not in reality, and I look at the details of the prior work to show that. There are three main reasons for the confusion: first of all, most works do not segment at all, as just explained before and review in related work; secondly, for those that do segment, when looking into the details of their experiments one discovers that those works only report very high scores for identification of lesions at image level (whether there is a lesion or not in an image), reporting much lower scores in sensitivity versus FPI in true segmentation of lesions. The other problem that was also mentioned for instance in [4] is that many scores are artificially high, because in the context of eye-fundus images, the background is huge and hence the term TN (true negatives) is also huge, making specificity, ROC and AUC inviable as scores. Sensitivity is also inviable if reported alone because it lacks recognition of FP (false positives), and in the context of segmentation of lesions the default classification threshold of 0.5 is associated with too many FP. That is why authors reporting segmentation scores correctly use the sensitivity when FP varies between 1 to 10 false positives per image instead of just sensitivity. The last problem also frequent is that the approaches are evaluated based on the degree of overlap between segments, which means that, for instance, a large region matching only 20% with the true segment can be considered a match if the overlap threshold is 20%. An overlap of 20% or even 50% between regions, no matter the shape of size, is a very bad tracing of contours that is accounted for as 100% correct in those works.
The focus of this work is on explaining what prior works propose, expose the common fallacy and then build and evaluate semantic segmentation networks that really segment the lesions, comparing to prior work, to show that a good-performing segmentation network is better than prior work. This destroys a common fallacy and shows the way for future improvements in segmentation of lesions in eye-fundus-images.
This article is very relevant because it identifies an important common fallacy concerning the perception of what prior works achieve in segmentation of diabetic retinopathy lesions, and compares true segmenters with what they do in terms of segmentation.

1) The Lesions segmentation problem
Lesions that are characteristic of Diabetic Retinopathy (DR) in different stages include micro-aneurisms (MA), which are small red and rounded regions resulting from augmented capillaries, exudates (hard and soft HE and SE), which are yellowish deposits of lipids and proteins, and hemorrhages (HA), larger blood stains that are a serious signal of advancing conditions. Proliferative Diabetic Retinopathy also exhibits neo-vascularization and other affections [5]. Figure 1(a) shows the original Eye Fundus Image (EFI) and Figure 1(b) shows the corresponding groundtruth pixelmap with hard and soft exudates (EX, HE, SE), microaneurysms (MA) and hemorrhages (HA). It also shows the optic disc (OD). The segmentation Deep Convolution Neural Network (DCNN) is given a large dataset with images similar to the one shown in Figure 1(a) and pixelmaps similar to Figure 1(b) and learns how to classify each pixel of the image to obtain a pixelmap as close as possible to the pixelmap shown in Figure 1 is an example of a segmentation result, showing the lesions and optic disk that were detected by the deep learning network. In summary, semantic segmentation is the classification of each pixel as one of the possible classes that includes each type of lesion plus the background (BK, the background is everything that is not a lesion and covers more than 93% of all the EFI area). Evaluation is based on assessing the quality of segmentation of all pixels or pixels of a certain class (e.g. lesion), using a typical set of metrics. Figure 2 shows a slightly different definition of groundtruths and evaluation most used in related work. The images are taken from DIARET-DB1 dataset [7]. In this case the groundtruth segments are shapes with varied sizes that enclose lesions or sets of lesions. The quality of segmentation is assessed by analyzing the amount of overlap between the segments each pixel belongs to and the groundtruth segments. A match is found if the overlap is above a certain threshold σ between 0 and 1, in which case all pixels in the segment are considered a hit (e.g. TP or TN depending on the case).
Semantic segmentation DCNNs can be applied and evaluated with both types of groundtruth and evaluation approach. Since prior work evaluates using the second approach and DIART-DB1 is one of the most frequent datasets used in those evaluations, it is used here as well to compare with prior works, while the first dataset and approach is used to evaluate and compare the quality among segmentation networks because it is more pixel-wise.

2) Deep Learning Approaches to Segment Lesions
Most proposals described as segmentation and/or identification of lesions in related work follow two main approaches shown in Figure 3(a) and Figure 3(b). Figure 3(a) shows the generation of lesions CAM/heatmaps from the inner coefficients of a Diabetic Retinopathy Classification network (e.g. [8] [9]).  DCNNs, which consist of a set of convolution layers with pooling and ReLU that encode the image or window into features, followed by a fully connected neural network that produces the classifications. However, the classifier of small windows is not by itself a real segmentation approach, because all it does is classify windows given to it directly, it does not process the image to segment contours.
The most difficult part, actually localizing the lesions and the contours, is completely avoided, and up to hundreds of thousands of small windows (as many as the pixels) would have to be classified to segment the image using such an approach. Processing hundreds of thousands of windows would take a huge amount of time. It will be shown that the only small window classifier that processes the whole image uses trivial thresholding for hand-crafted segmentation of MA lesions, which is sub-par compared with end-to-end learning and produces lower scores. On the contrary, the segmentation network shown in

Related Work
Lesions detection and localization from eye fundus images using either classical Machine Learning (ML) or Deep Learning (DL) was surveyed most recently in works such as [1] [2] [3], where most tasks report very high performance scores (90% and 100%) measures in metrics such as sensitivity, ROC/AUC and others. However, if we look closer only at approaches that are reviewed and claim to "segment" lesions, the list is already filtered to a much smaller number, including Prentasic et al. [10], Gondal et al. [8], Quellec et al. [9] (exudates, hemorrhages and microaneurisms), Haloi et al. [11], van Grinsven et al. [12], Orlando et al. [14] and Shan et al. [13] (microaneurisms, hemorrhages or both), and if we look at their performance results in detail and using the adequate metrics, it is possible to see that scores are not as high. Also importantly, there are other details that should be taken into account when analyzing related work, which we discuss next. For a work to propose a way to segment, it should input an EFI image and output all segments of its lesions or at least of a specific type of lesion that exists in the EFI image, and we should expect experiments in those papers to include a setup that does and evaluates that. Prentasic et al. [10], Haloi et al. [11], Van Grinsven [12] and Shan et al. [13] do not fit this definition of segmentation.
Those works are classification DCNNs that classify small square windows as being of a specific type of lesion or not. They can also be seen as classifiers of a single pixel, the center pixel of the squared window since the classification they output does not intend to guarantee that all the other pixels in the window have that same type. Although an image typically has hundreds of thousands of pixels, those works do not clarify how the classifier would be applied to the whole image or how segmentation of the whole image would work based on the classifier. Not only those works in general involve no segmentation, as the test windows for evaluating the approaches use windows picked based on the groundtruths with the lesions perfectly centered. Such experiments are totally unrealistic from the perspective of segmentation, where the input should be the EFI image and there is no prior knowledge of the location of lesions. The authors make no attempt in the papers at applying this as a building block to segment lesions, and there is also no good solution provided or experimented to scale this to real-time operation classifying all pixels of an EFI in a way that could enable segmentation. For instance, [10] is a pixel classifier that takes a window around a pixel to be classified, achieving 77% sensitivity classifying exudates, but the small test squares used for evaluation, as well as the train ones, were picked based on the groundtruth to get squares centered on a lesion (and also negative squares with background only). All that is done is the classification of those squares, the authors do not propose or experiment with scaling this to classify or segment all pixels in an EFI image. [12] and [13] have exactly the same limitations. [12] reports, at 1 false positive per image (FPI), a maximum sensitivity of 0.786 detecting Hemorrhages. Open Journal of Medical Imaging Haloi [15] is another pixel/window classification approach, but there the authors propose to preprocess the image with a color threshold to leave out trivial non-MA pixels to reduce computation time. To detect MA in an unseen image, they first apply a mask to get all pixels of interest, removing the usual black region, and apply a color threshold to leave out trivial non-MA pixels to reduce computation time. Then a window of size 129 × 129 centered at each image pixel is extracted. The classifier is used to classify every of the extracted pixels, resulting in a probability map that is then post-processed to remove false detections by analyzing connected regions using the concept of convexity and area. The authors report sensitivity for 1 false positive per image (FPI) around 50% (70% for 10 FPI). This is an interesting work that, contrary to the previous ones, deals with the whole EFI image, but it also has some limitations. First of all, the colour threshold is a very basic hand-crafted solution that easily returns too many false positives and also easily misses many MA, and the time cost of processing all remaining image pixels one-by-one, each one centered on a 129 × 129 square window around it can be large if the threshold is set to keep more pixels. A tradeoff would need to be studied between the colour threshold, precision of the approach and runtime. The main advantage of the segmentation network compared to this threshold is that it is optimized end-to-end. The backpropagation learning learns to extract MA and other lesions based not on a fixed manually-defined colour threshold, but rather on learning adjustments to convolution filter coefficients. Another limitation of [15] is that it is only dealing with MA, the procedure would have to be defined for other lesions as well, and it is not clear how the colour threshold and post-processing details would be defined for those.
The remaining related works Gondal et al. [8], Quellec et al. [9] and Orlando [14] all segment the EFI. [8] and [9] are based in DR classifier networks from which CAM/heatmaps are extracted to get the positions of lesions. Orlando [14] uses a different approach that combines DL with image processing to find can- This related work section is ended by briefly reviewing deep segmentation networks. The segmentation network has two well-distinguished parts, the encoder, most frequently an existing Convolution Neural Network (CNN) without the final fully connected layers, a decoder that reinstates the full image size, and the pixel classifier layer that assigns a score for each class to each pixel. The Fully Convolutional Network (FCN) [16] uses a CNN for encoding (e.g. VGG16 [17]), replacing all the fully connected layers by convolutional layers with large recep-tive fields and adds up-sampling layers based on simple interpolation filters. The U-Net [18] is another segmentation network especially designed for segmentation of biomedical images (around 75 layers). The architecture consists of "a contracting path to capture context" and a "symmetric expanding path that enables precise localization". Finally, the DeepLabV3 network [19] is used in experiments, using Resnet-18 encoder and applies some new techniques to improve the quality of segmentation, including Atrous Spatial Pyramid Pooling (ASPP) [20] that is implemented to better capture objects at multiple scales, and Conditional Random Fields (CRF) for improved localization of object boundaries using probabilistic graphical models. From the mentioned networks, best results are obtained with DeepLabV3 [21], which I hypothesize to be due both to the ASPP and CRF and to the use of Resnet-18 as the encoder CNN.

DCNN Segmentation Architectures
The DCNN (deep convolution neural network) networks used in segmentation are distinguished by different architectural choices and innovations. Our purpose is to choose the best possible architecture and to understand the influence of the architecture in the quality of the results. Next, the main characteristics and layers of the four architectures tested are summarized, and Figure 4 shows a rough sketch of those architectures.
Simple: "Simple" is a basic encoder-decoder CNN architecture that was built from scratch to compare with deeper, more complex architectures. With only four convolution layers accompanied by max pooling and ReLU operations on each, and another four deconvolution layers accompanied by ReLU plus a softmax and pixel classification layer for output. The convolution layers apply 64 3 × 3 filters with stride [11], and the deconvolution layers apply 64 4 × 4 filters. Table 1 summarizes the layers of "Simple".  Figure 4(b) and whose layers are summarized in Table 2, uses a DCNN classification network for feature extraction or encoding (VGG-16 with 7 stages corresponding to 41 layers), plus a much smaller sequence of up-sampling layers (decoding stages) for a total network size of 51 layers. FCN also forwards feature maps (the pooled output of coding stage 4 is fused with an output of the first up-sampling layer, and the pooled output of coding stage 3 is fused with the output of the second up-sampling layer. Finally, the image input is also fused with the output of the third up-sampling layer, all this followed by the final pixel classification layer. Segnet (and U-Net): Segnet shares a similar architecture with U-Net, with encoder stages and symmetric decoding stages. The Segnet uses VGG-16 for feature extraction (encoding), while the decoder up-samples using pooling indices computed in the max-pooling step of the corresponding encoder, to perform non-linear up-sampling (while the U-Net forwards cropped feature maps directly after ReLu regularization at each stage, which are concatenated with the corresponding stage outputs at the destination, Segnet forwards max-pooled outputs and unpools at the destination). Segnet can be configured with any number of stages, defined for this work 5 encoding layers and the symmetric decoding layers, for a total of 73 layers. Segnet (and U-Net) architecture is sketched in Figure 4(c).
DeepLabV3: it is the deepest network tested in this work, with 100 layers listed in Table 3. The general layout of layers in DeepLabV3 is shown in Figure  4(a). DeepLabV3 uses Resnet-18 as feature extractor, with 8 stages totaling 71 layers, the remaining stages being ASPP plus the final stages. Forwarding connections are also added from encoding stages to the Atrous Spatial Pyramid

P. Furtado
Pooling (ASPP) layers, for enhanced segmenting of objects at multiple scales. The outputs of the final DCNN layer are combined with a fully connected Conditional Random Field (CRF) for improved localization of object boundaries using mechanisms from probabilistic graphical models.

Experimental Setup
In experimental work two datasets are used, one to compare between segmentation networks that are very useful to evaluate semantic segmentation quality pixel-wise, the other one to compare with prior work because most prior works use it and are evaluated based on the connected components model (overlap between found segments and groundtruth segments). The first dataset is IDRID [6], a dataset that is publicly available for the study of automated detection of for 300 epochs, after initial tests in which it was observed that they would stabilize much before that number of epochs. The minibatch size was configured to 32, and the momentum was 0.9. I also experimented with data augmentation to improve the robustness and quality of the network predictions. These consisted with the random transformations (rotations, small translations and scaling) and adding the two datasets. In terms of hardware, a machine running windows 10 was used. The hardware was an intel i5, 3.4 GHz, 16 GB of RAM 1TB SSD disk.
A GPU was added to the PC, consisting of an NVIDEA GForce GTX 1070 GPU (the GTX 1070 has a Pascal architecture and 1920 cores, 8 GB GDDR5, with memory speed of 8 Gbps).
In terms of reporting the results of segmentation networks on IDRID, two scenarios are distinguished, one without data augmentation and one with data augmentation, defined in Table 4.
The same dataset, metrics and results that were used in prior works are also used in the comparison of the segmentation networks to those prior works, and the segmentation network with the best performance was chosen for those comparisons (deepLabV3). The main metric used for the comparison is sensitivity vs false positives per image (FPI), plus image-level sensitivities to compare lesion detection at image level (I also show the ROC curve for the segmentation network). Prior works compared used DIARET-DB1 [7].
For comparing the various segmentation networks, semantic segmentation evaluation metrics and the dataset IDRID [6] are used, which contains per-pixel classification groundtruths. In that case, the main metrics used are, in order of importance for the characterization of segmentation performance, are mean IoU, mean BFScore and then mean accuracy, weighted IoU and global accuracy. Metrics using the mean take the mean value over all classes (classes are each lesion plus the background), while weighted and global metrics take the average over all pixels directly. The values over all pixels are less informative than mean over classes because more than 90% of all pixels are background, therefore they characterize the quality of segmentation of the background mostly.
Training evolution curves Figure 5 briefly shows plots of validation accuracy curves when training the IDRID images. It is apparent that Simple has more difficulties to converge to high accuracy, DeepLabV3 converged well and with fewer fluctuations than FCN. Table 5 compares lesion-level sensitivities between the segmentation network and the prior work that actually segments the lesions, i.e. [8] [9] and [14]. These  results were obtained using the connected components model of evaluation (described for instance in [4]) with similar conditions as used in the compared works. The sensitivities are measured against the number of false positives per image (FPI), and both should be considered in the analysis of results. Note that we used the approach in [9], where FPIs can differ because they are obtained against the class classification threshold (0 to 1). Table 6 compares image-level detection of lesions for referral, where one can see again that DeepLabV3 ranks first in HA and SE and also ranks well in HE and MA when compared with the alternatives tested. Table 7 compares the segmentation network sensitivities to those of the small-window classifiers reviewed in the related work section. To finalize this first experiment Figure 6 shows the ROC curves of each lesion using DeepLabV3 semantic segmentation network.

1) Comparing segmentation network to prior work
2) Comparison between segmentation networks, improvements and limitations Table 8 shows the comparison between different architectures of the segmentation DCNN networks using IDRiD dataset, 5-fold cross validation (80%/20%). These results were obtained without data augmentation. Table 9 shows the comparison of results with two scenarios: an initial scenario with no data augmentation and a second scenario with data augmentation (DA) that improves the results in general. Table 10 shows the quality of segmentation of each lesion measured as IoU and sensitivity. Figure 7 shows an example visualization of segmentation network results in IDRID, with the groundtruth on the left and the segmentation output on the right. Figure 8 shows an example for the dataset DIRET-DB1, with the labels

Discussion of Results
1) Comparing segmentation network to prior work The comparison of lesion-level sensitivities between the segmentation network and prior work, in particular Quellec et al. [8], Gondal et al [9] and Orlando et al. [14] ( Table 5) reveals that the segmentation network improves the sensitivity for most lesions, i.e. for HA, HE and SE, while for MA the results seem worse than [14] but in line with the results of the remaining prior works compared to. The prior works scored 50% to 72% in HA, 47% to 80% in HE, 71 to 90% in SE and 21% to 61% in MA, the segmentation network scored 87% in HA, 94% in HE, 87.5% in SE and 48% in MA. This means that the deep segmentation network with around 100 layers, Resnet-18 as encoder network and the innovations we discussed previously that include ASPP and CRF proved better than prior works for segmentation of lesions. Note also that segmentation net-works are most often evaluated using Jaccard Index (JI), also known as intersect-over-the-union (IoU), or the Dice coefficient, and we include, at the end of the experimental section, an evaluation using IoU, but in this experiment we compared using exactly the same conditions and evaluation approaches that the prior works use.
The segmentation network was also better in terms of image-level detection of lesions for referral (   [15], but, as we discussed in related work section, except for Haloi [15], The ROC curves in Figure 6 complement the information for the segmentation network. It shows that MA is the hardest lesion to segment, with sensitivity around 58% for FPR 0.1, followed by SE (73% FPR 0.1), HA (85% FPR 0.1), and HE (97% FPR 0.1).

2) Comparison between segmentation networks, improvements and limitations
For the remaining analysis we switched dataset (to IDRID) and evaluation approach, to consider semantic segmentation instead of overlap of coarse segments in the evaluation of the segmentation networks. Using this approach, Table 8 shows that, without any training optimizations, DeepLabV3 [19] and FCN [16] compare favorably with the other segmentation network architectures that we described in the setup (UNET [18], SegNet and Simple) for the task of segmentation of eye fundus lesions. For instance, in terms of weighted IoU, FCN scored 87.5% and DeepLabv3 scored 78.5%, while the remaining ones scored between 46.5% and 56.2%. These results were obtained without data augmentation. Table 9 shows the effects of data augmentation. Looking in particular at weighted IoU, DeepLabv3 with DA improved from 78.5 to 94%, while FCN did not improve.
Even though the segmentation networks outperform prior work, Table 9 and Table 10 also show that there is still a long way to go in terms of optimizing semantic segmentation quality. DeepLabV3 achieves 94% weigthed IoU, but its mean IoU shown in Table 9 is 52%, and the quality of segmentation of each lesion (Table 10) is even lower for some lesions (MA and SE). We noticed that, while background, HA and HE have high scores (e.g. sensitivity of 74% to 99%), MA and SE have much lower scores (31% and 55% respectively). Additionally, significant difference between IoU and sensitivity signals a significant amount of false positives (e.g. background marked as lesions). These results show that segmentation networks could still benefit from further research in future to better deal with FP and also FN.

3) Visualizations and Conclusions from Experiments
The visualizations shown in Figure 7 for IDRID and in Figure 8 for DIRET-DB1 help illustrate the capacity of the segmentation network, since we see that the lesions are reasonably well recognized in those figures, but they also show that there are still many false positives.
Our experimental results succeeded to show that the deep segmentation network improves the quality of lesions segmentation from Eye Fundus Image (EFI) when compared with prior work that also segments those lesions. We have also compared segmentation network architectures, showing that the simplest architectures (e.g. Simple, with 5 simple encoder and decoder layers) were unable to reach the segmentation performance of deeper and more complex networks. Our best results were achieved with DeepLabV3, a deep segmentation network with 100 layers using Resnet-18 as encoder CNN and integrating important innovations that include atrous spatial pooling (ASPP) and conditional random fields (CRF). FCN, another well-known deep segmentation network using VGG-16 also obtained quite reasonable results. Finally, we also revealed some limitations of current state-of-the-art in segmentation of lesions.

Conclusion
In this work, we have shown that most prior works on segmentation of diabetic retinopathy lesions only detect lesions approximately and do not segment them.
We proposed the use of deep semantic segmentation networks to segment the lesions. Our main objective was twofold: to compare the segmentation networks solution with prior works, showing that they improve performance, and to compare between segmentation network architectures, to investigate whether the characteristics of those networks were important to improve the performance, and to see their limitations. We found out that the architecture is very important, achieving top performance the DeepLabV3 design with Resnet-18 as the encoder. Comparing with prior work segmenting EFI lesions, the advantage of the top-performing deep segmentation network reached 10 to 20 percentage points in HA, HE and SE lesions, although it was slightly worse than some of the others segmenting micro-aneurisms. But we also revealed the limitations of current best performing networks. Future work needs to deal with the still existing imperfections, including many false positives.