^{1}

^{*}

^{1}

^{2}

^{*}

^{1}

^{2}

Digital pathology is a major revolution in pathology and is changing the clinical routine for pathologists. We work on providing a computer aided diagnosis system that automatically and robustly provides the pathologist with a second opinion for many diagnosis tasks. However, inter-observer variability prevents thorough validation of any proposed technique for any specific problems. In this work, we study the variability and reliability of proliferation rate estimation from digital pathology images for breast cancer proliferation rate estimation. We also study the robustness of our recently proposed method CAD system for PRE estimation. Three statistical significance tests showed that our automated CAD system was as reliable as the expert pathologist in both brown and blue nuclei estimation on a dataset of 100 images.

The development and continued growth of cancerous cells involve various changes at both macro and micro levels of the body. Cell proliferation is usually among the major indicators for proliferation of cancerous cells. Specifically, breast cancer proliferation rate estimation (PRE) is a crucial step for determining the cancer level and is used as a prognostic indicator [

Traditionally, pathologists perform proliferation rate estimation for breast cancer by examining the whole slides via a microscope. Over the past two decades, digital pathology enabled the usage of high resolution digitizers to provide high resolution images that replace the microscope as shown in our previous work [

There are many clinically approved techniques to estimate the PRE including: mitotic index, S-phase fraction, nuclear antigen ImmunoHistoChemistry (IHC) including KI-67 and PCNA-staining Cyclins and PET [

In our work, we use Ki-67-stained biopsy images for PRE. In this technique, PRE is estimated by counting the number of brown nuclei and the number of blue nuclei as shown in

Manual PRE is time-consuming and laborious for pathologists. An average of six minutes per image is required for PRE by an expert pathologist. Our expert pathologist requires over 10 hours estimating the proliferation rate for our dataset containing 100 images. Many authors target automation of PRE including our recent work [

In this paper, we study the statistical inter-pathologist variability for the various manual PRE we have between four expert pathologists. Moreover, we investigate the reliability of our proposed automated PRE compared to the four pathologist opinions for the 100 images in our dataset.

Manual ground truth estimation is a major area of interest due to the various human factors that influence the experts. Specifically for breast cancer PRE [

We study three statistical significance tests to show the inter-observer variability. Moreover, we study the manual vs automated [

The value of the correlation coefficient

where x and y are two random variables,

Student T-Test (or t test for short) is one of a number of hypothesis tests. The t-test looks at the t statistic, t distribution and degrees of freedom to determine a t value (probability) that can be used to determine whether the two underlying distributions of the two random variables are different as shown in Equation (2):

where _{T} and n_{C} are the number of the corresponding samples. Moreover, the degrees of freedom (df) for the test should be determined. In the t-test, the degree of freedom is the sum of the persons in both groups minus 2. Given the alpha level, the df, and the t value, you can look the t value up in a standard table of significance Typically, when t > 0.05, the two random variables (two underlying data samples) are said to be statistically insignificance, i.e., highly correlated [

Chi square X^{2} is a statistical test commonly used to compare observed with unobserved data upon a specific hypothesis as in Equation (3):

where Oij is the observed frequency in the i^{th} row and j^{th} column, Eij is the expected frequency in the i^{th} row and j^{th} column, r is the number of rows and c is the number of columns. The appropriate number of degrees of freedom (df) is calculated as the number of rows-1 multiplied by the number of columns. If X^{2} is greater than what is known as the critical value, then the two samples are dependent.

Our data set contain 100 Ki-67-stained histopathology digital images for breast cancer. The blue nuclei are negative positive cells while the brown nuclei are the positive ones. Our collaborating pathologists provided us with the ground truth from four different pathologists including herself as the most senior pathologist. We provided each pathologist with anonymized images labeled in sequence along with an sheet to score for the blue and brown nuclei. None of the four pathologists knew about the other and they were scoring independently. Our most senior pathologist (coauthor) spent over 10 hours for scoring the 100 cases which means an average of 6 minutes per case. Moreover, we run our proposed automated PRE system proposed in [

The inter-observer reproducibility is first measured by using the correlation coefficient [

From

Expert observers | Correlation coefficient | |
---|---|---|

Brown count | Blue count | |

Observer 1 vs Observer 2 | 0.987 | 0.969 |

Observer 1 vs Observer 3 | 0.953 | 0.807 |

Observer 1 vs Observer 4 | 0.965 | 0.806 |

Observer 2 vs Observer 3 | 0.959 | 0.968 |

Observer 2 vs Observer 4 | 0.965 | 0.806 |

Observer 3 vs Observer 4 | 0.977 | 0.973 |

observer 1 vs observer 2 and observer 3 vs observer 4, respectively.

On the other hand, we study the correlation coefficients between the manual of each of the four experts and our proposed automated system as shown in

We performed a two-tailed paired T-Test on all the pairs between the four observers and the automated system. Our Null Hypothesis is that there is a difference between the observers in one hand and the automated system on the other hand. All of the reported significance probability values in

We computed Chi-square test all pairs, and it compared with the critical chi square value with df = 1, confidence level 99% (probability = 1 − 0.99 = 0.01). In all pairs (including inter-observer and our automated method), the calculated chi square value is greater than the critical value, which means that each pair of samples are dependent. In other words, it is statistically reliable to consider any of the expert scoring or the automated scoring values.

Expert observers | Correlation coefficient | |
---|---|---|

Brown count | Blue count | |

Observer 1 vs automated | 0.974 | 0.847 |

Observer 2 vs automated | 0.984 | 0.886 |

Observer 3 vs automated | 0.959 | 0.661 |

Observer 4 vs automated | 0.963 | 0.686 |

Expert observers | p-value | |
---|---|---|

p (brown count) | p (blue count) | |

Observer 1 vs observer 2 | 8.14 × 10^{−5} | 7.2 × 10^{−6} |

Observer 1 vs observer 3 | 1.5 × 10^{−4} | 1.5 × 10^{−15} |

Observer 1 vs observer 4 | 8.7 × 10^{−5} | 2.2 × 10^{−15} |

Observer 2 vs observer 3 | 8.9 × 10^{−9} | 1.4 × 10^{−11} |

Observer 2 vs observer 4 | 4.3 × 10^{−9} | 7.1 × 10^{−12} |

Observer 3 vs automated | 2.9 × 10^{−7} | 6.9 × 10^{−11} |

Observer 4 vs automated | 4.0 × 10^{−7} | 1.5 × 10^{−11} |

Significance probability p-value | Interpretation |
---|---|

p < 0.01 | Strong evidence to reject H_{0} |

0.01 < p ≤ 0.05 | Significant EVIDENCE TO REject H_{0} |

0.05 < p ≤ 0.10 | Weak evidence against H_{0} |

p > 0.10 | Insignificant evidence to reject H_{0} |

We proposed a detailed statistical study for breast cancer proliferation rate estimation. We studied the inter- observer variability between four expert pathologists on a set of 100 cases. We also studied the reliability of our recently proposed automated PRE system. On the 100 cases, we found that the variability of brown nuclei estimation was statistically insignificant between various pathologists. We also found that our proposed system brown nuclei estimation was statistically reliable. On the other hand, our three statistical significance tests showed fairly high reliability between pathologists for blue nuclei estimation. The same conclusion applies for our proposed automated blue nuclei system.