Parametric and Non-Parametric Survival Analysis of Patients with Acute Myeloid Leukemia (AML)

Background: Acute Myeloid leukemia (AML) is the most prominent acute leukemia in adults. In the United States, we experience over 20,000 cases per year. Over the past decade, improvements in the diagnosis of subtypes of AML and advances in therapeutic approaches have improved the outlook for patients with AML. However, despite these advancements, the survival rate among patients who are less than 65 years of age is only 40 percent. Purpose: The purpose of the paper is to study if there exists any significant difference in the survival probabilities of male and female AML patients. Also, we want to investigate if there is any parametric probability distribution that best fits the male and female patient survival and compare the survival probabilities with the non-parametric Kaplan-Meier (KM) method. Methods: We used both parametric and non-parametric statistical methods to perform the survival analysis to assess the survival probabilities of 2015 patients diagnosed with AML. Results: We found evidence of a statistically significant difference between the mean survival time of male and female patients diagnosed with AML. We performed parametric survival analysis and found a Generalized Extreme Value (GEV) distribution best fitting the data of the survival time for male and female patients. We then estimated the survival probabilities and compared them with the frequently used non-parametric Kaplan-Meier (KM) survival method. Conclusion: The comparison between the survival probability estimates of the two methods revealed a better survival probability estimate by the parametric method than the Kaplan-Meier. We also compared the median survival time of male and female patients individually with descriptive, parametric, and non-parametric methods of analysis. The parametric survival analysis is more robust and efficient because it is based on a well-defined parametric probabilistic distribution, hence preferred over the non-parametric Kaplan-Meier estimate. This study offers therapeutic signiHow to cite this paper: Chakraborty, A. and Tsokos, C.P. (2021) Parametric and Non-Parametric Survival Analysis of Patients with Acute Myeloid Leukemia (AML). Open Journal of Applied Sciences, 11, 126-148. https://doi.org/10.4236/ojapps.2021.111009 Received: December 12, 2020 Accepted: January 26, 2021 Published: January 29, 2021 Copyright © 2021 by author(s) and Scientific Research Publishing Inc. This work is licensed under the Creative Commons Attribution International License (CC BY 4.0). http://creativecommons.org/licenses/by/4.0/ Open Access A. Chakraborty, C. P. Tsokos DOI: 10.4236/ojapps.2021.111009 127 Open Journal of Applied Sciences ficance for further enhancement to treat patients with Acute Myeloid Leukemia.


Introduction
Leukemias are certain types of cancers that start in the cells that naturally develop into different types of blood cells. Most commonly, leukemia starts in early forms of white blood cells, but there are some leukemias that grow in other blood cells. There are different types of leukemia that are divided primarily based on whether the leukemia is acute (rapidly growing) or chronic (slower growing), and whether it starts in myeloid cells or lymphoid cells. Acute myeloid leukemia (AML) develops in the bone marrow (the soft inner part of certain bones, where new blood cells are formed), but most often, it rapidly moves into the blood, as well. It can sometimes spread to other organs that include the lymph nodes, liver, spleen, central nervous system (brain and spinal cord), and testicles. Acute myeloid leukemia (AML) [1] is the most common acute leukemia in adults, accounting for almost 80 percent of the cases in this group. Within the United States, the incidence of AML ranges from three to five cases per 100,000 population. In 2015 alone, an estimated 20,830 new cases were diagnosed, and over 10,000 patients died from this disease. To realize how and why leukemia affects a patient, it is essential to understand how blood cells are made.
The body manufactures blood cells in the bone marrow (the soft inner part of bones). The blood cells are produced in a controlled way, as the body needs them. Every blood cell starts as the same type of cell called a stem cell. This stem cell then develops into the following.
• Myeloid stem cells, which become white blood cells called monocytes and neutrophils (granulocyte), red blood cells, and platelets. And, • Lymphoid stem cells, which become white blood cells called lymphocytes. In the case of acute myeloid leukemia, the bone marrow produces a plethora of monocytes or granulocytes. These cells are often not fully developed and are not able to function regularly. Figure 2 illustrates a possible path of development of AML from a stem cell.  Figure 2 have been obtained from [2].
In the United States, AML increases progressively with age, to a peak of 12.6 per 100,000 adults 65 years of age or older [3]. Until the 1970s, the diagnosis was based solely on the pathological and cytologic examination of bone marrow and blood.
A. Chakraborty, C. P. Tsokos  A five-year survival rate during this period was less than 15 percent. Over the past decade, improvements in the diagnosis of subtypes of AML and advances in therapeutic approaches have improved the outlook for patients with AML. However, despite these advancements, the survival rate among patients who are less than 65 years of age is only 40 percent. Although in most of the cases, AML cancer disease remains irremediable, most researches into AML concentrated on how to improve the survival time of patients diagnosed with AML. The Kaplan-Meier (KM) method has been widely used for analyzing cancer survivorship data in recent time due to the simplicity of its usage. It is often used to compare the survival difference of several groups of patients based on the log-rank test of the null hypothesis that there is no significant difference among the groups. Our study presents a parametric and non-parametric survival analysis of the survival time of patients diagnosed with AML. We believe that finding the unique probability distribution that characterizes the probabilistic behavior of the survival time is essential so that we can proceed to obtain the survival function that is driven by the given data. Such an analysis is more powerful than the non-parametric approach. Feigl and Zelen, [4] have pointed out that assuming exponential distribution works well for studying the survival of cancer related cases [5] [6]. Assuming such a probability distribution without justification will lead to misleading results. Thus, it is important to identify the probability distribution of the survival time among any number of groups (for male/female or age < 65, age > 65). Hence, the probability distribution for a given set of survival time without justification will lead to an incorrect decision. In the present study, we identify the probability distribution that fits the survival time the best and proceeds to obtain the survival function. We also compare our results with the commonly used Kaplan-Meier (KM) method. The structure of the paper will be as follows: In Section 2, we provide the data discussion and perform the log-rank test [7] [8]. In Section 3, we discuss in detail the parametric survival analysis of male and female AML patients. Section 4 talks about the KM estimate and compares the median survival time of male and female patients using the descriptive, parametric, and non-parametric methods. In Section 5, we compare the survival probability estimates of male and female patients using parametric GEV distribution and non-parametric KM estimate. Section 6 and Section 7 provide results & discussion, and conclusion, respectively.

Data Description
The data for our study has been extracted from the Surveillance, Epidemiology and End Results (SEER) database. The data contains information on patients diagnosed with AML from 2004 to 2015. We are concerned with the survival time (in months) and cause-specific death (deaths due to AML cancer) for each patient. The survival time of patients is one of the most crucial factors used in all cancer research. It is necessary to evaluate the severity of cancer, which helps to decide the prognosis and help identify the correct treatment methods. We considered a random sample of 2015 patients diagnosed with Acute Myeloid Leukemia (AML) which accounts for almost 80% of the Acute Leukemia cases, [9]. A schematic diagram of the data used in this study with additional details is shown in Figure 3. The data for our study has been extracted from the Surveillance, Epidemiology and End Results (SEER) database. The data contains information on patients diagnosed with AML from 2004 to 2015. We are concerned with the survival time (in months) and cause-specific death (deaths due to AML cancer) for each patient. The survival time of patients is one of the most Open Journal of Applied Sciences crucial factors used in all cancer research. It is necessary to evaluate the severity of cancer, which helps to decide the prognosis and help identify the correct treatment methods. As the following schematic diagram illustrates, in our dataset, we have information on survival time regarding 1103 male and 912 female patients diagnosed with AML.
Before we proceed with performing the parametric analysis of the survival time of patients with AML, we need to investigate whether there is a difference in the survival time of gender, i.e., male and female patients. For this purpose, We use the Log Rank test using the following two hypotheses. The log-rank test produced a p-value of 0.011 (<0.05), implying that there is sufficient sample evidence to reject H 0 , which means the distribution of survival time between the Male and Female patients diagnosed with AML is significantly different. Figure 4 illustrates the behavior of survival curves of male and female patients. The male and female survival curves are highlighted in blue and yellow, respectively.
As Figure 4 illustrates, the survival curve of males (blue) lies below the survival curve of females (yellow), which means males have lower survival compared to females diagnosed with AML. In the following section, we describe the parametric analysis of survival time for both genders.

Descriptive Statistics of the Survival Time of AML Patients
We plotted the histogram and probability density function (pdf) to investigate the distribution of the survival time of male and female patients, as shown in Figure 5 and Figure 6. We can see that the probability distribution of the survival time of AML for both males and females is right skewed.     17.61 months on average. Also, the median survival time for male and female patients are four months and five months, respectively, which implies that the probability/chance of survival of a male or female AML patient beyond 4 and 5 months, respectively, is approximately  50%. A negative (less than zero) skewed value implies that data distribution is left or negatively skewed, and a positive skewed value suggests that data is right or positive skewed. Thus, the positive skewed value of 2.2 and 1.98, as shown in Table 1, for male and female patients, respectively, is further evidence to support the right-skewed behavior of the data, as shown in Figure 5 and Figure 6. Kurtosis supports the assessment of the extreme values of the data, and its positive value illustrates a leptokurtic behavior of the distribution. In contrast, a negative value shows a platykurtic behavior of the data distribution. Thus, the kurtosis value of 4.21 and 2.99 for males and females, respectively, in Table 1 attests to the AML survival time data's leptokurtic behavior.

Generalized Extreme Value (GEV) Probability Estimation of the Survival Time of Patients with AML
We perform a parametric analysis of the survival time of patients diagnosed with AML to identify the underlying probability distribution, which characterizes the probabilistic behavior of the survival time of AML patients (both genders). In the attempt to obtain the best-fitted probability distribution, a number of clas-sical distributions were tested to fit the subject data. The three commonly used goodness-of-fit tests, Kolmogorov-Smirnov test, Anderson-Darling test, and Chi-Square fitness test, were used to identify the best probability distribution function that characterizes the probabilistic behavior of the survival time of male and female patients. Also, we estimate the expected survival time and median survival time under each identified probability distribution function. The best fitted probability distribution that characterizes the probabilistic behavior of the survival time of the male and female patients accurately is the Generalized Extreme Value (GEV) distribution. We choose the Kolmogorov-Smirnov test, Anderson-Darling test, and Chi-Square fitness test to identify the best probability distribution function as they are very widely used and popular non-parametric goodness of fit (GOF) [10] [11] tests. Table 2 shows the goodness of fit (GOF) results of the GEV distribution.
The above results show that we fail to reject the null hypothesis that the subject data (survival time for males and females) follow a GEV distribution. In this section, we define the probability density function (pdf) of the Generalized Extreme Value (GEV) distribution and the statistical approach to obtain approximate estimates of its parameters. In the domain of probability theory and statistics, the Generalized Extreme Value (GEV) distribution is a family of continuous probability distributions developed based on the extreme value theory, [12]. The distribution combines three probability distribution families, namely, Gumbel, Fréchet, and Weibull. They are also known as type I, II, and III extreme value distributions. GEV distribution was first introduced by Jenkinson, [13] however, in some fields of application, the generalized extreme value distribution is known as the Fisher-Tippett distribution [14], named after Ronald Fisher and L.
H. C. Tippett, who recognized the three different forms of the distribution. Let T be a random variable following GEV distribution with location parameter ξ , scale parameter 0 α > , and shape parameter k. That is, Then, the probability density function (pdf) is given as follows: ( ) The corresponding cumulative distribution function (cdf) is given as follows: There are several methods to estimate the parameters ξ , α , and k of the GEV distribution. Some of these methods include Jenkinson's (1969) method of sextiles and the method of maximum likelihood (Jenkinson 1969;Prescott andWalden 1980, 1983). Neither of these methods is completely accurate [15] [16].
We use the Probability-Weighted Moments (PWM) method [15], introduced by Greenwood et al. (1979), which is a generalization of the method of moments of a probability distribution to estimate the set of parameters.

Parameter Estimation of GEV Distribution Using the Method of Probability Weighted Moments (PWM)
In general, the probability-weighted moments of a random variable X with cu- where p, r, and s are real numbers. Probability-weighted moments [16] [17] are most useful when it is written as a function of the inverse distribution function in closed form in the following way.
( ) The two special cases of To estimate the parameters of GEV distribution, we use r β from (5) according to the approach used by Hosking et al. [15]. Given a random sample of size n from the cdf, F, the estimate of r β , is based on the ordered sample b r will be used to estimate r β which will lead us to achieve our goal successfully.
Instead of b r , one might use the estimate where , j n p is a plotting position, that is, a distribution-free estimate of ( ) From (2) we can solve for X to obtain the inverse cdf, ( ) x F . The inverse distribution function is given by, Now we proceed to derive the analytical form of r β for the GEV distribution using expressions (4) and (7). From (5), we have Thus, for 0 k ≠ ; the probability-weighted moments of the GEV distribution is given by (9). When we can obtain explicit expressions of 0 β , 1 β and 2 β in terms of ξ , α , and k. That is, and 2 0 1 0 The PWM estimates of the parameters ( ), The exact solution requires some iterative methods. Hosking Once, we have obtained k , the estimates of scale and location parameters, ξ and α we can be estimated successively from Equations (11) and (10), that is, Table 3 shows the approximate parameter estimates of GEV distribution for male and female survival time.
We substituted the parameter estimates of , ,k Similarly, the analytical form of the GEV probability distribution function (pdf) for female survival time is given by: The above probability density functions characterize the probabilistic behavior of the survival time of male and female patients with AML cancer.
We now proceed to calculate the expected survival time of male and female patients. Using estimates given in Table 3, we can find the expectation and median survival time for both male and female patients that follow  GEV , , k ξ α is given by The above probability density functions characterize the probabilistic behavior of the survival time of male and female patients with AML cancer.
We now proceed to calculate the expected survival time of male and female patients. Using estimates given in Table 3, we can find the expectation and median survival time for both male and female patients that follow Similarly, the analytical form of the GEV cdf for female survival time is given as follows.
A. Chakraborty, C. P. Tsokos Figure 7 and Figure 8, illustrate the cdf plots of the male and female survival time.
As the figures illustrate, the CDF plots are very helpful to estimate the probabilities that a certain male or female patient diagnosed with AML will survive up to a particular point of time. For example, from Figure 7, the probability that a male patient will survive up to time t = 20 months is approximately 0.8. However, this probability is slightly lower for a randomly selected female patient, which is evident from Figure 8. In the next section, we will present the parametric survival analysis of the survival time of males and females AML patients, which is one of the most important aspects of this study.

Parametric Survival Analysis
Estimation of a parametric survival function is a process to evaluate the survival probabilities of male and female AML patients as a function of the survival time.
We have determined the cdf of the survival time for male and female patients diagnosed with AML in Equations (20) and (21), we can proceed to estimate the survival function of male and female AML patients.
Thus, the parametric survival function of male patients diagnosed with AML is given by, The survival function ( ) , S ⋅ ⋅ can be used to estimate the probability that a male patient diagnosed with AML would survive beyond time t, which is denoted by ( ) For example, we can compute the probability that a male patient diagnosed with AML would survive beyond 20 months. That is, for t = 20 in Equation (22), we estimate the probability as 0.2. Thus, we can infer that a randomly chosen male AML patient has a 20% chance of survival beyond 20 months. Figure 9 describes the parametric survival plot for male AML patients generated using GEV distribution. Figure 9 attests the fact that the survival probability is 0.2 for a male AML patient. As expected, it can be seen that the survival function of the survival time is decreasing with time and approximately zero beyond time t = 100.
From the above parametric survival function, we can compute the probability that a female patient diagnosed with AML would survive beyond 20 months. By inserting t = 20 in Equation (23), we compute the probability of approximately 0.25, which is greater than the survival probability of a male AML patient. Thus, we can infer that a randomly chosen female AML patient has an approximately 25% chance of survival beyond 20 months. Figure 10 describes the parametric survival plot for female AML patients generated using the GEV distribution. Figure 10 attests to the fact that the survival probability is approximately 0.25 for a female patient diagnosed with AML. Thus, a randomly chosen female AML patient has better survival than a male patient diagnosed with AML. In the next section, we discuss the non-parametric Kaplan-Meier Survival function for AML cancer briefly.

Kaplan-Meier Estimation of Survival Probability of the Survival Time of Patients with AML
The most frequently used parametric estimation methods for distributions of Figure 10. Parametric survival plot of female AML patients.
lifetimes are probably the fitting of a normal probability distribution to the observations or their logarithms by calculating the mean and variance and fitting an exponential distribution by estimating the mean alone. Such assumptions about the form of the distribution are naturally advantageous insofar as they are correct; the estimates are simple and relatively efficient, and a complete distribution is obtained even though the observations may be restricted in range. However, non-parametric estimates have the important functions of suggesting or confirming such assumptions and of supplying the estimate itself in case suitable parametric assumptions are not known. The Kaplan-Meier (KM) estimator [18] [19] also known as the product-limit estimator, is a non-parametric statistic used to estimate the survival function from data related to survival time. In health science, it is generally used to measure the fraction of patients living for a certain amount of time after treatment. It was developed by Edward L. Kaplan and Paul Meier (1958). It is defined as the product over the failure time of the conditional probabilities of surviving to the next failure time. Formally, it is given by, where i n is the number of patients at risk at time i t , and i d is the number of individual patients who fail(die) at that time. Figure 11 demonstrates the survival curves with a risk table for both male and female patients diagnosed with AML. The figure provides information about how many people are at risk at a specific time, t for both male and female patients diagnosed with AML. For example, at time t = 0, the number of male and female patients at risk are 1103 and 912, respectively, which is the total number of male and female patients in our data set whom we started our initial analysis with. At the time t = 60 (months), the male and female patients that are at risk are respectively 102 and 92. It is important to note that with the passage of time, the number of people at risk gradually decreases for both categories, which is also evident from Figure 11, as the KP survival estimate ( ) S t , is a function of the number of patients at risk ( i n ).

Median Survival and a Confidence Interval for the Median Using KM Estimate
Median survival time is a statistic that indicates how long a group of patients will survive with an illness in general or after a specific treatment has been applied. It is usually expressed in months or years. Median survival time is when half the patients exposed to a certain disease are anticipated to be alive. It signifies that the probability of surviving beyond that time is 50 percent. It gives an approximate indication of survival and the prognosis of a group of patients with cancer. Median survival is frequently reported in almost every cancer treatment studies. Generally, the median survival time [20] is defined as, It means that it is the smallest t such that the estimated survival function ( ) S t is less than or equal to 0.5. To compute a ( ) confidence interval for the median, we consider the following inequality: is given by the following equations: The confidence interval computed by the first variance formula in (26) might extend below zero or beyond 1. A more realistic approach to compute the variance formula, using the log-log transformation of ( ) S t in the second formula of (26). In order to compute a 95% confidence interval of the non-parametric survival function ( ) S t , we look for the smallest value of t, such that the middle portion of the expression (21) is at least −1.96 ( the lower limit) and the maximum value of t such that the middle expression does not go beyond 1.96 ( the upper limit). The median survival time, computed using non-parametric KM estimator, for male and female patients diagnosed with AML, is given as four months and six months, respectively. The corresponding 95% confidence interval for the median survival time is given as [4,5] and [5,8]. It is very interesting to note that the median survival time we obtained by the descriptive method (Table 1) are very close to what we obtained by non-parametric methods. However, the median survival time we obtained using the parametric method (implementing the GEV distribution) is slightly greater than the descriptive and non-parametric methods. Table 4 compares the median survival time for both male and female patients diagnosed with AML, computed using the three methods.

Comparison of GEV Distribution with the Kaplan-Meier Estimation of the Survival Function
In the parametric analysis (Section 3.3), we found that patients' survival time (both male and female) with acute myeloid leukemia follows a Generalized Extreme Value (GEV) distribution. In Section 3.4, we performed a non-parametric analysis using the Kaplan-Meier to estimate the AML patients' survival probability. We compare the survival probability estimates of the GEV distribution with the Kaplan-Meier survival estimates of the survival time of the AML patients.
The importance of the survival function of the two methods is to estimate the survival probability of a patient diagnosed with AML beyond a given time. The survival probabilities corresponding to a specific time (in months) are shown in Table 5 for comparison purposes. We see that the probability estimates computed by the GEV survival function are higher than that of Kaplan-Meier in most cases. However, there are times in which the KM estimates higher survival probabilities than the GEV survival function. Since parametric methods are more powerful, robust, and efficient than non-parametric methods, we must accept the parametric estimates of the probabilities as the most accurate. In Table 5, ( ) PM S t is the parametric survival probability estimated for male AML patients using GEV distribution.

( )
t is the non-parametric survival probability estimated for male AML patients using KM estimate.

( )
t is the parametric survival probability estimated for Female AML patients using GEV distribution.

( )
t is the non-parametric survival probability estimated for Female AML patients using KM estimate.

Results and Discussions
Given the risk posed by AML cancer in the past few years, it is imperative to   • We compared the median survival time of male and female AML patients using descriptive, parametric, and non-parametric methods.
• We compared the estimated survival probabilities of male and female patients diagnosed with AML by parametric method (driven by GEV probability distribution) and non-parametric method (driven by KM estimate) beyond a given survival time.
At the first stage of our analysis, we tried to investigate if there is any statistically significant difference between the survival time of male and female AML patients using the Log-Rank test. We found that there exists a significant difference between the survival time of both males and females diagnosed with AML.
So, we start performing our data analysis using the separate analysis of the males and females AML patients. We found that a GEV distribution best characterizes the survival time's probabilistic behavior for both male and female AML pa-tients, separately. We believe that finding the most accurate probability distribution that represents the probabilistic behavior of the survival time for a given cancer patient can lead to estimating the survival probability with much more accuracy and efficiency. The fact that we determined a unique probability distribution for our study of the survival time of patients diagnosed with AML contradicts the proposition of the assumption of exponential distribution (Feigl and Zelen ([1965] p. 835) and other authors) or using the non-parametric Kaplan-Meier for the majority of cancer survivorship studies. We found that the GEV distribution most often estimates higher survival probabilities compared to the KM survival function, given by Table 5. We know that KM estimates are very frequently and commonly used tool to analyze the cancer survivorship data, but they are not the best estimates. Statistically, the parametric technique is considered to be more robust and efficient than the non-parametric counterpart.
Therefore, our finding of the parametric GEV probability distribution gives better results in estimating the survival probability of the patients diagnosed with AML than the Kaplan-Meier. The KM technique is most frequently used to compare the difference between the estimated survival probabilities of the survival time of two or more entities or categories, typically based on the log-rank test. However, by obtaining the best parametric probability distribution that characterizes the survival time, we can find the survival function and estimate the survival rate and compare the results of two or more entities with a high degree of accuracy. One of the most useful results that we have obtained from our data analysis is that the survival probabilities for female AML patients are significantly higher than the survival probabilities for male AML patients by both parametric and non-parametric methods, which is evident from Table 5 and also Figure 11.

Conclusions
We have determined the survival probability of patients diagnosed with Acute Myeloid Leukemia (AML) using two different statistical methods: the parametric Generalized Extreme Value (GEV) distribution and the non-parametric Kaplan-Meier (KM) estimation. We found the parametric method to give often higher estimates of the survival probabilities than the non-parametric KM method. Despite the fact that there are instances when some of the non-parametric survival probability estimates are the same or higher; all-important arguments favor the parametric approach. The parametric survival analysis's difficulty is the fundamental inherent assumption that the population's survival time under study follows a specific probability distribution.
But if we can overcome such restriction, we can obtain a more robust and efficient result from the parametric analysis, which has greater statistical power. We can also evaluate the hazard function, which determines the rate at which patients die with AML, after finding the right parametric distribution. Depending on the two different methods utilized for estimating the probability of survival of A. Chakraborty, C. P. Tsokos patients diagnosed with AML, we convey the following important recommendations.
• Given the information regarding male and female cancer patients' survival time, it is customary to investigate first if there exists any statistically significant difference between male and female patients' survival time. If the difference is significant, we must perform a separate analysis for each of the two groups. In the present study, we found that there is a significant difference between the survival time of male and female patients diagnosed with AML.
• If the only information provided about the patient is the survival time, then estimating the survival probability using the parametric technique will yield more accurate, robust, and efficient results than the commonly used non-parametric Kaplan-Meier survival estimate.
• However, if no unique or well-defined parametric probability distribution can be estimated, we still propose using the Kaplan-Meier (KM) technique to estimate the survival probabilities. Although the use of non-parametric Kaplan-Meier survival analysis may, in certain circumstances, result in a similar or higher probability estimate of the survival rate (such as in our case), the parametric analysis remains more powerful, robust, and efficient. Hence, the parametric analysis must be considered the first stage of data analysis of any given cancer survivorship data. This study provides a more effective and plausible method for estimating the survival probability and analysis of cancer survivorship data to further enhance the therapeutic/treatment process of AML cancer.