Survival Analysis of Lung Cancer Patients from TCGA Cohort

Lung cancer is one of the leading causes of death worldwide, accounting for an estimated 2.1 million cases in 2018. To analyze the risk factors behind the lung cancer survival, this paper employs two main models: Kaplan-Meier estimator and Cox proportional hazard model [1]. Also, log-rank test and wald test are utilized to test whether a correlation exists or not, which is discussed in detail in later parts of the paper. The aim is to find out the most influential factors for the survival probability of lung cancer patients. To summarize the results, stage of cancer is always a significant factor for lung cancer survival, and time has to be taken into account when analyzing the survival rate of patients in our data sample, which is from TCGA. Future study on lung cancer is also required to make improvement for the treatment of lung cancer, as our data sample might not represent the overall condition of patients diagnosed with lung cancer; also, more appropriate and advanced models should be employed in order to reflect factors that can affect survival rate of patients with lung cancer in detail.


Introduction
Lung cancer, also called lung carcinoma, is a type of cancer that causes uncontrolled rate of cell growth in lung tissues, and it is the leading death-causing cancer among all types of cancer [2]. The two major types of lung cancer are small cell lung cancer and non-small cell lung cancer; both contain different stages regarding the seriousness of the disease. About 85% of the case of lung cancer can be attributed to the non-small cell lung cancer [3]. There are various risk factors of lung cancer, such as air pollution, personal characteristics, genetics. The dominant and the well-known cause is cigarette smoking, which accounts for about 85% of lung cancer [4], because cigarette contains hazardous chemical components such as nicotine, which speeds up the cell growth and eventually results in tumor and potential malignant lung cancer [5].
To explore and better understand lung cancer, the study of genes is extremely important. Cancer genomics is to provide better treatment via structural genomics, which "measures the activity of genes encoded in our DNA in order to understand which proteins are abnormally active or silenced in cancer cells" [6].
With the huge amount of data on genome, drugs invented can thus be more effective and specific, since they can target those abnormal genes or proteins precisely, instead of killing all cells like chemotherapy [7]. As a result, the survival probability of lung cancer patients would be largely boosted.
In this paper, we studied the impact of several risk factors on the survival of a lung cancer patient cohort, including genomic factors. The data used is from TCGA, which is an organization that gathers tons of gene data of cancer sequence, endeavoring to make contributions to cancer treatment. TCGA's data is relatively convincing, because the teams in TCGA classify cancers, or tumors, into subgroups that can be better analyzed by experts and investigators in the field of lung cancer [8].

Methodology
The dataset consists of data of 1145 patients, who were all diagnosed with different types of lung cancer. Important variables in the data include diagnosis age, sex, smoking history, stage of lung cancer, fraction genome altered, and mutation count. Specifically, diagnosis age, fraction genome altered, and mutation count are continuous variables; smoking history, sex, and stage are categorical variables [9]. However, in order to better capture the correlation between these variables and the survival, some continuous variables would be processed and transformed into categorical data in different analyses. For example, the smoking history could be divided into several categories based on the length of smoking. In order to get a better sense of data, tables and histograms are first employed before analysis.
Two main kinds of statistical models are involved in the analysis of dataset.

Kaplan-Meier Estimator
To understand the relationship between categorical covariables and survival, Kaplan-Meier estimator is used, which is one of the most widely-used non-parametric measures in survival analysis and in medical research. The formula used is: where t j is the time; d j is the number of deaths at t j ; and r j is the number of indi-

Cox Proportional Hazard Model (Cox PH Model)
On the other hand, to study the effect of multiple factors simultaneously, Cox PH model is a better approach. The formula to measure the hazard ratio between the two groups is: where λ 0 (t) is the hazard rate for the control group and λ i (t) is the hazard rate for the treatment group. Z is a vector of covariates, including continuous factors, indicators for categorical factors, and possible interactions (e.g. age by sex interaction To formally draw inference of the relationships, we use log-rank test and wald test for hypothesis testing. Hypothesis testing includes the comparison between p-value (the possibility that data matches null hypothesis) and alpha level (the possibility to reject null hypothesis given the null hypothesis is true). Typically, we used 0.05 value for alpha level in our data analysis, because 0.05 corresponds to the confidence interval of 95% (the most common one). If p-value is smaller than the alpha level, that means we have enough statistical evidence to reject the null hypothesis, and vice versa. Hence, a p-value < 0.05 is required to show a statistically significant effect of a variable on the survival [12] [13] [14].

Kaplan-Meier
First, we used Kaplan-Meier estimator to find out the effect of cancer type, cancer stage, patient sex, and smoking history on survival rate. Summary of data and results are shown in Figures 1-3.

Cancer Type
There are two types of cancer involved: LUAD and LUSC. From the Kaplan-Meier survival plot of the two groups, there seems no big difference between the survival rate (survival probability in the y axis) of LUAD and that of LUSC. After we applied log-rank test, we found the p-value to be 0.5, which is far greater than the alpha level (0.05). Hence, we do not have the statistical evidence to reject the null hypothesis and conclude that the survival rates of patients of different types of lung cancer are roughly the same.

Stage of Cancer
Stage of cancer is divided into four stages, which are I, II, III, and IV, where IV is the worst stage that the cancer cell spreads to different organs. As shown in Figures 4-6, the survival probability of stage IV is the lowest, then III, II, and I, meaning that patients in stage IV have shorter survival compared to the other three stages even with treatments. This is reasonable considering the categorization of the four stages.
Here, the p-value equals to 2e −6 , which is far smaller than the alpha level of 0.05, indicating that the survival probability for different stages of cancer is statistically different from each other.

Sex
For the gender of patients, there seems no big difference between the survival probability between females and males, as their survival point estimates and confidence intervals largely overlap. Moreover, the p-value of the log-rank test is 0.7, which is far greater than the alpha level, indicating that there is indeed no difference between the survival of females and males in this cohort.

Smoking
For smoking, we categorized patients in two main groups: smoking = 1 represents patients who have ever smoked during lifetime, and smoking = 2 represents patients who have never smoked during lifetime. However, contrary to our common understanding towards the harm of smoking on health, patients who smoked in our data have a better survival probability than those who never smoked based on Figure 7 and Figure 8. However, since the p-value (0.3) is bigger than the alpha-level (0.05), the differences in survival between smokers and non-smokers are not significant. The wide confidence interval of "smoking = 0" which covers that of "smoking = 1" also indicates the same conclusion. We may lack the power to test the underlying difference due to insufficient sample size in non-smokers.

Fraction of Genome Mutated
For fraction of genome altered (fga), we created two categorical variables based on it. "fga-binary = 1" corresponds to patients whose genome mutated fraction is greater than the average level of this cohort, while "fga-binary = 0" corresponds to those whose genome mutated fraction is smaller than the average level. Based on Figures 9-11, patients with more genome mutated fraction seem to have close survival rate with those with lower genome mutated fraction. As shown in Figure 12 and Figure 13, similar to previous variables, the differences between the two groups are not significant, since the p-value (0.9) is greater than the alpha-level.

Cox Proportional Hazard Model
To analyze the impact of multiple variables on the survival rate of patients, cox proportional hazard model is utilized. Figure 14 presents the regression output of a baseline cox model including the six variables we are interested in, which are three continuous variables (sex, smoking, and stage), and three continuous variables (diagnosis age, fraction genome altered, and mutation count).

Diagnosis Age
Based on this output, the hazard ratio of age is 1.02, meaning a high hazard for elders. Also, the p-value (Pr > |z| in the diagram) is 0.003, which is smaller than 0.05, meaning that this association is significant. Meanwhile, stages II, III, and IV also have a significant relationship with survival, and the hazard ratios are 1.52, 2.24, and 2.66 respectively, since all the associated p-values are smaller than 0.05. The higher hazard ratio in higher stages is reasonable, because patients in higher stages tend to have a worse state of lung cancer, thereby have an increasing hazard rate. On the other hand, other variables do not have an obvious impact on hazard ratio, since their p-values are all greater than 0.05. But we should be cautious when interpreting the results. We may fail to capture the true association due to a lack of statistical power.

Fraction of Genome Altered
Since we failed to detect the relationship between fraction of genome altered and hazard ratio in continuous form, to further explore the association, we transform the variable from a continuous variable to a categorical variable, where there are four main groups categorized by the 25th percentile, 50th percentile, and 75th percentile: 0 -0.157, 0.157 -0.326, 0.326 -0.479, and 0.479 -0.937, as shown in Figure 15. However, there is still no statistical evidence suggesting an association between fraction of genome altered and hazard ratio in categorical form, as the p-values are still greater than 0.05 (0.81, 0.55, and 0.22).

Mutation Count
As shown in Figure 16, similar to the variable "fraction of genome altered", we Advances in Lung Cancer  divided the total mutation count into four categories by quantiles, where the maximum mutation count is 2360. Nevertheless, due to p-values that are greater than 0.05, the categorization of mutation count does not help to detect a significant association between mutation count and hazard ratio.

Stage and Smoking
Moreover, apart from analyzing effects of individual variables on hazard ratio, we add interaction terms in the model to study the effect modification between covariates.
We first focus on the interaction between stage and smoking, shown in Figure  17. We add a variable denoting the product of the two covariates in the model.
However, based on the results of modeling, there is no significant interaction between the two variables, as evidenced by the p-value that is greater than 0.05.

Sex and Smoking
We then study the interaction between sex and smoking by adding a new variable denoting the product of the two covariates. Similarly shown in Figure 18, we also failed to capture the interaction between sex and smoking, since the p-value is 0.70, greater than 0.05.

Time Added to Different Variables
The assumption for Cox PH model is that hazard ratio does not depend on time (t), i.e. the hazards of the two groups remain proportional over time, the hazard ratio between t1 and t2 is the same as that for t2 and t3 in the sample. However, this assumption seems to be violated based on the Kaplan Meier analysis and above cox models. For example, due to the cross between two lines "fga_binary = 0" and "fga_binary = 1" in the diagram below, the hazard ratio between the two groups changes, and even reverse, over time. The impact of some other covariates on survival rate also seems to change with time. As a result, time should be taken into account in modeling. Thus, we revise the Cox PH model by adding new covariates indicating the interaction between time and other covariates. Based on the output in Figure 19 and Figure 20, all variables except diagnosis age have an extra covariate with time. The diagnosis age effect is thus not significant anymore because there is a colinearity between time and age. For other variables, we obtained both significant main effect and interaction between time and sex, smoking, and stage. For example, "smoking_time" denotes the difference in hazard ratio between smokers and non-smokers as time increases. Hence, given the exp (coef), hazard ratio, is 0.763, the longer the patient was diagnosed with cancer, the smaller the difference in the hazard rate between smokers and non-smokers. The p-value is far less than 0.05, making the effect of the interaction between smoking and time on survival convincing. Advances in Lung Cancer  On the other hand, the fraction genome mutated (fga_time) and the mutation count (mc_time) still do not have a significant effect on hazard ratio, due to their big p-values (0.224, 0.551 respectively). However, it does not mean that the effect of these two variables on survival does not change with time. Again, the reason we failed to capture the association between survival and them may be the limited cohort and sample size of our data. More intricate statistical methods and a greater sample base might help to detect the true association in the future.

Conclusion
To sum up, certain variables influence the survival rate of patients with lung cancer in our data sample from TCGA. Specifically, the stage of cancer, which is