Identification of Significant Genes and Pathways Related to Lung Cancer via Statistical Methods

Cancer genomic research is a relatively new method. It has shown great potential but faces certain challenges. Researchers often have to deal with tens of thousands of genes with a relatively small sample size of patient cases—a dilemma referred to as the “Curse of Dimensionality” [1]—and it makes it hard to learn the data well because of relatively sparse data in high dimensional space. To deal with the dilemma, this study uses p-values of individual genes for pathway enrichment to find statistically significant pathways. The aim of this study is to find significant genes and biological pathways that are related to lung cancer by statistical method and pathway enrichment analysis. Several significant genes, such as WNT2B, VAV2, and significant pathways, such as Metabolism of xenobiotics by cytochrome P450-Homo sapiens (human) and Fatty acid degradation-Homo sapiens (human), are found to be both statistically significant and biological studies supported. Significant genes-including TESK2, C5orf43, and ZSCAN21—and significant pathways such as Pentose and glucoronate interconversions-Homo sapiens (human), are found to be new cancer-related genes and pathways that worth laboratory studies. The idea and method used in this research can be applied to find more significant genes and pathways that worth study experimentally.


Introduction
In the 21 st century, cancer research, integrated with biology, genetics, cytology and statistics, continues to be a hot spot.Since last century, many researchers have been working in this field on clinical observations and theoretical deduction.Among many researches, generic aspects of such, a relatively new method for learning causes and preventions for cancer, have begun to show its potential.
The study of cancer genomics can reveal abnormalities in certain genes that drive the development and growth of lung cancer [2].Over the past decade, large-scale research projects have been launched but faced certain challenges.
Acquiring high-quality biological samples needed for genomic studies, managing and analyzing the vast amounts of data involved, and finally having a converged genetic abnormality result all add challenges to genomic method of research [3] [4].
Lung cancer is one of the cancers that caused most deaths around the world each year.About 1 out of 4 cancer deaths are from lung cancer.Each year, more people died of lung cancer than of colon, breast, and prostate combined [5].
Unsurprisingly it has been studied intensively.People have come to accept that smoking is the major cause of lung cancer as early as 1930s [6], but recent studies have shown that genetic aberrations can also raise cancer risks substantially [7], though how specific genes affect lung cancer remains vastly understudied.
In this paper, we focus on lung cancer and gene expressions associated with it.
Our dataset is retrieved from GSEA, which collects data in collaboration with National Cancer Institute, National Institutes of Health, National Institute of General Medical Sciences etc. Normalized RNA sequencing data of 868 lung cancer patients is studied statistically with patients' clinical observations.On the basis of clinical observations and gene expressions of lung cancer patients, significant genes and gene pathways are investigated.Two specific clinical data, recurrence free survival (RFS) and whether the patient has a new tumor event after initial treatment (cancer recurrence), are chosen to run regression on.RFS is a continuous variable, and we apply linear regression to identify relevant genes and pathways [8]; cancer recurrence is a binary variable, so logistic regression was applied [9].The resulting p-values of genes are then analyzed by Gene Set Enrichment Analysis (GSEA), an online pathway enrichment tool, to find statistically significant gene pathways [10].

Multivariate Linear Regression
Linear regression is a linear approach for modeling the relationship between a scalar dependent variable and one or more independent variables.Multiple linear regression takes a vector of independent variables and find each independent variables' associations with the dependent variable.Here for each patient, let Y denotes the dependent variable (RFS) for each patient, β denote a vector of associations between dependent and independent variables, and let X denotes the vector of independent variable (expressions of genes): ( ) So a multiple linear model is fit under the assumption: Where ε is natural noise with following distribution: ( ) where N stands for normal distribution, the distribution that is assumed for real random values, and i.i.d.stands for "independently and identically distributed", as natural noises have the same probability distribution and are mutually independent.
When β is positive, x and y has a positive relationship, and as x increases y increases; vice versa.Note that the explanatory variable is also given an intercept β 0 .β and β 0 are found by Least Square Estimation, rigorously defined by: ( ) Under such assumption, β is found using the closed form solution: ( ) Then a linear model is fit to describe the relationship between dependent variable and a vector of independent variables.A p-value is calculated for each β to test its statistical significance.More analysis of p-values will be discussed in Section 2.2.

Multivariate Logistic Regression
Logistic regression is a regression model dealing with categorical dependent variables, which can only take two values, "0" or "1".It estimates the probability of the binary response based on one or more explanatory variables.There are only two possible outcomes: having (y = 1) or not having (y = 0) a new tumor.The sum of the two probabilities is assumed to be 1, as stated: Multiple logistic regression takes a vector of independent variables and find each independent variables' contribution to the classification result of the dependent variable.A linear model could still be used for logistic regression except one problem: the probabilities would be allowed to go smaller than 0 and bigger than 1.So the joint probability takes a logistic transformation: where X denotes the vector of independent variables (expressions of genes) and β denotes the contribution an independent variable makes in determining the classification of the dependent variable: ( ) Unlike linear regression, a Least Square Estimation cannot be found in logistic regression, so instead a Maximum Likelihood Estimation is sought.The following likelihood function is maximized: Also unlike linear regression, there is no closed form solution for the optimized β, so the β that maximizes the likelihood function can only be found using iterative algorithm such as the Newton Raphson algorithm [11].A logistic model is then fit to describe the contribution each independent variable makes in deciding the category of the dependent binary variable.A p-value is then calculated for each β to test its statistical significance.More analysis of p-values will be discussed in Section 2.3.

Correction for Multiple Testing
Multiple testing is when a set of statistical inferences is tested simultaneously.In this study, each gene is given a p-value through multiple testing of all models fit.
When a model is fit for each gene, a hypothesis test is performed.Null hypothesis (H0) states that the real coefficient (β) is zero (the value of this specific independent variable does not have association with the value of the dependent variable), and alternative hypothesis (H1) states the opposite.
From each null hypothesis, an T population distribution is assumed.Then by using a t-test, a p-value-how likely the T statistic, resulted from the alternative hypothesis, is to occur under F distribution-is calculated.In other words, p-value measures the probability of obtaining a result at least as extreme as the sample data, assuming the null hypothesis is true.So when the p-value is small, it is reasonable to reject the null hypothesis.In this study, p-value is used as the metric for selecting top genes and pathways.
In the case of multiple testing, a general rule of 0.05 as cutoff for p-values for statistical significance is not sufficient.Consider a large test number such as 10,000.If 0.05 is used as cutoff, it is expected that 500 true hypothesis to be rejected even though all null hypotheses are true.Also, because p-value cannot be larger than 1, using 0.05 cutoff would yield many "statistical significant" cases just by chance.As a result, an accommodated cutoff would be applied for multiple testing.
There are multiple ways to calculate the cutoff for p-value.For example, Bonferroni cutoff is a rigid correction for multiple comparison that controls the overall probability of a single false discovery.It adjusts the significance cutoff to α/m where α is the normal 0.05 cutoff and m is the number of hypothesis.In this study, a Bonferroni cutoff would be extremely small because there are tens of thousands of genes, thus an m value of tens of thousands [12].gent than the Bonferroni cutoff [13].According to Benjamini-Hochberg (BH)'s method, the cutoff is calculated as: max : Both corrections are used in this study, and the results will be discussed in Section 3.

Pathway Enrichment Analysis
With the complexity of traits of genes, it is sometimes difficult to interpret the p-values of all genes.Therefore, gene set pathway enrichment analysis is also used.It is useful for finding out if the differentially expressed genes are associated with a certain biological process or molecular function, so that an association between cancer and certain gene pathways can be drawn.Here, a list of genes and their p-values is analyzed using Gene Set Enrichment Analysis (GSEA) through KEGG pathway.

Results and Discussions
The variables of interest in this study, RFS and Cancer Recurrence, are both of great importance in cancer study.RFS, as known as DFS (disease-free survival), is the length of time after primary treatment for a cancer ends that the patient survives without any signs or symptoms of that cancer [14].Cancer Recurrence is the primary way of judging whether a treatment works.Both variables provide a way to examine how well a treatment works for the patient.Specific to this study, these two variables provide a standard to determine whether certain differently expressed genes or certain pathways have high correlation with whether a treatment is useful.

Gene Level Results
The sorted p-values (ascending order) for both regressions are plotted to help visualization.There were three to four thousand p-values that passed the rule-of-thumb 0.05 cutoff in both regressions, so we proceed to use more rigorous corrections of cutoff: FDR corrections and Bonferroni cutoff.

Logistic Regression
The largest Pi found in FDR correction is that of 528 th p-values: 1.280e−3; the Bonferroni cut off is 2.435e−6 (Figure 1).
The top 10 genes are shown in Table 1 as following (Note that there are 528 genes that passed the FDR corrections, and p-values passed Bonferroni cutoff are highlighted): After studying existing literatures, we find some medical and laboratory evidence for correlation between some genes above and lung cancer.
WNT2B, the very top gene on the list, is a protein that encodes a member of the wingless-type MMTV integration site (WNT) family of highly conserved, Advances in Bioscience and Biotechnology  secreted signaling factors.WNT family members function in a variety of developmental processes including regulation of cell growth and differentiation [15].
A study of WNT2B expression in human cancer suggested that WNT2B (and mostly only WNT2B but not WNT2 or WNT2B, which also belong to WNT Family), was differently expressed in breast cancer, teratocarcinoma, and gastric cancer [16].Another study that intended to find effects of WNT2B in GLUT1 overexpressing also concluded that WNT2B plays a role in tumorigenesis and chemotherapy resistance [17].Also mentioned by this study was that GLUT1 Researches have also concluded that WNT2B is oncogenic and is especially related to non-small cell lung cancer [18].
OPN3 is a protein coding gene, a member of the guanine nucleotide-binding (G protein)-coupled receptor superfamily [19].A to find correlation between OPN3 and 5-fluorouracil treatment found that therapeutic strategies targeting OPN3 may improve chemotherapy function [20].Another study pointed directly to the role OPN3 regulatory sequences play in oncogenesis and drug resistance [21].
Other genes on the list, such as TESK2 and C5orf43, have not been studied in laboratory or medical researches thoroughly but have been connected to cancer indirectly in other existing literatures.Genes further down the lists are also worthy of studying.

Linear Regression
The largest Pi found in FDR correction is that of 374 th p-values: 9.106e−4; the Bonferroni cut off is 2.435e−6 (Figure 2).
Top 10 genes are shown in Table 2 as following (Note that there are 374 genes that passed the FDR corrections, and p-values passed Bonferroni cutoff are highlighted): After studying existing researches, we find some medical and laboratory evidence for correlation between some genes above and lung cancer.
LSM2 is a Protein Coding Gene, a member of LSm family of RNA-binding proteins.It binds specifically to the 3-terminal U-tract of U6 snRNA, and it may have a correlation with pre-mRNA splicing [22].LSm family has been found to have a strong impact on breast cancer [23], and it has also been reported that overexpression of LSm1 exhibits in many lung cancer cases [24].VAV2 is a Protein Coding Gene, a member of VAV guanine nucleotide exchange factor family of oncogenes.Its transcripts were found in most tissues, and it plays an important role in angiogenesis [25].A research of a well-known cancer-related protein suggests that VAV2 facilitates cancer cell motility and a molecular level.Vimentin is an intermediate filament protein whose expression correlates with cancer diseases.In exploring the molecular role of vimentin in cancer cell motility, they find that VAV2 localizes to vimentin-positive focal adhesions in lung cancer cells.Based on their study, the group built a model of vimentin-VAV2 pathway and proposed it as a potential novel regulator of lung cancer motility [26].
Both NCOR1 and NCOR2 are Protein Coding Genes that belong to family of nuclear receptor corepressors.Both of them encode proteins that mediate transcriptional repression by impeding the access of basal transcription factors [27] [28].In a study of transcription factor corepressors, the group investigated the roles of these corepressors in cancers, and pointed NCOR1 and NCOR2 as examples of how features of transcriptional rigidity affect cancer cells [29].
BUD31 is a Protein Coding Gene that is associated with transcriptional factor activity and sequence-specific DNA binding [30].MYC overexpression is one of the most common drivers of human cancer, but it is so far recalcitrant to therapeutic inhibition.To tackle this problem, Hsu and her team conducted a research and found that spliceosome could be a vulnerability of MYC.BUD31, as identified by the team, is a component of core spliceosome required for its assembly and catalytic activity, thus a MYC-synthetic lethal gene in human cells [31].
Interestingly, top two genes of the regression result cannot be related to many existing researches.In fact, few researches have been focused on the correlation between these genes and cancer whatsoever (besides NACC2 was distantly mentioned to be associated with bladder cancer survival [32]).According to the sig-

Pathway Enrichment Results
Results of analyses of the two variables appear to be very similar in terms of the top pathways and genes.Five pathways appear in the top six of both results.(Table 3).
After studying existing researches, we find some medical evidence for correlation between some pathways above and cancer.
Drug metabolism-cytochrome P450-Homo sapiens (human), id hsa00982, belongs to class of Xenobiotics biodegradation, Metabolism [33].A 1991 research related drug metabolism with cancer predictions, and proposed to use this pathway of genes to predict individual risk of cancer [34].
Metabolism of xenobiotics by cytochrome P450-Homo sapiens (human), id hsa00980, belongs to class of Xenobiotics biodegradation, Metabolism [33].A research to find a new way of cancer diagnosis found about how a specific expression of cytochrome P450 is related to multiple cancers including lung cancer.The team found that a cloned dioxin-inducible form of the cytochrome P450 family, CYP1B1, expresses at a high frequency in a wide range of cancers, so a test on this pathway can provide a new way of diagnosis [35].
Fatty acid degradation-Homo sapiens (human), id hsa00071, belongs to class of Lipid metabolism, Metabolism [33].A study of cancer from lipid perspective suggests that through control of fatty acid pathway, cancer cell proliferation can be controlled since it requires fatty acids for synthesis of membranes and signaling molecules [36].
The other two of the five pathways on the lists, though appeared top, have not been researched for cancer association.There is a great chance that new perspectives of cancer will be found through these two pathways.

Discussion
By now, extensive researches have been conducted about prevention, diagnosis and therapy of lung cancers.Genetics provide a wide range of opportunities for cancer research, and so far the potential still exists for much further studies.In

Figure 1 .
Figure 1.P-value-gene number of cancer recurrence.

Table 1 .
Top 10 genes and its P-value.

Table 2 .
TOP 10Genes and its P-value.

Table 3 .
Analysis of the two variables in terms of the top pathways and genes.