A Novel Treatment Optimization System and Top Gene Identification via Machine Learning with Application on Breast Cancer

Traditional treatment selection of cancers mainly relies on clinical observations and doctor’s judgment, but most outcomes can hardly be predicted. Through Genomics Topology, we use 272 breast cancer patients’ clinical and gene information as an example to propose a treatment optimization and top gene identification system. This study faces certain challenges such as collinearity and the Curse of Dimensionality within data, so by the idea of Analysis of Variance (ANOVA), Principal Component Analysis (PCA) is implemented to resolve this issue. Several genes, for example, SLC40A1 and ACADSB, are found to be both statistically significant and biological-studies supported; the model developed can precisely predict breast cancer mortality, recurrence time, and survival time, with an average MSE of 3.697, accuracy rate of 88.97%, and F1 score of 0.911. The result and methodology used in this study provide a channel for people to further look into the more precise prediction of other cancer outcomes through machine learning and assist in the discovery of targetable pathways for next-generation cancer treatment methods.


INTRODUCTION
As much as we have researched about cancer, it remains a huge issue for human health. What is even crueler than cancer itself is the process of making decisions for treatment selections and going through them. While many researches have been done in cancer early diagnosis and prevention, fewer have been looked into facilitation with post-diagnosis stage. Here we propose a novel treatment optimization system, using breast cancer as an example.
Breast cancer is the most occurred cancer for women worldwide and one of the most common causes Open Access of death from cancer. In the past few years, breast cancer has been studied substantially, increasing prognosis rate and decreasing death rate. Nevertheless, further research is still needed to achieve full understanding of its mechanism and corresponding efficient treatment. Traditionally, doctors mainly rely on biological method to diagnose cancer, such as B-Scan Ultrasonography and Fine Needle Aspiration Cytology (FNAC) [1]. Some more recent researches also support this process, enhancing precision and patient experience. However, after diagnosis, it is hard to get to know what the optimistic treatment plan is for individual patient. Without a commonly applied standard to determine the treatment plan, most cancer patients' treatments tend to be too general, since the doctors normally give patients of different characteristics similar suggestions. A systematic method is missing to provide more information for both the doctors and patients after the diagnosis.
Genomics analysis is a relatively new way to study cancer. Since it only became feasible as the capacity of calculation improves with development of computation power, much potential is worthy to explore. Applying statistical methods to understand cancer has shown great potential in two main ways: prediction of cancer outcomes-such as diagnosis and survivability-and identifying significant genes-helping predetermine certain clinical outcomes.
The data is collected by NKI (Netherlands Cancer Institute), cleaned by Gene Expression Omnibus (GEO) [2], and finally downloaded from Data World website [3]. The dataset included information of 272 breast cancer patients and 1567 attributes, which include 10 clinical attributes and 1554 gene attributes, and 3 patient general attributes.

Pre-Processing
The dataset is examined and no missing data is found. Three response variables-"event death", "survival", "time recurrence"-are taken out as responses and investigated separately from the independent variables. "event death" is a binary variable indicating whether the patient dies of breast cancer, and both "survival"-indicating survival time-and "time recurrence"-indicating recurrence time-are continuous variables. Upon investigation, recurrence time is generally the same as survival time when the patient did not die from cancer. This case would probably affect the result accuracy, so upon investigation of the range for recurrence time, we decided to double the recurrence time if found to be the same as survival time.
All gene attributes are continuous variables and are scaled before further exploration. The following attributes are binary: "chemo" indicating whether the patient has received a chemotherapy, "hormonal" indicating whether the patient has received hormonal therapy, "amputation" indicating whether forequarter amputation has been used as a treatment, "histtype" indicating the histological type. "diam" is a discrete variable indicating the diameter of the tumor size, and so is "posnodes" which indicates the number of nodes. But both are considered as continuous variables because they have many levels. Several attributes are categorical variables: "grade" with three levels indicating cancer grade, "angioinv" with three levels indicating the extent to which the cancer has invaded blood vessels or lymph vessels, "lymphinfil" with three levels indicating the level of lymphocytic infiltration. These three categorical variables are each transformed into three binary variables so that they can feed into more analysis methods. Figure 1 is a heat map to show the correlation: After attributes investigation, the dataset was further evaluated by different classifiers and prediction methods under 10-fold cross validation scheme [4]. The average squared error, error rate, f1 scores are used as the final metric for performance evaluation.

Principal Components Analysis (PCA) [5]
PCA is a statistical analysis method that can be used for dimension-reduction. Our data is very high-dimensional with more than fifteen hundred variables compared to only fewer than three hundred patient samples, which would make it very hard to find statistical patterns from the sparse data comparing to the high-dimensional space. PCA, based on the principles of Analysis of Variance (ANOVA), would convert the data into a set of directions called Principal Components (PC), ranking by their level of explanation of original data's variance. The transformation is made by multiplication of a loading vector (usually the same dimension as the number of attributes) and the original date.
To maximize the variance explained, the first loading vector can be calculated using: where X is the original data and w is the loading vector. Then, to calculate the k th , first subtract the first (k − 1) PCs from X: Then k th loading vector can be calculated: Using PCA, we successfully reduced more than fifteen hundred attributes to 130 attributes while maintaining 90% of the original variance. The reduced dataset is now available for us to perform various machine learning methods. Figure 2 and Figure 3 show visually how PCA analyzes the data. Both plots are showing how principal components help represent the data variance. Figure 2 is PCA performed on the entire data, and Figure  3 is PCA performed solely on clinical attributes for visualization purpose: J. Biomedical Science and Engineering

Regression
Regression models are fit for survival time and recurrence time, and each method is tried separately with clinical-only data, gene-only data and combined data. The results are shown in 6. Model Evaluation.

Linear Regression [6]
Linear regression is a linear approach for modeling the relationship between a scalar dependent variable and one or more independent variables. It is one of the most basic methods in statistical learnings but still has great significance in its application. To achieve the best result possible by linear regression, we used data transformed by PCA and made some adjustments in model selection.
1) Information Criterion and Stepwise Selection A problem with model selection of linear regression is that as the number of attributes increases, the overall result usually seems to increase as well, but often we require a balance between model simplicity and goodness of fit. In this study, therefore, some information criterions are adopted to find the optimal model for linear regression.
Akaike Information Criterion (AIC) [7] is defined as: where k is the number of estimated parameters and L is the maximum value of the likelihood function.
At the same time, Bayesian Information Criterion (BIC) [8] is defined as: where k and L represent same value in AIC and n is the number of data points in X. BIC works in a similar way as that of AIC, and they each work better in some cases, so both of them are tried in this study. For model selection, we used stepwise selection, which is an iterative method that at each iteration, a certain number of attributes are selected and decided to be kept or removed based on the information criterion. There are multiple directions for stepwise selection, and in this study we only used forward and backward.
Forward selection starts with no variables in the model and tests the result when each variable is added, finally keeping the variables that improve the statistical significance of the model by the judgment of AIC and BIC. Backward selection, on the other hand, starts with all variables in the model and tests the result when each variable is left out, removing variables that increases AIC or BIC.
2) Shrinkage Method Shrinkage methods add penalties on parameters in the loss function. It introduces some bias, but could significantly improve the test result. A shrinkage method that can be used as an alternative of model selection is the Lasso method [9]. It solves the following optimization: where λ is a tuning on the penalty function that balances shrinkage and goodness of fit. With appropriate chosen lambda, the Lasso method can shrink some parameters all the way to zero, thus providing model selection.

Support Vector Machine Regression [10]
Support vector regression is adapted from support vector classifier. It is based on distance between data points, thus is sensitive to high dimensions. Therefore, we used PCA transformed data for this method. After grid search on different kernels, we found that nonlinear kernels work better for both response variables. A nonlinear SVM regression finds that coefficient that minimizes: Parameters ε and C (cost) are tuned before finding the best result.

Bagging and Random Forest Regression [11]
Bagging and random forest are bootstrap aggregating methods applied to decision trees (details in 4.4. Decision Tree) to mainly solve the problem of over-fitting. In both methods, multiple samples of certain variables are taken into consideration to be split in nodes, and the result is the average of these trees. The difference between bagging and random forest is that bagging consider all variables in each bootstrap, while random forest picks a portion of variables. Both methods handle high-dimension comparatively well, thus we feed in the full data.
There are multiple parameters that require tuning to achieve optimal random forest models. Using 10-fold cross validation, we tuned the number of variables selected at each iteration ( Figure 4) and the maximum of nodes generated ( Figure 5). Figure 6 and Figure 7 can roughly represent the models (it is impractical to visualize all fifteen hundred attributes).

Classification
Classification models are fit for mortality and each method is tried with clinical-only data, gene-only data and combined data. Whether data transformed by PCA or full data is used depended on the specific method used. The results are in 6.1. Model Selection.

Bayesian Logistic Regression [12]
Bayesian logistic regression works in a similar way of traditional logistic regression. However, it starts with an assumption of distribution of ( ) 0 1 , p β β , usually normal distribution, then maximize the likelihood function: where y i is the response variable (mortality in this case) and the probability p i is defined as: where β denotes the contribution an independent variable makes in determining the classification of the dependent variable. Data transformed by PCA is used for this method.

Linear Discriminant Analysis (LDA) [13]
LDA is an unsupervised learning method that can separate data into two or more classes. It is closely related to analysis of variance (ANOVA) and is similar to PCA in this aspect. However, it focuses more on the difference between classes.
LDA assumes that the conditional probability density functions ( ) , Σ µ  . We predict that the class of one point is different from that of another if: where T, here for two classes without specified weights, is 0.5. Full data is used for this method.

K-Nearest Neighbor (KNN) [14]
KNN is another popular method that uses distance between points to achieve classification, thus making it sensible to high dimensions, so data transformed by PCA is used here. The class of a certain data point is determined by a majority vote of k nearest neighbors" classes. Parameter k is tuned in Figure 8 (the mse here is 10-fold cross validation result).

Decision Tree [15]
Decision tree uses tree-like graph to represent each decision that leads to the last classification or regression result. It uses an iterative logarithm that at each iteration, the tree is split at where the smallest RSS would be produced. The parameter that limited the minimum amounts of leaves produced is tuned to be optimal first. The tree based methods generally handle high-dimensions better, thus we feed the full data. Figure 9 can roughly represent the tree built.

Random Forest Classification
A random forest classifier works in a very similar way as a random forest regression model described in 3.3. Bagging and Random Forest Regression. Same parameters that are tuned in random forest regression are tuned here as well. Because it is a tree based method, we feed in the full data and will also use this method for top gene selection. Figure 10 can roughly represent the model.

Adaptive Boosting [16]
Adaptive boosting is another ensemble method that evaluates the classification error during the process of readjusting weights. In this study, the adaptive boosting is based on decision trees, so we feed full data into it. We tuned the number of iteration.

Support Vector Machine Classification (SVM)
SVM classifier puts data points into p-dimensional space where p is the number of attributes. Then it creates two hyperplanes to separate the space into two regions, thus achieving classification. The argument of SVM is: ξ is the slack variable and C is the penalty of error term i.e. cost. ∅ is the function that maps the original feature space to a higher feature space, thus achieving non-linear decision bounds (in this case, we found radial kernel better than linear). Because of its use of distance among data points, we feed in data transformed by PCA. A plot can represent the first two attributes. J. Biomedical Science and Engineering

Gene Selection
Random forest method, or rather decision tree, provides an insight into feature selection. Trees are built from root to leaves, and the closer a variable is to the root, generally, the more important it is. For all three response variables, optimal parameters for a random forest model are already found. Three separate models are fit using the entire data, and by setting the "importance" parameter to be true, top attributes can be found for each response variable.

Model Evaluation
The regression models are evaluated through squared error and classification methods through error rate and f1 score. All the results below are 10-fold cross validation result, each representing the optimal result of that method. Table 1 shows the regression performance of each method. For recurrence time models, support vector regression on gene data only proves to be the best with a squared error of 76. For survival time, bagging regression on gene data only proves to be the best with a square error of 14. In general, it seems like for regression, gene data produces a better result than that of clinical data or combined data.  Table 2 shows the classification performance of each model. For classification, the best model comes by K-Nearest Neighbor on clinical data only, with an overall accuracy of 88.97% and f1 score of 0.7518. A couple of ROC curves are drawn for a couple of methods for visualization (Figures 11-14). ROC curves show how a classification method works. The larger the area under the curve (AUC), the better is the classification i.e. the closer the curve is to the top left corner, the better the model is. Plots below show the comparisons between the use of only gene data, only clinical data, and combined data. For classification methods such as random forest (Figure 13), for example, using only gene data or combined data proved to be better than using only clinical data. Table 3 shows a case example of our treatment optimization method. Given a patient with specific clinical attributes, represented by "posnodes", and unique gene information, we start by assuming that the patient does not take any treatment, so the inputs for "chemo," "hormonal," and "amputation" are all zero. Then, we test if chemotherapy would work by changing "chemo" to one while keeping other parameters the same. Then using our model, we find the survival time more than doubles. This indicates the effect of chemotherapy on this specific patient without actual clinical attempts. To further investigate, we also change amputation to one, but the survival time drops. This result suggests that combing chemotherapy and amputation may not work the best for this specific patient.

Application
Suppose there are n treatments available, then, by comparing the results of 2 n possible combinations, an individualized optimal treatment plan can be determined for a specific patient at the cost of mere computation power. Table 4 shows the top genes selected using random forest method and its importance parameter. Several genes are found to be biological-studies supported. The most significant ones are: 1) NM_014585 Homo sapiens solute carrier family 40 member 1 (SLC40A1), mRNA. Studies show that SLC40A1 is a family that plays a vital role in causing breast cancer. PIK3CA mutations are closely related with over-expression of several genes involved in the Wnt signaling pathway (WNT5A, TCF7L2, MSX2, TNFRSF11B), regulation of gene transcription (SEC14L2, MSX2, TFAP2B,      NRIP3) and metal ion binding (CYP4Z1, CYP4Z2P, SLC40A1, LTF, LIMCH1). And "PIK3CA, encoding the PI3K catalytic subunit, is the oncogene exhibiting a high frequency of gain-of-function mutations leading to PI3K/AKT pathway activation in breast cancer" [17].
A research studying breast cancer in Chinese women concluded that "Of the studied SNPs, only rs12570116 in ACADSB, rs10902845 in C10orf88, rs4760658 in VDR, and rs6091822, rs8124792 and rs6097809 in CYP24A1 had a nominal association" [18].
SCUBE2 was found to be expressed in invasive breast carcinomas: "In this report, we show by anti-SCUBE2 immunostaining that SCUBE2 is mainly expressed in vascular endothelial and mammary ductal epithelial cells in normal breast tissue. In addition, we observed positive staining for SCUBE2 in 55% (86 of 156) of primary breast tumors." The entire report above shows the role of SCUBE2 in breast cancer and the specific mechanisms of how SCUBE2 influence the outcome of the breast cancer [19].

5) NM_001565
Homo sapiens C-X-C motif chemokine ligand 10 (CXCL10), mRNA. CXCL10 is found to lead to breast cancer: "activation of Ras plays a critical role in modulating the expression of both CXCL10 and CXCR3-B, which may have important consequences in the development of breast tumors through cancer cell proliferation" [21]. 6) NM_003890 Homo sapiens Fc fragment of IgG binding protein (FCGBP), mRNA.
FCGBP is a gene family that has been found to have an independent influence on the progression of carcinoma: "Immunochemistry and clinic pathological results showed that the expression of NT5E and FCGBP in gallbladder adenocarcinoma is an independent marker for evaluation of the disease progression, clinical biological behaviors and prognosis" [22]. 7) NM_003891 Homo sapiens protein Z, vitamin K dependent plasma glycoprotein (PROZ), transcript variant 2, mRNA.
PROZ is found to be in support of the spread of tumor cells: "In line with our functional findings, PROZ expression has been observed in several human cancers, suggesting that the PROZ/ZPI complex might support the invasion and metastasis of tumor cells" [23]. 8). NM_002426 Homo sapiens matrix metallopeptidase 12 (MMP12), mRNA. MMP family members usually represent the host response to the tumor of breast cancer: "These results indicate that there is very tight and complex regulation in the expression of MMP family members in breast cancer that generally represents a host response to the tumor and emphasize the need to further evaluate differential functions for MMP family members in breast tumor progression" [24].

DISCUSSION
This study stands out in its comprehensive understanding of cancer genomics. Not only could the result help precisely predict breast cancer survivability, but it can also predict recurrence and survival time. Aside from modeling, the study identifies certain genes that could further help with clinical prognosis. The methodology used in this study can be applied to more disease learning, especially those with high dimensional data such as genomics, to not only achieve better clinical prediction but identification of significant attributes of the patient that worth biological researches.
This research, most importantly, demonstrates an application of the core idea of big data to the medical field. For more diseases, even those beyond cancer scope, such methodology of modeling and exploration can be applied. For example, given a lung cancer data in a format similar to what we used in this study, we could develop such treatment optimization and gene selection system using the methods discussed. Choices of clinical attributes, such as gender of the patient and some basic physical measurements, may be specified according to lung cancer; along with gene attributes, a gigantic scale of parameters can be provided for a single patient. The training dataset should also include several possible treatments for lung cancer as attributes, and several clinical outcomes as response variables. Once the best model is developed using methods discussed in this research, it can be used to determine the best combination of treatments to achieve optimal clinical outcomes (6.3. Application).
In the recent years, similar efforts were made to develop customized cancer treatments. The idea, basically, is to investigate the patient's gene and how it is correlated with the cancer. Then, they develop medicines that targeted these genes specific for this certain patient. Nevertheless, there are certain limitations to this method. The idea requires meticulous lab work and investigation, which makes it time consuming and expensive. Also, though we have learnt some biomarkers and cancer-related pathways, cancer genomics are not fully understood yet, let alone disease genomics for many other diseases. At current stage of cancer research, this method inevitably introduces error.
Our research, however, is more feasible. With a larger data, ideally a sample of at least tens of thousands patients (which should not be hard to gather given that millions of people suffer from cancer every year) that is collected over several years, our model can be more reliable and consistent in its prediction, thus reducing error as the machine learns. Also, the dependent variables (the metric we use to evaluate a treatment) can include, for example, side effects. Hormonal therapy can increase vein clot, so if we use such result as a dependent variable, the patient can have an insight into how much side effect certain combination of treatments would bring. Most importantly, once a comprehensive model is built, all it costs is computation power. Therefore, it is cheap and accessible for everyone. This system can give patients an optimal combination of treatment seconds upon diagnosis, helping them make better decisions and avoid unnecessary cost.
There are three main ways through which we may increase the applicability of the research and finally achieve the goal described above. First, more advanced machine learning or deep learning method are always possible to improve the model result. Second, the bigger the data is and the longer period over which the data is collected, the more reliable the model will be since less bias will be generated. Finally, several attributes (gene or clinical) and treatments may have correlations among themselves. If they can be understood thoroughly, the attributes representing them can be pre-processed to avoid collinearity. For instance, if a gene is biologically proved to be closely related to certain disease, the weight of that gene input can be increased; if certain medical treatment targets certain gene, some transformation of the two attributed could be processed before modeling. Through this vertical improvement of the result of certain application (such as the one discussed in this study), and horizontal expansion that includes more and more medical areas (generalization of the idea proposed by this study), future hospitals can provide a comprehensive and systematic procedure to present the patient with crystal clear choices and possibilities.