Correlation of selected molecular markers in chemosensitivity prediction

Finding effective cancer treatment is a challenge, because the sensitivity of the cancer stems from both intrinsic cellular properties and acquired resistances from prior treatment. Previous research has revealed individual protein markers that are significant to chemosensitivity prediction. Our goal is to find correlated protein markers which are collectively significant to chemosensitivity prediction to complement the individual markers already reported. In order to do this, we used the D’ correlation measurement to study the feature selection correlations for chemosensitivity prediction of 118 anti-cancer agents with putatively known mechanisms of action. Three datasets on the NCI-60 were utilized in this study: two protein datasets, one previously studied for chemosensitivity prediction and another novel to this topic, and one DNA copy number dataset. To validate our approach, we identified the protein markers that were strongly correlated by our analysis with the individual protein markers found in previous studies. Our feature analysis discovered highly correlated protein marker pairs, based on which we found individual protein markers with medical significance. While some of the markers uncovered were consistent with those previously reported, others were original to this work. Using these marker pairs we were able to further correlate the cellular functions associated with them. As an exploratory analysis, we discovered feature selection correlation patterns between and within different drug mechanisms of action for each of our datasets. In conclusion, the highly correlated protein marker pairs as well as their functions found by our feature analysis are validated by previous studies, and are shown to be medically significant, demonstrating D’ as an effecive measurement of correlation in the context of feature selection for the first time.


INTRODUCTION
The success of cancer treatment as well as the severity of the side effects of said treatment is heavily dependent on the sensitivity of the cancerous tissue to chemical treatment.Clinics face a great challenge in predicting treatment success, because chemosensitivity is determined by both intrinsic genomic and proteomic characteristics of the cancer as well as resistances induced through prior treatment.When trying to choose a therapy that will work best for a patient, it is important to evaluate their physical responses to different drugs.Because of this, many studies have been done to improve drug response prediction accuracy.
Data profiling of cancer cells at genomic, proteomic, chromosomal and functional levels has long been used in the analysis of pharmacological sensitivity of the cancer cells [1,2,3,4].A primary source of cancer data in this field is a set of 60 human cancer cell lines provided by the National Cancer Institute (NCI-60) [5].These cell lines have been in use since 1990 and over 100,000 chemical compounds have been tested on them [6].The NCI-60 includes melanomas, leukemias and samples of ovarian, prostate, renal, breast, colon, lung and central nervous system cancers.

Related Works on the NCI-60
One study [7] used protein expression profiles to predict responses to a set of 118 anti-cancer agents with known or experimentally supported mechanisms of action [8].Well known machine learning algorithms such as Random Forest, Nearest Neighbor and Relief were used to make chemosensitivity predictions.One Random Forest based classifier was built for each of the 118 drugs.To measure the significance of their predictions, this study compared the computed predictions against random pre-dictions, which can be measured by a standard P-value.The P-value was the percentage of 1000 random predictions with higher accuracy than the calculated predictions.The study found chemosensitivity prediction accuracies ranging from 50 to 90%, with the vast majority being between 50 and 70%.Every prediction had P-values less than 0.019, and 97 of the predictions had P-values equal to 0.00.
A subsequent study by the same research group used a combination of the previously used proteomic data and new transcriptional data [9].This integrative approach demonstrated its advantage, achieving higher accuracy and statistical significance, with P-values for all 118 drugs less than 0.001, calculated in the same manner as in [7].
A separate study [10] analyzed the correlation between DNA copy number variations, gene expression levels, and chemosensitivies to the same 118 drugs as in [7,9].The analysis indicated that the correlations of gene expression and DNA copy number are particularly evident among leukemias and ovarian cancers.
An additional study [6] used four gene expression datasets, two of which were original to the paper, and one proteomic dataset.These data sets were used to observe the effectiveness of transcript profiling for the prediction of different protein expression levels.In addition, a consensus set selected from the four gene expression datasets was constructed.This consensus set was found to have a correlation to the protein dataset of 65%; a notable percentage that was higher than most reports done with mammalian cells.Further, this consensus dataset was used to predict tissue origin with a higher accuracy than any of its parent datasets.

Feature Selection and Motivation
New technologies in biomedical studies, such as microarrays, have made the analysis of large volumes of complex data a necessity [11].Frequently, a majority of these data contain noise, i.e., features not relevant to a particular task at hand, such as classification of cancer types with gene expression data.
Both studies conducted by [7,9] used Random Forest as a feature selection technique to improve the accuracy of chemosensitivity predictions and to single out protein markers that were particularly important to this task.
In studying the effects of feature selection on chemosensitivity prediction, we observed disparity between expected and observed results.We ranked and ordered all features in the smaller protein dataset used in this study according to the Relief algorithm provided by Weka.We used Random Forest to make predictions based on incrementing feature subsets, using the top two ranked features, then three, four, etc. up to 40 features, as in Figure 1.We observed that contrary to our expectations, some higher ranked features decreased prediction accuracy, while some lower ranked features increased accuracy.
This led us to hypothesize that features contribute to the prediction accuracy collectively, rather than independently.To test this hypothesis, we developed a new technique using the D' measure [12] in order to study the correlations between feature pairs.As a demonstration of the utility of this technique, we apply it to those protein markers found to be significant in [9].

Datasets
Three datasets derived from the NCI-60 were used in our study: two sets of protein data, and one of DNA data.
Protein expression data.The first protein expression dataset had 162 protein markers, hereafter referred to as Protein162, and was created by Shankavaram et al [6] and can be found at http://discover.nci.nih.Gov/datasets.jsp.The second dataset, which contains 52 protein markers (Protein52), available at http://discover.nci.nih.gov/host/2003_profilingtable7.xls., was generated by a study of the proteomic profiles of the NCI-60 [13], and was also used by two studies on chemosensitivity prediction [7,9].
DNA copy number.The DNA copy number variation dataset was presented in a study of the correlation between mRNA and DNA copy number [10].It is available at http://discover.nci.nih.gov/datasets.jsp.
Drug activity data.Our drug resistance information contained activity data from 118 anti-cancer agent activity profiles.They were screened by Scherf et al [8] and recorded using the NCI-60 cancer cell lines.The file containing this data can be found at http://discover.nci.nih.gov/nature2000/data/selected_data/dataviewer.jsp?ba seFileName=a_matrix118&&nsc=2&dataStart=3.
Defining drug sensitivity and resistance.As in [7,9] we used a threshold to define sensitivity to a drug into Figure 1.Random forest prediction accuracy.This plot shows the prediction accuracies of Random Forest using the same protein dataset used in [7,9].The drug on which the prediction was performed was Bisentrene (NSC # 337766).

SciRes Copyright © 2009
JBiSE three categories.A log 10 (GI 50 ) was taken for each cell line to determine sensitivity.Cell lines with sensitivities at least 0.5 standard deviations above the average were given the label 'resistant.'Those with sensitivities at least 0.5 standard deviations below the average were 'sensitive.'The remaining cell lines were defined as 'intermediate' [7,9].

D' Formula
A standard measurement for the correlation between pairs of events i and j in a set of sequences is D', which can be defined by the following formulae: where x ij is the frequency at which both event i and event j occur in a single sequence, p i is the frequency of event i and q j is the frequency of event j.
The D' formula was introduced by Richard Lewontin as a measurement of linkage disequilibrium of alleles at two or more loci on the same chromosome [12].The D' formula has been shown to be a more reliable measurement than other measurements of correlation between pairs of events [14], but this study is the first to use it to correlate pairs of selected features.

Markov Blanked-Embedded Genetic Algorithm (MBEGA)
Genetic Algorithms have been used as a strategy for feature selection [15] due to their ability to generate better feature subsets than other feature selection algorithms.
In some cases, these genetic algorithms are combined with memetic operations in order to fine tune results beyond what would be produced by classical genetic algorithms alone.
One particular implementation of these memetic algorithms is the Markov blanket-embedded genetic algorithm (MBEGA), which uses an approximation of a Markov blanket to reduce redundancy in selected features.Pseudocode for the MBEGA can be found in Figure 2.
In each generation of the algorithm, the MBEGA uses add and delete operations to add and delete features from some of the elite feature subsets in the population; the elite feature subsets are improved by adding important features and removing those that are less important.After the memetic operations, standard genetic algorithm techniques such as linear ranking, crossover and mutation methods occur to generate the next population [16].
The MBEGA was selected in our study for two reasons: First, the MBEGA generates a population of fea-

Markov Blanket Embedded Genetic Algorithm (MBEGA) BEGIN
(1) Initialize: Randomly generate an initial population of feature subsets encoded as binary strings (2) For the number of iterations to run (3) Evaluate all feature subsets in the population based on prediction accuracy (4) Select a number of elite feature subsets from the population to undergo the Markov blanket memetic operations (5) For each feature subset create a set of all present features X and all absent features Y Add operation BEGIN 1) Rank the features in Y according to their correlation to the class label.
2) Select a feature Yi in Y so that the larger the correlation of a feature in Y the more likely it will be picked.
3) Add Yi to X.

Delete operation BEGIN:
1) Order the features in X according to their correlation to the class label.
2) Select a feature Xi in X so that the larger the correlation of a feature in X the more likely it will be picked.
3) Eliminate all features in X which are less correlated than Xi.If no feature is eliminated, remove Xi. thms.The feature subsets from each generation are represented as binary strings, with a 1 representing a present feature and 0 representing an absent feature, to calculate the D' values of our correlation analysis.Second, the MBEGA does not require a predefined number of features to be selected.Rather, the MBEGA gradually optimizes both the size of the feature subset as well as the accuracy of the classifier.

Correlation Analysis
We used the D' formula to calculate correlation bet pairs of features selected in each generation of the MBE-GA.Because the MBEGA begins with a randomized feature subset and becomes more selective as the algorithm progresses, we decided to use only the last 20% of the feature subsets generated.We calculated the D' values for every pair of selected features, using the presence of one feature within the encoded binary sequence as event i and the presence of the other as event j.

Feature Selection Using Weka's Relief Algorithm
are three prim thms: filter, wrapper and embedded algorithms.Filter algorithms have advantages in their speed and scalability, ture subsets in each generation, rather than generating able 1. Highly correlated protein marker pairs in protein52 based on significant chemosensitivity protein markers.The protein Relief, a filter feature selection algorithm implemented in WEKA, was used to assess the features pairs found by our correlation analysis.Relief ranks features by assigning them weights according to their ability to T markers in column one and associated drugs, expressed as their NSC drug numbers, in column two were found in [9].The remaining columns are the ten protein markers with the highest correlation to the protein marker in the first column.The markers notated with a * are those also selected by Weka's Relief algorithm.iscrim ate betw n neig ring pat ns.In iteration of the algorithm, an instance x containing features (x 1 ,x 2 ,…,x n ) is selected randomly, and one nearest neighbor from the same class (called NH) along with one nearest neighbor from a different class (called NM) are found.The weights of the features in x are updated such that they will be greater if x is similar to the NH and dissimilar to the NM, and less if the opposite is true.

RESULTS AND DISCUSSION
To generate sequences for D' analysis, we ra cross validation of the MBEGA on all three datasets.Each fold of the MBEGA ran for 100 generations, with a population of 51.Each fold generated 5100 sequences, of which we used the last 20% generated, or 1020 from each fold.The final number of sequences used for the D' analysis was 10200 for each dataset.

Correlation Analysis of Indi Protein Markers from Previous Study
evious study [9] discovered 18 individual p markers from Protein52 along with their functions, including transcriptional factoring, tumor suppressing, DNA repair, cell adhesion, and apoptosis, among others, that are significant to the prediction of chemosensitivity to 33 of the 118 anti-cancer agents.These drugs represent 12 out of the 15 total mechanisms of action present in the 118 anti-cancer agents, with a large number of them being tubulin active antimitotic agents.In order to investigate the protein markers highly correlated with those found in [9], for each protein marker/drug combination identified there, we found the ten protein markers with the highest D' value.We also sought to validate these pairs using a ten-fold cross validation of the Relief feature selection algorithm provided by Weka, which measures feature significance individually.As seen in Table 1, the highly correlated protein marker pairs our analysis discovered are validated not only by the protein markers reported in [9], but also by Relief.
In er to d over an atterns the corr ation of e protein markers selected in Table 1, we took a frequency count of them, as illustrated in Figure 3.While most protein markers had frequencies within the same range, mostly between 4 and 10, there were some which clearly stood out.In particular, the protein marker CDH2, a cell adhesion protein, is highly correlated with 19 out of the 40 protein markers in Table 1.CDH2 was not selected in the previous study, but is very similar in both function and family to CDH1, which was selected.Another protein with a high frequency, 16 out of 40, was TP53 whose function is tumor suppression and apoptosis.We found that in most occurrences TP53 was paired with protein marker KRT18.Both of these protein markers are involved in protein death, and both were found to be strong chemosensitivity predictors for the drug Taxol in the previous study [9].Lastly, we noticed STAT5A is both from the same family as and is highly correlated with the protein markers STAT1 and STAT6, both of which were highlighted in previous study [9].
We were also interested in observing how the functions of the individual protein markers from [9] correlated with the functions of the highly correlated protein markers found in Table 1.We grouped the previously reported protein markers according to their function, and then selected the protein markers that were most frequently correlated with them.We only included those protein functions which had three or more protein markers associated with them, as in Table 2.
Because we used two protein datasets in this study, we wanted to conduct the same analysis on the Protein162 dataset in order to explore the possibility of discovering new protein markers highly correlated with those found in [9].All but 2 of the previously reported protein markers from Protein52, G22P1 and CCNE, were also present in Protein162, so we used the same protein marker/drug combinations as in Table 1 when generating Table 3 for Protein162.
We also created a selection frequency histogram for   d on signif sensitivity protein markers.The pro markers in column one and associated drugs, expressed as their NSC drug numbers, in column two were found in [9].The remaining columns are the ten protein markers with the highest correlation to the protein marker in the first column.The protein markers G22P1 and CCNE present in Table 1 are excluded here because neither is present in Protein162.The markers notated with a * are those also selected by Weka's Relief algorithm.is dataset when compared to Protein52.We believe this is because the number of unique protein markers in Pro-tein162 was roughly twice that of the unique protein markers in Protein52; however we chose the top ten most correlated pairs in both instances.
Many of the most selected protein markers from Table 1, including CDH2, TP53, and STAT requenci s were ower b oughly factor for th 5A, had only an average or even low frequency in Table 3.We selected 8 protein markers from Table 3 whose average frequencies were above 4.These were KRT18, KLK3, CCNA2, ADNP, MVP, RIPK1, SMARCB1, and ENAH.Because the Protein162 dataset contains protein markers not present in Protein52, we found 5 protein markers which were not reported in the previous study [9].These proteins, as well as their cellular functions and associated drugs can be seen in Table 3.The most frequently selected protein marker from Table 3 was KRT19, a structural protein from the same family as KRT18, a protein marker found to be significant in [9], and KRT20, a protein marker frequently selected in Table 1.KLK3 had and monitor prostatic carcinoma.Members of the KLK family are also thought to be biomarkers for cancers and diseases.CCNA2 has a functional relationship with CDC2, another protein marker with an above-average selection frequency in Table 3. ADNP affects both normal cell growth and cancer proliferation.In addition, ADNP is a transcription factor, a trait held in common with six of the eighteen significant protein markers in [9].MVP is a protein which is over-expressed in multi-drug resistant cancer cells, and is potentially useful as a signal for drug resistance.MVP also bears a functional relation with STAT1, one of the important protein markers reported in [9].RIPK1 is an apoptosis protein related to cell death, much like the KRT18, TP53 and CASP2 found in both the previous study [9] and in Table 1.SMARCB1 functions as a tumor suppressor, but mutations within the protein are associated with rhabdoid tumors.ENAH is a cell adhesion protein which is present in some breast cancers, and may be used as a marker for such.3.This plot shows the frequency with which protein markers from the Protein162 dataset were selected in Table 3.Only those protein markers with an averag quency Tab to th evious study [9].In addition to simply calculating the D' values o pairs, we for each feature based on the D' values of all pairs associated with that feature.We performed this analysis for both protein datasets as well as the DNA166 dataset, using the protein markers as features for the protein datasets, and the DNA copy number variations as features for the DNA166 dataset.As an exploratory analysis, we attempted to use the average D' values to find the trends of feature correlation within and between the mechanisms of action of the 118 anti-cancer agents.
Each dataset generated unique patterns of feature correlation in each of the 118 drugs.We did observe, hower, similar patterns of feature correlation in drugs with poisomerase I inhibitors and topoisomerase II inhibitors have very similar trends of feature correlation to one another in all three datasets.Drugs which alkylate at positions N7 (24 drugs) and O6 (7 drugs) of guanine also have very similar trends of feature correlation, as shown in Figure 5.This implies that related drug mechanisms tend to produce similar patterns of correlation between feature pairs.Our analysis indicated that this is not necessarily true of drugs with similar chemical structure.We also grouped drugs with similar mechanisms into three larger categories: drugs which alkylate at specific positions of guanine (Alkylating), drugs which inhibit topoisomerase (TIM Inhibitors), and all other drugs (Other).In Protein52, we observed that while each of the larger categories carries sms, the averaged D' values of all three of these categories were very similar, with the averages being 0.053 for the Alkylating category, 0.054 for the TIM Inhibitor category, and 0.06 for the Other category.All averaged D' values were positive.
The same analysis of the Protein162 dataset revealed that while the averaged D e Other categories where very similar, with average D' values -0.0348 and -0.0345 respectively, the TIM In-For DNA166, all three curves have distinct averaged D' values, with the Alkylating category having an average D' of -0.0382, the Other category an average of -0.0286 and the TIM Inhibitors category an average of .0240.Again, all D' values were negative.These plots, available in Figure 6, illustrate the benefit of using multiple datasets in this type of study.While all of our datasets are based on the NCI-60, they each provide unique insight into physical res e cell lines to the 118 anti-cancer drugs.If we were only using the DNA data, we might be tempted to claim that these three mechanism categories produce distinctly different D' values, whereas if we were only using the data from Protein52, we might claim the opposite.It is only when a wide range of data are used in study that a holistic understanding of the effects of these 118 drugs becomes possible.

CONCLUSIONS
We found that each of our datasets provides a unique insight into the analysis of feature correlations in the ity of the NCI-60 cancer lines.52, DNA166) have been u K., Tanab in previous studies for the prediction of chemosensitivity of cancerous cells, and though Protein162 was novel to this topic, we have shown that both Protein162 and Pro-tein52 contain a number of protein markers that are both medically significant and highly correlated to individual protein markers highlighted in previous study [9].
In addition, we have shown D' to be an accurate measure of correlation in the context of feature selection for the first time.

Figure 4 .
Figure 4. Frequency of protein162 protein markers present in Table3.This plot shows the frequency with which protein markers from the Protein162 dataset were selected in Table3.Only those protein markers with an averag quency

e fre above 2
are shown due to limited space.

le 4 .Figure 5 .
Figure 5. D' patterns categorized according to mechanism of action.The drugs which alkylate at N7 and O6 produce similar patterns of D' values in each of the three datasets.For readability, the curves displayed are m average
They also do not interact with classifiers, which is both an advantage, because they can select features independently, and a disadvantage, because they are unable to take the classifier into account when determining the feature subset.Wrapper algorithms, on the other hand, do interact with the classifier, and are therefore able to produce more informative feature subsets.They are also less prone to local optima.They are, however, computationally inten-se, and have a higher risk of over fitting.Embedded algorithms are built directly into a classifier.As such, they are able to interact with the classifier in the same manner as wrapper algorithms, but are far less computationally intense.
JBiSEhowever they ignore feature dependencies.

Table 1 . SciRes Copyright © 2009 JBiSETable 2 . Corre Protein Markers Functions of Correlated Protein Markers lation
of the functions of protein markers in Table1.

Table 3 .
Highly correlated protein marker pairs in protein162 base icant chemo tein