Predicting DNA methylation status using word composition

Background: DNA methylation will influence the gene expression pattern and cause the changes of the genetic functions. Computational analysis of the methylation status for nucleotides can help to explore the underlying reasons for developing methylations. Results: We present a DNA sequence based method to analyze the methylation status of CpG dinucleotides using 5bp (5-mer) DNA fragments – named as the word composition encoding method. The prediction accuracy is 75.16% when all 5bp word compositions are used (totally 4 = 1024). Furthermore, 5-bp DNA fragments/words having the most impact on the methylation status are identified by mRMR (Maximum-Relevant-Minimum-Redundancy) feature selection method. As a result, 58 words are selected, and they are used to build a compact predictor, which achieves 77.45% prediction accuracy. When the word composition encoding method and the feature selection strategy are coupled together, the meaning of these words can be analyzed through their contribution towards the prediction. The biological evidence in the literature supports that the surrounding DNA sequence of the CpG dinucleotides will affect the methylation of the CpG dinucleotides. Conclusions: The main contribution of this paper is to find out and analyze the key DNA words taken from the neighborhood of the CpG dinucleotides that are inducing the DNA methylation.


BACKGROUND
DNA methylation is an epigenetic modification that typically occurs on cytosine of CpG dinucleotide, during which the cytosine is transferred to 5-methylcytosine by DNA methyltransferase.In human genome, unmethylated CpG dinucleotides are normally clustered together in a region called CpG island which is highly associated with gene promoters [1].Methylation of CpG inhibits the expression of the downstream gene by firstly preventing some DNA-binding factors from recognizing their binding sites [2,3], and secondly by captivating some proteins that recognizing the methyl-CpG [4,5] to elicit the DNA's repressive capacity.When CpG dinucleotides are methylated, it may affect the transcription and cause diseases [6].DNA methylation patterns are altered in many cancer cells [7][8][9] since the tumor suppression genes are silenced by the DNA methylation.A study in [10] shows that CpG methylation may veil the presence of virus from cytotoxic T-cell immune surveillance and cause viral tumorigenesis.Berretta esophagus, Helicobacter pylori gastritis, inflammatory bowl disease and viral hepatitis are also thought to be associated with CpG methylation [11].It is not possible to predict the DNA methylation status purely from the DNA sequence because the DNA methylation status is affected by many outer factors, i.e., the same sequence may have different methylation status due to these factors.A computational approach will be very hard to be devised to analyze the DNA methylation status affected by the outer factors, e.g.DNA methylation caused by heat stress [12], by overproduction of DNA cytosine methyltransferases [13] and hepatitis B virus [14].However, using computational tools, we are able to analyze what DNA sequence environment is more feasible for inducing methylation caused by those factors.
The last three authors contributed equally to this work.
Machine learning method is used to predict DNA methylation status in our study.The DNA sequence, containing the CpG dinucleotides, is truncated into a 1000bp DNA sequence with the CpG dinucleotides being the center of the sequence.Instead of raw binary encoding approach [15], a 5bp (5-mer) word composition method, which slices the 1000bp sequence into 996 5bp-words, is used as the input data for the predictor.Nearest neighbor classification algorithm [16] is adopted as the predictor, which achieves an accuracy rate of 75.16%, evaluated by 10-fold cross validation.These 5bp DNA fragments (words) are analyzed by a feature analysis method -the mRMR algorithm [17], and 58 most important words are identified.The literature evidence shows that these 5bp words have other significant roles in their biological functions, e.g.some of them are modifying sites and binding sites of enzymes and some are binding motifs of some transcription factors.Since these 5bp words are important for DNA methylation, they are probably associated with the binding of DNA methyltransferase.Using these 58 words instead of all the words, the prediction accuracy increases from 75.16% to 77.45%.This indicates that the predictor suffers from over-fit problem when all words are used for the prediction.
An online web server for the predictor used in this study is freely available at http://pcal.biosino.org/.

RESULTS
Firstly, all 1024 features were used to distinguish the methylated CpG dinucleotides from unmethylated ones.Based on the 10-fold cross-validation test, we got 75.16%prediction accuracy rate.The accuracy rate is shown in Figure 1 as the red dashed line.The top features, generated by the MaxRel (Max-Relevance) algorithm, are input into the predictor.The prediction accuracy curve, in increasing number of features, is shown in Figure 1 as the grey curve.Compared with the prediction accuracy using the total 1024 features, no significant improvement is achieved from the mRMR features.Thus a backward sequential feature selection is applied to extract a compact feature subset to improve the prediction accuracy.The black curve in Figure 1 shows prediction accuracy curve obtained from the backward feature searching algorithm.The prediction curve is smooth indicating that the prediction performance is stable using backward feature searching method.The highest prediction accuracy occurs when 58 features are included, achieving an accuracy rate of 77.45% (see Table 1).
The result of mRMR feature selection program contains two lists of features (see supplemental material l).The first part lists 500 MaxRel features in a descending order, and the second part lists the mRMR features in the feature selection order.Since the MaxRel features indi-The highest accuracy 77.40% is achieved using backward feature selection when 58 features are included.The red dashed line indicates the 75.16% prediction accuracy obtained by using all 1024 features, and the blue dashed line indicates the 73.23% prediction accuracy obtained by the BLAST method.cate how well each 5bp word contributes to the prediction, these words may have other significant biological functions besides the DNA methylation prediction.By manually searching over the internet, we found that more than half of the top MaxRel features have other significant roles in biological function.This may indicate that these 5bp words are important in binding with many enzymes including the methyltransferase.For example, it is reported that both methylases (LlaDII and Bsp6I R/M) have two recognition sites (5'-GCGGC-3' and 5'-GCCGC-3') [18].DNA methyltransferase (methylase) FauIA (of the restriction-modification system FauI from Flavobacterium aquatile)'s recognization site is 5'-CCCGC-3' [19].Our study offers a perspective to find a connection between the DNA sequence fragments and the methylation mechanism.

CONCLUSIONS
We introduce a DNA word encoding method for the analysis of the DNA methylation status.The most important words inducing the methylation are found using mRMR and backward feature selection methods.Some of these words are the recognition sites for methylases, while most words' biological role in methylation still remains unknown.These words should be paid with more attention when the biologists investigate the mechanism of methylation.The length of the words is set to be 5 as the length is a bit longer than the 3-length DNA words for the translation of the proteins, and the number of the variation of the words can be coped easily with by the feature selection methods.A future research could be

Dataset
The DNA methylation status data, used in this study, were originated from the Human Epigenome Project (HEP) website (http://www.sanger.ac.uk/PostGenomics/ epigenome/,Release26thJun2006) [20].These data were determined through experiments [6,20].Current data release (26 th Jun 2006) contains about 1.9 million CpG methylation values, assigned by analyzing the 2,524 amplicons from 4 chromosomes and 12 different tissues.According to [20], in more than 80% cases the methylation status of the same locus is consistent among the 12 different tissues.Especially in cell types like CD4 + and CD8 + lymphocytes, the difference level of methylation status is as low as 5%.Therefore we investigate methylation data derived from CD4 + lymphocytes in this contribution and ignore the minor variances across the tissues and cell types.The methylation scores of CpG sites reported by HEP range from 0 to 100, indicating the degree of methylation from null to full scale.CpG sites with the high scores (between 90 and 100) and low socres (between 0 and 10) were assigned to be the methylated and unmethylated ones respectively.As a result, 26397 CD4 + lymphocytes specific CpG methylation records were collected, including 11345 methylated CpG sites and 15052 unmethylated ones.
Why does DNA methylation occur on some CpG sites whereas not take place on other ones?From the perspective of DNA sequence, flanking sequences of various CpG sites are thought to encode hints of this problem.An early study [21] showed that a flanking sequence size of 800bp is optimal for methylation status determination.We focus on flanking sequences around CpG sites with total length 1000bp, exactly 499bp nucleotides upstream and 499bp nucleotides downstream of the CpG site.Comethylation occurs over short distance (≤ 1000bp) [20], which may cause the flanking sequences of the neighboring methylation sites overlapped.To avoid this, similar sequences were filtered using the homologous sequence alignment program CD-HIT [22].Finally, a sequence set with no more than 80% sequence similarity between any pair of them is obtained.The dataset comprises 1994 flanking sequences of methylated CpG sites and 1585 unmethylated sequences.The human genome data of chromosome 6, 20 and 22 were downloaded from Ensembl ftp site (ftp://ftp.ensembl.org/pub/release-25/human-25.34e/data/fasta/dna).

Word Composition Based Encoding Approach
Given a piece of DNA sequence without other transcendent knowledge such as the genome location and the gene context, how can the DNA sequence provide information to infer the CpG methylation status?An intuitive thought would be to use the difference between the DNA sequences to determine the methylation status.However, though the differences can be located relatively easy, the analysis afterwards will be a rather complex task.In this study, we encode the long sequence into short pieces so as to easily analyze the DNA pieces using feature analysis method.We fix the length of each piece to be 5 bp, thus combining the 4 different nucleotides into a length of 5 will give 1024 (= 4 5 ) variations/compositions. Notice that a 1000-length sequence will result in 996 words by sliding the sequence from the first nucleotide to the 996 th .The prediction data is built by counting the occurrence of each composition among the 996 words.Thus, each DNA sequence will result in a 1024-dimensional vector.The encoding approach can be summarized as follows: 1) each 1000bp DNA sequence is sliced into 5bp nucleotide fragments by the shift-byone cut; 2) a 1024 dimensional vector is built by counting the occurrence of each composition of the 5-length words appearing in the sliced 5bp fragments.For example, if "AAGTT" is the th i component of the 1024D vector and the 996 fragments contain n "AAGTT" fragments, the th i component of the corresponding vector is set to be n ( n can be null).
The next step is to apply the mRMR (Maximum-Relevant-Minimum-Redundancy) [17] and backward feature selection methods for the searching of the prominent words.

The mRMR and Backward Feature Selection Wrapper
The mRMR (minimum-redundancy-maximum-relevance) framework aims at maximizing the relevance between the to-be-selected feature set S and the target class c and at the same time minimizing the redundancy between the to-be-selected features [23].The maximum relevance criterion (Max-Relevance) drives to search for features that are maximizing the mutual information between S and c : where ( ; ) i I x c is the mutual information between feature i x and c , and 1024 m  is the total number of features.
The minimum redundancy criterion (Min-Redundancy) is formulated as: which is used to minimize the mutual information of all couples of the features.The mRMR method combines the Max-Relevance and Min-Redundancy as: to optimize D and R simultaneously.
An incremental search procedure is adopted to implement the mRMR algorithm.Suppose we already have a feature set The first feature is selected through the Max-Relevance criterion using Formula 1.The rest features are gained using Formula 4.
Since mRMR method only performs pre-selection role in the feature selection, these pre-selected features need to be refined using a more accurate yet more timeconsuming feature selection method.Backward sequential selection scheme is adopted here to do the refinement.It works by excluding one feature each time from the current feature set k S .Initially, k S is the preselected feature set resulted from the mRMR method.Each feature is in turn excluded from k S and the prediction accuracy is obtained using the rest 1 k  features by the KNN predictor, i.e., there are k different configurations in obtaining the 1 k  features.The con- figuration that achieves the highest prediction rate is chosen as the next selected feature set 1 k S  .If multiple configurations lead to the same accuracy, one of them is chosen randomly.This decremental selection procedure is repeated until one feature is left.The optimal feature subset can be chosen by selecting the feature subset that performs the best prediction.

COMPETING INTERESTS
No competing of interests.

AUTHORS' CONTRIBUTIONS
LL, KL, HL, YC and YL proposed the research topics, developed the methods, designed the experiments, analyzed the data, and wrote the paper.LL and KL implemented the experiments.

Table 1 .
The final 58 features selected.