A distribution pattern assisted method of transcription factor binding site discovery for both yeast and filamentous fungi

Transcription factors (TFs) are the core sentinels of gene regulation functioning by binding to highly specific DNA sequences to activate or repress the recruitment of RNA polymerase. The ability to identify transcription factor binding sites (TFBSs) is necessary to understand gene regulation and infer regulatory networks. Despite the fact that bioinformatics tools have been developed for years to improve computational identification of TFBSs, the accurate prediction still remains changeling as DNA motifs recognized by TFs are typically short and often lack obvious patterns. In this study we introduced a new attribute-motif distribution pattern (MDP) to assist in TFBS prediction. MDP was developed using a TF distribution pattern curve generated by analyzing 25 yeast TFs and 37 of their experimentally validated binding motifs, followed by calculating a scoring value to quantify the reliability of each motif prediction. Finally, MDP was tested using another set of 7 TFs with known binding sites to in silico validate the approach. The method was further tested in a non-yeast system using the filamentous fungus Magnaporthe oryzae transcription factor MoCRZ1. We demonstrate superior prediction reranking results using MDP over the commonly used program MEME and the other four predictors. The data showed significant improvements in the ranking of validated TFBS and provides a more sensitive statistics based approach for motif discovery.


INTRODUCTION
Transcription factors (TFs) are proteins containing one or more DNA-binding domains, which bind to specific DNA sequences to activate or repress the recruitment of RNA polymerase, thereby up or down regulate transcription of downstream genes [1,2].In fungi, TFs bind to either enhancer or promoter regions, usually 1 -1500 bp upstream to the ORFs they regulate [3][4][5].
Understanding networks of transcriptional regulation is one of the most challenging yet important tasks in genome analysis.Transcription factors function by binding to recognition sites in gene regulatory regions, which are generally degenerate motifs of 5 -15 base pairs [6].Extensive research has focused on identifying transcription factor binding sites (TFBSs) by biological validation.Nevertheless, experiments identifying TFBSs are usually time-consuming and laborious, which made it difficult to de novo discover TFBSs without any candidate motifs and thus left the binding sites of most transcription factors unclear [7].Therefore, prediction of potential TFBSs utilizing bioinformatics approaches has become an essential tool to explore gene regulation networking.
Different approaches have been tried and applied to discover novel TFBSs over years, which could be put into two categories.Prediction tools in the first category search for the over-represented motifs in a given set of sequences-usually promoter regions of co-regulated genes or ChIP-chip/ChIP-seq identified binding regions.This strategy has been used widely since it does not require additional information other than the sequences.In the searching process, TFBSs could be treated as either a Position Weight Matrix as in MEME [8] and GLAM [9], or consensuses as in SMILE [10] or Weeder [11], when heuristic approach was applied and exhaustive enumeration was avoided.Top candidates after scoring and sort-ing are then represented in the form of a position frequency matrix (PFM) [12], or a position weight matrix (PWM).However, application of this approach was limited by the computational speed and scale of input data, thus could hardly be applied on the thousands of sequences generated by ChIP-seq.As a result, new tools such as DREME [13] and WordSeeker [14] were designed specifically for large input dataset, which utilize new mapping algorithms and multi-nodes computation platform to significantly increase speed.Another way to search TFBSs focuses on biological information such as evolutionary conservation.This strategy is based on the assumption that the conserved non-coding regions across related species are likely to be under negative selection force and thus contain functional motifs.Due to the fast increase of available genome sequences, this method has been developed quickly in recent years, and generate lower rate of false-positive compared to other methods.[15,16].To date, various software have been developed to analyze possible binding motifs, however, the multifaceted biochemical interactions between proteins and DNA may easily lead to false-positive results and make theoretical predictions of TFBSs error prone [17].
Yeast transcription factors were used in this study as training and testing data sets.Saccharomyces cerevisiae is the most widely used and most well studied yeast species, whose genome serves as one of the most thoroughly analyzed to date [18].Recently, multiple yeast databases have been built and published, including two transcription factor databases used in this study [19,20].
To expand this method to filamentous fungi, the MoCRZ1 transcription factor from Magnaporthe oryzae, a fungal pathogen that cause severe rice blast disease worldwide, was included [21].MoCRZ1 is a C2H2 zincfinger type transcription factor activated by calcineurin dephosphorylation and functions as a mediator of calcineurin signaling [22].In 2010, Kim et al. identified the binding sequences of MoCRZ1 by applying both ChIPchip and microarray approaches, reporting three binding motifs predicted from the bound sequences [23].
In this study, we developed a strategy to improve the predictions made by MEME, using the transcription factor binding motif distribution pattern (MDP) information.Twenty-five well-studied TFs with their previously validated binding motifs were selected to form a training dataset.Although similar spatial distribution of yeast TFBSs has been reported [24], we focused on TFBS prediction improvement by using a novel MDP approach.

General Distribution Frequency Curve
As showed in the analysis pipeline (Figure 1), we checked all the 113 TFs with their 301 documented binding motifs from the yeast transcription factor databases "YEASTRACT" [19].Then in the filtering step 32 TFs which have more than 50 documented regulated genes were chosen.Some TFs had multiple binding motifs but similar to each other, so we referred to the positional-weight matrix provided by the database "JAS-PAR" [20] and merged those similar motifs to generate a consensus motif sequence (details in methods).In total, 32 TFs with 63 target motifs were included in the analysis dataset (Table 1).25 TFs with 37 validated binding motifs were randomly chosen to build the training dataset (Table 2), and the other 7 TFs formed the testing dataset.For each motif, their occurrence locations in the validated regulated gene models were scanned in 1000 bp upstream of transcription start site (TSS).Next, a general distribution curve was drawn from the average distribution of all motifs (Figure 2).It was observed from the curve that the lowest frequency was 2.3% at a region −50 bp to 0 bp, while the highest frequency was 9.2% at a region −200 bp to −150 bp.A peak was observed from the region −275 bp to −100 bp, with center at about −200 bp.This distribution pattern was similar to those from previous studies in yeast [24,25] and human [26].

Estimating Reliability of Motif Prediction
It was assumed from the pattern of the distribution curve, that majority of transcripti n factors worked as short o  distance cis-elements binding specifically at −300 bp to −100 bp-a region referred to as the "PR" (peak region), but not at the other two regions: −150 bp to 0 bp regarded as the "NBR" (non-binding region) where most transcription initiation complexes bind; and −1000 bp to −250 bp regarded as the "DR" (distal region).To quantify pattern fitness, a DP (distribution pattern) value was introduced to estimate fitness of any TF binding motif to the general frequency curve.The DP value was generated from shape of the general distribution curve and an assumption: a "true" binding motif should occur more often in the PR, but not NBR or DR, while a random over-represented motif sequences may not have any specific distribution preference in the 1 KB upstream region.So if you compare the average occurrence/frequency of PR and that of NBR and DR, the former should be higher in a "true" binding motif and a larger difference represents a more reliable prediction.We thus proposed a formula to calculate the DP value which was expected to be close to zero in a random motif.A motif with DP value zero or negative then has lower possibility to be the biologically "true" TF binding motif.The DP value was calculated as the following formula: DP value = (Average PR Occurrence × 2) − (Average NBR Occurrence) − (Average DR Occurrence).

Testing
Seven TFs were involved to test if the utilization of DP value could assist in improving TFBS predictions.Four public motif finding tools were included in the performance comparison: MEME (published in 1994), MDscan (published in 2002), WordSeeker (published in 2010), and DREME (published in 2011).These four tools all search for statistically over-represented motifs in a given sequences set.MEME uses the expectation maximization to fit a two-component finite mixture model to the input sequences, and multiple motifs are found by probabilistically erasing the occurrences of the top motif and then repeating the process [8].MDscan combines the advantages of two motif search strategies: position-specific weight matrix updating and word enumeration to enhance the success rate [27].DREME [13] and Word-Seeker [14] were developed in recent years and specifically designed to process large size of ChIP-chip/ChIP-Seq datasets on scalable analysis platforms.
For each TF, the 1000 bp upstream sequences of their documented regulated genes were firstly selected as input into MEME for consensus motif search with the target motif length parameter set from 5 bp to 9 bp.The top ten consensus motifs in the results were then processed to calculate their OR and DP value.
The OR (over-represent) value, or the observed: expected frequency ratios (O/E) descripted and utilized in a previous study [28] reflects the statistical over-represent of these consensus motifs.OR value was calculated as following formula, where O refers to the overall occurrence of a motif across the 1KB upstream sequences set, "ln" is the natural logarithm, and Eo represents the expected occurrence of that motif:  [30].The remaining 6 consensus motifs were not Fkh1 documented binding motifs (Figure 3, Table 3).average and standard deviation of both value is close to each other.The Re-rank value was calculated as following formula:

Re-rank value = OR value + DP value × 1000
The MDP approach utilize both the over-representation information-measured by OR value, and the distribution pattern information-measured by DP value.Since both values represent the reliability of the motif prediction-higher value represents higher reliability-a re-rank value was introduced to combine both values so the "true" motifs with over-represented occurrence and distribution pattern fitting the general curve will obtain higher value and thus be picked up from all the candidates.To even the contribution of the two values, we checked the OR and DP value in all the training dataset, and decide to amplify DP value by 1000 times so the After re-ranking based on the Re-rank value, the motif originally in the 1st place ranked by MEME showed a negative DP value and thus dropped to 5th in the MDP rank, since its distribution pattern showed little similarity to the general TF frequency curve, while the two documented Fkh1 target motifs were raised from 2nd/3rd to 1st/2nd.Since these two target motifs are reverse complimentary to each other, we recorded the rank change as 2nd in MEME and 1st in MDP.
Same as descripted in "Fkh1", the upstream sequences of the 7 TFs were input into MDscan, WordSeeker and   4.Among the 5 target motifs, MDP predicted four motifs as the 1st and one motif as the 2nd in rank.While MEME and DREME only predicted two target motifs as the 1st in rank; MDscan and WordSeeker each failed to predict two target motifs in the top ten results.Some repeat-like sequences were noticed in MDscan and WordSeeker results, indicating their detections were somehow disturbed.Overall, MDP generated a better rank for the target motifs compared to other four tools.MoCRZ1, a transcription factor involved in Ca 2+ /Calcineurin signaling in Magnaporthe oryzae, was also used to estimate this MDP approach.Recently, three binding motifs (TTGNTTG, CAC[AT]GCC, TAC[AC]GTA) of MoCRZ1 in M. oryzae were predicted from 106 binding sequences discovered by ChIP-chip and microarray methods [23].We tested if the MDP approach could identify the "true" MoCRZ1 binding motifs without the need of ChIP-chip data.From published microarray data [23], 190 genes were picked as predicted MoCRZ1 regulated genes as they all showed a 2 fold or greater expression change between the control and libraries of Ca 2+ deficiency, MoCRZ1 inhibitor added, and the MoCRZ1 deletion.Results of the top ten consensus motifs predicted by MEME from 1000 bp upstream of MoCRZ1 regulated genes with their distribution pattern curves were shown in Figure 4 and Table 5.Three target motifs were originally ranked by MEME as the 1st, 5th, and 6th, after two simple repeats being removed.After re-ranking, these three target motifs went up to 2nd, 4th, and 5th.It was observed that another two motifs ([GT]CTTGGC and TGCCAAG ) which originally ranked at the 3rd and 8th moved up to the 1st and 3rd, also showed a significant rank improvement.These two motifs were next searched in the TOMTOM [31] database and identified as "Rim101" binding motifs, which was reported as a transcription factor involved in a pathway acting in parallel to Crz1 via calcineurin [32].

CONCLUSIONS
In this study, we developed the MDP approach to improve TFBS prediction.Genome-wide TFBS identification is generally challenging with both experimental validation and computational analysis required to refine TFBS predictions.The use of TFBS distribution profiles improves the accuracy of predictions by estimating both the over-representative level of the candidate motifs and their distribution pattern as well.The major originality here is that we are focusing on improving TFBS prediction by utilizing distribution pattern.

METHODS
To select transcription factors, all the 113 TFs from the yeast transcription factor database "YEASTRACT" were checked and the number of their documented regulated genes was counted.Those TFs having less than 50 documented regulated genes were filtered out.We noticed that some TFs recorded in "YEASTRACT" had multiple documented binding motifs, however, regular expression sequences of some binding motifs from the same TF showed high similarity, and thus could be clustered into a single motif.In those cases, the position weight matrix was checked from the "JASPAR" database.If the clustering was supported by the PWM, then the different motifs were merged into a new one.For example, TF "Aft1" had 4 recorded binding motif in "YESTRACT" database: s   YRCACCCR, TGCACCC, GGCACCC, and TGCACCCA, while only one matrix is recorded in "JASPAR".So in the clustering process, consensus binding motif sequences of Aft1 were generated as "TRCACCY".After clustering, each of the target motifs was scanned in the 1 KB upstream TSS region of all genes in the yeast genome to check for number of occurrence.Any motif with less than 100 or more than 2000 occurrence were removed.The remaining 32 TFs were randomly divided into two groups: 25 TFs in training group and 7 in testing group.
List of documented regulated genes of the 32 TFs was downloaded from "YEASTRACT" database.Their 1 KB upstream sequences were extracted from the yeast genome sequence and were used to scan for the binding motif sequences.The 1 KB upstream region was divided into twenty 50 bp windows and motif occurrence in each window was counted.Then the average occurrence of each window was calculated and a general frequency curve was generated.
For each TF in the testing group, the 1 KB upstream sequences of their regulated genes were input into MEME running on a local cluster, to search for consensus motifs with their expected length around 8 bp, as well as into MDscan, WordSeeker, and DREME.Then the top ten consensus motifs reported from MEME were used as queries to define the MDP.

Figure 2 .
Figure 2. Distribution frequency curve of training group.Each block coded by colors represents the frequency of one training motif in each scanning window.The curve was generated as average of all blocks.
OR value = O × ln (O/Eo) Taking transcription factor "Fkh1" for example, the top ten consensus motifs from MEME were originally ranked by P-value which represents the possibility of obtaining this motif solely by chance.The first two motifs AAA[AG]A[AG]AAA and TT[TC][TC]T[TC]TT [CT], were likely to be simple sequence repeats and thus being removed from the ranked results.The 2nd motif [GT]GTAAACAA and the 3rd motif [TCG]TTGTTTAC were reverse complimentary to each other and matched documented Fkh1 motifs [AG][CT]AAACA[AT][AT] [29] and [AG]TAAA[CT]AA

Table 1 .
List of yeast TFs involved in training and testing groups.

Table 2 .
TFs in training group.

Table 3 .
Fkh1 consensus motifs testing result.The two motifs in bold type represent two documented Fhk1 binding motifs.Xbp1 and Gzf3) failed to find validated target binding motifs in the top 10 consensus motifs predicted by MEME or other predictors.Summary of the remaining 5 TFs were showed in Table

Table 4 .
Test results of all testing group."N/A" means the target motif was not found in the top ten candidate motifs returned.

Table 5 .
MoCrz1 consensus motifs testing result.The five motifs in bold type are three MoCRZ1 binding motifs and two Rim101 binding motifs.