In silico tests on sequence motif significances for human tissue specific genes *

Identification and analysis of tissue-specific (TS) genes and their regulatory activities play an important role in understanding the mechanisms of the organism, disease diagnosis and drug design. Although so far we are not clear about the mechanisms totally, the sequence features of TS genes are becoming an important clue. In this paper we used an integrated pipeline to discover sequences motifs for the promoter regions of TS genes. To test the significances of those motifs in a specific tissue, we used hypotheses test approaches including Bayesian hypothesis, Binomial distribution and traditional z-test. We finally got 2784, 1204 and 703 motifs respectively out of 3244 motifs obtained in discovery phase using above three tests from 3954 TS genes across 83 human tissues. 52.7% of those motifs can be found in public databases available.


INTRODUCTION
Identification and analysis of tissue-specific (TS) genes and their regulatory activities play an important role in understanding mechanisms of the organism, disease diagnosis and drug design [1].In last years, many research projects were performed to study expressions and regulatory mechanisms of TS genes including transcription factor and their binding sites, sequence features of promoter regions [2], alternative splicing [3] and Epigenetics features [4] of those genes.
Although until now we are not completely clear about the mechanisms of the gene tissue specificity, the sequence features of TS genes are becoming an impor-tant clue [2].P. FitzGerald et al. calculated the statistics of Simple Sequence Repeats (SSR) and identified that the SSR could be an important factor to the tissue specificity [5].F. Song et al. pointed that methylation changes during development are dynamic, involve demethylation and methylation, and may occur at late stages of embryonic development or even postnatally using mouse genome data [6].C. Heber et al. showed that Nucleosome rotational setting is associated with transcriptional regulation in promoters of tissue-specific human genes [4].
With the completion of the whole human genome project, various algorithms have been developed for discovering patterns or motifs of huge volume genome sequences.Those typical algorithms include three phases: motif searching, redundant motif pruning and motif significance testing.The methods for motif discovery may be grouped into two categories [7]: enumerative methods and alignment-based methods.Enumerative methods typically involve exhaustive enumeration of words up to some maximum size in a dataset, and are thus best suited to consensus sequence motif models, like Consensus, PROJECTION, PDEM.Alignment methods take on a wide variety of forms, but often involve the development of a probabilistic model of the observed sequence data and optimization to find motifs common to all input sequences, such as MEME [8] program, the expectationmaximization (EM) algorithm and Gibbs sampling [9].Each algorithm has its unique advantage on individual species or datasets.Tompa et al. [7] conducted a study that compares the performance of 13 different motif finders by using a variety of real and synthetic sequence sets covering a range of genomes.A common practice is to apply several such algorithms simultaneously to improve coverage at the cost of increased redundancy [10].
In this paper, we first applied an integrated motif searching approach to find motifs for TS genes.As we known, it is the first time to search sequence motifs for tissue specific genes.Then we merged the similar motifs using the method in literature [7].To test the signifi-cances of those motifs in each tissue, we used three hypothesis test methods: Bayesian hypothesis, Binomial distribution and traditional z-test.We also distinguish two kinds of significant motifs: tissue rich motifs (TIM) and tissue even motifs (TEM).The former refer to motifs only showing significance in few tissues, and the later refer to motifs in most of the tissues.We finally got 2784, 1204 and 703 motifs respectively out of 3244 motifs obtained in discovery phase using above three tests from 3954 TS genes across 83 human tissues.52.7% those motifs can be found available in databases public.

Date Preparing
Tissue specific genes were obtained mainly by querying the tissue specific gene expression database TiGER [11] against the tissue names.Some of them came from Tis-GED [12] database.All of the TS genes with PubMed IDs were used in the experiment.We finally got 3954 human tissue specific genes across 83 human tissues.The gene's promoter sequences are downloaded from DBTSS [13] and EPD [14].The promoter region with 1500 bp (−499 bp -1000 bp around TSS) length is used for motif searching.

Motif Searching
In this phase, we integrated three motif searching programs: MEME, AlignACE and Gibbs Sampler.The length of candidate motifs is fixed to 6 -12 bp, other parameters as the default setting.In this phase, we get 6794 motifs.

PWM Representations of Motifs
Since different motif search programs have their own motif formats as outputs, we have to define a uniform format for motifs to compare their similarities in motif merging phase.A common used representation is the Position Specific Weight Matrix (PWM or PSWM) [15], which is a matrix of nucleotide frequencies in each position of the motif (i.e. the frequencies of the nucleotides A, C, G and T in each position).We transformed all the motifs to the PWM representation.

Motif Merging
In motif merging phase, we used the method similar with in literature [16] to remove motif redundancies.Because this step isn't the emphasis of this paper, we skip the details of the merging process.After motif merging, 3244 motifs were obtained.

Motif Tissue Significance Testing
To identify whether a motif is really related with tissue specificity or not, we statistically distinguish two kinds of motifs: tissue rich motifs (TRM) and tissue even motifs (TEM).The former refer to motifs only showing statics significance in less than 3 tissues, and the later refer to motifs in more than 70 tissues.We used hypothesis approaches to test the significance of motifs in each tissue.To do the hypothesis test, the distributions of motifs in a given sequence must be estimated.Therefore, a key step is to calculate the statistic of a motif in a given sequence.
For a given motif m with length w from tissue T 0 , in which the motif is discovered, our purpose is to judge whether its occurrence in tissue T 1 is significant or not.Therefore we have to take a measure on the motif occurrences.Based on the requirements of different hypothesis tests, we applied scoring schemas.
Definition 1: for a given motif m, its matching Score with a Promoter sequence segment x of the gene from tissue T 1 PMS1 is defined: is the score between m and x in position i, which can be calculated through the PWM of the motif.
Definition 2: for a given motif m, its matching Score with a Promoter Sequence S of the gene from tissue T 1 PSS1 is defined: where   s i with PMS1 more than a predefined threshold is a segment of S by sliding a widow with length w, n is the number of   s i .PSS1 is used in classical z-test and binomial test.Definition 3: for a given motif m, its matching Score with a Promoter sequence segMent x of the gene from tissue T 1 PMS2 is defined [16]: is the frequency of residue B at position i, which is from PWM; min max where   s i with PMS2 more than a predefined threshold is a segment of S by sliding a widow with length w, n is the number of   s i .PSS2 is used in Bayesian hypothesis test.

Classical Z-Test
In the classical z-test, we estimated the mean and variance of the match score PSS1 in tissue T 1 , and then calculated the z-value: where 0  and  are the mean and variance of the PSS1 in tissue T 0 .
In the experiment, we set the confidence degree 0.05.

Bayes Hypothesis Test
Assumed that the PSS2 of a motif at tissue T 0 follows a Gaussian distribution , N    .To test that whether the motif is significant at tissue T 1 , we constructed two hypothesizes as the followings: where 1 x is the mean of PSS2 in tissue T 1 .Assumed that

Binomial Distribution Test
In Binomial distribution test, instead of PSS1 value, we need the number of matches between the motif and the promoter sequence of a gene.A match between a motif and a sequence is defined if the PMS1 of the motif with a segment of the sequence is larger than a predefined value.We counted all the matches in tissue T 0 and T 1 , represented the numbers of matches by K 0 and K 1 respectively.The Binomial distribution test is to seek a value K-value holding: where n 0 and n 1 are the numbers of promoter sequences in tissue T 0 and T 1 respectively and p is fixed to 0.5 in the experiment.

Data Sources
The gene expression datasets, such as GNF, SAGE, and EST, are very widely used as data sources for the identifications of TS genes.However, because of the noise in expression datasets and human involvement in defining thresholds, the reliability of the identifications is often not high.In this paper, we use the specific genes obtained mainly by querying the tissue specific gene expression database TiGER against the tissue names.Some of them came from TisGED database.All of the TS genes with PubMed IDs were used in the experiment.We obtained 3954 TS genes across 83 human tissues.Because of the limitation of page size, the gene lists for all the tissues are available on request to the authors.The gene's promoter sequences were downloaded from DBTSS and EPD.The promoter region with length 1500 bp (−499 bp -1000 bp around TSS) is used for motif discovery.

Motifs Discovered by Three Test Methods
After merging phase, we get total 3244 motifs.The number of motifs in each tissue is shown in Table 1.
With Bayes Hypothesis Test method, we get 1534 TRMs and 1270 TEMs.With Classic z-test method, 539 TRMs and 164 TEMs are obtained.With Binomial Distribution test method, the numbers of two kinds of motifs are 270 and 925 respectively.For the details, see in

Overlap Motifs in Three Test Methods
In all the TRMs, 5 TRMs are covered by three methods, 150 TRMs are covered by two methods.In all the TEMs, 39 TEMs covered by three methods, 264 TEMs covered by two methods.For the details, see Figure 3.
We also compared the overlapped 5 TRMs and 39 TEMotif with JASPAR [17].4 TRMs (see Table 2) out of 5 TRMs are found in the JASPAR.For an example, [CCCCNCCCCC] is a motif which was discovered by previous researches in JASPAR ID MA0079.2_SP1, and [GGGGAATCCCC] with JASPAR ID MA0105.1_NFKB1.19 TEMs out of 39 TEMs are found in the JASPAR.For an example, the motif [NGNNGCRSCG] has JASPAR ID MA0123.1_abi4.For the details see Ta- ble 3.

CONCLUSIONS
T issue specificity is the foundation for cells form specific  tissues and functional organs.Identification and analysis of tissue-specific genes and their regulatory activities play an important role in understanding mechanisms of the organism, disease diagnosis and drug design.And finding accurate and meaningful motif with tissue specificity still remains a big challenge.
In this paper we used an integrated pipeline to discover sequence motifs for the promoter regions of TS genes.To test the significances of those motifs in a specific tissue, we used hypotheses test approaches including Bayesian hypothesis, Binomial distribution and traditional z-test.We finally got 2784, 1204 and 703 motifs respectively out of 3244 motifs obtained in discovery phase using above three tests from 3954 TS genes across 83 human tissues.52.7% of those motifs can be found available in databases public.
content of residue B at position i.Definition 4: for a given motif m, its matching Score with a Promoter Sequence S of the gene from tissue T 1 PSS2 is defined:

2 
are known.The post distribution of  is fol-

Figure 3 .
Figure 3. (a) Venn diagram of numbers of tissue rich motif finding by three methods; (b) Venn diagram of numbers of tissue even motif finding by three methods.
Figure 1.Numbers of TEMs by three test methods (The green color represents binomial distribution method, red color represents bayes hypothesis test method, purple color represents classic hypothesis test method).WholeBlood Figure 2. Numbers of TEMs by three test methods (The green color represents binomial distribution method, red color represents bayes hypothesis test method, purple color represents classic hypothesis test method).

Table 2 .
4 matches of 5 TRMs and 19 matches of 39 TEMs found in JASPAR.

Table 3 .
TRMs (Motif Match item is the number of motifs in the database).