The application of hidden markov model in building genetic regulatory network

The research hotspot in post-genomic era is from sequence to function. Building genetic regulatory network (GRN) can help to understand the regulatory mechanism between genes and the function of organisms. Probabilistic GRN has been paid more attention recently. This paper discusses the Hidden Markov Model (HMM) approach served as a tool to build GRN. Different genes with similar expression levels are considered as different states during training HMM. The probable regulatory genes of target genes can be found out through the resulting states transition matrix and the determinate regulatory functions can be predicted using nonlinear regression algorithm. The experiments on artificial and real-life datasets show the effectiveness of HMM in building GRN.


INTRODUCTION
In order to understand the functioning in living organisms, it should be known which genes are expressed, when and where, and to which extent.The regulation of gene expression is achieved through the interactions between DNA, RNA, proteins, and small molecules.This regulatory system can be described by the structure of network called genetic regulatory network (GRN).Building GRN is a reverse-engineering of real gene expression data to help studying the relationship between genes systematically, understanding the essential rule of biological phenomena and provide valuable idea to treat some complex diseases [1,2].Many mathematics models have been proposed during GRN research, such as Boolean model, Linear combination model, Weighted matrix model, Differential equations model, and so on [3,4].Resent years, probabilistic GRN model has been paid more attention for the real biological system is stochastic and the determinate model cannot infer the complex process.Bayesian network and Markov Chain have been studied [5][6][7].The final result of probabilistic GRN is represented using a graph consisting vertices (genes) and edges (relationships).The relationships between genes are described through probability.Considering the current probabilistic GRN is simple and cannot give the dynamics behavior.This paper studies the application of Hidden Markov Model (HMM) and nonlinear regression algorithm to GRN building.
This paper is organized as follows: Section 2 gives a theoretical representation of HMM.Section 3 focuses on the application of HMM in building GRN.Section 4 shows the experimental result and discussion.Section 5 is the conclusion.

Basic Theory
Classical Hidden Markov Model is a kind of stochastic state machine based on statistical signal, which is a double stochastic process where the sequences of states can not to be observed directly.is the observed value sequence and k is the observed value, the three parameters in a HMM model are defined as follows,  is the probability of the initial state equals .
means the observed probability that generating under state j s .B is a M × N matrix.

Fundamental Problems in HMM
There are three problems needed to be solved when building a HMM.

1) Evaluation
For a given HMM ( , , ) and the observed sequence Forward-Backward Algorithm proposed by Baum is used to solve the evaluation problem.In practical application, the results are very small and usually normalized or carried on logarithmic operation in calculation process.
2) Decoding Ascertaining an optimal state transition sequence on the given HMM and observed sequence is called decoding., ,...

3) Learning
Learning can obtain the optimal parameters   of HMM through training algorithm, where   make the value of ( | )  There are two solutions to solve the learning problem.One uses gradient technique, another is based on iteration or recursion like Baum-Welch algorithm which is often used to train parameters of HMM.

Constructing HMM
A fundamental assumption is that genes sharing similar expression levels are commonly regulated, and the genes are involved in related biological functions.Most of GRNs are built on the basis of clustering.The process of constructing HMM in this paper is carried on the genes clustered into the same class.1) States: Considering the genes clustered into same class, different genes are considered as different states.So the size of state transition matrix is just the same as the number of genes.State transition probability corresponds with the regulatory probability between genes.One gene may be regulated by any other genes, even itself, so the wholly connected connection structure is used.
2) Observed sequence: The expression profiles of genes are considered as the observed sequence.Since these data is easily affected by noise, smoothing is used firstly to reduce the influence of noise.
3) Training steps: Step 1: initializing parameters 0  of HMM.The number of states is equal to the number of genes and each value of state transition matrix is initialized as average value1/ , N is the number of states, and N

( | )
P O  can be computed; Step 2: revaluating HMM's parameter 0  .Baum-Welch algorithm is used to train HMM model to acquire λ; Step 3: computing ( | ) P O  under the obtained model λ using Forward-Backward algorithm; Step 4: judging the convergence criterion.If is not satisfied, then 0    and return to step 2. Else, training process is finished and final HMM model close to the observed sequence can be acquired.

Building Probabilistic GRN
Regulatory genes for each target gene can be found out based on the state transition matrix after training HMM.Then the structure of regulatory network can be built according to the following steps: Step 1: For one target gene ( 1, 2,..., ) i x i N , the genes whose transition probability in the trained state transition matrix A is bigger than the initial average probability are found out and these genes are regarded as the parental regulatory genes of each target gene; Step 2: repeating step 1 until finding out the global information for each target gene; Step 3: predicting the determinate regulatory function i f between target gene and its parental regulatory genes using multiple nonlinear regression and least squares algorithm.

Experiment with Artificial Data
In order to evaluate the efficiency of our algorithms, a group of networks are required whose structure had been known.However, the real structures of GRN are unknown completely because the research about GRN is still in an early stage.So artificial data reported in paper [8] are used in this study.Here, the adopted ALARM network contains 37 discrete variables, 46 edges and the value of every variable ranges from 2 to 4. The network with known structure is called target network t and the result of our algorithm is called deduced network d .Three index sensitivity, specificity and F-factor are used to evaluate our algorithm.Sensitivity is used to test the inference ability, specificity reflects the degree of accuracy and F-factor is the balance of above two indicators.The bigger F-factor means the higher accuracy.
N N 1) Experimental results Initial elements of state transition matrix are assigned as 1/15, Table 2 is the final state transition matrix trained, and each row shows the transition probability corresponding with target gene, through which the probable regulatory genes of target genes can be found out.Considering the transition probability of MBP1, SWI6, MCM1 and CLN3 to gene SWI4 are bigger than initial probability, so these genes can be regarded as the regulatory genes of target gene SWI4.Table 3 lists the regulatory genes of target genes SWI4, SWI5, CLN2, and CLB2.

JBiSE t t
N s means the total number of the edges in .
d Table 1 shows the comparison of standard Simulated Annealing algorithm, called BANJO developed by Hartemink and our algorithm.It can be seen that our result is better than general Simulated Annealing algorithm in all items.
N 2) Determinate regulatory relationships Predicting the determinate regulatory relationships can offset the drawbacks of the probabilistic GRN, which can not describe the specific dynamics behavior.The obtained regulatory relationships between target gene CLB2 ( 12x ) and its regulatory genes MCM1 ( 5x ) and SIC1 ( 13x ) is:

Experiment with Real-Life Data
The real-life experimental data in this paper comes from yeast cell cycle expression datasets created by Spellman [9], which imply the regulatory information about genetic property in yeast cell cycle.However, the above mentioned index cannot be used to evaluate the result because the real biological regulatory network is unknown completely.So the existing regulatory relationships which had been already proved are used to evaluate our results.0.1503 0.0943 0.0075 0.0039 0.0026 0.17 As listed in Futcher's paper [10], 15 main transcriptional factors: MBP1, SWI4, SWI6, FKS1, MCM1, FKH1, NDD1, SWI5, ACE2, CDC28, CLN3, CLB2, SIC1, CLN2 and HHT1 are discussed.It had been verified that there exist interactions among these genes.For convenience, these genes are marked No. 1, 2,…15.
Figure 2 gives a local structure of our resulting GRN.Where, black connecting lines represent the verified existing edges [11].Blue connecting lines represent the reversed direction with known relationships.Red con--necting lines represent the regulatory relationships pre- dicted by our algorithm which is remained to be verified further.
3) Discussions It had been verified in Reference [12] that gene CLN3 activates gene SWI4, gene SWI4 regulates gene CLN2 meanwhile, gene CLN1 and CLN2 are both influenced by gene CLB2, and CLB2 regulates CLN2.The expression pattern of gene SWI5 is similar with SIC1 and it has been proved that SWI5 regulates SIC1.These conclusions confirm that our algorithm is effective.
Moreover, it had been verified that CLN3 is regulated by gene SWI4 and CLB2 is regulated by gene SIC1, which is identical with the predicted results by our algorithm.

CONCLUSIONS
This paper discusses the application of HMM in building GRN.The regulatory genes for each target gene can be found out through the state transition matrix and then the global structure of GRN can be determined.Simulative experiment proves that this algorithm is more effective.The results in real-life data also show its rationality.Compared with the determinate model, HMM is more scientific because it describes the transcriptional regulatory degree between genes through probability.Especially, the present algorithms can find out self-regulatory relationships of genes.
There are still many problems should also be considered during the research of GRN using HMM, for exam-  ple, how to choose the initial model.Since the biological GRN is a time-continuous and complicated dynamic system and haven't been completely known, as a result, how to evaluate the GRN integrated with biological meanings effectively is the next research.

1 2 ,
evaluation means calculating the probability ( , ,..., O O O  P O  corresponding the observed sequence generated by the model, which can evaluate the similarity of the observed sequence with the given model.

Figure 1 .
Figure 1.The real expression profile and regression result of gene CLB2.

Figure 2 .
Figure 2. The local structure of GRN.

Table 1 .
The comparison of BANJO and HMM.

Table 2 .
The states transition matrix.

Table 3 .
The regulatory genes of several target genes.