Novel Quantitative Approach for Predicting mRNA/Protein Counts in Living Cells

One of the most complex questions in quantitative biology is how to manage noise sources and the subsequent consequences for cell functions. Noise in genetic networks is inevitable, as chemical reactions are probabilistic and often, genes, mRNAs and proteins are present in variable numbers per cell. Previous research has focused on counting these numbers using experimental methods such as complex fluorescent techniques or theoretical methods by characterizing the probability distribution of mRNAs and proteins numbers in cells. In this work, we propose a modeling based approach; we build a mathematical model that is used to predict the number of mRNAs and proteins over time, and develop a computational method to extract the noise-related information in such a biological system. Our approach contributes to answering the question of how the number of mRNA and proteins change in living cells over time and how these changes induce noise. Moreover, we calculate the entropy of the system; this turns out to be important information for prediction which could allow us to understand how noise information is generated and expanded.


Introduction
Randomness, or noise, in biological systems has long been predicted from basic physical principles [1]- [8] and later on by observations of phenotype heteroge-neity [7]. But the confirmation came later with [9], [10] and [11] who showed that mRNA and protein variability may lead to important a source of noise in biology. Researches in [12], [13], [14] and [15] have reported that the number of proteins translated from an mRNA obeys a geometric distribution but the distribution describing the number of protein remaining once mRNA is degraded will no longer be geometric. Various techniques have so far been used to monitor and capture those numbers among which fluorescent probes or green fluorescent protein variants which allow the quantification of protein levels in living cells by flow cytometry or fluorescence microscopy [16] [17]. The first quantitative study collectively examines the noise associated with the principal step of central dogma of molecular biology in replication, gene activation, transcription, translation and the enslaving intracellular environment, and suggested that autorepression of replication and transcriptions suppresses noise. This then leads to examination (by analysis, modelling and simulation) of the role of noise in biology relying on the similarity between biological and engineering systems-see [7], [10] and [18]. In general, noise may be considered either intrinsic or extrinsic to a specific gene circuit, and within a specific gene circuit there are three different effects of noise: i) noise is negligible with little or no influence over function; ii) noise is detrimental to function and gene circuit; iii) noise is important for circuit function, and by using simple assumptions, it is possible to evaluate these effects. The assumption we use in this paper is dynamic correlation between the noise level of molecules (mRNA/protein) and the change in the probability of having those molecules in given interval of time. Our paper is organised as follows. In Section 2, we introduce our model of the dynamic of the number of mRNA and proteins after a brief review of previous models. In Section 4, we present our method and algorithm for solving the (mRNA and protein) prediction problem. In Section 5, we present the simulation results, followed by a discussion of those results, and end this work in Section 6 with a short conclusion.

Birth-Death Model
To understand noise in biological systems, biochemical circuits and genetic networks are often used as the measured noise properties to elucidate the structure and the function of the underlying gene circuit [6] [8]. Also recent researches [13] and [14] have clearly established the existence of dynamic correlation between genetic network and mRNA/protein variability. In the next section we will present previous models with their strengths and weaknesses. The preliminary model used was a simple birth-death Markov process which captures noise in a biochemical process. This model showed that noise in the population was a consequence of the change in the parameters of the system over time and was used to explore the temporal change of the number of proteins in a biological system. The time course of the number of proteins was modelled consequently by the equation  with parameters α representing the rate of production and γ the rate of decay of number of proteins ( ) n t . However, such continuous time formulation neglects the discrete nature of proteins and the random timing of molecular transition [17] because the actual time evolution may follow any one of a number of trajectories, and hence sufficiently many trajectories have to be examined to obtain statistics that converge. In the next section a probabilistic approach using the extended versions of Kolmogorov's equations is used to explore randomness in the system.

Kolmogorov's Equations Based Model
In , p n t , that the system evolves into the state ( ) ( ) , n t n t = at time t is described by the following partial differential equation: p n t a n p n a n p n t This equation makes sense only if we assume that the probability for two or more reactions to occur in the time interval dt is negligible compared to the case when only one reaction occurs. In addition, (2) can only be solved numerically for relatively simple systems. In a recent work by [15], a similar mathematical model was used for gene expression and an approximate solution was proposed to the PDE; the model was based on the assumption that gene expressions are Brownian motions. They considered a two-stage model of gene expression, assuming that the promoter was always active and so had two stochastic variables (the number of mRNA and the number of proteins). The probability of having m mRNA and n proteins at time t was given by the following master equation: The meanings of the rates in (3)

The New Model
a m p m t m t a m p m t m t t p n t n t a n d p n d t n t a n p n t n t t The parameters of the model have an autoregressive form: a m a m a n a n ϑ θ The transcription, translation and degradation rates are assumed to vary from one cell to another as where T is the total time, 0 N the total number of points in the simulation and N is the total number of mRNA or proteins in a single cell. In the next section we introduce our method and algorithm for solving Equations (4)-(6).

Method and Justification
We propose a straightforward method of solving the above problem based on numerical approximation via the following algorithm. As the analytical solution to Equation (4) is (at least) hard to obtain, even for a "reasonable" number of cells, a numerical algorithm using an adapted stochastic simulation approach is proposed in this paper. In our algorithm, two random variables tence of a one-to-one relation between the dynamic and distribution for predictable dynamic systems. We will present our algorithm in the next section of our work.

Our Algorithm
Input: Initial data 0 0 , m n a m a m a n a n θ ϑ θ ϑ

Simulations and Results
The initial data here is a matrix of randomly generated numbers between one and fifty for mRNAs and between one and forty for proteins. The rows represent

Results
Our results will show various figures related to our solutions. We first plot the variability of the number of protein in cells over time for a sample of 50.
Next, we plot the solutions of (4) over time and explain their relevance for our work. pmf (probability mass function) of the mRNA and proteins in separate graphs for each sample, and further we plot the histograms of the distribution and finally the scatter plot of n P against m P . Our observations are presented in the caption of each figure.
It can be seen that all probability values are between 0.1 and 0.9 and do not overlap in most of the cases; this is an indication that mRNAs and proteins number may be dynamically dependent, and therefore correlated. Next, we predict the number of mRNAs and proteins

Entropy Distribution
To measure the uncertainty associated with each sample of mRNA or proteins count, we introduce the concept of entropy over a population, which is calcu- Computational results are shown in the figures below in the discussion section.

Discussion
We have shown (Figure 2) that, one may calculate the distribution of the number of mRNAs and proteins during gene expression, according to our model in Section 3. Based on these distributions (Figure 3 and Figure 4) we were able to predict the number of proteins and mRNAs over time. We use two main assumptions: i) The initial number of mRNA and proteins must be known; and ii) all cells must present similarity (functional, structural, architectural and/or dynamical). Our results show that both the protein and mRNA distributions are typically non-symmetric and may not be unimodal ( Figure 5, Figure 6 and Figure 7). Consequently the mean and the mode are significantly different, and    7) and (8)    It can be seen that the maximum entropy reached earlier for proteins and that the right tail is also longer in protein compared to that of mRNA. This may suggest that proteins have a longer life time, compared to mRNA. This evidence is in line with biological knowledge. The next section gives some discussion of the results.
the standard deviation is clearly not constant over time. Such distributions are poorly characterized by Gaussian characteristics. This paper was primarily designed to promote a modelling culture among noise biologists, modellers and to cope with the noise source and consequences in cell development.
H. C. Jimbo et al.

Conclusion
The advantage of counting single molecules (mRNAs or proteins) is that, one obtains the probability distribution of molecules corresponding to each stage of the "central dogma" of molecular biology for each single gene. The mathematical model developed here differs from those that cellular biologists are accustomed to encountering [3] [5]. Instead of having a continuous and deterministic model of kinetic behavior, the mathematics of gene expression may be described by discrete stochastic models that take into account the numbers of molecules involved at both the mRNA and protein levels variability. Figure 7 shows the plot of entropy distributions over time in a chosen cell. We have found that the maximum entropy reached earlier for proteins in comparison to mRNAs, the right tail density is also longer in protein in comparison of that of mRNA. This result clearly suggests that proteins have a longer life time, compared to mRNA.
This evidence is in line with biological principles.