Novel Quantitative Approach for Predicting mRNA/Protein Counts in Living Cells ()
1. Introduction
Randomness, or noise, in biological systems has long been predicted from basic physical principles [1] - [8] and later on by observations of phenotype heterogeneity [7] . But the confirmation came later with [9] , [10] and [11] who showed that mRNA and protein variability may lead to important a source of noise in biology. Researches in [12] , [13] , [14] and [15] have reported that the number of proteins translated from an mRNA obeys a geometric distribution but the distribution describing the number of protein remaining once mRNA is degraded will no longer be geometric. Various techniques have so far been used to monitor and capture those numbers among which fluorescent probes or green fluorescent protein variants which allow the quantification of protein levels in living cells by flow cytometry or fluorescence microscopy [16] [17] . The first quantitative study collectively examines the noise associated with the principal step of central dogma of molecular biology in replication, gene activation, transcription, translation and the enslaving intracellular environment, and suggested that autorepression of replication and transcriptions suppresses noise. This then leads to examination (by analysis, modelling and simulation) of the role of noise in biology relying on the similarity between biological and engineering systems―see [7] , [10] and [18] . In general, noise may be considered either intrinsic or extrinsic to a specific gene circuit, and within a specific gene circuit there are three different effects of noise: i) noise is negligible with little or no influence over function; ii) noise is detrimental to function and gene circuit; iii) noise is important for circuit function, and by using simple assumptions, it is possible to evaluate these effects. The assumption we use in this paper is dynamic correlation between the noise level of molecules (mRNA/protein) and the change in the probability of having those molecules in given interval of time. Our paper is organised as follows. In Section 2, we introduce our model of the dynamic of the number of mRNA and proteins after a brief review of previous models. In Section 4, we present our method and algorithm for solving the (mRNA and protein) prediction problem. In Section 5, we present the simulation results, followed by a discussion of those results, and end this work in Section 6 with a short conclusion.
2. Somes Examples and Motivations
2.1. Birth-Death Model
To understand noise in biological systems, biochemical circuits and genetic networks are often used as the measured noise properties to elucidate the structure and the function of the underlying gene circuit [6] [8] . Also recent researches [13] and [14] have clearly established the existence of dynamic correlation between genetic network and mRNA/protein variability. In the next section we will present previous models with their strengths and weaknesses. The preliminary model used was a simple birth-death Markov process which captures noise in a biochemical process. This model showed that noise in the population was a consequence of the change in the parameters of the system over time and was used to explore the temporal change of the number of proteins in a biological system. The time course of the number of proteins was modelled consequently by the equation
(1)
with parameters
representing the rate of production and
the rate of decay of number of proteins
. However, such continuous time formulation neglects the discrete nature of proteins and the random timing of molecular transition [17] because the actual time evolution may follow any one of a number of trajectories, and hence sufficiently many trajectories have to be examined to obtain statistics that converge. In the next section a probabilistic approach using the extended versions of Kolmogorov’s equations is used to explore randomness in the system.
2.2. Kolmogorov’s Equations Based Model
In general, the Kolmogorov’s equations are used as master equation to capture the distribution of chemical components of the gene circuit over time. The state of the system is defined by a vector
, where
represents the i-th component of molecule n at time t;
and
are internal parameters representing respectively the propensity of the dynamic and the actual change in
, resulting from the change in the previous state The probability,
, that the system evolves into the state
at time t is described by the following partial differential equation:
(2)
This equation makes sense only if we assume that the probability for two or more reactions to occur in the time interval
is negligible compared to the case when only one reaction occurs. In addition, (2) can only be solved numerically for relatively simple systems. In a recent work by [15] , a similar mathematical model was used for gene expression and an approximate solution was proposed to the
PDE
; the model was based on the assumption that gene expressions are Brownian motions. They considered a two-stage model of gene expression, assuming that the promoter was always active and so had two stochastic variables (the number of mRNA and the number of proteins). The probability of having m mRNA and n proteins at time t was given by the following master equation:
(3)
The meanings of the rates in (3) are:
is the probability per unit time of transcription,
the probability per unit of translation,
the probability per unit time of degradation of an mRNA, and
the probability per unit time of degradation of a protein. The authors use a particular generating function and transform (3) into a first order
PDE
which is solved using a simple approximation. However, this model works only on a single cell, and all rates
,
,
and
are fixed over time. Further, by assuming that the protein synthesis occurs in bursts
, the authors derive the Kolmogorov (master) equation for gene expression that considers only proteins, by implicitly including mRNAs (since n and m seem to be correlated over time). In the next section, we shall re-examine this model and propose a new one in order to overcome the above limitations.
3. The New Model
Our setup is motivated by the necessity to overcome the limitations from the previous models by increasing the cell numbers and relaxing the restriction on constant parameters. We propose a new, flexible, and more general, model for a population of N cells. This model is an extended version of the previous Kolmogorov’s equation with additional cell-dependent constraints.
(4)
The parameters of the model have an autoregressive form:
(5)
The transcription, translation and degradation rates are assumed to vary from one cell to another as
and
(6)
We assume for
, the first
are sequences of transcription rates and the late
are sequences of translation rates with
being the fixed initial rate. Our model, which is composed of the Equations (4)-(6), is well adapted to various real biological promoter change. We shall notice that Equation (4) is a system of 200 equations with 100 by 2 unknowns, which is likely to be only numerically solvable after some good approximations. To efficiently predict the number of mRNAs and proteins over time, we shall rely on the following assumptions.
Proposition 1. Over time the number of mRNAs/Proteins is perfectly correlated with the probability mass functions of mRNAs m(t) and proteins n(t) respectively. That is,
and
, where
and
are initial measurements.
Proof: The proof follows from our algorithm and solution in this paper. ,
Proposition 2. Let
be the number of proteins and
be the noise generated by
proteins (or m for mRNAs) in the same time interval
. Then there exists a unique constant C such that
which means that noise is cells is proportionally correlated to the probability distribution of protein and mRNA numbers.
Proof:
Let
,
be respectively the noise and the number of proteins in a cell. By the simple decomposition of numbers of mRNA/ proteins,
and
, (by the additivity property of probability distribution. We also have, using the
definition, that
,
and
. This implies that
multiplying the right side of above with
and we obtain
.
since
and
thus
leading to
Finally we conclusion that
. ,
Here we put
where
is the total time,
the total number of
points in the simulation and
is the total number of mRNA or proteins in a single cell. In the next section we introduce our method and algorithm for solving Equations (4)-(6).
4. Method and Justification
We propose a straightforward method of solving the above problem based on numerical approximation via the following algorithm. As the analytical solution to Equation (4) is (at least) hard to obtain, even for a “reasonable” number of cells, a numerical algorithm using an adapted stochastic simulation approach is proposed in this paper. In our algorithm, two random variables
and
determine the temporal evolution of the system. The variable
is the time for the next event to occurs, the probability density of an event (appearance of m(t) or n(t)) is evaluated based upon our model (4), so as to give a better flexibility and applicability to the approach in comparison with previous ones. The main purpose of creating such an algorithm is to simultaneously simulate the process noise, while predicting the online probability mass function (»probability density) of each event over time. An important assumption here is that the hypothetical probability distribution functions (p.m.fs) of the translation and transcription rates are of the form
and the mRNA and protein degradation rates are
. This is in line with the existence of a one-to-one relation between the dynamic and distribution for predictable dynamic systems. We will present our algorithm in the next section of our work.
Our Algorithm
Input: Initial data
Outputs:
1. Set
.
2. For j = 1:k do [k = number of iterations]
a. Let
be the changes associated to a single event;
b. Compute
;
c. Compute
;
d. Compute
, where
;
e. Normalize
.
3. Output
.
End
5. Simulations and Results
The initial data here is a matrix of randomly generated numbers between one and fifty for mRNAs and between one and forty for proteins. The rows represent the cell numbers and the columns are the number of mRNAs/proteins counted at each time interval. Therefore we have 100 cells (population) and 50 samples taken at a time interval of one unit, and the total time of 50 time units in the entire population; (a unite could be second, minute or hour depending on the experiment). The bar and image pots of the initial data are shown in Figure 1.
![]()
Figure 1. These subfigures give plots of the initial randomly generated data with 50 samples for proteins. A sample is the number of protein in cells for a fixed interval of time. This image plot support the presence of various level of noise in the biodynamic system.
5.1. Results
Our results will show various figures related to our solutions. We first plot the variability of the number of protein in cells over time for a sample of 50.
Next, we plot the solutions of (4) over time and explain their relevance for our work. pmf (probability mass function) of the mRNA and proteins in separate graphs for each sample, and further we plot the histograms of the distribution and finally the scatter plot of
against
. Our observations are presented in the caption of each figure.
It can be seen that all probability values are between 0.1 and 0.9 and do not overlap in most of the cases; this is an indication that mRNAs and proteins number may be dynamically dependent, and therefore correlated. Next, we predict the number of mRNAs and proteins
using a straightforward probabilistic concept which states that “a good value of m (or n) depends on a good guess of p”. The prediction for the number of mRNA and Protein
(for iteration
) are then given by the following Markov equations.
(7)
(8)
Leading to the following results for mRNA
5.2. Entropy Distribution
To measure the uncertainty associated with each sample of mRNA or proteins count, we introduce the concept of entropy over a population, which is calculated as follows:
(for mRNAs)
(9)
(for Proteins)
(10)
Computational results are shown in the figures below in the discussion section.
6. Discussion
We have shown (Figure 2) that, one may calculate the distribution of the number of mRNAs and proteins during gene expression, according to our model in Section 3. Based on these distributions (Figure 3 and Figure 4) we were able to predict the number of proteins and mRNAs over time. We use two main assumptions: i) The initial number of mRNA and proteins must be known; and ii) all cells must present similarity (functional, structural, architectural and/or dynamical). Our results show that both the protein and mRNA distributions are typically non-symmetric and may not be unimodal (Figure 5, Figure 6 and Figure 7). Consequently the mean and the mode are significantly different, and
![]()
Figure 2. These figures are a recorded solution of the main
PDE
, showing a continuous random change in the probability distribution of mRNA (left) and proteins (right) in all 100 cells. It can be seen that most probability values are between 0.3 and 0.7, this indicates that very low and/or very high probability values are rare and our prediction approach is suitable to this problem.
![]()
Figure 3. Shows scatter plots of the probability distribution of both number of mRNA and proteins over time for four different samples, plotted on the same graph. Each column represents the number of mRNA or proteins in all cells at a specific time period.
![]()
Figure 4. Four steps ahead prediction of mRNA numbers in all cells using Equations ((7) and (8)), for j = 1, 2, 3, 4. This result shows a fast decrease of probability values of mRNAs in cells iteration, indicating that mRNAs have a short life time, which is in accordance with biological evidence.
![]()
Figure 5. Shows the frequency of proteins related to Pn values in four generations. The optimal for proteins is around 0.4 (except generation 3), this indicates that on average 40% of probability level will give a better proteins count over time. The frequency distribution shows in all cases an asymmetric distribution, which indicates protein numbers are not normally distributed.
![]()
Figure 6. Shows the online scatter plots of
against
in four different cells. These figures confirm that both processes are strongly correlated over time in each of the cells, indicating that mRNA and protein dynamics (count) are depend over time. This will also hold for m and n since probabilities and mRNA/proteins are correlated.
![]()
Figure 7. Shows the plot of entropy distributions over time in a chosen cell. It can be seen that the maximum entropy reached earlier for proteins and that the right tail is also longer in protein compared to that of mRNA. This may suggest that proteins have a longer life time, compared to mRNA. This evidence is in line with biological knowledge. The next section gives some discussion of the results.
the standard deviation is clearly not constant over time. Such distributions are poorly characterized by Gaussian characteristics. This paper was primarily designed to promote a modelling culture among noise biologists, modellers and to cope with the noise source and consequences in cell development.
7. Conclusion
The advantage of counting single molecules (mRNAs or proteins) is that, one obtains the probability distribution of molecules corresponding to each stage of the “central dogma” of molecular biology for each single gene. The mathematical model developed here differs from those that cellular biologists are accustomed to encountering [3] [5] . Instead of having a continuous and deterministic model of kinetic behavior, the mathematics of gene expression may be described by discrete stochastic models that take into account the numbers of molecules involved at both the mRNA and protein levels variability. Figure 7 shows the plot of entropy distributions over time in a chosen cell. We have found that the maximum entropy reached earlier for proteins in comparison to mRNAs, the right tail density is also longer in protein in comparison of that of mRNA. This result clearly suggests that proteins have a longer life time, compared to mRNA. This evidence is in line with biological principles.
Acknowledgements
The authors would like to thank Dr. Sakumura for dedicating his precious time, giving many insightful comments and suggestions. This work was supported by the GCOE International Senior Research Fellowship, NAIST and the Grant of Aid from the Ministry of Education, Culture, Science and Technology (MEXT) Japan. We also thank the Universities of Sorbonne (France) and AUAF for their generous support and collaboration.