A MCMC strategy for group-specific 16S rRNA probe design

Revealing biodiversity in microbial communities is essential in metagenomics researches. With thousands of sequenced 16S rRNA gene available, and advancements in oligonucleotide microarray technology, the detection of microorganisms in microbial communities consisting of hundreds of species may be possible. Many of the existing strategies developed for oligonucleotide probe design are dependent on the result of global multiple sequences alignment, which is a time-consuming task. We present a novel program named OligoSampling that uses MCMC method to design group-specific oligonucleotide probes. The probes generated by OligoSampling are group specific with weak cross-hybridization potentials. Furthermore a high coverage of target sequences can be obtained. Our method does not need to globally align target sequences. Locally aligning target sequences iteratively based on a Gibbs sampling strategy has the same effect as globally aligning sequences in the process of seeking group-specific probes. OligoSampling provides more flexibility and speed than other software programs based on global multiple sequences alignment.


INTRODUCTION
Metagenomics is a new field combining molecular biology and genetics in an attempt to reveal the vast scope of biodiversity in a wide range of environment, as well as new functional capacities of individual cells and communities, and the complex evolutionary relationships between them [1,2,3,4].Apparently, revealing biodiversity in microbial communities is the first step [1,5].The vast majority of microbial diversity had been missed by cultivation-based methods [2].The analysis of 16S rRNA gene sequences is the most common approach to determine microbial diversity [6].
A critical issue for oligonucleotide microarray design is to find appropriate oligonucleotide probes specific to their target sequences.To improve efficiency in probe design, many softwares or databases have been developed.Kaderali et al. proposed a combination of suffix trees and dynamic programming based alignment algorithms to compute melting temperature ( T ), and presented an efficient algorithm to select probes with high specificity in detecting the target [24].Loy et al. built a comprehensive database containing more than 700 published rRNA-targeted oligonucleotide probe sequences with supporting bibliographic and biological annotation [25].Kumar et al. provided a software package ARB to evaluate sequence alignments and oligonucleotide probes with respect to three-dimensional structure of ribosomal RNA [26].DeSantis et al. proposed an alignment compression algorithm, NAST (Nearest Alignment Space Termination), to find Operational Taxonomic Units (OTUs) for automated design of effective probes [27,28].m Global multiple sequences alignment plays an important role in comprehensive analysis of group-specific oligonucleotide probe for those methods mentioned above.It is a challenge for personal computer to align a large amount of sequences.Here we present a novel program named OligoSampling that uses MCMC method to design group-specific oligonucleo-tide probes.OligoSampling does not need to globally align target sequences.This MCMC method is more flexible and efficient.

Algorithm Overview
Figure 1 summarizes the global overview of the Oligo-Sampling algorithm.We assume that we are given a set of N sequences and that those N sequences are divided into m group.We seek a group-specific oligonucleotide probe for each group.For a particular probe of specified width W segments in probe binding sites within each sequence are mutually similar.Conversely we seek within each sequence mutually similar segments.The segments can be regarded as binding sites for a particular probe.If the probe is specific to group A, segments within sequences in group A are more mutually similar.And to avoid cross-hybridization, there need to be more sequence variants between segments within sequences of group A and segments within sequences in any other group.In the process seeking mutually similar segments, patterns shared by multiple sequences in each group are obtained.
The algorithm maintains an evolving data structures.The data structure is a set of positions , for k from 1 to N. For a particular set of positions we obtain the pattern descriptions of mutually similar segments in each , ,  In the process of seeking group-specific probe for group i the algorithm is initialized by choosing starting positions   0 k a within the various sequences.A Gibbs sampling-based local multiple alignment algorithm [29] is applied to update a .After alignment the algorithm then proceeds through many iterations to execute the following three steps: 2 Otherwise, positions resume to   1 k a .Then, go back to step 1.
In the objective function sensitivity of the probe is measured by entropy, and specificity is measured by Kullback-Leibler divergence.By selecting a set of that maximizes the objective function, the algorithm finds the most group-specific oligonucleotide probe for group i.

DISCUSSION
The secondary structure model of 16S rRNA consists of a conserved core that is interspersed with a number of variable areas.16S rRNA evolves slowly and is often not very convenient to resolve bacterial strains at the species level.Therefore, multiple copies of 16S rRNA gene in different species share high sequence similarity.In jump step each aligned position is updated by jumping a same distance in the same direction.Convergence will be achieved rapidly in the following alignment due to high sequence similarity between sequences (Figure 3).In the process of updating positions , Gibbs sampling-based local multiple alignment algorithm is applied to keep sequences locally aligned.And then, sensitivity and specificity of probe is evaluated based on entropy and Kullback-Leibler divergence.This process is functionally equivalent to oligonucleotide probe design based on global multiple sequences alignment, but has less calculation amount.Sets of target sequences are highly similar to each other.We choose initial starting positions by letting the distance from 5' termini to initial starting position in each target sequence be proportional to sequence length.As shown in Figure 3, Local alignment beginning with proportional starting positions can achieve faster convergence speed than that beginning with random starting positions.In this method, the algorithm is initialized by choosing a random proportion.And the proportion is used to find starting position in each target sequence.
It is noted that if a probe can match well with several segments in a 16S rRNA, local multiple alignment can not analyze this situation comprehensively.We have analyzed the possibility that a probe matches well with several segments in a 16S rRNA.At present, the Ribosomal Database Project [30] (RDP) has collected around 280,952 16S rRNA sequences in bacteria superkingdom.Among the 280,952 16S rRNA sequences we picked out a 16S rRNA sequence and chose a segment in this 16S rRNA randomly.And then in this 16S rRNA sequence another segment which is most similar to the original segment was found.Mismatches of the two segments were counted.The procedure mentioned above was iterated 10,000 times.As shown in Figure 4, there is little possibility that a probe can match well with several segments in a 16S rRNA.
The problem of group-specific probe selection is further complicated as all probes must work under the same hybridization condition.To ensure the closer optimum 16S rRNA sequences in bacteria superkingdom collected by RDP, we picked out a 16S rRNA sequence and chose a segment in this 16S rRNA randomly.And then in this 16S rRNA sequence another segment which is most similar to the original segment was found.Mismatches of the two segments were counted.The procedure mentioned above was iterated 100,000 times.hybridization condition in one chip including all oligos, probe lengths must be variable.By adjusting probe lengths (increase or decrease base-pairs in 5' and 3' ends), probe whose melting temperatures ( ) are closer to a predefined optimum hybridization condition will be generated.As shown in Table 1, probe lengths are selected based on a predefined melting temperature.Melting temperatures of all probes can be close enough to work well under the same hybridization condition.m T

CONCLUSIONS
OligoSampling provides an efficient alternative for group-specific oligonucleotide probe design.Using this method we do not need to globally align target sequences.Locally aligning target sequences iteratively based on a Gibbs sampling strategy has the same effect as globally aligning sequences in the process of seeking group-specific probes.OligoSampling provides more flexibility and speed than other software programs based on global multiple sequences alignment.Furthermore, search for multiple local optimal solutions beginning with multiple initializations can improve the effect of this algorithm in group-specific probe design.This algorithm can be parallelized conveniently.

A Gibbs Sampling Strategy for Local Multiple Alignment
Lawrence et al. have described a Gibbs sampling algorithm for local multiple alignment of protein sequences [29].We applied Lawrence's method to locally align target sequences.Pattern shared by multiple sequences is described in the form of a probabilistic model of base pair frequencies for each position i from 1 to W, and consisting of the variables .This pattern description is accompanied by an analogous probabilistic description of the "background frequencies" with which base pairs occur in sites not described by the pattern., , Through many iterations to execute two steps of the Gibbs sampler an objective function (Eq.2) evaluating the alignment is maximized.In the study, the two steps of the Gibbs sampler were executed to locally align target sequences.
For the ith position of the pattern we have 1 N  observed nucleotides, because sequence z has been excluded; let be the count of nucleotide j in this position.These are supplemented with nucleotide-dependent "pseudocounts" to yield pattern probabilities.
B is the sum of the , we accept the update of positions z a .Otherwise, we reject the update.Then go back to first step.

Selection of Weight Set in Sensitivity and Specificity Evaluation
Pozhitkov et al. examined the effects of single-base pair mismatch (all possible types and positions) on signal intensities of hybridization through a series of calibration experiments [31].The results of experiments indicated that the most optimal discrimination of mismatch from perfect match probe-target duplexes is provided with the mismatch in the middle of the duplex.To suppress non-specific binding of probe to target, the position of the mismatch should be moved from the 5' or 3' termini to the center of the probe.Therefore, in evaluation of sensitivity and specificity of group-specific oligonucleotide probes the weight coefficients j 1 to m, in the form of a probabilistic model of base pair frenquencies for each position j from 1 to W, and consisting of the variables .

Figure 1 .
Figure 1.Algorithm overview: The algorithm is initialized by choosing starting positions , for k from 1 to N, within the various sequences.A Gibbs sampling-based local multiple alignment algorithm [29] is applied to update .After an identical jump for each position , the Gibbs sampling-based local multiple alignment algorithm [29] is applied again to update .An objective function defined to evaluate sensitivity and specificity of probes is calculated based on positions before jump and after alignment respectively.The ratio of objective values is compared to a random number

k a 2 . 2 .
Design Group-Specific Oligonucleotide Probes for 5 Bacterial SpeciesTo examine the performance of OligoSampling on group-specific probe design, we select 5 bacteria species (Afipia sp., Bordetella pertussis, Brucella sp., Escherichia coli O157:H7, and Mycobacterium tuberculosis) with a total of 40 sequences.For OligoSampling, the input dataset is 5 clusters of homologous sequences.The objective of this algorithm is to find a group-specific oligonucleotide probe for each cluster.As shown in Figure2, the probes generated by OligoSampling are group specific with weak cross-hybridization potentials.Oligo-Sampling also obtains a high coverage of target sequences.

Figure 3 .
Figure 3. Convergence of local alignment: Optimal objective values converge in 10000 iterations under three conditions: after jump, beginning with starting positions proportional to sequence length, and beginning with random starting positions.To make the comparison of convergence speed independent of initialization, ten alignment procedures beginning with different sets of starting positions are executed under each condition.Average optimal objective values in iterations 1 to 10000 were normalized to that in last iteration.

Figure 4 .
Figure 4. Mismatch in similar segments: Among 280,952 16S rRNA sequences in bacteria superkingdom collected by RDP, we picked out a 16S rRNA sequence and chose a segment in this 16S rRNA randomly.And then in this 16S rRNA sequence another segment which is most similar to the original segment was found.Mismatches of the two segments were counted.The procedure mentioned above was iterated 100,000 times.

p
one of the N sequences, z, is chosen in specified order.The pattern description and backare calculated, as described in Eq.3, from the current positions in all sequences excluding z.
Second step, the probability x Q of generating segment x in position z a according to the current pattern description are calculated, as are the probability , current segment x.And then draw a sample d from a normal distribution.Positions z a are updated to .In the updated position weight

 (in Eq. 1 )
should dependent on the signal intensity values of mismatches in different positions.Based on normalized signal intensity values of mismatch duplexes at position 1 to 20 provided by Pozhitkov et al., the weight coefficients j  are assigned.Position in the middle has the highest weight.
Local multiple alignment step.Gibbs samplingbased local multiple alignment algorithm is applied re-garding positions  