Improvements in the score matrix calculation method using parallel score estimating algorithm

The increasing amount of sequences stored in genomic databases has become unfeasible to the sequential analysis. Then, the parallel computing brought its power to the Bioinformatics through parallel algorithms to align and analyze the sequences, providing improvements mainly in the running time of these algorithms. In many situations, the parallel strategy contributes to reducing the computational complexity of the big problems. This work shows some results obtained by an implementation of a parallel score estimating technique for the score matrix calculation stage, which is the first stage of a progressive multiple sequence alignment. The performance and quality of the parallel score estimating are compared with the results of a dynamic programming approach also implemented in parallel. This comparison shows a significant reduction of running time. Moreover, the quality of the final alignment, using the new strategy, is analyzed and compared with the quality of the approach with dynamic programming.


INTRODUCTION
The biologists need the help of Bioinformatics because everyday they generate a huge amount of data from their experimental results and they need to analyze these results aligning the sequences, searching for some patterns on them and identifying some hot spots [1,2].However, they can not perform this action in time without com-puters [3].
When considering multiple sequence alignments (MSA), the dynamic programming algorithms are efficient only for a limited number of multiple lengthy sequences [4].The MSA algorithms, using stochastic methods, have become a feasible alternative for the situations in which the dynamic programming approach is not convenient [5].
Nevertheless, with the increase of problems' complexity, the number of sequences to be analyzed has grown from several dozens to several thousands [6], demanding more computing power to perform this analysis [3].
With the demand for more computing power, parallel and distributed computing were used to improve the performance of task execution in Bioinformatics, specially stochastic algorithms, thus putting together high performance computing and Bioinformatics.
Stochastic approaches can not arrive at the exact solution, but they try to obtain the best optimality degree of the solution.The progressive MSA is one of the most used stochastic algorithms for aligning sequences with a good performance and a reasonable quality.
The progressive MSA algorithm is divided into three stages: the first stage is the score matrix calculation, the second is the phylogenetic tree construction and the last is the multiple alignment.The score matrix calculation is the most computationally complex of the three, because it performs the pairwise comparisons among sequences, generally using dynamic programming.However, some strategies can be used to reduce the computational complexity of first stage and score estimating is one of them [7,8].
This work shows some results obtained by an implementation of a parallel score estimating technique (which we call the new approach) in the score matrix calculation stage and the comparison of this method with traditional dynamic programming also implemented in parallel (which we call the standard approach).The execution time and the quality of the obtained alignments were compared between the approaches.This new approach can reduce the computational complexity of this step from O(mn) to O(m + n), considering two sequences X and Y of lengths m and n, respectively [8].
This paper is organized as follows: Section II reviews the main concepts of the multiple sequence alignments and discusses some related works; in Section III the score estimating algorithm is described; in Section IV, the obtained results are presented and discussed, and in Section V, the conclusions are presented.

SEQUENCE ALIGNMENT
The sequence alignment is not an easy task and the biologists constantly need evaluations over genes and the characteristics of proteins of species.
The alignment among sequences of DNA/RNA and proteins of different species is a hypothesis of homology among the components of genes and proteins.The alignments can be used as models to propose and test evolutionary hypothesis which are also important to the studies of phylogeny [1,3].
The use of MSA algorithms, the parallelization of them and the optimization techniques for these algorithms have been improved in the last years.Some methods to improve the execution time of MSA algorithm based on pairwise comparison, where the goal is to find an optimal alignment with some restrictions, were proposed [9].However, they did not achieve the efficiency of parallel solutions [10][11][12].
Otherwise, with the growing size of the sequences, the parallel solutions began to lose their high efficiency, and the addition of optimization techniques in the algorithms of parallel solutions became necessary [13,14].
Our work explores all the parallelism power in the solution of multiple sequence alignment problems, developing a parallel version over the sequential score estimating optimization technique proposed by Chen et al. [8], and applying it in the first stage (the score matrix calculation) of a progressive multiple sequence alignment (MSA).We performed comparisons in the execution time of this method with traditional dynamic programming implemented in parallel and also applied in this first stage.

SCORE ESTIMANTING
Usually, the standard progressive MSA algorithm uses in its first step the dynamic programming algorithm.However, from our research experience we realized that there are different strategies to calculate this score matrix (a matrix containing the final score of aligned pairs) which work better.More specifically, the standard approach is based on the progressive algorithm of Clustal [6,15], which is a well known, largely used and conescrated strategy in Bioinformatics.The source code of the tool can be obtained in the web page of the European Bioinformatics Institute 1 .Basically, we have taken the first stage of this algorithm, the pairwise alignment, and parallelized it (standard approach) as can be seen in Figure 1.The sequence pairs are distributed among the processors of the parallel machine and then each processor calculates the score of each pairwise alignment.When the processor finishes the task, it returns the score value to be written in the score matrix.
In the new approach, we used the estimating score technique to obtain the score matrix results.However we did not use it in a sequential way, as it is reported in the literature [8]: we parallelized the stages of this technique to improve the performance of the algorithm.To the extent of our knowledge, this approach was not previously reported in the literature.
Each one of the four stages of the estimating score algorithm performs a scan in the sequences which are placed in pairs.Each pair of sequences has to perform the four stages, necessarily.The stages are classified as: Right-Upper, Right-Lower, Left-Upper and Left-Lower.The classifications Upper and Lower are related to the position of the sequences in the analysis.The Right and Left movements are related to the scanning directions.The maximum score among four stages is used to the matrix score.
Figure 2 shows an illustration of the developed parallel score estimating algorithm.To illustrate, it will be explained the Right-Upper execution, which is executed in one node.The other steps (Left-Upper, Left-Lower and Right-Lower) work likewise, the differences are the sequences scanning directions and the starter character.
In the Right-Upper execution, the last character on the right side of the upper sequence is chosen and set as the   starter character.In this case, it is the character A. Departing from it, a series of comparisons with the characters of lower sequence are performed, going from right to left (this is the direction of scanning).If there is no equal character (a match) in the lower sequence, the algorithm moves to the next left character in the upper sequence, (i.e.considering it the new starter character) and repeats the complete scanning process again.Otherwise, if an equal character is found in the lower sequence (this is the case in our example, when we find a match), one point is scored and the scanning starts again, now taking the new starter character in the lower sequence.The match is with the character A (third from right to left in the lower sequence), as it can be seen in the Figure 2. The new starter character, now in the lower sequence, is the next left character from where is the match.Then, taking character T as the new starter character (fourth character from right to left in the lower sequence), we perform comparisons with upper sequence, starting from its second character, which is also character T, from right to left in this sequence.This process is repeated until the two sequences are covered.
The square around the illustration of sequence pairs indicates that each task is performed in a different processor unit.The distribution of the stages is done as soon as the processor is or becomes available.This approach is possible, because the stages are totally independent.
The algorithm might be executed in any amount of processors, because the distribution of the stages is done through an order queue, where the four stages of the pair of sequences in time are distributed for the processors and, if there are more processors available, the stages of the next pair of sequences in the queue are distributed too, until all the processors are working (busy).This process is repeated until the end of the pairs of sequences and their stages.Below, we present an algorithm for this control: While (pair_of_sequences_has_not_yet_executed > 0) { waiting(); //waiting message of processor's availability check_the_order_queue(); allocate_the_correct_stage_of_time(); decrement_counter_pair(); }

RESULTS
In this section we report the results that demonstrate the improvement in the performance of the new approach, implemented by the score estimating algorithm, when compared to the standard multiple progressive alignment, implemented with parallel dynamic programming.It is important to emphasize that the performance results showed here are related only to the execution time of the first stage of the algorithm.
The tests were performed with 550 residues on average for nucleotides and with 180 residues on average for amino acids, with different amounts of sequences for both approaches.They were run under a Linux Debian Beowulf cluster of Athlon XP 2100 + with 9 operational nodes.The front-end node has 2 GB of memory and 2 disks of 80 GB each, and the other 8 nodes have 1 GB of memory and 1 disk of 80 GB for each node.The communication interface is based on Fast Ethernet 10/100 and uses MPICH as a communication library.
Figure 3 shows the execution time results for tests with nucleotides, for the standard and new approaches.
It can be seen in Figure 3 that the new approach, using the score estimating, has better performance than the standard approach.The improvement in the execution time is around 15%.
Figure 4 shows the execution time results for tests  with amino acids for the standard and new approaches.
Analyzing Figure 4, it can be seen an improvement in the execution time of the new approach when both approaches are compared.The reduction of the execution time is around 12%.
From the results showed previously, the execution time reduction was significant and, consequently, the speedup of the new approach is improved too.
Besides the performance results showed earlier, the quality of the final alignments produced were analyzed, to show that the results produced in score matrix of new approach are correct to be used in the other stages, when compared with the results of the standard approach, and produces results with biological significance.Since as both approaches are the first stage of a MSA algorithm, the quality tests of the final alignments were performed by completing the other two stages of the MSA algorithm, that are the phylogenetic tree construction and the multiple alignment, with the same algorithm and rules.
On average, the new approach showed very close results (slightly better, some times) when compared with the standard approach, as presented in the Table 1.It shows some results about the quality of the final alignments for some amino acids sequences.The two first columns are, respectively, the name of the analyzed gene (Gene) and its number of sequences (N.Seq).The other two columns (New Approach and Standard Approach) are the score values of the tools ranging from 0 (worst) to 1 (best).These score values were obtained by the bali_score software 2 the choice of sequences was completely randomic, and it was made on the BaliBase website 3 [16].The BaliBase is an aligned protein database, and each set of amino acids sequence alignments is made

CONCLUSIONS
The present work reports some improvements in progressive MSA through the application of the score estimating as an optimization technique.The results show that the new approach when compared with the standard one can achieve better results in performance evaluation when we compare the execution times.Besides, the new approach assured the quality results of the final alignments when compared with the standard approach.
The obtained results are relevant because the improvement of the performance with the assurance of the alignments quality makes this approach interesting and very useful for computer scientists and biologists.Moreover, it might be portable and implementable in a Grid or GPU environments to work with even larger data sets.

Figure 2 .
Figure 2. Illustration of the parallel score estimating al gorithm.

Figure 3 .
Figure 3. Performance evaluation of standard approach (SA) and new approach (NA) algorithms for sequences with nucleotides.

Figure 4 .
Figure 4. Performance evaluation of standard approach (SA) and new approach (NA) algorithms for sequences with amino acids.

Table 1 .
Quality comparison between the new approach and the standard approach.
manually through a structural study of the protein.This ensures that alignments of the BaliBase are 100% reliable.The bali_score software compares the BaliBase alignment with the alignment obtained by different tools.