Condensed Matrix Descriptor for Protein Sequence Comparison

The present paper develops a novel way of reducing a protein sequence of any length to a real symmetric condensed 20 × 20 matrix. This condensed matrix can be nicely applied as a protein sequence descriptor. In fact, with such a condensed representation, comparison of two protein sequences is reduced to a comparison of two such 20 × 20 matrices. As each square matrix has a unique Alley Index/normalized Alley Index, such index is conveniently used in getting distance matrix to construct Phylogenetic trees of different protein sequences. Finally protein sequence comparison is made based on these Phylogenetic trees. In this paper three types viz., NADH dehydrogenase subunit 3 (ND3), subunit 4 (ND4) and subunit 5 (ND5) of protein sequences of nine species, Human, Gorilla, Common Chimpanzee, Pygmy Chimpanzee, Fin Whale, Blue Whale, Rat, Mouse and Opossum are used for comparison.


Introduction
A protein is a linear chain of 20 amino acids, which starts with a start codon ATG, which corresponds to the amino acid methionine, followed by a sequence of amino acids and ends with a stop codon.The amino acid sequence that makes a protein is called its primary structure.Protein sequence analysis means analysis of its primary structure.It provides important insight into the structure of proteins, which in turn, greatly facilitates the understanding of its biochemical and cellular function.Efforts to use computational methods in predicting pro-acids, so in this case, the final reduction is a 9 × 9 matrix.

First Step: To Calculate Distance of Each Label from the Neighboring Labels
In the first step we construct "distance" of each label from the neighboring labels of the same and different kind of amino acid.It is calculated by numbering the amino acids in the protein sequence starting from 0 (zero) for the first amino acid and starting from 1 (one) for the other amino acids.Thereby we get a 12 × 12 matrix, where 12 is the length of the protein sequence (as shown in Table 2).The entries of the matrix represent frequencies of occurrences of amino acids.

Second
Step: To Group Together All Similar Amino Acids Second step involves grouping together all similar amino acids.First of all, the amino acids are taken alphabetically as D, E, K, L, M, P, R, V, W. Next the bases of the same kind are grouped together.The elements of the matrix correspond to the serial distance of the first 12 amino acids.The rearranged 12 × 12 matrix is shown in Table 3.

Fourth Step: To Obtain Final Reduced 9 × 9 Condensed Matrix
All the sub-matrices are not square matrices.So we cannot get Eigen values of all the sub-matrices.Mathematically it may be shown that the average of all the elements of a square matrix nearly approximates the highest Eigen value of that matrix.So in the third step we consider the average of all the elements of each sub-matrix      and finally get the 9 × 9 condensed matrix given in Table 7.

Construction of 20 × 20 Condensed Matrix for (HIV 1) Tat Protein
We consider the sequence of Human Immunodeficiency Virus 1 (HIV 1) Tat Protein, which has 86 amino acids.By following steps one to three as above, we get 86 × 86 rearranged matrix and calculate the sub-matrices Two such sub-matrices AA and CG are given in Table 8 and Table 9 respectively; The final 20 × 20 condensed matrix is given in Table 10.

Sensitivity of the 20 × 20 Condensed Matrix
Now we change the protein sequence of Human Immunodeficiency Virus 1 a little bit.We interchange the 5th and 56th amino acid i.e. we take the 5th amino acid as R instead of D and take the 56th amino acid as D instead of R and we get the following (Table 11).
We get the following final 20 × 20 condense matrix (Table 12) of the sequence of Table 11.To test for sensitivity, we generate a 20 × 20 matrix, which contains the cell-by-cell differences of the content of Table 10 and Table 12.This is given in Table 13.
The result shows that our method of constructing the 20 × 20 Condensed Matrix is highly sensitive.Little bit of change in the sequence of a protein affects the content of the final 20 × 20 condense matrix.

Comparison of Protein Sequences
As we have already illustrated how to reduce a protein sequence to a condensed 20 × 20 matrix, so the problem of comparison of two protein sequences reduces to the problem of comparison of two 20 × 20 matrices.In this paper we solve this problem by ALE index [17].
ALE index for a matrix M is defined by ( ) where, ( ) Table 10.20 × 20 condense matrix of the protein sequence of Human Immunodeficiency Virus 1 (HIV 1) tat protein.

MEPVRPRLEPWKHPGSQPKTACTNCYCKKCCFHCQVCFITKALGISYGRKKRRQRDRPPQGSQTHQVSLSKQPTSQSRGDPTGPKE
The ALE-index is very simple for calculation so that it can be directly used to handle long sequences.If desired, one can introduce weighting procedure that will normalize magnitudes of the ALE-indices to reduce variations caused by comparison of matrices of different sizes.For instance, one can consider instead of χ a normalized ALE-index χ' = χ/n, where n is the length of the sequence and the order of the corresponding matrix as well.

Sequences for Comparison
We have used the NADH dehydrogenase subunit 3 (ND3), subunit 4 (ND4) and subunit 5 (ND5) protein sequences of nine species for comparison as shown in Table 14.

Measures of Comparison of Sequences from Reduced Matrices
First we construct 20 × 20 matrices for nine protein sequences of ND3, ND4 and ND5.Then we calculate the differences of each pair of protein sequences.For example, the difference of 20 × 20 matrices of Human and Gorilla for ND5 protein sequences is shown in Table 15.In this way we get 36 matrices for each type of protein (ND3, ND4 and ND5).Then we calculate the χ' values of 36 matrices for each type of protein (ND3, ND4 and  13. 20 × 20 matrix of the differences of Table 8 and Table 10.

Discussion
In this paper we introduce a novel characterization for Protein Sequence using condensed matrices that are based on average distances for pairs of bases obtained as quotients of sequential numbers and serial numbers in primary sequences.Such matrices not only offer some insight into the nature of the protein sequence but also allow one to make qualitative and quantitative comparisons between different sequences of proteins, whether within the same species or between different species.
The method of construction of 20 × 20 condensed matrix of the protein sequence reveals that • The representation of the protein sequence in 20 × 20 condensed matrix is unique.
• The condensed form of representation may help in comparing two protein sequences of unequal lengths.
• It is applicable to sequence of any finite length, however large it may be.
• The phylogenetic trees (Figures 1-3) of nine species of three different types of proteins (ND3, ND4 and ND5) agree with the standard phylogenetic tree of the same species.• It is comparatively an easier form of comparison of protein sequences.

Conclusion
Condensed matrix representation of protein sequences is a useful tool.It is applicable to comparison of protein sequences of equal or unequal lengths and of any finite size, however large it may be.It is also an accurate one in comparing the protein sequences of the aforesaid types.

Table 1 .
Small sample of protein sequences.

Table 2 .
12 × 12 matrix obtained by considering the 12 amino acids from the beginning of the sequence of Table1.

Table 12 .
20 × 20Condense matrix of modified or changed protein sequence of Human Immunodeficiency Virus 1 (HIV 1).

Table 14 .
List of nine species with their versions and lengths.

Table 15 .
Difference of 20 × 20 matrices of human and gorilla (ND5).The results for ND3, ND4 and ND5 are shown in Tables 16-18 respectively.Then we construct the phylogenetic trees for each type of proteins (ND3, ND4 and ND5) for nine different species (Human, Gorilla, Common Chimpanzee, Pygmy Chimpanzee, Fin Whale, Blue Whale, Rat, Mouse and Opossum).The results are shown in Figures 1-3 respectively.

Table 16 .
ALE index of pair of nine species (ND3).

Table 17 .
ALE index of pair of nine species (ND4).

Table 18 .
ALE index of pair of nine species (ND5).