Favorable and unfavorable amino acid residues in water-soluble and transmembrane proteins

We analyzed the amino acid residues present in the water-soluble and transmembrane proteins of 6 thermophilic and 6 mesophilic species of the domains Archaea and Eubacteria, and characterized them as favorable or unfavorable. The characterization was performed by comparing the observed number of each amino acid residue to the expected number calculated from the percentage of nucleotides present in each gene. Amino acids that were more or less abundant than expected were considered as favorable or unfavorable, respectively. Comparisons of amino acid compositions indicated that the water-soluble proteins were rich in charged residues such as Glu, Asp, Lys, and His, whereas hydrophobic residues such as Trp, Phe, and Leu were abundant in transmembrane proteins. Interestingly, our results found that although the Trp residue was abundant in transmembrane proteins, it was not defined as favorable by our calculations, indicating that increased numbers of a particular amino acid does not necessary indicate it is a favorable residue. Amino acids with high G + C content such as Ala, Gly, and Pro were frequently observed as favorable in species with low G + C content. Comparatively, amino acids with low G + C content such as Phe, Tyr, Lys, Ile, and Met were frequently observed as favorable in species with high G + C content. These are the examples to increase the supply of amino acids than expected. Amino acids with neutral G + C content, i.e., Glu and Asp were favorable in water-soluble proteins from all species analyzed, and Cys was unfavorable both in water-soluble and transmembrane proteins. These results indicate that amino acid compositions are essentially determined by the nucleotide sequence of the genes, and the amino acid content is altered by a deviation from expectation.


INTRODUCTION
Proteins can be roughly classified into 2 types: water-soluble proteins and transmembrane proteins.The transmembrane proteins have membrane-spanning regions, which contact the hydrophobic environment of the lipid bilayer and are largely composed of amino acids with nonpolar side chains [1][2][3].Comparatively, water-soluble proteins have more charged residues than transmembrane proteins, and therefore, the amino acid compositions differ between the 2 types of proteins.We recently reported that the dinucleotide composition of the genes coding for water-soluble proteins differs from those encoding transmembrane proteins [4].The genes encoding water-soluble proteins are rich in the purine dimers AA, AG, and GA, whereas those encoding transmembrane proteins are rich in the pyrimidine dimers TT, CT, and TC.This trend was observed in thermophilic and mesophilic species of Archaea and Eubacteria.The AA, AG, and GA dinucleotides are components of the codons of the charged residues, Glu, Asp, Lys, and Arg, whereas the TT, CT, and TC dinucleotides are components of the codons of the hydrophobic residues Leu, Ile, and Phe.The AA, AG, and GA dinucleotides are complementary to TT, CT, and TC, this revealed that a simple strategy is utilized to produce water-soluble and transmembrane proteins with distinct characteristics by using the DNA sequences on opposing strands.
The primary structure of a protein depends on the nucleotide composition of the protein-coding gene.Therefore, if the order of the coding nucleotides is random, the amino acid content would correlate with the calculated values determined by the nucleotide composition.The G + C content of bacterial genomes varies from 25% to 75% between species, but it is relatively constant within a bacterial genome [5,6].The nucleotide sequences of bacterial genes have species-specific dinucleotide compositions [7][8][9].Previous studies identified correlations between the nucleotide composition of genes and the amino acid content of proteins on a genome-wide scale [10][11][12].However, as water-soluble and transmembrane proteins have different amino acid and nucleotide compositions, it is necessary to analyze them separately like Lobry's study [13].Studies of amino acid compositions from various species have revealed that the proteins of thermophiles have more charged amino acids than the proteins of mesophiles [14][15][16][17][18], whereas halophilic proteins contain more Asp residues [19].
In this study, we analyzed amino acid compositions in water-soluble and transmembrane proteins taking into account of different character of the coding sequences in their nucleotide compositions.We characterized amino acids as favorable or unfavorable depending on whether they were observed more or less often than expected.The favorable and unfavorable residues was used to understand the relationship between G + C content and protein compositions in the thermophilic and mesophilic Archaea and Eubacteria species in a wide range of G + C content.

Selection of Water-Soluble and Transmembrane Proteins
The proteins were classified as water-soluble or transmembrane proteins according to the annotations on the genome to protein structure and function (GTOP) database [31].The SOSUI program [3] was used in the GTOP database to predict the transmembrane regions.Proteins with no transmembrane regions were considered as water-soluble proteins.Proteins with ≥2 transmem-brane regions were utilized to calculate the amino acid composition of transmembrane proteins.The transmembrane proteins were divided into 100 groups and one protein was randomly selected from each group.The water-soluble proteins were similarly selected.The water-soluble and transmembrane proteins were examined for their amino acid sequence similarity by using the BLAST program [32].Proteins which had ≥30% sequence identity with other selected proteins were replaced to keep the sequence identity below 30%.Amino acid sequences utilized in this study correspond to the genes in our previous study [4].Proteins were longer than 100 residues.

Ratios of Observed and Calculated Compositions
The expected amino acid composition was calculated as the product of the mononucleotide content for each gene.For example, a gene consisting of 31.4% adenine, 20.0% cytosine, 26.4% guanine, and 22.2% uracil would have an expected frequency of Lys residue (AAA and AAG) of 0.314 × 0.314 × 0.314 + 0.314 × 0.314 × 0.264 = 0.0570.The 3 stop codons were not included in the calculation of expected values, therefore, the values were adjusted by a correction factor of 1.062.Thus, in this example, the expected frequency of Lys residues was 6.05%.The expected amino acid compositions for 100 water-soluble and transmembrane proteins were calculated and averaged.These values were then compared to the average observed number, and the ratios of the observed values to the expected values were calculated.The expected dinucleotide composition was calculated as the product of the mononucleotide composition for each gene.The averages of the expected dinucleotide compositions for 100 genes encoding water-soluble and transmembrane proteins were calculated.Subsequently, the ratios of the observed values to the expected dinucleotide composition were calculated.

Amino Acid Composition
The average amino acid compositions of 100 water-soluble and 100 transmembrane proteins from 12 species are listed in Table 1 with the G + C content of their genes.In T. maritima, Glu, Leu, Lys, Val, and Ile residues were enriched in the water-soluble proteins, whereas in the transmembrane proteins, the Leu, Val, Ile, Phe, and Gly residues were enriched.To show the differences in amino acid content, the ratios of each amino acid of the water-soluble to the transmembrane proteins were calculated.In T. maritima, the 3 highest ratios were observed in Cys (3.29 = 0.92/0.28),Glu (2.16 = 10.04/Comparatively, the hydrophobic Trp, Phe, Leu, and Met residues were frequently observed in transmembrane proteins.These results are not surprising as charged residues are suitable for water-soluble proteins, and hydrophobic residues are suitable for transmembrane proteins. In addition, the frequency of some amino acid residues was dependent on the G + C content.For example, the frequency of Ala, Gly, and Pro residues, which are composed of G + C-rich codons, was increased in genes with high G + C content, whereas the frequency of Ile, Lys, and Asn residues, which are composed of A + T-rich codons, was decreased in genes with high G + C content.This tendency was observed in both water-soluble and transmembrane proteins, and it is consistent with previous findings [10][11][12][13]33].The percentage of Lys and Ala residues in the water-soluble proteins plotted against G + C contents of genes demonstrated an almost linear relationship for proteins from both thermophiles and mesophiles (Figure 1).The Lys content was higher in the thermophilic proteins than in the mesophilic proteins, while the Ala content was the reverse.This is consistent with our previous findings that at higher temperature, DNA stability is enhanced by AA and decreased by GC [34].The Lys content showed an almost linear relationship with the dinucleotide AA content, while Ala showed an almost linear relationship with the dinucleotide GC.The first and second nucleotides are AA in Lys codons, and GC in Ala codons.The genes encoding water-soluble proteins showed slightly higher G + C content than those encoding transmembrane proteins in all species, except P. aeruginosa (Table 1).

Favorable and Unfavorable Amino Acid Residues
The ratios of the observed to the expected amino acid compositions were calculated.Ratios of ≥1.3 were considered favorable and ≤0.7 were considered unfavorable.The favorable/unfavorable residues are listed in Table 3.
In T. maritima, Glu, Phe, and Lys were favorable residues in both water-soluble and transmembrane proteins.

OPEN ACCESS
The percentage of Glu in water-soluble and transmembrane proteins was 10.04% and 4.64%, respectively, whereas that of Phe was 4.42% and 8.49%, respectively.Therefore, the amino acid compositions of Glu and Phe were different in the 2 protein groups, however, they were regarded as favorable in both the proteins.This is because the expected amino acid compositions were different for the 2 types of proteins due to the different nucleotide compositions of the 2 types of genes.In T. maritima, Cys, Gln, Arg, and His were estimated as unfavorable in both water-soluble and transmembrane proteins.Generally, the Glu, Asp, Lys, Ile, and Phe residues were favorable in the water-soluble proteins, and the Ile, Met, Phe, Ala, and Lys residues were favorable in the transmembrane proteins.A comparison of the amino acid compositions of the water-soluble and transmembrane proteins revealed that the Trp residue is abundant in the transmembrane proteins.However, Trp was not estimated as favorable.This result indicated that high proportions of an amino acid do not necessary dictate that it will be favorable.Glu and Asp were observed as favorable residues in all water-soluble proteins, whereas Cys and Arg were observed as unfavorable in both water-soluble and transmembrane proteins.No significant difference was observed with respect to favorable and unfavorable residues in thermophiles and mesophiles, with the exception of Gln.Consistent with the previous study [35], Gln was often observed as unfavorable in thermophiles.The 3 highest ratios of observed/calculated composition were obtained for Asp in Halobacterium water-soluble proteins (3.54), Met in transmembrane proteins of P. aeruginosa (3.37), and Lys in transmembrane proteins of T. thermophilus (3.32); the former result was in agreement with previous study [19].The 3 lowest ratios were calculated from the Cys content in the transmembrane proteins of the thermophilic Eubacteria T. thermophilus (0.04), T. maritima (0.07), and T. tengcongensis (0.09).This result is attributed to the very low observed Cys content in the transmembrane proteins of the thermophilic Eubacteria (Table 1).
The number of favorable residues in both water-soluble and transmembrane proteins increased with the G + C content; comparatively, the unfavorable residues did not.The Ala, Gly, and Pro residues, which have G + C-rich codons, were frequently observed as favorable in species with low G + C content.However, the Phe, Tyr, Lys, Ile, and Met residues, which have G + C-poor codons, were frequently observed as favorable in G + C-rich species.The positive correlation between the number of favorable residues and the G + C content may be due to the large number of residues that have G + C-poor codons, compared to those with G + C-rich codons.The Pro residue was observed as favorable in G + C-poor species, but was unfavorable in G + C-rich species.This result suggests that species maintain the Pro content in a certain range, increasing the supply when it is low and decreasing it when it is high.Some species do not have aminoacyl tRNA synthetases for all 20 amino acids.For example, Halobacterium does not possess aminoacyl tRNA synthetases for Asn and Gln [27].Generally, the Gln content reduces in the absence of aminoacyl tRNA synthetase for Gln, therefore, in these species, the Gln residue was regarded as unfavorable.Interestingly, the abundance of the Asn residue was not affected.

DISCUSSION
The amino acid sequences and compositions of watersoluble proteins differ from those of transmembrane proteins.The amino acid composition of a protein depends on the nucleotide sequence of the protein-coding gene.The average nucleotide composition of proteincoding genes from 3 animal mitochondria was A = 31%, C = 28%, G = 13%, and T = 28%.The proteins translated from the mitochondrial genes using the mitochondrial codon table [36] contain a significantly higher numbers of hydrophobic amino acid residues, therefore they are considered appropriate for transmembrane proteins.The observed amino acid composition correlated with the calculated amino acid content [37], indicating that designing proteins with specialized amino acid compositions is possible by a given specific nucleotide composition.In double-stranded DNA, the amount of adenine is equal to that of thymine, and the amount of guanine is equal to that of cytosine.This is known as Chargaff's first parity rule [38,39].This rule also applies to single-stranded DNA and is called Chargaff's second parity rule [40,41].This parity rule was confirmed by using over 3400 genomic sequences from Archaea, Eubacteria, eukaryotes, and viruses [42].Species have to produce various kinds of proteins to survive under the constraints of Chargaff's first and second parity rules for the DNA sequence.However, Chargaff's second parity rule does not hold true for mitochondrial DNA [42].
The water-soluble and transmembrane proteins were obtained from the genes investigated in our previous study [4].We examined the amino acid compositions of proteins from other species, and obtained similar trends corresponding to the G + C content.This result indicated that the characteristics of amino acid composition were maintained in proteins from various species.We selected species covering a wide range of G + C content, as it represents the mononucleotide composition.The amino acid composition is thought to be controlled by 2 factors, namely, the mononucleotide composition, and the deviation from expected values calculated using the mononu-cleotide values.Frequency of some amino acids is primarily dependent on G + C content as shown in Figure 1.Amino acids with high (or low) G + C content were frequently observed as favorable in species with low (or high) G + C content.These are the examples of the deviations from expectation to increase the supply when it is low.
Amino acids with a compositional ratio (observed/ calculated) of ≥1.3 were considered favorable and those with a compositional ratio of ≤0.7 were considered unfavorable.The ratios ranged from 0.04 to 3.54.Ratios of 1.1 and 0.9 were utilized to determine the favorable and unfavorable dinucleotides [4].
Both Arg and Cys residues were unfavorable in all the proteins from the Archaea and Eubacteria species.The depletion of the Arg residues in the amino acid sequences of mammals was identified more than 40 years ago [43].In mammals, the cytosine of the dinucleotide CG is methylated to 5-methyl cytosine, which is more susceptible to deamination than cytosine that yields thymine.In addition, some of the T-G mismatches produced are poorly repaired, therefore, CG/CG tends to become TA/CA, which leads to a reduction in CG and an increase in TG and CA [44].CG is a component of the Arg residue codon, CGN, and therefore, the repair-related errors lead to the depletion of Arg.To confirm this idea, we examined the amino acid sequences of both water-soluble and transmembrane proteins from mice.
The 100 water-soluble and 100 transmembrane proteins were selected according to the annotations of the GTOP database.The amino acid sequences having ≤30% sequence homology with other selected sequences were utilized.The nucleotide sequences corresponding to those protein sequences were retrieved from the NCBI web site, ftp://ftp.ncbi.nlm.nih.gov/genomes/M_musculus/RNA/rna.gbk.gz.The genes encoding water-soluble proteins were rich in AA, AG, and GA dinucleotides, whereas those encoding transmembrane proteins were rich in CT, TC, and TT.This trend was similar to that observed with the 12 species in our previous study [4].The average amino acid compositions of the 100 water-soluble and 100 transmembrane proteins from mice are listed in Table 1, with the G + C content of the genes.The mouse genes encoding transmembrane proteins exhibited slightly higher G + C content than those encoding water-soluble proteins.The ratios of the amino acid composition of water-soluble proteins to those of transmembrane proteins were calculated.We observed a bias for Asp, Lys, and Glu residues in water-soluble proteins, and Trp, Phe, and Leu in transmembrane proteins (Table 2).This result was similar to that observed in the mesophilic species.Furthermore, Glu, Phe, Lys, and Met residues were favorable in both water-soluble and transmembrane pro-teins.However, only the Arg residue was deemed unfavorable in the mouse proteins (Table 3).The ratios of the observed to the expected dinucleotide compositions of CG, TG, and CA were 0.46, 1.39 and 1.23, respectively, for the genes encoding water-soluble proteins, and 0.48, 1.39, and 1.29, respectively, for the genes encoding transmembrane proteins.The ratios of CG (≤1) indicate lower amounts of CG in the genes, whereas the ratios of both TG and CA (≥1) indicate higher amounts of TG and CA.This result suggests that the CG/CG dinucleotides may now be TG/CA in the mouse genes.This trend was not seen in Archaea and Eubacteria, with the exception of M. stadtmanae.Therefore, the depletion of the Arg residue in Archaea and Eubacteria might be due to different reasons compared to those responsible in mammals.

Table 2 .
In addition to Cys, the charged residues Glu, Asp, Lys, and His were frequently observed in water-soluble proteins.

Table 2 .
List of amino acids frequently observed in the water-soluble and transmembrane proteins.

Table 3 .
Favorable and unfavorable amino acids in the water-soluble and transmembrane proteins based on the ratios of observed/calculated composition.