J. Biomedical Science and Engineering, 2010, 3, 868-883
doi:10.4236/jbise.2010.39117 Published Online September 2010 (http://www.SciRP.org/journal/jbise/
JBiSE
).
Published Online September 2010 in SciRes. http://www.scirp.org/journal/jbise
Identification of the interactive region by the homology of the
sequence spectrum
Masatoshi Nakahara1, Masaharu Takeda2*
1Department of Computer and Information Sciences, Sojo University, Ikeda, Kumamoto, Japan;
2Department of Materials and Biological Engineering,Tsuruoka National College of Technology, Tsuruoka, Yamagata, Japan.
Email: mtakeda@tsuruoka-nct.ac.jp
Received 4 June 2010; revised 9 July 2010; accepted 12 July 2010
ABSTRACT
The base sequence in genome was governed by some
fundamental principles such as reverse-complement
symmetry, multiple fractality and so on, and the anal-
ytical method of the genome structure, the “Sequence
Spectrum Method (SSM )”, based on the struct ural f ea -
tures of genomic DNA faithfully visualized these prin-
ciples. This paper reported that the sequence spec-
trum in SSM closely reflected the biological phe-
nomena of protein and DNA, and SSM could identify
the interactive region of protein-protein and DNA-
protein uniformly. In order to investigate the effec-
tiveness of SSM we analyzed the several protein-
protein and DNA-protein interaction published pri-
marily in the genome of Saccharomyces cerevisiae. The
method proposed here was based on the homology of
sequence spectrum, and it advantageously and sur-
prisingly used only base sequence of genome and did
not require any other information, even information
about the amino-acid sequence of protein. Eventually
it was concluded that the fundamental principles in
genome governed not only the static base sequence
but also the dynamic function of protein and DNA.
Keywords: Spectrum of Genome Base Sequence; Ho-
moology of Sequence Spectrum; Interactive Region;
Reversese-Complement Symmetry; Multiple Fractality;
Analytical Method Of Genome
1. INTRODUCTION
As described in the previously [1,2], it was very impor-
tant to investigate the structure of the entire genome be-
cause the four bases should be arranged in a sophisti-
cated fashion in the genome, and essentially the base
sequences might reflect the conformations of protein,
RNA and DNA. DNA sequences were deeply affected
by the adjoining sequences. In other words, the non-cod-
ing sequences might play some important roles to express
each gene (the coding sequences) in genome. That is, not
only the coding region, but also the non-coding region
might be necessary to transmit and to transform the bio-
logical information precisely, rapidly, and stably. There-
fore, if we would find meaningful structure in the ge-
nome, we might also obtain important information about
the functions of protein, RNA and DNA from their
structure.
Previously, we showed that the four bases in genomic
DNA were organized based on the generation-rules in all
organisms by analyzing the appearance frequency of the
bases, and we proposed three generation-rules of the base
sequences in a single-strand of DNA: 1) reverse-com-
plement symmetry of the 1 ~ 9 successive base sequen-
ces, 2) multiple fractality of each base distribution de-
pending on the distance, and 3) bias of four bases, A, T,
G and C. These rules were universally observed regard-
less species [1]. Further we also defined the sequence
spectrum by the appearance frequency of the base se-
quence in genome, and we have developed the powerful
method “Sequence Spectrum Method (SSM)” in order to
visualize and analyze the generation-rules in entire geno-
me explained above. As one of important results, we re-
vealed by using SSM that there was the remarkable ho-
mology of sequence spectrum between proteins and tR-
NAs [2]. This fact suggested the sequence spectrum
could be closely associated with the function of protein,
and the homology of sequence spectrum could be related
to the mutual interactive region. Identification of mutual
interactive region of protein, RNA and DNA was defi-
nitely important to figure out their functions, and usually
the homology of base sequence or amino-acid sequence
was used for it.
To investigate the effectiveness of SSM, in this paper,
we showed that SSM could identify the interactive region
of the protein-protein and the protein-DNA by the ho-
mology of the sequence spectra. The advantages of the
proposed method were as follows.
1) It used only base sequence of genome and did not
M. Nakahara et al. / J. Biomedical Science and Engineering 3 (2010) 868-883
Copyright © 2010 SciRes. JBiSE
869
require any other information, even information about
amino-acid sequence of protein. As SSM faithfully re-
flected the biological information, the conservation of the
bases sequences of genomic DNA was also conserved in
the translated amino acids sequence of the protein se-
quence [1,2].
2) It could identify the interactive region of both pro-
tein-protein and protein-DNA in completely the same
manner.
3) It could be executed fully on a personal computer
and did not require a special high performance computer.
Moreover the identification was done in a few seconds.
2. MATERIALS AND METHODS
2.1. Sequence Spectrum Method (SSM)
SSM was carried out in the same way as the published
procedures [2]. The outline of the proposed method was
as follows. The base sequence of interest was sectioned
by a small number of bases from the top (5’-end). The
key sequences of the nine successive base sequences (d
= 9) was 262,144 sequences (= 49, Reference [2]). The
appearance frequency of the key sequence was counted
in the entire genome, and was plotted at the position of
the first base of the key sequence as described in the
next paragraph. These procedures were carried out for
the entire base sequence of interest with one base shift (p
= 1). The next step was to average the appearance fre-
quencies so that a recognizable pattern of appearance
frequency was obtained for the base sequence. This pat-
tern of the averaged appearance frequency was called the
“sequence spectrum”. Finally, the homology factor be-
tween two sequence spectra was calculated to determine
the degree of homology. The exact procedure was ex-
plained below in a mathematical way.
Let S be an entire set of base sequences, and B = [bi]
be a partial set of interest in S. A base element was deno-
ted by bi (i = 1..M), and M was the base sequence size of
B. The base element bi become A (adenine), T (thymine),
G (guanine) or C (cytosine). The key sequence ki and the
appearance frequency fi were defined for bi as follows.
Key sequence ki : base sequence comprised of sequen-
tial base elements bi~bi+d-1 (d : base size of the key se-
quence).
Appearance frequency fi : appearance count of ki in S.
The key sequence ki was compared with the base se-
quence of the entire set S, and the appearance frequency
fi was increased by one every time the key sequence ki
matches the partial base sequence of the entire set S.
This procedure was iterated for all key sequences ki to
obtain fi (i = 1..M). In practice all fi were counted and
tabulated in advance by scanning all base sequence in S.
Consequently, the appearance frequency vector F = [fi] (i
= 1..M) was determined (actually, the appearance fre-
quencies for the last (d-1) base elements of B could not
be calculated; however, this was neglected because M >>
d-1).
Next, the appearance frequency fi was averaged as
follows:
j
mi
mij
si f
m
f

12
1
where the parameter m was average width. This aver-
aged appearance frequency Fs = [fsi] (i = 1..M) was
called the “sequence spectrum”.
The next step was to calculate the homology factor to
determine the degree of homology. The homology factor
determines the homologous region of a target base se-
quence with respect to a reference base sequence. In order
to derive the homology factor, the mutual correlation
function MF within the window width of homology was
calculated as




w
k
kjj
w
k
kii
jkjj
w
k
kjj
iki
w
k
ikii
j
w
k
kjiki
ji
ij
fst
w
fst
fsr
w
fsr
fstfstfstfstFst
fsrfsrfsrfsrFsr
fstfstfsrfsr
FstFsr
FstFsrMF
1
1
1
1
1
1
1
)(*)(
)(*)(
)(*)(
1
),(
where
Fsr sequence spectrum of the reference base se-
quence
Fst sequence spectrum of the target base sequence
w— window width of homology
The mutual correlation function MF ranges from -1 to
1, and then the homology factor HF was defined as
[%]100*
2
)1(
),(
ij
ij
MF
FstFsrHF
The higher the homology factor, the more similar the
sequence spectra were. The similar regions of the target
base sequence with respect to the reference base se-
quence were obtained by calculating the homology fac-
tors HFij for all i (i = 0..Mr-w, Mr: size of reference se-
quence) and j (j = 0..Mt-w, Mt: size of target sequence).
When the base sequence was very large, elements of
the sequence spectrum were skipped by the size factor p
M. Nakahara et al. / J. Biomedical Science and Engineering 3 (2010) 868-883
Copyright © 2010 SciRes.
870
(http://www. yeastgenome.org/).
to reduce the size as follows.
NCBI genome data base. (2010) (http://www.ncbi.nlm.
nih.gov/sites/entrez?db=genome).
1*)1( 
pii fsfs
For instance, when p = 2
...,,...,,531321 fsfsfsfsfsfs 2.2. Appearance Frequencies of Bases
This operation reduced the size to 1/p. For nine successive bases, the appearance frequency was
counted for the entire genome by matching from the start
of the base sequence in a genome with one base shift (p
= 1) as follows.
The base sequences of the genomes were obtained
from the databases listed below.
Saccharomyce Genome Database. (2010)
Ex. Nine successive bases: AATAAAGAA
AATA AAGAA (one base shift)
Base Sequence:
5’-ATCGAATAAAGAACCGTTCGGTAAGTCGAATAAAGAAT-CTGGCATTT-3’
1 2
Count of AATAAAGAA: 2
In the case of the genome composed of the plural chr-
omosomes such as S. cerevisiae, we have calculated the
sum of the base frequencies of the 16 chromosomes (in
numeric order) plus mtDNA [1].
2.3. The Parameters “d”-, “m”- , “p”-, and
“w”-Values of SSM Analysis for the
Interaction
JBiSE
Controllable parameters in the sequence spectrum were
the base size “d” of the key sequence, the average width
“m”, the skip base number (the size factor) “p” and the
window width “w” of homology. The parameter “d” de-
termined the highest resolution for extracting the structural
feature of the base sequence. Therefore this parameter
should be chosen to be as a large value as possible to
extract the exact feature. The large “m” values were usu-
ally used to obtain the overall features of the structure,
and smaller “m” values were applied to investigate the
structure in detail. The value of “m” normally ranges from
1/10 to 1/100 of the base sequence size [2]. This parame-
ter was adjusted to the base sequence size especially when
the homology factor between a small reference and a large
target was calculated [2]. The window width of homology,
“w” determined the width of similar region to identify. In
this paper the values of “d”, “m”, “p” and “w” were 9, 10,
1 and 200, respectively, to identify the interactive region
of protein and DNA.
In figures of the sequence spectrum the horizontal pa-
rameter was the base size of sequence, M of each gene
or genomic DNA, and the vertical parameter was the se-
quence spectrum. These parameters were appropriately
scaled to show the similar region clearly.
2.4. Procedure of Identification of the Interactive
Region by SSM
To simplify the procedure, it was assumed that the inter-
active region of one protein was given (shown in pur-
ple-blue), and SSM identified the interactive region of
the other protein (shown in red). The procedure to iden-
tify the interactive regions of two proteins by SSM was
as follows. In the following procedure one of two pro-
teins was replaced by DNA when the protein-DNA in-
teraction was investigated.
[Step 1] One protein with the given interactive region
(shown in purple-blue) was designated as a reference
protein, and the other protein with the interactive region
(shown in red) which SSM identified was designated as
a target protein.
[Step 2] The sequence spectra of both the reference
and target proteins were calculated.
[Step 3] The similar regions between the sequence spec-
tra of the reference and target proteins were calculated.
[Step 4] The pair of similar regions (red/purple-blue)
with the highest homology factor (HF) was selected as a
candidate of interactive regions.
[Step 5] The base sequence of the reference protein
was converted to be the reverse complementary and the
steps [2-4] were repeated because of the reverse-com-
plement rule in genome.
[Step 6] In two candidates obtained in steps [4] and
[5], the similar region of the target protein with higher
HF was called first identified region, and the other was
called second identified region.
3. RESULTS AND DISCUSSION
This section demonstrates that the homology of the se-
quence spectrum was closely associated with the mutual
interaction of proteins or DNA. The identified interac-
tive regions of the proteins were all the first identified
regions in the examples below. We showed some of the
interactive regions analyzed by SSM in this section.
3.1. Mutual Interaction of Protein-Protein
1) MAS1 and MAS2
M. Nakahara et al. / J. Biomedical Science and Engineering 3 (2010) 868-883
Copyright © 2010 SciRes. JBiSE
871
Figure 1 showed the interactive region (in purple-blue)
of MAS1 [Mas1p (β-MPP), Reference [3]] - MAS2 [Mas-
2p (α-MPP), Reference [4]]. These proteins formed a
complex to cleave the mitochondrial targeting signal of
precursors. In Figure 1(a) the active region (in pur-
ple-blue) around the key amino acid E73 of MAS1
(Mas1p) was the reference, and the whole coding region
of MAS 2 (Mas2p) was the target (Figure 1(b)). Previous
reports proposed a model in which the glycine-rich re-
gion of MAS2 (Mas2p, in red) cooperated with the active
region of MAS1 (Ma- s1p, in purple-blue). Our results
strongly supported this model because the most similar
region of MAS2 (in red; HF = 90.5%) with the active
region of MAS1 (in purple-blue) was completely identi-
cal to the reported glycine-rich region [5,6, in red].
Moreover, the positions of the key amino acids in both
proteins (E73 in Mas1p and K296 in Mas2p) were also
identical.
Figure 1. Sequence spectra of MAS1 and MAS2 (d = 9, m = 10p = 1). (a)
Coding region of MAS1 (Mas1p, M = 1,386). The active region of MAS1 (Ma-
s1p, reference: M = 200, in purple-blue). This region (corresponding to E46
E106) carries the characteristic metal-binding motif associated with the cata-
lytic activity (5, 6). (b) Coding region of MAS2 (Mas2p) containing the 5’-
and 3’- non-coding region (target: M = 1,446). The region most similar to the
reference is shown in red (HF = 90.5%). The most similar region is gly-
cine-rich and closely related to the catalytic function (I261 – G327 of Mas2p).
E73 (shown in red letter) of Mas1p presumably interacts with K296 (shown in
red letter) of Mas2p (position of arrowhead). The scales of the axes for the
sequence spectra of the similar regions were the same. The amino acid se-
quences of Mas1p and Mas2p neighboring the interactive regions were shown
in figures, respectively.
M. Nakahara et al. / J. Biomedical Science and Engineering 3 (2010) 868-883
Copyright © 2010 SciRes. JBiSE
872
2) PHO4 and PHO80
Figure 2 showed the sequence spectra of PHO4 (a,
Pho4p, reference: the interactive region around the key
amino acid P174, in purple-blue) and PHO80 (b, Pho80p,
target: the whole coding region). PHO4 (Pho4p) was a
transcription factor, and PHO80 (Pho80p) inhibited the
transcriptional function of PHO4 (Pho4p). Ogawa & Os-
hima [7] and Okada & Toh-e [8] reported that there was
interaction between P174 in Pho4p and M42 in Pho80p,
respectively. The red region in (b) in which M42 (Figure
2(b), arrow head) of Pho80p was located was the region
most similar to the reference region of Pho4p, in which
P174 (Figure 2(a), arrow head) was located (HF =
89.1%). The interactive regions between Pho4p and Ph-
o80p were also discussed in the Pho2p results (6) later.
3) RPB2 and RPB12
Figure 3 showed the sequence spectra of RPB2 and
RPB12. The RPB protein family forms DNA-directed
RNA polymerase II [9]. RPB2 (Rpb2p encoding gene)
and RPB12 (Rpb12p) were members of the family, and
RBP12 (Rpb12p) combined with RPB2 (Rpb2p). Rpb12p
was a very small protein with 70 amino acids whereas
Figure 2. Sequence spectra of PHO4 and PHO80 (d = 9, m = 10, p = 1). (a) Cod-
ing region of PHO4 (Pho4p, M = 936, the active region was shown in purple-blue).
(b) Coding region of PHO80 (Pho80p, target: M = 880). The region most similar to
the reference is shown in red (HF = 89.1%). It has been shown that P174 (shown in
red letter) of Pho4p interacts with M42 (shown in red letter) of Pho80p [7, 8]. The
arrowhead in each spectrum respectively indicates the position of the amino acid
P174 of Pho4p, and M42 of Pho80p. The scales of axes in (a) and (b) are the same.
The amino acid sequences of Pho4p and Pho80p neighboring the interactive re-
gions were shown in figures, respectively. The red letter indicated to report as a
functional amino acid.
M. Nakahara et al. / J. Biomedical Science and Engineering 3 (2010) 868-883
Copyright © 2010 SciRes. JBiSE
873
Figure 3. Sequence spectra of RPB12 and RPB2 (d = 9, m = 10p = 1). (a) Coding
region of RPB12 (Rpb12p, reference: M = 210). (b) Coding region of RPB2 gene
containing the 5’- and the 3’- non-coding region (Rpb2p, target: M = 3,672). The
region most similar to the reference is shown in red (HF = 87.1%). The scales of
axes in (a) and (b) are the same. The amino acid sequences of Rpb12p and Rp- b2p
neighboring the interactive regions were shown in figures, respectively.
Rpb2p was a large one with 1224 amino acids. Therefore
in this case the whole coding region of RPB12 (Rpb12p)
was suitable for the reference (a) and the coding region
of RPB2 (Rpb2p) for the target (b). The result was sho-
wn in Figures 3(a-b). The red region is the most similar
re- gion of RPB2 (Rpb2p) with RPB12 (Rpb12p, HF =
87.1%). The literature [9] revealed that the interaction
between RPB2 (in red) and RPB12 (in purple-blue) occ-
urred at two regions of RBP2, and Figure 3 showed one
of these two interaction regions. This result was unlikely
to be a coincidence because the target size was about 18
times larger than the reference size. In addition, interest-
ingly the other interacting region was very close to the
second identified region in the coding region (not sho-
wn), although it was not completely identical (a previous
report [9] specified the region around the 900th amino
acid of Rpb2p, but our results specified the region aro-
und the 940th amino acid).
4) GCR1 and GCR2
The interactive region of GCR1 [Gcr1p,10] and GCR2
[Gcr2p,11] was very interesting. In Figure 4 the red re-
gion of GCR1 (Gcr1p, leucine zipper) was the first iden-
M. Nakahara et al. / J. Biomedical Science and Engineering 3 (2010) 868-883
Copyright © 2010 SciRes. JBiSE
874
tified region (HF = 92.9%) with respect to the reference
region (in purple-blue) of GCR2 (Gcr2p). The sequence
spectra suggested that the leucine-zipper region of GCR1
(Gcr1p) might interact with the C-terminus of GCR2
(Gcr2p, purple-blue region), although considerable con-
troversy still existed concerning the interaction between
Gcr1p and Gcr2p [12,13]. This case is quite interesting
for following reasons: a) the identified region was de-
rived from the reverse-complement reference region of
GCR2, that is, the reverse-complement base sequence of
GCR2 was also useful to the analysis of the interactive
region by SSM (designated it as the reverse-complement
rule), and b) the portion of the reference region exceeded
outside to the downstream region. This means that in this
case the proposed method identified both the different
objects, the protein region for GCR2 (Gcr2p) and the
DNA region for GCR2 of the reference region. That is,
the sequence spectrum of a given gene might reflect the
information of both protein and DNA, and SSM could be
applied to analyze both of them.
Figure 4. Sequence spectra of GCR1 and GCR2 (d = 9, m = 10, p = 1). (a) The
reverse-complement sequence of whole region of GCR2 (Gcr2p) containing the
5’- and the 3’- non-coding region was used as the reference (M = 3,157, the active
region was shown in purple-blue). (b) The functional region (K266 – R300, leucine
zipper) of GCR1 (Gcr1p, ref.10-13). The region most similar to the reference (HF
= 92.9%). This region (leucine zipper, ref. 12, 13) of Gcr1p might interact with
the reference region of Gcr2p. The scales of axes in (a) and (b) are the same. The
arrowhead of black and red were the start codon (M1) and the stop codon (TGA)
of GCR2, respectively. The bold black arrowhead of GCR1 was the position of
E262 (red letter in the amino acid sequence of Gcrp1). The amino acid sequences
of Gcr2p and Gcr12p neighboring the interactive regions were shown in figures,
respectively. The red letter indicated to report as a functional amino acid.
M. Nakahara et al. / J. Biomedical Science and Engineering 3 (2010) 868-883
Copyright © 2010 SciRes. JBiSE
875
5) SLA1 and SLA2
This example proved that SSM could apply to large
size proteins. The size of proteins Sla1p (coded by SLA1)
and Sla2p (coded by SLA2) were 1244 and 968 amino
acids respectively, and Figure 5 showed the interactive
regions of these proteins. In Figure 5 the red region of
SLA1 (Sla1p) was the first identified region (HF = 94.3%)
with respect to the reference region (in purple-blue) of
SLA2 (Sla2p) which was converted to be reverse com-
plementary. The literature [14] showed that this result
was valid.
The three examples 6) ~ 8) below were results of pre-
dicting the interactive regions by SSM. In these exam-
ples one of the interactive regions was known and the
other was unknown, and SSM predicted the unknown
interactive region.
Figure 5. Sequence spectra of SLA2 and SLA1 (d = 9, m = 10, p = 1). The re-
verse-complement of the base sequence gave more homologous than the normal
base sequence could be shown in the interaction SLA2 (Sla2p)/SLA1 (Sla1p). (a)
The reverse-complement sequence of coding region of SLA2 (Sla2p) was used as
the reference (M = 2,904, the active region was shown in purple-blue). (b) The
sequence spectrum region of SLA1 (M = 3,732. Sla1p, ref.14). The amino acid
sequences of Sla2p and Sla1p neighboring the interactive regions were shown in
figures, respectively. The region most similar to the reference (HF = 94.3%).
M. Nakahara et al. / J. Biomedical Science and Engineering 3 (2010) 868-883
Copyright © 2010 SciRes. JBiSE
876
6) PHO2, PHO4 and PHO80 [15-17]
The identification of the interactive regions might be
applied the characterization of the molecular mechanism
of the metabolism. For instance, the example focusing
on the interactive regions of PHO2 (Pho2p) - PHO80
(Pho80p) - PHO4 (Pho4p) was very suggestive. PHO2
was a gene coding a transcription factor, Pho2p regulat-
ing several genes like PHO5 with co-regulated with other
transcription factor, Pho4p [15-17]. It was well known
that Pho2p had a cooperative interaction with Pho4p, and
the literature [15] reported that the amino acids around
S230 of Pho2p played an important role concerning the
interaction with Pho4p. In this connection SSM predicted
the target interactive region of Pho4p with the reference
region around S230 of Pho2p. The predicted region of
Pho4p was located very close to or overlapped partially
with the interactive region with Pho80p, and the posi-
tions of the key amino acids, S230 of Pho2p and P174 of
Pho4p were identical (Figure 6).
As described in the above section (2) PHO4 and PH-
O80, P174 of Pho4p and M42 of Pho80p were functioned
in the interaction of theses proteins (Figure 2). Namely
the positions of the three key amino acids P174 of Pho4p,
M42 of Pho80p, and S230 of Pho2p were identical
Figure 6. Sequence spectra of PHO2 and PHO4 genes (d = 9, m = 10p = 1). (a)
Coding region of PHO2 (Pho2p, reference: M = 1677). The region most similar to
the reference is shown in purple-blue. (b) The reverse-complement sequence of
coding region of PHO4 (Pho4p, M = 936). The active region was shown in red
(HF = 93.7%). It has been shown that P174 (shown in red letter) of Pho4p interacts
with S230 (shown in red letter) of Pho2p [15-17]. The arrowhead in each spectrum
respectively indicates the position of the amino acid S230 of Pho2p, and P174 of
Pho4p. The scales of axes in (a) and (b) are the same. The amino acid sequences
of Pho2p and Pho4p neighboring the interactive regions were shown in figures,
respectively. The red letter indicated to report as a functional amino acid.
M. Nakahara et al. / J. Biomedical Science and Engineering 3 (2010) 868-883
Copyright © 2010 SciRes. JBiSE
877
in the identified interactive regions by SSM. This fact
suggested that Pho80p might be interfered in the coop-
eration between Pho4p and Pho2p, and this result was
very reasonable [15-17] although more experimental
confirmations would be necessary.
7) PHO2 and SWI5 [18]
SWI5 was a gene encoding a transcription factor, Sw-
i5p that activates transcription of genes expressed at the
M/G1 phase boundary and in G1 phase such as PHO2
encoding a regulatory protein involved in cooperatively
phosphate metabolism, Pho2p. The base number of the
interactive region in SWI5 is known and unknown in
PHO2 [18]. We predicted the unknown interactive re-
gion of Pho2p by the SSM (Figure 7).
Figure 7. Sequence spectra of SWI5 and PHO2 genes (d = 9, m = 10, p = 1). (a)
Coding region of SWI5 (Swi5p, M = 2127, the active region was shown in pur-
ple-blue). (b) Coding region of PHO2 (Pho2p, target: M = 1677). The region most
similar to the reference is shown in red (HF = 95.0%). It has been shown that the
amino acids sequences (shown in red letter) of Swi5p interacts with the amino
acids sequences (shown in red letter) of Pho2p [18]. The arrowhead in each spec-
trum respectively indicates the position of the functional amino acid N471 of
Swi5p, and N3305 of Pho2p. The scales of axes in (a) and (b) are the same. The
amino acid sequences of Swi5p and Pho2p neighboring the interactive regions
were shown in figures, respectively. The red letter indicated to report as functional
amino acids sequences.
M. Nakahara et al. / J. Biomedical Science and Engineering 3 (2010) 868-883
Copyright © 2010 SciRes. JBiSE
878
8) AT P 3 and AT P15 [19-21]
AT P 3 and AT P15 were genes encoding F1F0-ATPase
complex γ and ε subunits respectively, which partici-
pated in a rotation of the complex [19-21]. In this exam-
ple the interactive regions both of AT P 3 and AT P 15 were
unknown. However we could choose the entire coding
region of ATP15 as the reference because the genome
size of ATP15 was small (186 nt). Therefore, we used as
w = 186 by SSM in this case. Other values, m, d, and p
were the same, 10, 9, and 1, respectively as before. In
addition, the reverse-complement base sequence of
AT P 15 was used because HF was higher in this analysis.
We predicted the unknown interactive region of ATP 3 by
the SSM (Figure 8).
In x-ray crystallography of γ - ε complex of ATP syn-
thase in E. coli and bovine, presumably, the 200th amino
acid and the adjacent amino acids of γ - subunit (Atp3p)
locating the foot-position could be interacted with ε -
subunit (Atp15p) [19,20]. The prediction by SSM might
be in accord with the results of these literatures for X-ray
crystallography. The experiment to confirm the interac-
tive regions of Atp15p and Atp3p analyzed by SSM is
under the progress.
SSM was the analytical method to identify the base
numbers (position from 5’-ATG = the start codon) of the
interactive regions (sites) of the reference- and the target-
protein. However there were not many examples where the
interactive regions with the base numbers were identi-
fied for the reference and target proteins in the yeast
genome databases such as SGD etc. Therefore we could
not select many examples for the SSM analyses and
showed all examples we have in this manuscript.
Figure 8. Sequence spectra of ATP15 and ATP3 genes (d = 9, m = 10p = 1).
The reverse-complement of the base sequence gave more homologous than
the normal base sequence could be shown in the interaction ATP15
(Atp15p)/AT P3 (Atp3p). (a) Coding region of ATP15 (Atp15p, M = 186, the
active region was shown in purple-blue). (b) Coding region of ATP3 (Atp3p,
target: M = 933). The region most similar to the reference is shown in red (HF
= 88.6%). It has been shown that the amino acids sequences (shown in red
letter) of Atp15p interacts with the amino acids sequences (shown in red re-
gion) of Atp3p [19-21]. The scales of axes in (a) and (b) are the same. The
amino acid sequences of Atp15p and Atp3p neighboring the interactive re-
gions were shown in figures, respectively. The arrowhead and the red letter
amino acid residue, N200 of Atp3p might be interacted with Atp15 from X-ray
crystallography [19,20].
M. Nakahara et al. / J. Biomedical Science and Engineering 3 (2010) 868-883
Copyright © 2010 SciRes. JBiSE
879
The results in this paper could be sufficient to confirm
the validity of SSM method because the probability to iden-
tify the interactive regions was very small by coincidence.
For instance, in the case of MAS1 (Mas1p)/MAS2
(Mas2p), MAS2 was composed of about 1,400 nt, which
meant that the identification probability by coincidence
was lower than 1/7 (= 200 / 1400) under the condition of
the homology window width w = 200 nt. The probabili-
ties of other examples in this manuscript were following.
PHO4/PHO80, lower than 2/9 (= 200/900);
RPB12/RPB2, 1/20 (= 200/4000);
GCR2/GCR1, 1/15 (= 200/3000);
SLA2/SLA1, 1/20 (= 200/4000);
PHO2/PHO4, 1/15 (= 200/3000);
PHO2/SWI5, 1/10 (= 200/2000);
AT P 15/ATP3, 1/5 (= 200/1000);
GAL1/GA4, 1/15 (= 200/3000);
GAL4/GAL10, 2/7 (= 200/700);
GAL4/GAL2, 1/7 (= 200/1000);
GAL4/GAL7, 1/4 (= 200/800);
Therefore the results in this paper made sense statisti-
cally to confirm the validity of the proposed method. In
addition the positions of the key amino acids were iden-
tical in the identified interactive regions in case of the
examples of MAS and PHO proteins. This fact definitely
reinforced the proposed method.
Finally we predicted the interactive regions of many
proteins which were chosen randomly from 16 different
chromosomes of S. cerevisiae [22], and summarize the
prediction results in Table 1 to demonstrate the effec-
tiveness of SSM. For the examples in Tab le 1 we used
the same analytical conditions, m = 10, d = 9, p = 1 and
w = 200, and predicted the interactive regions both of
the reference and target proteins. However the proposed
method in this paper was based on the condition that the
interactive region of the reference protein was known
and that of the target protein was unknown. Therefore
some of these prediction results might be revised in our
future work because the identification ability of SSM
was not strong at present when the interactive regions
both of the reference and target proteins were unknown.
We are improving SSM to apply these cases now.
Table 1. Possible interactive region. The upper column indicated the 1st, and the lower column indicated the 2nd interactive region,
respectively. *1) Conditions, m = 10, d = 9, p = 1, w = 200; *2) Reference gene; *3) Chromosome located the reference gene; *4)
Amino acid residues of the reference protein; *5) Interactive region of the reference protein predicted by SSM; *6) Target gene; *7)
Chromosome located the target gene; *8) Amino acid residues of the target protein; *9) Interactive region of the target protein; *10)
Homology factor between the target to the reference protein; *11) Either protein was used as the reverse-complement base sequence.
Reference*2 Chromosome*3 Amino
acids*4
Interactive
reagion*5 Target*6Chromosome*7Amino Acids*8Interactive
region*9 HF (%)*10 Complement*11
GDH3 1 457 272-338 GDH115 454 116-182 93.7
52-118 52-118 92.3
CDC24 1 854 183-249 ACT16 478 83-149 94.7
234-300 94-160 94.2
PHO11 1 467 374-440 PHO52 467 374-440 94.3
88-154 144-210 93
ATP2 10 511 170-236 ATP32 311 57-123 93.1
300-366 20-86 92.6
SUP45 2 437 311-377 RPS1215 143 4-70 92.1
188-254 (-7)-59 91.3
YDJ1 14 409 292-358 PRD13 712 508-574 94
67-133 552-618 93.6
GCD2 7 651 550-616 GCD712 381 274-341 94.9
221-287 (-21)-45 91.4
PHO87 3 923 433-499
SPL2 8 148 24-90 92.1
564-630 60-126 92
HXT15 4 567 323-389 GAL212 574 52-118 96.9
483-549 399-465 96.8
NAB2 7 525 69-135 SNF34 884 480-544 95.4
243-309 175-241 95.2
ECM10 5 644 140-206 SSA11 642 237-303 94
70-136 395-461 93.4
HEM1 4 548 16-82 LCB24 561 296-362 95
269-335 447-513 94.1
POL4 3 582 169-235 CCA15 546 218-284 97
67-133 103-169 93.9
GUT1 8 709 99-165 XKS17 600 24-90 94.5
649-(715) (-12)-54
93.7
YAP1 13 650 335-401 CAD14 409 96-162 93.9
531-597 217-283 93.3
M. Nakahara et al. / J. Biomedical Science and Engineering 3 (2010) 868-883
Copyright © 2010 SciRes. JBiSE
880
3.2. Mutual Interaction of Protein-DNA
This section clarified that the homology of sequence sp-
ectra was also related to the mutual interaction between
protein and DNA. The interactions of the transcription
factor GAL4 [23] and the promoters of GAL genes (UA-
SGal signal, GAL1, GAL10, GAL2 and GAL7) [24-26]
were taken as an example. Figure 9 showed the seque-
nce spectra of the upstream region of GAL1 as the refer-
ence (a) and the reverse-complement base sequence of
the coding region of GAL4 as the target (b). We em-
ployed the upstream region of GAL1 to demonstrate the
effectiveness of the method although its base size was
668 wh- ich was a little large for the reference region. In
Figure 9 the red region was the first identified region of
GAL4. Surprisingly this red region is completely identi-
cal to the DNA binding region of GAL4 with the zinc
finger motif, and the purple-blue region is the promoter
region of GA- L1. This means that in this case the pro-
posed method perfectly identified both the interactive
reference (in purple-blue) and target regions (in red) at
the same time despite the different objects, the protein
region for GAL4 and the DNA region for GAL1.
Thus interactive analysis might be applied to other GAL
genes, GAL10, GAL2, and GAL7, which their promoter
regions were also interacted with the N-terminal DNA
binding domain (zinc-finger domain) of GAL4 (Gal4p).
Figure 10 showed all the promoter regions identified
by SSM with the DNA binding region of the Gal4p (the
reverse-complement base sequence) in Figure 9 as the
reference region (in purple-blue). In this figure the ref-
erence region of GAL4 was fixed to arrange the layout of
Figure 9. Sequence spectra of GAL1 and GAL4 genes (d = 9, m = 10p = 1).
(a) Upstream region of GAL1 (668 nt) was used as the reference (in pur-
ple-blue). The arrowheads were indicated several promoter sequences. (b)
DNA binding region of GAL4 (reverse-complement sequence of GAL4 (Gal4p,
M = 2,643) was useful in comparison with GAL1 gene. The first 107 amino
acids at the N-terminus of Gal4p, which is involved in DNA binding (shown in
red, ref. 23), were used as the target. The bold arrowhead of Gal4p was indi-
cated the position of L64.
M. Nakahara et al. / J. Biomedical Science and Engineering 3 (2010) 868-883
Copyright © 2010 SciRes. JBiSE
881
[Enlargement of the spectrum of the interactive region
of GAL promoter region with Gal4p DNA binding region.]
Figure 10. Sequence spectra of other GAL genes (d = 9, m = 10, p = 1). (a) DNA binding region of GAL4 (reverse-complement sequ-
ence of GAL4 (Gal4p, M = 200) was used as the reference (shown in purple-blue), and other GAL genes upstream, GAL10, GAL2 and
GAL7 were as the target to search their promoter regions (the arrowheads were indicated several promoter sequences). (b) Upstream
region of GAL10 (target: M = 668: HF = 89.8%). (c) Upstream region of GAL2 (target: M = 964: HF = 85.8%). (d) Upstream region
of GAL7 (target: M = 728: HF = 84.9%). The bracket in each GAL gene indicated the promoter regions (upstream activator sequences,
UASGal) binding with the zinc finger motif of Gal4p [23-26]. The UASGal signals (arrowhead) of each GAL gene were concentrated in
the similar region shown in red. The red regions in (b), (c) and (d) were the most similar regions. The base numbers on the abscissa
were matched in each panel either to the coding or upstream region. The bold arrowhead of Gal4p was indicated the position of L64.
identified regions for the promoter. It was clear from this
figure that the promoter sites in the red regions over
lapped with each other. We obtained similar results for
PH- O genes (data not shown).
3.3. Crucial Problems and Discussions
Our results raised various crucial problems below which
were definitely related to fundamental principles of life.
However we had to admit that we did not have perfect
answer to these problems at the moment. Therefore our
discussions below had some uncertain hypotheses.
[Question 1] Why was the sequence spectrum asso-
ciated with functions of protein and DNA?
Originally the sequence spectrum was devised to ex-
M. Nakahara et al. / J. Biomedical Science and Engineering 3 (2010) 868-883
Copyright © 2010 SciRes. JBiSE
882
amine the generation-rules in genome, and succeeded in
visualizing the rules of reverse-complement symmetry,
multiple fractality and so on. Therefore the fact that the
sequence spectrum was associated with the functions of
protein and DNA led to the fact that the generation-rules
could govern not only the static base sequence in ge-
nome as the blueprint of life but also the dynamic phe-
nomena of proteins and DNAs as the principle of life
mechanism.
[Question 2] Why was the homology of sequence sp-
ectrum closely associated with the interaction of pro-
teins?
A possible answer to this problem was that the se-
quence spectrum could reflect the higher order structure
of proteins. The interacting region was considered to
consist of the specific sequence of amino acids. This
specificity of the amino acid sequence could be reflected
to the appearance frequency of the base sequence corre-
sponding to the amino acid sequence. The homology of
the sequence spectrum could be interpreted to be an af-
finity of the interactive regions of the proteins.
[Question 3] Why was the homology of sequence
spectrum closely associated with the interaction of
protein and DNA?
Similarly to the problem [Question 2], a possible answer
to this problem was that the sequence spectrum could
reflect the higher order structure of protein and DNA.
However, this fact would raise another crucial problem.
Why could the sequence spectrum reflect the higher or-
der of both protein and DNA in the same manner which
was totally different objects? In order to answer this
problem, it was definitely necessary to examine the rela-
tion between the higher order structures of protein and
DNA (or RNA). Our results implied that there could
exist a close structural relation between them. For in-
stance, it was well known that a domain of EF-G factor
protein emulated amino acyl-tRNA [26]. It could be
even possible that the structure of protein could inherit
the structure of its original DNA in genome because in-
heritance could be most simple answer for this problem.
SSM basically could detect the interacting regions of
gene DNAs through the homology of the sequence spec-
trum, and this automatically could lead to detect the in-
teracting regions of proteins translated from the gene
DNAs through the structure inheritance. We suspected
that tRNA and codon table gave an important clue on
this issue because tRNA were directly associated with
the amino acid of protein and the triplet codon of DNA.
Moreover the sequence spectrums of tRNA and protein
possess the similar relation. For instance the GTP bind-
ing protein RAS2 [27,28] and Gly(GGG)-tRNA which
were both related to guanine(G) in common were similar
in the sequence spectrum [2].
4. CONCLUSIONS
The conclusions obtained in this study were summarized
as follows.
1) The homology of the sequence spectrum was clo-
sely associated with the interaction of protein and DNA.
2) The SSM was a suitable prediction method to iden-
tify interacting regions regardless of the biological mac-
romolecules: DNA, RNA and protein.
3) The SSM was so fast and useful that it did not re-
quire a super computer but rather a personal computer.
4) The generation-rules in genome could govern not
only the static base sequence in genome but also the dy-
namic phenomena of proteins and DNAs.
5) The sequence spectrum could reflect the higher or-
der structure of protein and DNA.
6) There could be a close relation between the struc-
tures of protein and DNA.
The proposed method by SSM should be improved to
identify or predict both the reference and target regions
at the same time in any cases. This project is now ongo-
ing in our laboratory and we will report on this subject in
the next paper.
REFERENCES
[1] Takeda, M. and Nakahara, M. (2009) Structural features
of the nucleotide sequences of genomes. Journal of Com-
puter Aided Chemistry, 10, 38-52.
[2] Nakahara, M. and Takeda, M. (2010) Characterization of
the sequence spectrum of DNA based on the appearance
frequency of the nucleotide sequences of the genome-A
new method for analysis of genome structure. Journal
Biomedical Science and Engineering, 3, 340-350.
[3] Geli, V., Yang, M., Suda, K., Lustig, A. and Schatz, G.
(1990) The MAS-encoded processing protease of yeast
mitochondria. Overproduction and characterization of its
two nonidentical subunits. Journal of Biological Chem-
istry, 265(31), 19216-19222.
[4] West, A.H., Clark, D.J., Martin, J., Neupert, W., Hartl,
F.U. and Horwich, A.L. (1992) Two related genes enco-
ding extremely hydrophobic proteins suppress a lethal
mutation in the yeast mitochondrial processing enhanc-
ing protein. Journal of Biological Chemistry, 267(34),
24625-24633.
[5] Ito, A. (1999) Mitochondrial processing peptidase: mul-
tiple-site recognition of precursor proteins. Biochemical
and Biophysical Research Communication, 265(3), 611-
616.
[6] Nagao, Y., Kitada, S., Kojima, K., Toh, H., Kuhara, S.,
Ogishima, T. and Ito, A. (2000) Glycine-rich region of
mitochondrial processing peptidase α-subunit is essential
for binding and cleavage of the precursor proteins. Jour-
nal of Biological Chemistry, 275, 34552-34556.
[7] Ogawa, N. and Oshima, Y. (1990) Functional domains of
a positive regulatory protein, PHO4, for transcriptional
control of the phosphatase region in Saccharomyces cer-
M. Nakahara et al. / J. Biomedical Science and Engineering 3 (2010) 868-883
Copyright © 2010 SciRes.
883
JBiSE
evisiae. Molecular and Cellular Biology, 10(5), 2224-
2236.
[8] Okada, H. and Toh-e, A. (1992) A novel mutation occur-
ring in the PHO80 gene suppresses the PHO4c mutations
of Saccharomyces cerevisiae. Current Genetics, 21(2), 95-
99.
[9] Cramer, P., Bushnell, D.A. and Kornberg, R.D. (2001)
Structural basis of transcription: RNA polymerase II at
2.8 Angstrom resolution. Science 292(5523), 1863-1876.
[10] Baker, H.V. (1991) GCR1 of Saccharomyces cerevisiae
encodes a DNA binding protein whose binding is abol-
ished by mutations in the CTTCC sequence motif. Proce-
eding National Academy of Sciences of the United States
of America, 88(21), 9443-9447.
[11] Uemura, H. and Jigami, Y. (1992) Role of GCR2 in tran-
scriptional activation of yeast glycolytic genes. Molecu-
lar and Cellular Biology, 12(9), 3834-3842.
[12] Deminoff, S.J., Tornow, J. and Santangelo, G.M. (1995)
Unigenic evolution: A novel genetic method localizes a
putative leucine zipper that mediate dimerization of the
Saccharomyces cerevisiae regulator Gcr1p. Genetics,
141(4), 1263-1274.
[13] Deminoff, S.J. and Santangelo, G.M. (2001) Rap1p req-
uires Gcr1p and Gcr2p homodimers to activate ribosomal
protein and glycolytic genes, respectively. Genetics,
158(1), 133-143.
[14] Gourlay, C.W., Dewar, H., Warren, D.T., Costa, R., Sat-
ish, N. and Ayscough, K.R. (2003) An interaction be-
tween Sla1p and Sla2p plays a role in regulating actin
dyn- amics and endocytosis in budding yeast. Journal of
Cell Science, 116(12), 2551-2564.
[15] Liu, C., Yang, Z., Yang, J., Xia, Z., and Ao, S. (2000) Re-
gulation of the yeast transcription factor PHO2 activity
by phosphorylation. Journal of Biological Chemistry,
275(41), 31972-31978.
[16] Yang, J. and Ao, S.Z. (1996) Interaction of the yeast
PHO2 protein or its mutants with the PHO5 UAS in vitro.
Sheng Wu Hua Xue Yu Sheng Wu Li Xue Bao (Shanhai)
28(3), 316-320.
[17] Shimizu, T., Toumoto, A., Ihara, K., Shimizu, M., Kyo-
goku, Y., Ogawa, N., Oshima, Y. and Hakoshima, T.
(1997) Crystal structure of PHO4 bHLH domain-DNA
complex: Flanking base recognition. EMBO Journal,
16(15), 4689-4697.
[18] Bhoite, L.T. and Stillman, D.J. (1998) Residues in the
Swi5 zinc finger protein that mediate cooperative DNA
binding with the Pho2 homeodomain protein. Molecular
and Cellular Biology, 18(11), 6436-6446.
[19] Rodgers, A.J. and Wilse, M.C. (2000) Structure of the
gamma-epsilon complex of ATP synthase. Nature Struc-
tural Biology, 7(2000), 1051-1054.
[20] Montgomery, G.C., Lesile, A.G. and Walker, J.E. (2000)
The structure of the central stalk in bovine F(1)-ATPase
at 2.4 A resolution. Nature Structural Biology, 7(11 ), 1055-
1061.
[21] Tsumuraya, M., Furuike, S., Adachi, K., Kinoshita, K. jr.
and Yoshida, M. (2009) Effect of ε subunit on the rota-
tion of thermophilic Bacillus F1-ATPase. FEBS Letters,
583(7), 1121-1126.
[22] Saccharomyce G.D (2010) (http://www.yeastgenome.org/).
[23] Ding, W.V. and Johnston, S.A. (1997) The DNA binding
and activation domains of Gal4p are sufficient for con-
veying its regulatory signals. Molecular and Cellular Bi-
ology, 17(5), 2538-2549.
[24] Johnston, M. and Davis, R.W. (1984) Sequences that
regulate the divergent GAL1-GAL10 promoter in Sac-
charomyces cerevisiae. Molecular and Cellular Biology,
4(11), 1440-1448.
[25] Lorch, Y. and Kornberg, R.D. (1985) A region flanking
the GAL7 gene and binding site for GAL4 protein as up-
stream activating sequences in yeast. Journal of Molecu-
lar Biology, 186(4), 821-824.
[26] Tajima, M., Nogi, Y. and Fukazawa, T. (1986) Duplicate
upstream activating sequences in the promoter region of
the Saccharomyces cerevisiae GAL7 gene. Molecular
and Cellular Biology, 6(1), 246-256.
[27] Nissen, P., Kjeldgaard, M., Thirup, S., Polekhina, G., Re-
shetnikova, L., Clark, B.F. and Nyborg, J. (1995) Crystal
structure of the ternary complex of Phe-tRNAPhe, EF-Tu,
and a GTP analog. Science, 270(5241), 1464-1472.
[28] Kataoka, T., Powers, S., McGill, C., Fasano, O., Strath-
ern, J., Broach, J. and Wigler, M. (1984) Genetic analysis
of yeast RAS1 and RAS2 genes. Cell, 37(2), 437- 445.
[29] Mabuchi, T., Ichimura, Y., Takeda, M. and Douglas, M.G.
(2000) ASC1/RAS2 suppresses the growth defect on
glycerol caused by the atp1-2 mutation in the yeast Sac-
charomyces cerevisiae. Journal of Biological Chemistry,
275(14), 10492-10497.