J. Software Engineering & Applications, 2009, 2: 206-208
doi:10.4236/jsea.2009.23028 Published Online October 2009 (http://www.SciRP.org/journal/jsea)
Copyright © 2009 SciRes JSEA
MicrobIdentifier: A Microbial Identification Softw are
Based on Mass-Spectrometry
Feng LIU, Lu LI, Chi ZHANG, Lingbing WANG, Pei LI
International School of Software, Wuhan University, Wuhan, China.
Email: wolflf@126.com, {lulu.li1989, chzhcn88}@gmail.com
Received May 18th, 2009; revised July 5th, 2009; accepted July 16th, 2009.
As the technology of microbial identification by mass cataloging has been widely used, we have developed the microbi-
al identification software, MicrobIdentifier, which integrates and automates different steps in the procedure of rapid
species identification based on mass-spectrometry. This software is written in Java for cross-platform intention.
Keywords: Microbial Identification, Mass-Spectrometry
1. Introduction
With the development of the technology, microbial iden-
tification by mass cataloging has attracted considerable
attention due to its high efficiency and automation. In
order to improve efficiency and automation of this tech-
nology, we’ve developed this microbial identification
software based on the spectral coincidence function pro-
posed in [1]. The software has two major functions: First,
it can be used to search for all the possible primer pairs
among the given genes of different species, and evaluate
these primer candidates by giving each pair a score. This
is proved to be a useful reference during primer design.
Second, it takes advantage of the spectral coincidence
function to compare mass spectrometric observables with
theoretical fragmentation patterns, and further to deter-
mine the genetic affinity between the sample gene and
genes of known species in the database. This will free
researchers from the effort of comparing the fragmenta-
tion patterns manually.
2. Algorithm
The core algorithm our work has been based on is a
spectral coincidence function proposed in [1] as follow:
iji j
ii jj
CCM,M (MM )(MM )
The dot-product in the coincidence function is defined as
i=1j 1
M, MMMδm-m
 
where M is the mass vector of one sample’s fragmenta-
tion, which has N1 elements with mi standing for the ith
element, while M’ is the mass vector of the other sample,
which has N2 elements with m’j standing for the ith ele-
ment. The discrete delta function is:
() 0
kotherwis e
Based on the formulas, the inner-product is greater if
the two samples have more fragmentation of the same
mass. The coincidence function normalizes the inner-pro-
duct value to a range between zero and one, and a high
value of the coincidence function indicates more similar-
ity between the two genes in comparison. Therefore, this
function can be used to score the similarity in both the
primer search process and the identification process.
The algorithm in primer search process is as follow:
1) Align all the gene sequences with ClustalW algo-
rithm [3].
2) Find regions where all the sequences have more
than N nucleotides at the same place and in the same or-
der, which are the conserved regions. If the regions are
less than two, then exit.
3) Take two conserved regions and check whether the
number of nucleotides is more than M. Take another pair
of regions if otherwise.
4) Cut the regions between two conserved regions
(conserved regions included) after every “G”, filtering
the fragments which have less than L nucleotides.
5) Calculate the mass of all fragments of each se-
quence, and then form the sequence’s mass vector.
6) Take the mass vectors of one pair of gene sequences
and calculate the score indicating their similarity by us-
ing the coincidence function.
MicrobIdentifier: A Microbial Identification Software Based on Mass-Spectrometry
Copyright © 2009 SciRes JSEA
7) Repeat Step 6 until any pair of all the gene se-
quences has been compared. Calculate the average value
of all the scores calculated in Step 6. The average value
is the final score of the primer pair chosen in Step 3.
8) Repeat the steps from 3 to 7 until all the combina-
tions of the conserved regions are considered.
Optimal primer pairs are those conserved regions with
very variable regions in between. A primer pair with a
lower score is better than the ones with higher scores,
since there is less similarity between the primer pairs,
thus the test samples could be identified with much more
ease in the identification process.
The algorithm in identification process is almost the
same as the Steps from 3 to 6 in the primer search proc-
ess with one exception that, in identification process, it is
the comparison of experimental data and the computed
mass vector in the database. A higher score indicates
more genetic affinity, suggesting a higher possibility of
being the same species.
Given inevitable experimental inaccuracy, the discrete
delta function is further modified to be:
() 0
k tolerance
Thus, tolerable difference between masses is ignored.
3. Software
The software accepts a fasta file as input, then invoke a
new process running clustalw that also takes the .fasta
file. As long as the .fasta file is valid in format, a .aln file,
the result of clustalw’s pairwise alignment, is created and
afterwards captured. Through parsing both the fasta file
and .aln file, a data group is fabricated. In the software, a
data group is a concept of a pool of sequences with user
configuration that is identification-ready. Typically users
need to assign four thresholds: the minimum length of a
sequence fragment after simulated cutting; the minimum
length of a primer; the minimum and maximum length of
the variable region between primer pairs. The same se-
quence pools with different configurations are different
data groups. The software ensures users only work on
one data group at a time given that the concept of data
group supports sufficiently in flexibility and reusability
for users to handle microbial identification merely on one
data group in most situations. During this preprocessing
phase, the software stores user configurations as well as
the data group sequences into the database for the pur-
pose of 1) enabling access to previously processed data
groups in later cases 2) providing thresholds reference
for identification process.
Figure 1. MicrobIdentifier screenshot
MicrobIdentifier: A Microbial Identification Software Based on Mass-Spectrometry
Copyright © 2009 SciRes JSEA
The user interface shows the sequences in the pool;
primer selection thresholds and primer pair candidates
are also given out if current data group is loaded from
database, whose primer pair candidates have already
been worked out after proper configuration in previous
use. The more usual case, however, is the user sets up
basic configuration after a new pool is given, parsed
down and shown on UI, to calculate potential primers
pairs. The list of primer pairs is sorted by score in as-
cending order. The configurations are saved into the da-
tabase in associate with the working data group.
To perform microbial identification, the software uses
exported ASCII Spectrometry .txt file from DataExplorer,
whose data is the mass spectrometry result from
MALDI-TOF. Users are free to customize proposed
primer pair candidates to choose a subset, however man-
datory to provide some parameters about the conditions
in their mass-spectrometry experiment, including: in vi-
tro transcription enzyme, either SP6 or T7; mass toler-
ance and minimum intensity threshold; whether the elec-
tric charge is positive of negative during MALDI-TOF
experiment. The software parses the input file, generates
peek list after filtering peak values below the intensity
threshold, taking into account the experimental inaccu-
racy by means of adopting tolerance and finally provides
the identification consequence.
Figure 1 shows the interface of MicrobIdentifier.
4. Acknowledgements
This paper is sponsored by the National Science and Te-
chnology Major Project 2009ZX10004-107 and The Na-
tural Science Founds of Wuhan University F020504.
[1] G. W. Jackson, R. J. McNichols, G. E. Fox and R. C.
Willson, “Bacterial genotyping by 16S rRNA mass cata-
loging”, BMC Bioinformatics, vol.7, pp. 321–335, June
[2] Z. D. Zhang, G. W. Jackson, G. E. Fox, and R. C. Willson,
“Microbial identification by mass cataloging,” BMC
Bioinformatics, Vol. 7, pp. 117–135, Match 2006.
[3] J. D. Thompson, D. G. Higgins, and T. J. Gibson,
“CLUSTAL W: Improving the sensitivity of progressive
multiple sequence alignment through sequence weighting,
position-specific gap penalties and weight matrix choice,”
Nucleic Acids Research, Vol. 22, pp. 4673–4680, Sep-
tember 1994.
[4] C. Honisch, Y. Chen, C. Mortimer, C. Arnold, O.
Schmidt, D. van den Boom, C. R. Cantor, H. N. Shah,
and S. E. Gharbia, “Automated comparative sequence
analysis by base-specific cleavage and mass spectrometry
for nucleic acid-based microbial typing,” Proceedings of
the National Academy of Sciences, Vol. 104, pp.
10649–10654, June 2007.
[5] H. Steen and M. Mann, “The abc’s (and xyz’s) of Peptide
Sequencing,” Molecular Cell Biology, Vol. 5, pp. 699–
711, September 2004.