MicrobMatcher : a microbial comparison software based on matrix-assisted laser desorption / ionization with time-of-flight mass spectrometry

Matrix-assisted Laser Desorption/Ionization with Time-of-flight Mass Spectrometry (MALDI-TOFMS) was investigated as a method for the rapid identification of species. Current demand in microbial identification is how to compare unknown strains to the known one quickly, semi-automatically and accurately. In this paper, we present a software tool that allows flexibly microbial matching in a user-friendly way, by letting the users to customize comparison parameters including: in vitro transcription enzyme, mass tolerance, minimum fragment length, intensity threshold and corresponding weights. We provide three spectral scoring functions to compute the affinity between the species. Therefore, the precision of microbial comparison increases. To test and verify this tool, we employed experimental spectral data based on MALDI-TOFMS and the gene sequences of E.coli and Salmonella. This software is written in Java for cross-platform intention.


INTRODUCTION
MALDI-TOFMS is an analytical technique that measures the mass-to-charge ratio of charged particles.It is used for determining masses of particles, for determining the elemental composition of a sample or molecule, and for elucidating the chemical structures of molecules, such as peptides and other chemical compounds.With the development of this technology, microbial identification by mass cataloging has attracted considerable attention owing to its high efficiency and automation.Meanwhile, there is a current demand that to compare mass spectrometric observables with theoretical fragmentation patterns, and further to determine the genetic affinity between the sample gene and genes of known species in the database quickly, semi-automatically and accurately.
Within this context, our paper presents a software tool that allows flexibly microbial matching in a user-friendly way.When it comes to the matching speed and accuracy, this software provides three spectral scoring functions to compute the coincidence between the species.As for the semi-automation, the tool allows the users to customize comparison parameters including: transcription enzyme, mass tolerance, minimum fragment length, intensity threshold and corresponding weight.
To test and verify this tool, we employed the experimental spectra data based on MALDI-TOFMS and the gene sequences of E.coli and Salmonella.
The remainder of the document is structured as follows.We present three algorithms for computing the coincidence between the sample gene and genes of known species in Section 2, followed by the description and the verification of the software separately in Section 3 and Section 4. Subsequently, related work is discussed in Section 5. Finally, Section 6 concludes.

Overall Algorithm
The overall algorithm in comparison process is as follow: 1) Amend the gene sequence of the known reference species according to the transcription enzyme.To form the theoretical gene sequence, if the promoter is T7, the nucleotide sequence "TTCTATAGTGTCACCTAAAT" will be added to the original one, while If the promoter is Sp6, reverse and complement(A-T, G-C) the original gene sequence, and then add the nucleotide sequence "CCCTATAGTGAGTCGTATTAC" as its subsequence.
2) Cut the theoretical gene sequence after every base 'G', omitting the fragments which have less than L nu-cleotides.L is determined by the user.
3) Calculate the mass of all fragments (also referred to as 'fingerprint biomarkers') from the sequence and then form the sequence's mass vector.The mass of every fragment is: -329.2 305.2 361.2 18.0148 1.0072 A, G, C and T separately stand for the one-to-one total number of Adenylic acid, Guanylic acid, Cytidylic acid and Thymidylic acid in each fragment.
4) Take the mass vector of the gene sequence and calculate the score indicating their similarity by using one of the spectral scoring functions introduced below.

The First Spectral Scoring Function
The first spectral scoring function [1] in our work is as follow: Let N denote the total number of fingerprint biomarkers in the given theoretical gene sequence.A vector u of length N is constructed.The elements of u contain 0's and 1's.The ith element of u is 0 if the ith fingerprint peak is not observed in the blinded sample and 1 if the mass of ith fragment is observed within tolerance in the blinded sample.The number of 1's in u (or sum of all elements of u) indicates the number of fingerprint biomarkers observed in the blinded sample.
For each blinded sample and each reference species, likelihood is computed based on the number of fragments observed in the blinded sample.This likelihood is a value between 0 and 1.If the likelihood is close to 1, then the reference bacterium is determined to be present.If the likelihood is close to 0, then the blinded sample does not contain the significant fingerprint biomarkers, and the reference is determined to be absent.

The Second Spectral Scoring Function
Based on the first method, the second spectral scoring function [2] in our work allows the user to define two intensities, partitioning the whole experimental peaks into three parts: the first peaks list, whose intensities are higher than the larger defined intensity; the second peaks list, whose intensities are between the two defined intensities; and the third peaks list, whose intensities are lower than the smaller defined intensity.Furthermore, users can assign the credibility for the three intervals of peaks, and give weights for them separately, but the weighted sum must be one.This method considers the reliability of the intensities and involves the users' experience.The scoring function is as follow: where: MP1 is the number of the matched fragments between the theoretical fragments and the experimental peaks whose intensities are higher than the larger defined intensity.
MP2 is the number of the matched fragments between the theoretical fragments and the experimental peaks whose intensities are lower than the larger defined intensity and higher than the smaller defined intensity.
MP3 is the number of the matched fragments between the theoretical fragments and the experimental peaks whose intensities are lower than the smaller defined intensity.
W1, W2 and W3 are separately the credibility of the three intervals of peaks corresponding to MP1, MP2 and MP3.
N is the total number of fingerprint biomarkers in the given theoretical gene sequence.
A higher score indicates more genetic affinity, indicating a higher possibility of being the same species.

The Third Spectral Scoring Function
The third spectral scoring function [2,3] in our work is as follow: The scalar product (often referred to as a 'dot-product') of two mass in the function is defined as: where M is the mass vector of one sample's fragmentation, which has N1 fragments with mi standing for the mass of the ith fragment, while M' is the mass vector of the other sample's fragmentation, which has N2 fragments with standing for the jth fragment.The discrete delta function δ is: Given inevitable experimental inaccuracy, the discrete delta function δ can be further modified to be: Based on the formulas, the inner-product is greater if the two samples have more fragments of the same mass.The spectral scoring function normalizes the inner-product value to a range between zero and one, and a high value of the spectral scoring function indicates a higher possibility of being the same species.

SOFTWARE
tion of the experimental inaccuracy by means of adopting tolerance, and finally provides the comparison consequence of the selected method.For further research, it is available for users to save the comparison result as a txt report file.Figures 2, 3 and 4 separately represent the user interfaces of the three scoring methods in the software.
To perform microbial comparison, the software uses the exported ASCII Spectrometry .txtfile from DataExplorer (Figure 1), whose data is the mass-intensity spectrometry result from MALDI-TOFMS, and the theoretic gene sequence of the known reference species, either .txtfile imported from the local file system or direct text pasted in the blank box, as inputs.The software offers three spectral scoring functions mentioned above, and users can choose one of them to calculate the coincidence between the experimental data and the theoretic DNA sequence.In all the three methods, users are free to customize some conditional parameters in their massspectrometry experiment, including: in vitro transcription enzyme-either T7 or Sp6, mass tolerance, minimum fragment length and intensity threshold.In addition, in Method 2, users can customize the intensity range and corresponding weight according to their previous experience of the importance of the peaks among the relative intensity scope.Subsequently, the software parses the input file, generates peak lists after filtering peak values below the intensity threshold, with the considera-

VERIFICATION
This paper presents two parts of experiments, the negative control and the positive one, to verify the accuracy and the utility of the software.
In the negative control, we divide it into two parts, and in each part we use five sets of data from five separate experiments of one species and the DNA sequence of another species as input to test the consequence of inconsistence.For example, we calculate the coincidence between the theoretic sequence of E.coli and each set of the experimental data of Salmonella.To ensure justice, we control the experimental conditions with the same parameters.Table 1 shows the results of these negative control experiments.We find that the results of the coincidence are all too low for the microbe to be classified as the certain species of the theoretic sequence.In other words, it demonstrates that the experimental species is probably not the same kind as the theoretical species, which accords with our expectation.
Meanwhile, in the positive control, we divide it into two parts as well, and we use five sets of data from five separate experiments of one species and the DNA sequence of the same species as input.For instance, the coincidence between the theoretic sequence of E.coli and its experimental data is calculated.Also, the circumstance of each experiment remains the same as to ensure fairness.Table 2 shows the results of the positive control experiments.
Given the allowed tolerance during experiment and the previous experience, we find the results of the coincidence are all within acceptance, which reflects high probability of the similarity between two species in the comparison, and which also demonstrates that our software is robust and accurate.

RELATED WORK
The software in this paper completes the comparison between the known species in the databases and the unknown species which has mass-intensity data generated by MALDI-TOFMS.In the next phrase, we will do statistical analysis to amount of spectra from one species and expect to compare affinity among unknown species.Furthermore, we will try to model for species and search the possible species range for the unknown species based on its MALDI-TOFMS data.

CONCLUSIONS
In order to allow flexibly microbial matching in a userfriendly way, we design the software "micromatcher".To perform microbial comparison, the software uses exported ASCII Spectrometry .txtfile from DataExplorer, whose data is the mass-intensity spectrometry result from MALDI-TOFMS and the theoretical gene sequence of the known species in the database as inputs.The software offers three spectral scoring functions and users can choose one of them.Then users are free to customize some comparison parameters, including: in vitro transcription enzyme, mass tolerance, minimum fragment length, intensity threshold and corresponding weight.The software parses the input file, generates peek lists after filtering peak values below the intensity threshold, taking into account the experimental inaccuracy by means of adopting tolerance and finally provides the comparison consequences.
The software computes the genetic affinity between the sample gene and genes of known species in the database quickly, semi-automatically and accurately.

Figure 3 .
Figure 3.The user interface of Method 2.

Figure 4 .
Figure 4.The user interface of Method 3.

Table 1 .
The results of the negative control experiments.

Table 2 .
The results of the positive control experiments.