A New Tag Index Scheme Enables Fast Peptide Retrieval for Protein Identification

Sequence tag index in the field of computational proteomics can be used to facilitate faster open-search-based identification of modified peptides and in-depth analysis of mass spectrometry data. In protein-identification search engines, sequence tag index are playing a prominent role in recent ten years due to fast searching speed. However, in pursuit of less index space consumption, some protein search engines design excessively concise index schemes which lead to higher computational burden. We proposed a new tag index scheme named TIIP with a better balance between space and time complexity. TIIP has a unique two-level hierarchical index structure which allows rapid retrieval of all peptide sequences and their corresponding masses. Theoret-ically, the index space consumption of TIIP is not much higher compared to the typical tag index schemes, but the time complexity of sequence retrieval can be reduced to O(1), and practically, TIIP has about one million fold im-provement in searching speed compared with brute force approach.


Sequence Tag
The concept of sequence tag was introduced by Mann et al. in 1994 [1]. It refers to the partial sequence of amino acids derived from a series of continuous fragment ions as shown in Figure 1 [2] [3] [4] [5].
As an analysis strategy of mass spectrometry (MS) data in proteomics, it is an intermediate method between database search [6]- [19] and de novo sequencing [2] [3] [4] [5] [10]. The sequence tag method can be used to prune exponentially larger search space in discovery proteomics to realize the effective filtering of candidate peptide sequences. Because of the characteristics above, search engines embrace the possibility of open search (expanding the search scope) without taking too much time [7] [9] [10] [13] [18]. In recent years, sequence tag method has been adopted by many modern protein search engines, such as Open-pFind, MODplus, and TagGraph, as an essential speeding up technique and become a fundamental peptide identification algorithms [7] [13] [18].

Tag Index Design
Sequence tag index, or tag index for short, is a hash table that can quickly search related peptide sequences [20] [21]. It is also an inverted index for search engines to accomplish rapid retrieval. The tag index, in essence, is a set of key-value pairs with a substring as the key, and the position of this substring in the original string as the value. In search engine technique, querying with inverted index is a commonly used approach. Compared to brute force with at least O(N) time complexity, inverted index method could transform a sequence tag to a hash code, which is related to an index entry and helps retrieve tag-containing sequences in O(1). As shown in Figure 2, exhaustively sequential substring matching can be avoided by using tag index, and search time of sequence tag can be reduced [7] [9] [10] [13] [18].
In order to facilitate the implementation of string matching algorithm, Tag-Graph constructs a database index structure by suffix array, in which the structured protein sequence information is called FM-indexed protein [18]. The other two search engines, Open-pFind and MODplus, adopt the scheme of k-length tag index, so they can find the corresponding peptide sequence information according to the k-length tags. In the index entries of Open-pFind, it only records the protein ID and starting position of tag in protein sequence to compress the memory space [7]. MODplus goes further in this direction. In MODplus, all protein sequences are concatenated into one linearized sequence delimited with Journal of Computer and Communications the character "$". Therefore, tag index of MODplus only needs to store tag's start position at the linearized sequence string, so as to further compress the memory space occupied by inverted index [13]. The offset addresses of tags in Open-pFind and MODplus are encoded by dictionary order.
Both Open-pFind and MODplus compress the index information as much as possible in order to save memory space and complete the rapid search of tags.
However, when retrieving sequences containing any of extracted tags, they all have unnecessary operations of amino acid traversal and mass calculation because of the design problem of tag index, which undoubtedly increases the cost of time. Nowadays, with the advances of computer hardware technology, memory usage of ordinary mass spectrometry data analysis is no longer a problem even for personal computer. Therefore, we should focus on how to retrieve the sequence more quickly.
Here, we introduce the new scheme named TIIP (Tag Index of Intact Proteins), a novel tag index design scheme. TIIP has a unique two-level index structure, which makes tag index and protein database cooperate effectively in search.
Based on the design of TIIP, we can rapidly retrieve all peptide sequences that satisfy user-specified parameters, and get the theoretical masses information of peptide sequences quite fast.

Analysis of Sequence Retrieval Approach
Index design in search engine will affect the workflow design and efficiency of peptide sequence retrieval. The open search strategy in Open-pFind is to retrieve all the peptide sequences containing at least one extracted tag by extending this tag sequences from its both ends. This strategy can be compatible with semi-specific and non-specific searches.
The search strategy of MODplus will change with user-specified parameters.
When searching for non-specific peptide sequences with MODplus, the search space of sequence is similar to the one considered by Open-pFind.
However, it should be noted that, as shown in Figure 3, it is very uneconomic to directly retrieve a large number of non-specific peptide sequences at the beginning  The design and implementation of index structure and search flow will affect each other, so considering the needs of search can help index design. If retrieving all peptide sequences within the mass tolerance, lots of non-specific peptide sequences would reduce the efficiency of identification. From another point of view, the difference between fully specific and non-specific sequence from one protein sequence is several terminal amino acids, so we could regard terminal mass shifts as special-mass modifications, whose masses equal to the cumulative masses of terminal amino acids. Based on that, we can consider retrieving only fully specific sequences and dividing the search flow into two stages. The first one is using tag index to obtain candidate fully specific peptide sequence set and score to filter, while the second one includes precursor mass difference calculation, semi-specific/non-specific sequence detection and other operations for candidate peptide sequences. In view of previous analysis, we propose that TIIP is designed as a scheme for fast retrieving full-specifically digested peptide sequences, and moving the semi-specific and non-specific detection to the later stage, transferring the identification pressure of semi-specific and non-specific digested peptide sequences from the sequence retrieval stage to the subsequent stage.

Tag Index of Intact Proteins
In a word, the tag index and protein database structure we designed can rapid retrieve peptide sequences and fast calculate the difference between monoisotopic masses of precursor ions and theoretical masses of peptide sequences.

Refinement of TIIP
After finishing the initial design of tag index, further refinement is required. As shown in Figure 5, we consider refining the format of sequence records as follows, including cleavage sites, theoretical masses etc.
Firstly, the information of cleavage sites is stored in protein database, and each protein sequence corresponds to a list of all cleavage sites. Among them, in order to acquire peptide sequences, we need to store the offset addresses, which include zero and length of protein sequence, into the lists of cleavage sites. This refinement makes checking missing sites number become convenient, because any two closest sites of one protein sequence are adjacent in one list so their offset addresses are also adjacent.
Secondly, according to the list of cleavage sites belonging to each protein sequence, we calculate the cumulative mass of amino acid residues between two adjacent cleavage sites, and similarly store this masses information as a list so that we can calculate theoretical sequence masses so fast. Obviously, the length of cumulative mass list is equal to the length of related cleavage sites list minus one.
Finally, we only record protein ID, and offset addresses of the cleavage sites at the N-terminus and C-terminus of a peptide, and these two sites are the closest two to the tag in a protein sequence. Each index entry points to offset addresses of cleavage sites pair in list, so this design makes the tag index and protein database form a two-level index.

Sequence Retrieval of TIIP
As shown in Figure 6, when retrieving all possible peptide sequences, a sequence tag can be used to immediately acquire the shortest full-specifically digested se-    Theoretically, compared with the tag index schemes in Open-pFind and MODplus, TIIP scheme avoids unnecessary enumeration of amino acids when retrieving sequences. If the number of missing cleavage sites is fixed, the time complexity of retrieving sequence can be reduced to O(1), and the corresponding memory space consumption is acceptable. The database in test with 20,350 human proteins was downloaded from Uni-Prot on March 29, 2020. From the downloaded database, we randomly chosen 10,000 non-redundant peptide sequences and generated corresponding simulated mass spectra as the test dataset. Figure 7 shows an example of simulated spectrum.

Baseline, Dataset and Parameters
Some parameter settings, software and hardware environment during the test are shown in Table 1 and Table 2.

Memory Space and Time Cost of Tag Index
In the environment and parameter settings mentioned above, we tested the space consumption of TIIP index design scheme and the time cost of index generation.
The memory space consumption of 5-tag index (index with 5-length tag) is acceptable, while index generation has fast speed. The details are shown in Figure   8 and Figure 9.

Time Cost of Peptide Sequences Retrieval
This test is in full enumeration method which means that all full-specifically digested sequences are retrieved. We compared the time consumption of full traversal (brute force method) and TIIP scheme (searching with inverted index) using 5-length sequence tags. The result details are shown in Figure 10. If using brute force method, the average time cost of only one spectrum is 2156.59 s. If we use TIIP scheme, the total time consumption for 10,000 spectra is only 60.73 s which includes the time (31.09 s) consumed in the process of database and index generation. The average time cost for a single spectrum is only 0.006 s. If the scale of spectra is larger, the database building time will be further diluted so that the average single spectrum time cost will be shorter.   It can be found that the search based on TIIP is actually very fast, which thanks to the enumeration of only full-specifically digested peptide sequences, and compared with brute force method, TIIP has about one million fold improvement in searching speed.

Conclusion
We designed TIIP scheme using two-level index structure with a tag hash table and a pre-digested protein database. When searching with TIIP, users can fast acquire candidate peptide sequences only restricted by the number of missed cleavage sites and open search mass window. Compared with the index design of Open-pFind and MODplus, TIIP scheme can avoid a large number of unnecessary enumeration of non-specifically digested peptide sequences, and thus save search time. TIIP scheme is more applicable to the cases of full-specifically digested peptides identification, while semi-specifically and non-specifically digested peptides are supported, too. For TIIP, after scoring and pruning, the candidate peptide sequences' number will be considerably reduced. Thus, the identification of semi-specific and non-specific peptide sequences only need to be determined by the flanking masses of hitting sequence tags. Therefore, the computational costs of finding semi-specific and non-specific peptides are reduced.
According to the current research progress, we will continue further development in tag index design for better performance of any protein search engines.