An Algorithm for Generation of Attack Signatures Based on Sequences Alignment

This paper presents a new algorithm for generation of attack signatures based on sequence alignment. The algorithm is composed of two parts: a local alignment algorithm-GASBSLA (Generation of Attack Signatures Based on Sequence Local Alignment) and a multi-sequence alignment algorithm-TGMSA (Tri-stage Gradual Multi-Sequence Alignment). With the inspiration of sequence alignment used in Bioinformatics, GASBSLA replaces global alignment and constant weight penalty model by local alignment and affine penalty model to improve the generality of attack signatures. TGMSA presents a new pruning policy to make the algorithm more insensitive to noises in the generation of attack signatures. In this paper, GASBSLA and TGMSA are described in detail and validated by experiments.


Introduction
Network worms, viruses and malicious codes are still the top threat against the current Internet and enterprise security, and they cause a loss of hundreds of millions dollars every year [1].Intrusion detection based on attack signatures is the most effective solution of this issue currently, but the continuous emergence of new types of attacks and polymorphic engines such as PHolyP [2] are great challenges to the existing intrusion detection technologies.To solve this problem, automatic generation of attack signatures has been concerned by more and more researchers and has become a new hotspot in intrusion detection since 2003 [3].
Algorithms for generation of attack signatures can be divided into two categories: one is based on string mode and the other is based on semantics.However, the latter relies on prior semantic analysis of a certain type of attacks, so it is incompetent for generating signatures of unknown attacks automatically.Currently the research on algorithms for generation of attack signatures is mainly based on string mode, including the following categories: algorithms based on the LCS (longest common substring), algorithms based on the Token (the strings appearing frequently in suspicious datum and containing more than one character) [4], algorithms based on sequence alignment, algorithms based on finite automaton and algorithms based on protocol field and length [5].
The algorithms for generation of attack signatures based on Token is considered as the most effective and approbatory method currently.But in [3], the authors point out that signatures generated by this kind of algorithm are not precise and give out an algorithm based on sequence alignment.In this paper, we present a new algorithm for generation of attack signatures based on sequence alignment through analyzing the algorithms presented by [3] and referring to the idea of sequence alignment used in Bioinformatics.The algorithm is composed of two parts: GASBSLA algorithm and TGMSA algorithm.With the inspiration of sequence alignment used in Bioinformatics, GASBSLA replaces global alignment and constant weight penalty model by local alignment and affine penalty model to improve the generality of attack signatures.TGMSA presents a new pruning policy to make the algorithm more insensitive to noises in the generation of attack signatures.
The rest of the paper is organized as follows.Section 2 refers to related research, which describes the algorithms for generating attack signatures in [3] and analyzes its weakness.Section 3 presents the design of GASBSLA algorithm and TGMSA algorithm, and details their relative analysis.Section 4 presents the experiments on the effectiveness and the anti-noise ability of the algorithms.Section 5 concludes the paper and mentions of some future work.

Related Research
Sequence alignment is divided into pair-wise alignment and multi-sequence alignment, and most of multi-sequence alignment is based on pair-wise alignment.Firstly, this section introduces and analyzes a pair-wise sequence alignment algorithm CMENW (Contiguous-Matches Encouraging Needleman-Wunsch) and a multi-sequence alignment algorithm HMSA (Hierarchical Multi-Sequence Alignment) [3].They are the most representative algorithms applied to the generation of attack signatures based on sequence alignment, and they are also the foundation of this paper.Then we introduce the most representative pair-wise local alignment algorithm-Smith-Waterman algorithm [6].

CMENW Algorithm
CMENW algorithm is a pair-wise alignment algorithm based on global alignment.It is improved on Needleman-Wunsch algorithm [7], which is the typical pair-wise alignment algorithm.The main difference between the two algorithms is: Needleman-Wunsch algorithm easily leads to fragments.In order to reduce the influence of fragments in the process of alignment, CMENW algorithm introduces contiguous-matches encouraging function ( ) enc x ( x is the number of contiguous bytes in the alignment), which is used to encourage contiguous bytes to be aligned together.The score function of CMENW algorithm is as follows:

HMSA Algorithm
HMSA algorithm is a type of hierarchical multi-sequence alignment algorithm based on pair-wise alignment CMENW algorithm, which is suitable for attack signatures generation.This algorithm has three main features [3]: (1) hierarchical pair-wise alignment; (2) supporting wildcard characters; (3) with a pruning function.
HMSA algorithm possesses the function of pruning, which accelerates its convergence and enhances the noise resisting ability.However, the effectiveness of pruning function is based on two assumptions: (1) the alignment result of any two noise will be pruned because of trivial solution; (2) the alignment result of any two samples will not be pruned and get a precise attack signature.However, in reality, it is possible that the alignment result of any two noises is not pruned, because input sequences of signatures generation algorithm are often processed by clustering algorithms.Thus the alignment results of noise that not pruned and the alignment results of sample will be easily prone to trivial solution and be pruned, and finally there is no result returned.

Smith-Waterman Algorithm
Smith-Waterman algorithm is a pair-wise local alignment algorithm put forward by Smith and Waterman in 1981, which is used to find and compare the similarity in local regions in an overall view.Even today it is still a common basic algorithm in bioinformatics.Given sequence x and y as inputs, Smith-Waterman algorithm outputs a local alignment result which is global optimal.The similarity value of it is maximal according to formula (2).And the meanings of the bytes in this formula are the same as those in the formula (1) in Section 2.1.
(2) Smith-Waterman algorithm is used to find the biggest similarity value and the best alignment based on the principle of dynamic programming, and its process includes two major steps: 1. Calculate the similarity values of two given sequences, and get a similarity matrix; 2. Get the best results of sequence alignment through dynamic programming and backtracking algorithm, according to the similarity matrix got in step 1.
Smith-Waterman algorithm improves Needleman-Wunsch algorithm.The main difference between them is: Smith-Waterman algorithm uses 0 to replace all the negatives in the similarity matrix; if the similarity values no longer increases when the length of alignment result increases, this algorithm will finish backtracking and output the result.According to the differences between the two algorithms, the idea of Smith-Waterman algorithm is helpful for CMENW algorithms to overcome the problem of insufficient generalization.

GASBSLA Algorithm and TGMSA Algorithm
Through the analysis of CMENW algorithm and HMSA algorithm, we present a new algorithm for generation of attack signatures based on sequence alignment.The algorithm is composed of two parts: a local alignment algorithm-GASBSLA (Generation of Attack Signatures Based on Sequence Local Alignment) and a multisequence alignment algorithm-TGMSA (Tri-stage Gradual Multi-Sequence Alignment).

GASBSLA Algorithm
In Bioinformatics, local alignment has more practical significance than global alignment because two sequences are often with very high similarity just in some local regions [8].For example, two long DNA sequences often have relation with each other only in seldom areas (password districts); proteins belonging to different families often have some regions in the same on the structure and function.The situation in generating of attack signatures is very similar with that of Bioinformatics, so GASBSLA algorithm replaces global alignment by local alignment to improve the generality and precision of attack signature under the conditions of a small sample.In addition, to further reduce the number of fragments, GASBSLA algorithm replaces constant weight penalty model by affine penalty model [9].
The differences between affine penalty model and constant weight penalty model are: the penalty for each gap is independent in constant weight penalty model.That is, in any case, the penalty for one gap is d , and the penalty for n gaps is nd ; but in affine penalty model, the penalty for n gaps which attached together is 1 Where q is the penalty for the first one of n gaps attached together, r is the penalty for the other gaps, and r « « « « q.We can learn from descriptions above that in affine penalty model, the penalty for the first gap is more than the other ones which means the reduction of single gaps and fragments in the attack signatures.
The general idea of GASBSLA algorithm based on Dynamic Programming is: First, calculating the similarity values of two sequences and keeping them in a matrix (named similarity matrix or DP matrix); second, according to the dynamic programming backtracking algorithm, finding the optimal alignment sequence on the basis of the DP matrix.Both the time complexity and the space complexity of GASBSLA algorithm are ( ) O mn , where m and n are the lengths of the two sequences.
( , ) x y σ σ σ σ is the similarity value of the alignment of x and y , where x and y are any two characters.),  ( , ) ,

TGMSA Algorithm
TGMSA algorithm presents a new pruning policy to avoid the situation of no output caused by not being pruned in the alignment process of two noises.The general idea is modifying pruning policy in the nth (n>1) layer alignment according to alignment similarity value.
If the alignment similarity value is less than the threshold (that the alignment similarity value is out of confidence interval), the alignment result will not be pruned, but the two sequences will be laid aside then align respectively with the signature sequence result, which is the alignment result of other sequences.If the alignment result does not accord with pruning conditions, it will replace the original signature sequence, otherwise it will be deserted.
Algorithm 2. TGMSA algorithm Input: sequence set S Output: multi-sequence alignment result Initialization:

S S
Ali falls in confidence interval(the calculation of similarity value confidence interval will be specified in Section 3.3.)Ali ≥3 [10] and there exists at least two fragments whose length≥3 [11,12] then Ali =

The Selection of Alignment Similarity Confidence Interval
Central limit theorem holds that regardless of the statistics population on the subject obeying whatever distribution, the distribution of sample mean is close to a normal distribution, the mean of normal distribution equals that of population distribution, and the variance equals that of population distribution divided by the Sample size.Therefore, we can estimate the average signature alignment similarity based on a certain attack by the average of the similarity value samples.We use all the alignment similarity values calculated in the first stage as a sample to calculate the similarity value confidence interval which is the judgement condition of pruning in the second stage.Assume 1 2

( , , , ) LL n F F F
is a sample of the alignment similarity value population F , so the sample mean and sample standard variance are as follows: According to the small probability event theory of normal distribution: the most datum of normal population (99.7%) falls in the range of 3 µ σ ± , and those cases out of the range are called small probability events.Statistics holds that small probability events occur almost impossibly, and they can be ignored.The confidence interval of alignment similarity value is as follows: That is: )

Experimental Results
In this section we verify the effectiveness and the noise resisting ability by practical results.In our experiments, CMENW algorithm and HMSA algorithm are implemented to verify pertinently the effectiveness of improvement gave out in GASBSLA algorithm and TGMSA algorithm.

Experiments Environment
Hardware environment: Dawning Server (Intel® Xeon® CPU, 4G internal storage); Software environment: Linux Red Hat 9.0 Operating System(the version of kernel is 2.4.20-8).

Algorithm Validity Verification
For the purpose of comparison, we selected the same experimental method as [3].We generate signatures for polymorphic versions of four real-world exploits: Apache-Knacker [13], CodeRed Ⅱ [14], IISPrinter [15] and TSIG [16].The Apache-Knacker exploit, the CodeRed Ⅱ exploit and the IISPrinter exploit use the text-based HTTP protocol.The TSIG exploit uses the binary-based DNS protocol.We use polymorphic engine to generate 150 samples for each exploit attack include 50 samples used to generate signatures and 100 samples used to detect false negatives.In order to simulate an ideal polymorphic engine, we fill wildcard and code bytes for each exploit with values chosen uniformly at random.In addition, we select 10,000 data samples without attacks from the MIT Lincoln Laboratory intrusion detection system test set-DARPA99 (the third week data sets) [17] to detect False positives.
In our experiments, we set the matching score

( ) . , ( ) (
) ., Table 1 and Table 2 show the signatures of the four exploit attacks introduced above which are generated by CMENW algorithm and HMSA algorithm and by GASBSLA algorithm and TGMSA algorithm.The two tables also give out the rate of false positives and false negatives of the detection experiment using the signatures.It can be discovered from the comparison of Table 1 and Table 2, the signatures generated by GASBSLA algorithm and TGMSA algorithm have better generality and effect when they are used to detect polymorphic attacks.We take TSIG for example to analyze the reason: signatures generated by CMENW algorithm and HMSA algorithm include exact position relation, but in fact polymorphic attacks are effective attack codes through processed by polymorphic mechanism(some methods to add useless codes into effective attack codes), and the lengths of useless codes is alterable, which leads to false negatives when signatures generated by CMENW algorithm and HMSA algorithm are used to detect polymorphic attacks.For Apache-Knacker exploit, the effective attack codes contain distance restriction, so the false negatives of CMENW algorithm and HMSA algorithm is zero, while the false positives of GASBSLA algorithm and TGMSA algorithm is 0.08.Nowadays, but, most of the polymorphic attacks contain no distance restriction in their effective attack codes.

Noise Resisting Ability Verification
We selected the same experimental method with HMSA algorithm: testing the noise resisting ability using CodeRed exploit and IISPrinter exploit as attack Ⅱ samples and comparing it with HMSA algorithm.The sample set contains 20 samples for each attack, and the number of noises included in the sample set is increased gradually to observe the numbers of generating signatures and generating precise signatures using HMSA algorithm and TGMSA algorithm.Generating precise signatures means both false negatives and false positives generated from the sample set are zero.
It can be found from the results as showed in Figure 1 and Figure 2 that: when SNR below or equal to one, both HMSA algorithm and TGMSA algorithm possess strong noise immunity; but when SNR is more than one, the noise resisting ability of TGMSA algorithm is better than that of HMSA algorithm.The reason is that: when SNR is more than one, there must be the situation that two noise align with each other.HMSA algorithm assume that any alignment of two noise will de pruned, but in fact the assumption usually cannot to met which leads to no result or no precise result when the alignment of noises aligns with the alignment of sample.Aiming at this fact TGMSA algorithm improves the pruning policy, and the experimental results prove that our improvement enhance the noise resisting ability of the algorithm.
Algorithm 1. GASBSLA algorithm Input: sequence a and b Output: the similarity value and optimal sequence alignment of a and b

Figure 1 .Figure 2 .
Figure 1.Rates of generating signatures and generating precise signatures for CodeRed exploit attack in different SNR Ⅱ Ⅱ Ⅱ Ⅱ Fieldschema, "Catch me, if you can: Evading network signatures with web-based polymorphic worms," Boston, MA: 2007.[3] Y. Tang, X. C. Lu, et al., "An automatic generation of attack signatures based on multi-sequence alignment [J]," single sequence from W orderly, then align it with the alignment result Ali in the second stage respectively to generate a new alignment result '