Generating Rule-Based Signatures for Detecting Polymorphic Variants Using Data Mining and Sequence Alignment Approaches

Antiviral software systems (AVSs) have problems in detecting polymorphic variants of viruses without specific signatures for such variants. Previous alignment-based approaches for automatic signature extraction have shown how signatures can be generated from consensuses found in polymorphic variant code. Such sequence alignment approaches required variable length viral code to be extended through gap insertions into much longer equal length code for signature extraction through data mining of consensuses. Non-nested generalized exemplars (NNge) are used in this paper in an attempt to further improve the automatic detection of polymorphic variants. The important contribution of this paper is to compare a variable length data mining technique using viral source code to the previously used equal length data mining technique obtained through sequence alignment. This comparison was achieved by conducting three different experiments (i.e. Experiments I-III). Although Experiments I and II generated unique and effective syntactic signatures, Experiment III generated the most effective signatures with an average detection rate of over 93%. The implications are that future, syntactic-based smart AVSs may be able to generate effective signatures automatically from malware code by adopting data mining and alignment techniques to cover for both known and unknown polymorphic variants and without the need for semantic (run-time) analysis.


Introduction
Computer worms and viruses continue to grow despite improved intrusion de-The significance of this paper is to continue a purely syntactic exploration of the possibility of generating signatures automatically from malware source code without the need for semantic analysis.Syntactic techniques for signature extraction based on structural detection of malware are relatively unexplored in comparison to semantic techniques (i.e.techniques based on analyzing the execution behavior of malware).The primary benefit with a syntactic or structural technique is that new and previously unknown variants can be generated from the extracted syntactic or structural rules of existing variants (see [13] for more detail).For a semantic approach, an actual variant instance is required so that it can be run to create an execution trace.This execution trace can be compared with other execution traces from previous instances to determine whether a new signature is required and, if so, how effective that signature is in detecting the family of which this instance is a variant.For a syntactic approach, on the other hand, the set of actual instances so far found is a subset of possible instances of the language derivable using a grammar.Effectiveness of signatures can be determined by generating numerous possible instances even if they have not occurred.
Previous work used sequence alignment to extract consensuses (calculated order of the most frequent symbols found in each position) from malware code variants for the purpose of generating the minimum possible number of signatures for detecting those variants and previously unseen variants.But there was no attempt made to make the most of a by-product of the alignment for data mining purposes, which is the output of equal length malware code of variants.
Our task in this paper is to compare signatures produced from the outcomes of data mining the variable length malware code before alignment with the outcomes of data mining the equal length malware code after alignment to determine which method produces better signatures automatically.
Malware is typically a script or program written first in a high-level language V. Naidu  (e.g.C, Java) and then compiled into hex code.The source code will contain instructions for the infector part (how to spread), the payload part (what action to take) and methods for encryption/decryption to hide the malware intent.The infector part also usually contains instructions on how to change the code so that new variants are produced on infection.This leads to many "variants" of the same family where the infector and payload are the same but differently coded.
The run-time behavior of the variant is used by human experts to generate signatures (short strings of hex code) for storage in libraries of AVSs to scan incoming packets and the contents of memory to detect the variant and its family.
One of the main problems for AVSs is that polymorphic techniques that change the order of the malware code can evade signatures that assume a constant left-to-right ordering in malware code variants.As will be seen below, some very old and well-known viruses still evade modern AVSs because their variants adopt simple code sequence changes that cannot be detected by the latest signatures.
The task of a syntactic learning system for signature generation of polymorphic malware using hex code only (i.e.no execution traces) is specified below (see Figure 1): a) From the code of a set of seen variants P s , automatically generate signatures to identify and detect unseen variants P u , where P s and P u form currently known variants P k .
b) From the code of a set of known variants P k , automatically generate signatures to identify and detect unknown variants P x for cross-validation.In this case, P x are code variants that have not been seen before for either training or testing purposes.
The learning task is to maximize true positive rates, and minimize false positive Figure 1.Our method comprising of eight steps.
Unknown Polymorphic Variants P X Known Polymorphic Variants P K

Unseen Polymorphic
Variants P U Seen Polymorphic Variants P S and false negative rates in both cases above.As will be seen below, previous work has addressed a) through sequence alignment techniques that use insertion operations as well as substitution matrices for matching malware code.It is currently not known whether matching techniques that work well for a) will continue to work well for b), or whether data mining techniques that look for patterns in underlying structure are required to allow generalization to unknown variants.
Roadmap: Section 2 and Section 3 discuss the background and limitations of previous work.Section 4 discusses previous related work relevant to this paper.

Background
A key development in syntactic approaches has been adoption of string-based algorithms in bioinformatics for identifying structural matches in malware code.Such algorithms do not just look for the presence or absence of characters in specific positions but also manipulate the strings to allow for insertion of characters to expand the number of matching characters.Importantly, the results of such string manipulation are a set of equal length strings from an initial set of variable length strings.Earlier work [21] has demonstrated that string matching and sequence alignment algorithms taken from bioinformatics perform best with biologically represented strings (DNA or protein) rather than non-biological character sets, possibly due to being optimized for chemistry-based mutations between characters.We follow previous approaches in transforming malware code to an appropriate biological string representation before sequence alignment, with transformation of consensuses (i.e., those parts of the malware strings that are common) back to hexadecimal (hex) code for signature generation (see [21] for more detail).
A sequence-based method to signature extraction was previously proposed and illustrated utilizing the Smith-Waterman algorithm (SWA) without gap penalties [10].The method adopted in [10] was further fine-tuned [11] by selecting SWA with six different substitution matrices.Results demonstrated that it was possible to extract signatures for pairs of malware strings and meta-signatures for a family of malware after implementing data mining rule-extraction methods (PRISM [22]) to the extracted signatures.That is, underlying patterns were mined after two rounds of matching: the first round dealt with pairwise matching of variable length malware variants to produce level 1 consensuses, where each consensus was of different length to other consensuses; the second round dealt with multisequence alignment of these variable length level 1 consensuses to produce level 2 equal length consensuses, or signatures.The two-stage alignment process (pairwise followed by multiple) was required because of the computational difficulty in running multiple sequence alignment directly on strings of vastly different lengths, as malware variants of a family tend to be.The initial pairwise alignment allowed pairwise recurring similarities to be first identified in consensuses before these consensuses were themselves multiply aligned to produce level 2 consensuses (signatures).These level 2 consensuses of equal length were then mined using PRISM to find underlying patterns, resulting in meta-signatures.Another relevant enhancement in syntactic methods was also recently published [12].Two different dynamic programming techniques, namely, Needleman-Wunsch and SWA, were explored for matching purposes, and it was found that SWA gave the best results with 100% of unseen P u variants in the test set P k being detected.Recent work [13] adopted ten different combinations of gap open and gap extend penalties in conjunction with dynamic programming.
It was found that changes in these parameters helped to generate effective signatures for detecting unseen P u (test set P k ) polymorphic variants.

Limitations of Previous Work
Previous work using a sequence alignment approach [10] [11] [12] [13] had two limitations.First, the string matching search using the SWA found only the most optimally-conserved meta-signatures using left-to-right matching techniques.It was not known how successful these meta-signatures would be when used against unknown P x variants where code has been moved and restructured (i.e.case b) above), thereby reducing the number of left-to-right matches.A rule-based or top-down approach that tries to find underlying patterns may overcome the limitation of signatures generated in left to right order, thereby reducing or nullifying the false positive and false negative rates [23] [24].Rule-based signatures obtained in this way might potentially capture knowledge which makes the identification and detection of unknown P x variants possible.Thus, the rule-based NNge approach (more details in Section 5) is explored in this research and detailed in this paper.
A second limitation, as noted above, was that the alignment using SWA was "pairwise" and only allowed alignment of two viral sequences at a time in the first round of alignment.Multiple sequence alignment was then used on all pairwise consensuses to generate equal length sequences for rule-based data mining using PRISM.However, in the first round, only those regions of similarity in the V. Naidu

Related Work
The main body of research over the last fifteen years has concentrated on malware detection adopting semantic-based approaches and only a few adopting syntactic-based approaches.A list of approaches to automatic signature generation is presented in Table 1.Practically all previous approaches deal with only a restricted set of variants belonging to the same malware family and it is currently not known how generalizable these approaches are for detecting other variants of the same family, either unseen (P u ) or unknown (P x ).In our approach, new P u and previously unknown P x structural variants belonging to the JS.Cassandra polymorphic viral family are provided by one of the most respected grey hat hackers.
Some other related and selected previous work that primarily focuses on malware detection using data mining and bioinformatics approaches are shown in Table 2. Very little research has been undertaken using data mining and bioinformatics approaches for the detection of polymorphic virus and its unseen P u Table 1.Related research to the automatic signature generation in malware detection.
Researchers/Application Type of Malware Type of Approach Description Wespi et al. [25] Intrusions Semantic Variable length patterns from training data consisting of system call traces of commands under normal execution were analyzed by a sequence-based algorithm called Teiresias for intrusion detection.
Polygraph [29] Polymorphic worms Syntactic Generates an array of tokens, a subsequence of tokens and Bayes signatures based on probabilistic methods to detect polymorphic worms.
Nemean [30] Worms Semantic Focus on generating signatures that defend against worms.

PAYL [31]
Worms Semantic Produces subsequence signature tokens that associate ingress/egress payload notifications to detect the initial replication of worms.
Hamsa [32] Polymorphic worms Semantic Produces a set of signature tokens that can deal with polymorphic worms by investigating their invariant activity.
ShieldGen [33] Worms Semantic Generates network signatures for unseen vulnerabilities (worms) that are based on protocol-aware for instance.
AutoRE [34] Botnets Semantic Produces a spam signature creation architecture from spam emails that use botnets to detect them.
Coull and Szymanski [35] Masquerade Semantic Sequence alignment was used to identify masquerade detection by comparing "audit data" with legitimate user signatures extracted from their actual command line entries.Journal of Information Security Continued Scheirer et al. [36] Polymorphic worms

Syntactic and Semantic
Detection of many polymorphic worms and uses intrusion detection techniques such as sliding window schemes and instruction semantics.
Wurzinger et al. [37] Botnets Semantic Detects botnets that are under the influence of botmaster (malicious body) using network signatures by examining the response from a compromised host to a received command and by generating detection models.
Botzilla [38] Malware binaries Semantic Produces signatures for the malicious activities (traffic) created by a malware binary executed several times within a controlled domain.

Semantic
Generates signatures through an inverse transcoding method by converting the malware sequential information, such as system call sequences, propagation dataflow, etc. into amino acid sequences and then aligning them using multiple sequence alignment tool.
ProVex [40] Botnets Semantic Generates signatures to detect botnets that use encrypted command and control (C & C) systems after being given the keys and decryption routine employed by the malware be derived using binary code reuse strategy.
FIRMA [41] Botnets Semantic Detects C & C systems but does not produce signatures for those.
Ki et al. [42] Worms, Trojans, etc. Semantic Generates sequences that are typical API call sequence motifs of malicious activities belonging to several malware samples and employed multiple sequence alignment tool to align those malware samples to extract signatures.
MalGene [43] Evasive malware samples Semantic Uses sequence alignment techniques on two sequences of system call events belonging to two different analysis environments: one environment in which the malware evades the AVS, and the other in which it exhibits the malicious activities.These events are used to construct an "evasion signature" using sequence alignment.variants, let alone its unknown P x variants.The syntactic approach most closely related [44] adds nothing new to what was published by Chen et al. in 2012 [16], and replicates the structural sequence alignment and data mining approaches adopted in that paper and subsequently refined by [10] [11] [12] [13].
Previous use of sequence alignment and data mining has for the most part been semantic in nature, depending on system behavior patterns or using n-grams of bytes instead of code or structural patterns for the detection of malware.Also, because of their semantic nature, the generalizability of the results to new P u variants generated through polymorphism is unknown.A purely syntactic-oriented approach, on the other hand, is based on the intuition that most new P u (polymorphic) variants are simple syntactic variations of existing versions.The complicating aspect is variable length variations.The "expressive power" of signatures can be estimated by detecting how well these signatures generalize to unseen P u and unknown P x variants of the same family, all obtained through polymorphic (structural) alterations to the code.The benefit of a syntactic approach is that no semantics is needed.More importantly, as will be shown below, the number of malware training instances required to extract signatures for use against unseen P u test instances is exceptionally small given the sequence alignment and data mining approaches adopted in the experiments.

Data Mining
Previous work [11]  As an instance of a polymorphic string-based technique, consider the structu-Journal of Information Security rally-related set of sentences [11]: The cat saw the mouse (Class 1) The mouse was seen by the cat (Class 2) We see that the cat saw the mouse (Class 1) We see that the mouse was seen by the cat (Class 2) PRISM and NNge were applied on the four structurally-related set of se- Class 1: the we cat see saw that the cat mouse saw the mouse Class 2: the we mouse see was that the seen mouse by the was cat seen by the cat The results on this example string set show that NNge can generate rules with 100% accuracy over PRISM, which generated rules with 75% accuracy.One of the aims of this paper is to determine whether this result is generalizable to many more instances of strings (variants) belonging to different classes (families).
NNge, first introduced by Martin (1995), is a nearest neighbor algorithm and an expansion of Nge [46], which generalizes by merging exemplars [47] and forming hyperrectangles in feature space that represent conjunction rules (if-then rules) with internal disjunction.The learning is incremental; each ex-ample is first classified and then generalized by joining the example to its nearest neighbor, either a single instance or a hyperrectangle, in the same class.Each hyperrectangle is converted into a production rule.When a hyperrectangle covers just one instance it is regarded to be non-generalized exemplar [48].An instance of a hyperrectangle is shown below [49]: This hyperrectangle covers strings "42210b" and "62231c" but not "3118b", for instance.Within the NNge algorithm [49]  ( ) ( ) In Equation (1), min The constant w k signifies weights corresponding to attributes and can be regulated throughout the training procedure [46] or can be assigned to mutual information [48] [50].
The adjustment stage is implemented when a previously created hyperrectangle covers an instance associated with a different class.To circumvent the creation of nested hyperrectangles NNge regulates the current hyperrectangle so that the inconsistent instance is eliminated.This is accomplished by splitting the hyperrectangle into two or more hyperrectangles and potentially into a few isolated variants/instances.The generalization stage comprises modifying the "border" of the nearest hyperrectangle possessing the same class as the training case in order to cover it.The extension is obtained only when the newly split hyperrectangle does not overlap with hyperrectangles possessing a separate class.If the overlap is detected the training case is included in the model as a non-generalized exemplar [48].
The experiments are intended to check whether data mining using DNA code produces better results than using hex code.Once viral code is converted to DNA code, sequence alignment using publicly validated and provably tested alignment software becomes possible.
Also, in the experiments below, "padding" was required to convert variable length viral strings into equal length strings for two of the experiments (Experiments I and II).For example, given hex strings "13ad3" and "245335623f", pad-

Systems and Methods
Methods Overview (Experiments I-III): The method in Experiment I consists of six steps, summarized as follows.
Step-1 deals with virus code variant generation P k and separating the training set P s from the test set P u .
Step-2 deals with the process of variable length data mining on a small percentage of the training P s and test P u sets using NNge classifier to generate rules for string extraction. Step

Comparison of Three Sets of Experiments in Detail
Experiment ).As can be seen from Table 3, the length of sequences, and therefore the number of attributes where each position in a sequence represents an attribute value, varies from over 20,000 to over 90,000, making both sequence alignment and data mining heavy computational and memory-intensive tasks.

1) Comparison of the Data Mining Results Obtained from Three Sets of
Experiments as Well as from Other Related and Selected Previous Work Table 4 presents the results of Experiments I-III and compares those results with the virus detection results presented in previously published works (see Table 2).In the case of the work by Chen et al. [16] only the percentages of correctly detected and incorrectly detected instances were reported (as for J48 method) and in the case of Prabha et al. [45] no performance metrics were reported.In the case of Srakaew et al. [18] other overall performance metrics such For the processes of data mining and pairwise sequence alignment.
For the processes of multiple sequence alignment, data mining and pairwise sequence alignment.
Multiple sequence alignment for the process of data mining No No Yes

Conversion of variable length sequences into equal length sequences
By adding the letter "x" towards the end of each sequence until all the variable length sequences were of equal lengths.
By adding the letter "X" towards the end of each sequence until all the variable length sequences were of equal lengths.
By the process of multiple sequence alignment.All the gaps introduced by the process of alignment were substituted by "X".[51].We used "Gary's Hood" online tool [51] as it allows multiple files to be scanned at the same time adopting the four existing AVS products/scanners (i.e.AVG, AntiVir, ClamAV and F-Prot).ESET AVS product was installed on a private machine with Windows based operating system and Clamscan antivirus scanner was installed on a private machine with Linux based (Linux Mint) [52] operating system using their own ClamAV database and using the own generated (.ndb) databases [10] containing the corresponding malicious or non-malicious meta-signature (C1 HEX and C2 HEX ).The databases of all the AVS products were up-to-date with the latest updates.In total, 71 meta-signatures (9 meta-signatures from Experiment I, 14 meta-signatures from Experiment II and 48 meta-signatures from Experiment III) were generated from malicious and non-malicious sequences.All the 71 meta-signatures (C1 HEX and C2 HEX ) were scanned/tested against the 352 known (P k ) JS.Cassandra malicious variants, 43 JS.Cassandra non-malicious (P u ) variants and 352 random JavaScript files individually by placing these meta-signatures inside their own generated (.ndb) database [10].The testing process was conducted using Clamscan antivirus scanner.None of the scans took longer than a second.
Table 5 shows the scan results of some of the effective meta-signatures tested against the malicious, non-malicious and random datasets.Non-malicious C2 HEX7 (I) and C2 HEX11 (II) 5 shows that none of the existing AVSs fully detected these non-malicious (P u ) variants as malicious.
The same batch of 71 meta-signatures (C1 HEX and C2 HEX ) was once again tested against the 100 unknown (P x ) JS.Cassandra malicious variants by using the own generated (.ndb) database [10].The testing process was conducted using Clamscan antivirus scanner.The uniqueness of these 100 new (P x ) malware variants was cross-checked by generating a CRC32b hash value for each variant, and no duplicates were found.Clamscan had overall accuracies of 100%, across all three experiments (see Table 6).Table 6 shows that all 100 (accuracy of 100%) JS.Cassandra unknown (P x ) variants were successfully detected by the Clamscan using the .ndbdatabase augmented with our meta-signatures.
The 71 meta-signatures (C1 HEX and C2 HEX ) were tested for false positives.First, any duplicate meta-signatures (C1 HEX and C2 HEX )   detected as false positives (0.159% false positive rate) using the 45 meta-signatures (C1 HEX and C2 HEX ), thereby satisfying the false positive rate requisite of 0.1%.

Discussions
It non-malicious (15/48)) detected seen (P s ) and unseen (P u ) variants belonging to malicious and non-malicious groups (see Table 5).Only 11 out of the 30 effective meta-signatures (C1 HEX and C2 HEX ) obtained from Experiment III are shown in Table 5.
As Experiments I and II were performed using two different representational approaches (i.e.hex/DNA) along with Experiment III containing aligned DNA sequences, all with the same (unchanged) instances each time, some of the meta-signatures (C1 HEX and C2 HEX ) obtained from the three sets were identical to each other.Malicious C1 HEX1 (I), C1 HEX3 (II), non-malicious C2 HEX41 (III) and C2 HEX43 (III) share identical meta-signature.On the other hand, malicious C1 HEX4 (I), C1 HEX9 (II) and non-malicious C2 HEX37 (III) share identical meta-signature.
Although Experiment II generated rules with 100% inaccuracy, the overall combined percentage of effective meta-signatures (C1 HEX and C2 HEX ) generated from all three sets of experiments was 57.75%.On the other hand, the overall combined percentage of non-effective meta-signatures (C1 HEX and C2 HEX ) generated from all three sets of experiments was 42.25%.
The key differences between previous related work [10] [11] [12] [13] and the work presented here are as follows: 1) Previous work adopted left-to-right string matching techniques to find the most optimally-conserved meta-signatures.The work presented in this paper adopts a rule-based or top-down approach that attempts to find underlying patterns.
2) Previous work generated equal length consensuses using sequence alignment techniques, whereas the current work generates variable length consensuses adopting a variable length data mining technique (NNge).
3) Previous work adopted pairwise alignment techniques for extracting signatures which only allowed alignment of two viral sequences at a time taking into account only the information available in the sequence pair.This work allows all sequences to be used to extract signatures and so takes into account all the information in all the sequences at the same time, including both family generic and variant specific information.

Conclusions
In this paper, some of the limitations (discussed in Section 3) of previous work [10] [11] [12] [13] were addressed.The learning task of maximizing true positive rates and minimizing false positive and false negative rates was satisfied.A syntactic approach was investigated and three sets of experiments were conducted which involved various approaches to automatic signature generation using the NNge classifier to generate rules that distinguish between malicious and non-malicious files.The results show that this string-based syntactic approach using an NNge rule generation and subsequent extraction and sequence alignment using SWA can successfully generate signatures (C1 HEX and C2 HEX ) which are capable of detecting the known (P k ) (i.e.seen and unseen) as well as unknown (P x ) polymorphic variants of the JS.Cassandra virus (see Table 5, Table 6 and Figure 2).Remarkably, this research demonstrated that it is possible to detect seen (P s ) (training set), unseen (P u ) (test set) as well as unknown (P x ) variants using the training signatures obtained from a very small proportion (typically 3% and below) of training variants of that test family.A minimal number of training variants was deliberately chosen because the need to detect large numbers of test variants from a minimal number of training variants accurately represents the syntactic malware signature generation approach in the real world.
The use of newly generated novel (P x ) variants differentiates our approach from all previous research that adopts existing malware samples from an online repository.In comparison to the semantic-based approaches as shown in Table

Section 5 and
Section 6 discuss the data mining technique and sequence representations adopted in this paper.In Section 7, we describe our systems and methods.Section 8 summarizes the key features and steps by comparing the three different sets of experiments conducted in Section 7. Section 9 discusses the results.That is Section 9-1) compares the data mining results obtained from three different sets of experiments against other related work and Section 9-2) evaluates signatures generated through the three different sets of experiments against state of the art AVS products, and on the detection of JS.Cassandra polymorphic virus and its known and unknown variants.Section 10 and Section 11 contain the discussions and conclusions.The paper concludes with references and Appendix section.Appendix Sections A1-A3 explain the three different sets of experiments (Experiments I-III) that were individually performed with these methods.
used PRISM on the consensuses derived after two rounds of alignment to generate rule-based signatures by performing several train/test (P s /P u ) iterations with an overall accuracy of 62%.Although PRISM and NNge are both rule induction algorithms, the theoretical advantages of choosing NNge over PRISM are due to its potential for improved accuracy and production of extensive or verbose rules.Optimizing rules to produce minimal redundancy is counter-productive in malware signature generation, especially when trying to deal with P x instances and to keep false positive and negative rates low.Moreover, in NNge, frequent removal of data instances and restoration of the training dataset are not required unlike in PRISM.These steps are overcome in NNge by joining the instances to its nearest neighbour (more details below).
quences by categorizing them into two classes, namely: Class 1-cat saw the mouse and Class 2-mouse was seen by the cat.The variable length strings were converted into equal length strings by expanding the shorter strings to have a length equal to the longest string by adding the letter "x" at the end of each short string.PRISM gave the following rules with 75% accuracy after four iterations ("pos" = position): If pos1 = the, pos2 = cat, pos3 = saw, pos4 = the, pos5 = cat, pos7 = the, pos8= mouse, pos9 = x and pos10 = x then Class 1 If pos2 = mouse, pos3 = was, pos4 = seen, pos6 = was, pos7 = seen, pos8 = by, pos9 = the, pos9 = x and pos10 = x then Class 2 NNge gave the following rules with 100% accuracy ("^" = conjunction; "{}" signifies disjunctive options): Class 1 IF: pos1 in {the, we} ^ pos2 in {cat, see} ^ pos3 in {saw, that} ^ pos4 in {the} ^ pos5 in {cat, mouse} ^ pos6 in {saw, x} ^ pos7 in {the, x} ^ pos8 in {mouse, x} ^ pos9 in {x} ^ pos10 in {x} Class 2 IF: pos1 in {the, we} ^ pos2 in {mouse, see} ^ pos3 in {was, that} ^ pos4 in {the, seen} ^ pos5 in {mouse, by} ^ pos6 in {the, was} ^ pos7 in {cat, seen} ^ pos8 in {by, x} ^ pos9 in {the, x} ^ pos10 in {cat, x} The strings were extracted from the above-mentioned PRISM and NNge rules and are shown as follows in their corresponding classes: PRISM: Class 1: the cat saw the cat the mouse Class 2: mouse was seen was seen by the NNge: class B if p1 = (2 or 4 or 6) AND p2 = (22) AND (p3 ≥ 9 AND p3 ≤ 32) AND p4 = (b or c) (see below), creating the collection of hyperrectangles starting from the training collection is an accumulative procedure where, for every instance I n , the subsequent three stages are consecutively enforced, i.e. classification, model adjustment and generalization.The classification stage locates the hyperrectangle G b which is nearest to I n .The model adjustment stage divides the hyperrectangle G b if it covers an inconsistent instance.The generalization stage extends G b in sequence to cover I n at most if the generalized instance does not overlap/cover an inconsistent instance/hyperrectangle [48].NNge Algorithm: For each instance I n in the training collection do: Locate the hyperrectangle G b which is nearest to I n /*Classification Stage*/ IF D (G b , I n ) = 0 THEN IF class (I n ) ≠ class (G b ) THEN Divide/Split (G b , I n ) /*Adjustment Stage*/ ELSE G': = Extend (G b , I n ) /*Generalization Stage*/ IF G' overlaps with inconsistent hyperrectangles THEN add I n as a non-generalized exemplar ELSE G b : = G' The classification stage is formulated based on the distance D(I, G) between an instance I = (I 1 , I 2 , …, I n ) and a hyperrectangle G as shown in Equation (1) (Classification Stage).
of numerical values across the training collection which correspond to attribute k.For categorical (i.e.nominal) attributes, the length of this set is constantly 1. G k is the interval[ min k G , max k G ] if I k is a quantitative attribute,and is a list of values if I k is a categorical attribute.The distance between the corresponding hyperrectangle i.e. the "side", and the attribute values is formulated based on the type of the attribute, as illustrated in Equation (2) (Distance between the Corresponding Hyperrectangle).
-3 deals with the extraction of common training sequences (i.e.strings, or first-level rule-based consensuses) using the NNge rules.Step-4 deals with converting the hex code of the training P s and test P u sets (obtained from Step-1) as well as first-level consensuses (obtained from Step-3) into a form (in this case, DNA) acceptable for sequence alignment.Step-5a deals with the process of pairwise (local) sequence alignment between the first-level consensuses and some variants of the training set P s (both obtained from Step-4) using the SWA to produce equal length sequences (i.e.second-level consensuses).Step-5b deals with the extraction of meta-signatures, or common substrings, from these second-level consensuses.Step-6 deals with the conversion of meta-signatures back into viral hex code for the purpose of signature testing against P k and P x viral sets.More details concerning each step are supplied in Figure A1 in the Appendix section.The method in Experiment II consists of six steps.The same procedure as Experiment I was used along with the same training P s and test P u sets, with the only difference being that some variants of the training set P s were converted into DNA format prior to the process of variable length data mining.More details concerning each step are supplied in FigureA2in the Appendix section.The method in Experiment III consists of seven steps.The same procedure as Experiments I and II was adopted and the same training P s and test P u sets were used, with the only difference being an additional step of multiple sequence alignment on the training set P s to produce equal length sequences prior to the process of equal length data mining.More details concerning each step are supplied in FigureA3in the Appendix section.

Figures 2 (
Figures 2(a)-(c) are the screenshots of the scan results indicating that 352 of the 352 known (P k ) malicious variants, 43 of the 43 non-malicious (P u ) variants and 100 of the 100 unknown (P x ) malicious variants were successfully identified as infected by the Clamscan antivirus scanner using the 45 meta-signatures (C1 HEX and C2 HEX ).Figure 2(d) shows that only 29 of the 18,123 clean files were

Figure 2 (Figure 2 .
Figure 2. Screenshot of the scan results obtained from Clamscan antivirus scanner for JS.Cassandra variants and clean files using the 45 meta-signatures (C1 HEX and C2 HEX ).
et al.

Table 2 .
Some related and selected previous work in malware detection using data mining and bioinformatics approaches.
Algorithms i.e.J48, KNN (K-Nearest Neighbours), Naïve Bayes. 100 binaries out of which 90 were benign and 10 were malware binaries.15 subfamilies, with a total of 1056 malicious viral samples.Extraction of hex dumps/Extraction of byte sequences in terms of n-grams of different sizes.Journal of Information Security .Naidu et al.
Set NM in this paper is defined as malware that is generated by eliminating their key polymorphic functions and are partially functional with no payload properties.HEX and NM HEX are then converted to DNA.Multiple sequence alignment is then applied on M DNA and NM DNA to produce equal length sequences M E and NM E .Then NNge is applied to M E and NM E to produce variable length strings N1 DNA and N2 DNA .N1 DNA and N2 DNA are then pairwise sequenced against P s .This produces C1 DNA (between N1 DNA and P s ) and C2 DNA (N2 DNA and P s ), and these consensus C1 DNA and C2 DNA become the meta-signatures for use against P k and P x after converting back into hex (i.e.C1 HEX and C2 HEX ).Therefore, the viral code remains in hex format until just before pairwise sequence alignment.s ) and C2 DNA (N2 DNA and P s ), and C1 DNA and C2 DNA become the meta-signatures for use against P k and P x after converting back into hex (i.e.C1 HEX and C2 HEX ).The difference between Experiment II and Experiment III is that the viral strings are multiply aligned first to produce equal length strings before NNge is applied.

Table 3 .
Comparison of three sets of experiments.

Table 4 .
[16]arison of the results of Experiments I-III with those reported previously for data mining approaches to malware detection reported in the related work section (see Table2). flected in the results of the work reported by Chen et al.[16], where improved classification was observed if J48 classification was performed after a double alignment process.2)An Evaluation of the State of the Art AVS Products and Meta-Signatures (C1 HEX and C2 HEX ) on the Detection of JS.Cassandra Virus and Its Known P k (~55%) and II (43%) (see Section 9-2).The fact that the meta-signatures (C1 HEX and C2 HEX ) in DNA format performed better if the DNA sequences were aligned prior to rule mining (Experiment III vs. Experiment II) and extraction is also re-

Table 5 .
Detection ratio using five state of the art AVSs and the 14 most effective malicious and 8 non-malicious meta-signatures (C1 HEX and C2 HEX ) from Experiments I to III with Clamscan scanner. ) malicious variants of the JS.Cassandra polymorphic virus, where (I), (II) and (III) represent the meta-signatures (C1 HEX and C2 HEX ) generated from Experiments I, II and III.None of the five state of the art AVSs fully detected all known (P k ) JS.Cassandra variants.Scan results for AVG, AntiVir and F-Prot AVS products were obtained from an open source online website known as "Gary's Hood" k detected 339 out of the 352 (with 96.31% accuracy) JS.Cassandra malicious (P k ) variants, whereas non-malicious C2 HEX41 (III) and C2 HEX43 (III) detected 340 out of the 352 (with 96.59% accuracy) JS.Cassandra V.Naidu et al.
u ) variants were still executable.The results presented in Table

Table 6 .
Detection ratio using two state of the art AVSs and the 71 meta-signatures (C1 HEX and C2 HEX ) obtained from Experiments I to III with Clamscan antivirus scanner.
00% Journal of Information Security was found from the experiments conducted in this paper that Experiment III (equal length data mining technique) gave the highest number of successful meta-signatures (C1 HEX and C2 HEX ) in comparison to Experiments I and II (variable length data mining technique).Experiment II gave the lowest number of successful meta-signatures (C1 HEX and C2 HEX ).Not only did Experiment III gave the highest number of meta-signatures (C1 HEX and C2 HEX ), but it also gave the highest number of effective meta-signatures (C1 HEX and C2 HEX ).Moreover, Experiment III generated meta-signatures (C1 HEX and C2 HEX ) that were not generated in Experiments I and II.The importance of multiple sequence alignment prior to data mining significantly improved both the quality and quantity of meta-signatures (C1 HEX and C2 HEX ) in comparison to Experiments I and II.In comparison to previous reported work (see Section 4 and Section 5), the syntactic approach to automatic signature generation using NNge successfully has addressed the limitations of previous work by generating signatures in the quickest, simplest and most accurate manner.In total, 45 out of the 71 overall meta-signatures (C1 HEX and C2 HEX ) i.e. around 63.38% (33.80% malicious (24/71) and 29.58% non-malicious (21/71)) were effective i.e. detected seen (P s ) and unseen (P u ) variants from the two different types of groups (i.e.malicious and non-malicious).Specifically, six out of the nine meta-signatures (C1 HEX and C2 HEX ) generated from Experiment I (i.e.around 66.66% meta-signatures-44.44%malicious(4/9)and22.22% non-malicious (2/9)) detected seen (P s ) and unseen (P u ) variants belonging to malicious and non-malicious groups (see Table5).And seven out of the 14 meta-signatures (C1 HEX and C2 HEX ) generated from Experiment II (i.e.50% meta-signatures-28.57%malicious(4/14)and 21.43% non-malicious (3/14)) detected seen (P s ) and unseen (P u ) variants belonging to malicious and non-malicious groups (see Table5).Additionally, 32 out of the 48 meta-signatures (C1 HEX and C2 HEX ) generated from Experiment III (i.e.66.66% meta-signatures-35.41%malicious (17/48) and 31.25%