Exploring the Effects of Gap-Penalties in Sequence-Alignment Approach to Polymorphic Virus Detection

Antiviral software systems (AVSs) have problems in identifying polymorphic variants of viruses without explicit signatures for such variants. Align-ment-based techniques from bioinformatics may provide a novel way to generate signatures from consensuses found in polymorphic variant code. We demonstrate how multiple sequence alignment supplemented with gap penalties leads to viral code signatures that generalize successfully to previously known polymorphic variants of JS. Cassandra virus and previously unknown polymorphic variants of W32.CTX/W32.Cholera and W32.Kitti viruses. The implications are that future smart AVSs may be able to generate effective signatures automatically from actual viral code by varying gap penalties to cover for both known and unknown polymorphic variants.


Introduction
The automatic extraction of virus and other malware signatures for use in antiviral software systems (AVSs) is of paramount importance due to the need to find effective solutions to defend systems against the increasing number and severity of attacks [1].It is generally accepted that these attacks now pose a global risk [2].Early work on automatic signature extraction focused on simulating the way that human experts analyzed viruses and generated signatures for use in AVSs [3].
Typically, suspicious code is identified due to anomalous behavior of a computer system.Human experts then manually analyze the suspicious code to identify invariant code portions (syntactic analysis) or code portions that are regularly executed (semantic analysis).Such analysis leads to the generation of unique signatures for use by AVSs when scanning network packets, user files or memory.Before such signatures can be released, they must be checked against non-malware to ensure that the number of false positives is kept acceptably low.
For instance, signatures based only on malware encryption/decryption information are likely to lead to unacceptably high false positives due to the large proportion of normal Internet traffic that also carries encryption/decryption information for integrity (e.g.hash algorithms) and authentication (e.g.certified public keys).But relying on human expertise alone to provide manually extracted signatures is becoming increasingly difficult with the growing volume of malware.As a result, interest continues to grow in methods to improve automatic signature extraction.Semantic approaches [4] [5], in addition to standard dynamic and execution behavior analysis [6] [7], now include methods such as control flow analysis [8] [9], behavior model checking [10] [11], executable graph mining [12] and formal semantic models of analysis [13].The main problem with a semantic approach is that an infection must occur to produce anomalous behavior.Several execution traces may be required before signatures can be extracted manually, and there is always the risk that such signatures may not be effective for different execution paths of the same viral code.Syntactic or static approaches [14] [15] [16] on the other hand, while possibly preferable because of their ability to extract signatures that may apply to different variants of the same malware family and to generate signatures irrespective of differences in execution paths, have not managed to keep pace with the latest polymorphic and metamorphic techniques used by virus writers to obfuscate their malware [17] [18].Static signature extraction methods must also disassemble or reverse engineer executable code so that structural analysis of the source code is possible.
Such analysis includes: statistical analysis of parameter values and searching for repeating strings [19] [20]; code feature selection [21]; feature extraction [22]; and n-grams analysis [23] [24] [25].The mapping of executable code to a suitable level of program representation that allows such structural analysis is problematic, however, due to such code being deliberately constructed to hide its functionality, such as through the use of redundant control instructions and variable assignments.
Predicting future metamorphic and polymorphic viral forms to prepare AVSs for as yet unknown variants has remained a distant research goal for both semantic and syntactic techniques.The key to a successful syntactic approach would appear to lie in analyzing malware code directly and without execution, and so removing the need for reverse engineering.By comparing different structural variants of the same virus, a successful structural/static approach may be able to identify common code patterns despite attempts to obfuscate through polymorphism because, if the virus is to perform its designated payload or function and remain a variant of a virus family, a common code must be present even if it is deliberately obscured.A purely syntactic approach, such as the one proposed in this paper, should detect new polymorphic viral variants independently of semantic knowledge based on execution traces, command and control channels, deduplication and propagation vectors.That is, a purely syntactic approach to new variants should not require prior infection by those variants.
In this study, we focus on a sequence-based automatic signature extraction method for identifying polymorphic malware using syntactic analysis of hex code.Theoretically, malware with polymorphism changes its code and keeps the functions intact, whereas malware with metamorphism changes sub-functionality and code while preserving overall functionality [26].The implications of this theoretical division are unknown for automatic signature extraction.It is not even known if any metamorphic malware actually exists [27].For that reason, we confine our approach to polymorphic malware capable of mutating into a potentially infinite number of functionally equivalent but structurally different variants (details below).
Previous work in syntactic signature extraction [28] introduced the idea of using basic pairwise sequence alignment techniques from bioinformatics to identify "consensuses" (common occurrences of hex code) in pairs of variants, which was a signature for that pair.These consensuses were in turn multiply aligned with each other to generate a common consensus (i.e. a meta-signature) for all variants [29] [30].A by-product of alignment is that variable-length viral sequences become of fixed length and longer through the introduction of gaps.
Gaps are the segments that are generated when aligning amino acid or nucleotide sequences so that similar and analogous residues in two or more sequences are paired with each other in the same column.These could also get deposited at areas where one or more sequences have some additional residues (produced by an insertion) or have missed some residues (produced by a deletion).Gaps are generally substituted with gap symbols such as blanks, asterisks or hyphens to make it pair up with sequences that have no gaps.If insertions and deletions never occurred, then sequences could simply be paired by shifting them along each other and only considering the alignment that best paired the existing residues.In previous work, the evaluation of these consensus-based techniques was restricted to all known, already identified, polymorphic variants.The signatures extracted were therefore "variant-fit" rather than "variant-predictive".
The aim of this paper is to examine whether string searching algorithms of greater sophistication than those investigated previously by Naidu and Narayanan [29], such as the Smith-Waterman algorithm which unlike previous work [29] includes different combinations of gap open and gap extend penalties, can lead to the automatic generation of signatures not just for known variants but also for unknown (future), or newly generated, variants.In order to In Section 2 and Section 3, we discuss the background of syntactic techniques and previous related work.In Section 4, we describe the problem statement.We then demonstrate our systems and methods in Section 5. Section 6 compares the results against state-of-the-art AVS products.Section 7 contains the conclusion.

Background
Because The cat saw the mouse The mouse was seen by the cat We see that the cat saw the mouse We see that the mouse was seen by the cat Signature extraction is similar to finding the two patterns "cat saw mouse" and "mouse seen cat" that will help to detect all four sentences as variants despite the variable length of the sentences, the movement of tokens within the sentences and introduction of extra material.If options and alternatives are allowed, "{we see} [cat|mouse] [saw|seen] [cat|mouse]" is an approximate regular expression (rule-based signature) for all four sentences that also allows for derivations of new structural variants not so far encountered (e.g."that the cat was seen by the mouse was seen by us").These signature examples are of course simplistic when compared to the real task of automatic signature extraction.
Viral signatures must also take into account dependencies between non-adjacent code in order to deal with specific polymorphic features as well as possible rearrangements of code that alter the left-to-right order of signatures.In reality, the first four sentences above would be in hex (machine code) format and require accurate disassembly to a language amenable to structural analysis, and the signature then converted back to hex for real-time scanning of network packets and cached files.Signatures must also be checked for their uniqueness.That is, before the generated signature can be released it must be able to distinguish its source malware from all other malware as well as be consistent with as many variants of that malware as possible.It is generally believed that in 2017 a contemporary AVS may contain between a quarter of a million to half a million signatures due to the increasing rate of release of new malware.Updates to AVSs may require removing old and no longer effective signatures as well as adding new signatures, and this can be expected to become more time-consuming with the growth in occurrences of new malware.
A sequence-based approach to signature extraction was previously proposed and demonstrated using the Smith-Waterman algorithm (SWA) without gap penalties [29].SWA is used extensively in bioinformatics for sequence alignment (finding common subsequences or consensuses among a set of variable length sequences), and previous work demonstrated the feasibility of using such consensuses in viral hex code as signatures.The approach was further refined [30] by adopting SWA with six different substitution matrices.Results showed that it was possible to extract signatures/meta-signatures after applying data Another related advancement in a syntactic approach was also recently reported [31].Two different dynamic programming methods, namely, Needleman-Wunsch and SWA were investigated.However, this work was limited to a single polymorphic malware family (JS.Cassandra) and used fixed parameters which were not tuned [31].It was found that SWA gave the best results with What has improved considerably since the historical view that only semantic analysis will reveal viral signatures is the growth in our knowledge of sequence-based syntactic and structural search algorithms in bioinformatics.Such algorithms do not just search for the presence or absence of characters in certain positions but also use pre-loaded substitution matrices that give substitution probabilities and/or allow such substitution matrices to be generated using probabilistic techniques.Of greater importance to this paper is that such algorithms manipulate (shift) the strings/sequences to allow for insertion and deletion of characters to maximize the number of matching characters.Previous work [32] showed that such string manipulation algorithms from bioinformatics work best with biologically represented strings (amino acids, nucleotide bases) rather than arbitrary character sets.This is due to the possible inclusion of heuristic biological information in the algorithms that determines to some extent the matching process (e.g.built-in information concerning mutation rates between amino acids or nucleotide bases).The implications of rewriting already well-understood and publicly available sequence-based bioinformatics algorithms to work on hex code (numeric data) are not known.For these reasons and to allow comparison with previous work, conversion of hex code to an appropriate biological representation is required before sequence matching, with conversion back to hex code for signature generation.We used a simple identity (ID) substitution matrix for our alignment experiments instead of other well-known biological substitution/mutation matrices, such as BLOSUM (Block Substitution Matrix) and PAM (Point Accepted Mutation).ID provides the most parsimonious method in that no assumptions are made as to how symbols may be related to each other.Also, the use of ID allows the effects of gap opening and closing to be accurately assessed without being compromised by probabilistic substitution matrices.

Related Work
Previous research related to this work has primarily focused on worms.Syntactic approaches include Autograph [33], Honeycomb [34] and Early Bird [35], all of which generate signatures that constitute individual adjoining byte strings (tokens).Another syntactic approach is Polygraph [36], which identifies an array of tokens, a subsequence of tokens and Bayes signatures based on probabilistic methods to detect polymorphic worms.Semantic approaches include PAYL [37], which produces subsequence signature tokens that associate ingress/egress payload notifications to detect the initial replication of worms.Other semantic approaches include: Nemean [38], which focuses on identifying signatures that defend against worms; Hamsa [39], which produces a set of signature tokens that can deal with polymorphic worms by investigating their invariant activity; and Botzilla [40], which produces signatures for the malicious activities (traffic) created by a malware binary executed several times within a controlled domain.
V. malicious activities (polymorphic engines).In our approach, new structural variants were generated by us in the laboratory using the information included in documents concerning the corresponding polymorphic viral family (more details in Subsection 5.2).This use of newly generated novel variants differentiates our approach from all previous research that exclusively uses existing malware samples from an online repository.
Other semantic-based research exists for different types of malware, including: ShieldGen [41], which generates network signatures for unseen vulnerabilities that are protocol-aware (for instance, the protocol mode with which an invasive message can be posted); AutoRE [42], which produces a spam signature creation architecture from spam emails that use botnets to detect them; and Wurzinger et al.'s [43] approach, which identifies botnets that are under the influence of botmaster (malicious body) using network signatures by examining the response from a compromised host to a received command and by generating detection models.ProVex [44] is also a semantic-based approach which generates signatures to identify botnets that use encrypted command and control (C&C) systems after being given the keys and decryption routine employed by the malware using binary code reuse strategy, and is based on the research proposed by Caballero et al.'s approach [45].FIRMA [46], also a semantic-based approach, can be employed to detect similar C&C systems but does not produce signatures for these.A number of syntactic and semantic-based strategies were proposed by Scheirer et al.'s approach [47] for the identification of many polymorphic worms and use intrusion detection techniques such as sliding window schemes and instruction semantics, with further refinements by Scheirer et al.In comparison to these semantic-based approaches, we propose a purely syntactic approach which generates variable-length syntactic viral signatures that identify known and unknown variants belonging to a polymorphic viral family, independently of execution traces, and, critically for a syntactic approach, without needing numerous infections for the purpose of malware association.
There has also been some related research on sequence alignment approaches using a semantic approach in other security areas.For instance, sequence alignment was used to identify masquerade detection by comparing "audit data" (actual examples of attempted malicious activity via command line entry using authenticated accounts) with legitimate user signatures extracted from their actual command line entries [48].Another example is intrusion detection [49], where variable length patterns from training data consisting of system call traces of commands under normal execution were analyzed by a sequence-based algorithm called Teiresias.Other sequence alignment approaches that are based on semantics include Zhao et al.'s [50] approach, which generates signatures through an inverse transcoding method by converting the malware sequential information, such as system call sequences, propagation dataflow, etc., into amino acid sequences and then aligning them using multiple sequence alignment tool.Ki et al.'s [51] approach generates sequences that are typical API call sequence motifs of malicious activities belonging to several malware samples and employed multiple sequence alignment tool to align those malware samples to extract signatures.They then used data mining and machine learning algorithms to calculate statistical measures, such as accuracy, precision, etc., to test the extracted signatures but did not test the signatures against new variants.MalGene [52] uses sequence alignment techniques on two sequences of system call events belonging to two different analysis environments: one environment in which the malware evades the AVS, and the other in which it exhibits the malicious activities.These events are used to construct an "evasion signature" using sequence alignment.However, this semantic approach requires system call sequences from both analysis environments which in turn requires the use of system monitoring, which adds an overhead.In contrast, our syntactic approach is independent of any prior semantic knowledge.The syntactic approach most closely related to ours [53] adds nothing new to what was reported by Chen et al.'s approach in 2012 [28], and repeats the structural sequence alignment and data mining approaches adopted in that paper and subsequently enhanced by [29] [30] [31].
To conclude this section, previous use of sequence alignment has for the most part been semantic in nature, relying on system behavior patterns rather than code or structural patterns for the identification of malware or fraudulent activity.Also, because of their semantic nature, the generalizability of the results to new variants created through polymorphism is unknown, as is the generalizability, if any, of signatures to malware of different families.Our syntactic-driven approach, on the other hand, is based on the intuition that most new (polymorphic) variants are simple syntactic alterations of existing malware.The "expressive power" of signatures can be evaluated by identifying how well these signatures generalize to new and unseen variants of the same family, all derived through polymorphic (structural) changes to the code, as well as across different families.The advantage of a syntactic approach is that no semantics is required.
That is, there is no need for an infection before a signature is generated.Finally, most semantic approaches in the literature do not address the problem of false positive rates.This is because there are many different ways that a program can run and false positive rates may be impossible to quantify for signatures extracted from a limited number of execution traces on one variant of malware.
With a syntactic approach, on the other hand, signatures can be checked against static code and objects, including files, without needing to execute any code.For instance, one method of distributing malware is to generate new polymorphic variants and store them undetected in user files until triggered, and syntactic signatures may be effective in catching such variants before execution.The advantages of a syntactic approach are obvious for future smart AVS technology, but so far there has been very little attempt to analyze the effectiveness of a purely syntactic approach systematically and across different malware families.
For instance, the signatures generated from our approach are able to satisfy the false positive rate requisite of 0.1%.More importantly, as will be shown below, the number of malware training examples needed to extract a signature for use against unseen test examples is surprisingly small given the sequence alignment approach adopted in our experiments.

Problem Statement
Our previous work [29]  open and gap extend penalties to determine whether changes in these penalty parameters can help to identify signatures for known as well as unknown polymorphic variants which we generate in the laboratory, thereby extending the ability of future AVSs to identify variants not previously encountered.Win32.Cholera/W32.Cholera/W32.CTX is a polymorphic virus which attacks executable PE (Portable Executable) files and was first identified in 2010.This virus is programmed in assembly language, and it employs an EPO (Entry Point Obfuscation) approach, which makes its identification difficult [58] [59].The original source files were downloaded from "VX Heaven" [60] website.198 new polymorphic variants of "W32.Cholera" virus were generated by executing one of the original virus files (in this case, a file named "Virus.Win32.CTX.10853").

Hex Dump Extraction
Win32.Kitti/W32.Kitti is a polymorphic virus which works with the help of an overlapping code as an obfuscation technique and was first identified in 2011.
V. This virus modifies its instructions to create new instructions with the same semantics but a different structure using an overlapping code process [61] [62] [63].The original virus file along with its source code in assembly language was downloaded from the "Second Part to Hell" [64] website.1105 new polymorphic variants of "W32.Kitti" virus were generated by executing the original virus file (in this case a file named 'oc.exe').
The method consists of 8 steps, summarized as follows.
Step Summarizing our method, sequence alignment works on variable length viral hex strings to produce equal length hex strings through opening and closing gaps.These equal length strings can be analyzed to produce first-level consensuses (signatures), which represent common subsequences at specific locations for the pairwise alignments.These consensuses/signatures can themselves be analyzed using multiple sequence alignment to produce second-level raw consensuses that can be further analyzed to identify similarities with each other to produce meta-signatures for the six variants in that test family.These meta-signatures are then used to test against all existing variants.
Step-1 (Virus code variant generation): The JS. Cassandra virus and all its known variants were written in the JavaScript programming language, and their source code was readily available.Five variants out of the 351 known variants were taken for our training purposes plus the original "JS.Cassandra.js"virus (a total of six variants).In the case of the W32.CTX virus, five variants out of 198 newly generated polymorphic variants were taken for our training purposes plus the original "Virus.Win32.CTX.10853"virus (a total of six variants).In the case of the W32.Kitti virus, five variants of the 1105 newly generated polymorphic variants were taken for our training purpose as well as the original "oc.exe" virus (a total of six variants).New variant generation was achieved by using informa- previously encountered set of known variants to a potentially infinite set of new variants.

Hex to Amino Acid Conversion
Step-2 (Converting the viral code into a form acceptable for sequence alignment): In this step, the extracted 18 hex dump sequences belonging to the three polymorphic malware families were converted into amino acid sequences.
Conversion of hexadecimal into amino acid sequences for input to JAligner [67] was performed using the rules shown in

First Pairwise (Local) Sequence Alignment and Signature Extraction
The string matching SWA was used to perform pairwise local alignment and to extract the most common substring/pattern from the three different families of polymorphic variants.Signature and meta-signature in this section are defined as follows.A signature is a single string (or a common substring/pattern) that can identify a single or (in some cases) a few known and unknown variants, whereas a meta-signature is a string (or a common substring/pattern) that can identify most or all known variants as well as some or all unknown (or new) variants.
Step-3 (First pairwise (local) sequence alignment using the SWA): In this step, a pairwise (local) alignment was performed on all six training strings for each family using the SWA with an ID substitution matrix (i.e.alignment was performed through matching in particular positions rather than preloaded biologically informed mutation rates) between two sequential converted amino acid sequences using JAligner [67].

Multiple Sequence Alignment and Consensus Extraction
Step-5 (Multiple sequence alignment on signatures): In this step, a multiple alignment was performed on the signatures (i.e. common substrings) obtained in Step-4 using T-Coffee [70] available on the EMBL-EBI website, with alignment being constrained to the ID matrix.In total, three separate multiple alignments were performed (i.e. on 10, 17 and 30 signatures, respectively), one for each of the three polymorphic malware types.The main purpose of alignment here is to produce second-level consensuses (more details in Step-6).
Step-6 (Extraction of consensuses after multiple sequence alignment): T-Coffee [70], similar to other alignment tools, produces a consensus sequence that represents the most common residues (amino acid representations) in each position of multiple sequences after alignment.In this step, the consensus was stored and the process was repeated three times, once for each polymorphic malware.Three consensuses were extracted in this step.One of these consensuses with a sequence length of 203 for the JS.Cassandra virus is shown below in hex representation:

Results and Evaluation of State-of-the-Art AVS Products
Table 3 provides the results of the pairwise local alignments that were performed in Step-3.Only the desired pairwise local alignment results with the highest percentage of identities and similarities are shown in Table 3.
From Table 3, it can be seen that the percentages of identities and similarities were higher than 85%, indicating that there were high percentages of the code conserved in the sequences.In the case of W32.Kitti virus, the percentage of identities and similarities was 100%.In the case of W32.CTX virus, the percentages of identities and similarities were over 94% and in some cases 100%.As expected, Table 3 indicates that the amount of gap increases with lower gap open penalties (see Columns "Gap Open Penalty" and "Gaps Percentage"), indicating that the amount of insertions or deletions to maximize the amount of matches was also lower.In previously adopted methods [29] [30] [31] a fixed combination of gap open (i.e.10) and gap extend (i.e. 1) penalty was used.The work reported here has instead explored various combinations of gap open and gap extend penalties (conducted in Step-3) to explore the effect of these penalties on variant detection.It can be seen from the results in Table 3 that the percentages of identities and similarities were higher (i.e. over 97%) when the gap open and gap extend penalties were higher, indicating that the (pairwise local) alignments were compact, thereby restricting the amount of gaps (with lower gap percentages) and increasing their importance (see Columns "Gap Open Penalty", "Gap Extend Penalty" and "Gaps Percentage" in Table 3).
Tables 4-6 provide the detection rate results for the three malware types along with their known and unknown variants.The detection was carried out using Clamscan and the most effective signatures obtained in Step-4.The most effective signatures were determined to be the signatures that detected over 90% of the variants.These signatures were placed inside our own generated (.ndb) database [29], which is used by Clamscan as a recommended database file format for signature testing purposes.Detection performance for each of the three viruses reported by the "TopTenReviews" [71] website.The top five AVS products in this listing were tested using the same three viruses along with their known and unknown variants, and the results are presented in Tables 4-6.
From Tables 4-6, it can be seen that most of our signatures obtained in Step-4 detected the polymorphic variants, except for two of the 57 signatures that detected none of the variants (not shown in Tables 4-6).In the case of W32.Kitti virus, for 26 out of the 28 most effective signatures the detection rates were 100% and for the remaining two, the detection rates were over 99% (Table 6).In the case of W32.CTX virus, for four out of the eight most effective signatures the detection rates were 100% and for the remaining four, the detection rates were over 91% (Table 5).For the JS.Cassandra virus, the detection rates were above 92% using seven of the 12 signatures (Table 4).From Tables 4-6 (based on the detection ratio, accuracy and statistical measures, such as sensitivity, specificity, etc., needed for malware detection), it can also be seen that none of the top five AVS products fully detected the polymorphic variants except for the Kaspersky Anti-Virus, which successfully detected all of the new polymorphic variants of the W32.Kitti virus.In some cases, the top five AVS products could only successfully detect the original virus and none of its variants (either known or unknown).The eleven meta-signatures, obtained from Step-7, were tested on the three viruses along with their known and unknown variants using Clamscan by placing these meta-signatures inside our own generated (.ndb) database [29].Figure 2 shows that all 352 (accuracy of 100%) JS.Cassandra variants (including the original virus) were successfully detected by the Clamscan antivirus scanner using our .ndbdatabase.One of the three meta-signatures obtained for JS.Cassandra in Step-7 detected all 352 JS.Cassandra variants (output is shown in Figure 2).Two of the other three meta-signatures detected 340 out of 352 (with an accuracy of 96.59%) and 15 out of 352 (with an accuracy of 4.26%) JS.Cassandra variants, respectively.Figure 3 shows that all 200 of the W32.CTX variants (including the two original viruses) were successfully detected by the Clamscan antivirus scanner.Figure 4 shows that all 1106 of the W32.Kitti variants (including the original virus) were successfully detected by one of the three successful (with 100% accuracy) meta-signatures.The remaining two out of the overall five meta-signatures detected none of the 1106 variants.One of the three meta-signatures obtained for the W32.CTX virus in Step-7 detected all 200 W32.CTX variants (as shown in Figure 3) while another detected 189 of the 200 variants (94.5% accuracy).However, the final meta-signature detected only 19 of the 200 W32.CTX variants (9.5% accuracy).None of the scans (as shown in Figures 2-4) took    (0.024% false positive rate) using the six signatures, satisfying the false positive rate requisite of 0.1%.

Conclusions
The aim of our research was to test whether increasingly sophisticated gap open and extend penalties help to produce signatures capable of capturing new polymorphic variants.The results indicate that relatively sophisticated gap penalties captured known variants (training set) of JS.Cassandra virus (see Figure 2).
Furthermore, the increasingly sophisticated gap penalties captured unknown variants (test set) of W32.CTX and W32.Kitti viruses, respectively, indicating the feasibility of more sophisticated gap open and gap extend facilities (see Figure 3 and Figure 4).Remarkably, our research demonstrated that it is possible to detect known (training set) as well as unknown (test sets) variants using the training signatures obtained from a very small proportion (typically 3% and below) of training variants of that test family.Detection of test variants using the training signatures could revolutionize our understanding on the detection and generation of polymorphic variants.The three virus families selected are 5 -11 years old.But as our analysis shows, current AVS products still cannot successfully and consistently identify all their known variants (see Table 1, Table 4, Table 5 and Table 6).
As can be seen from our research, significant concerns exist as to whether modern AVS software systems can or will identify new/unknown (future) variants of polymorphic malware.The ultimate goal for any future, smart AVS would be to identify all potential new/unknown (future) polymorphic variants

V.
Naidu et al.DOI: 10.4236/jis.2017.84020297 Journal of Information Security the same viral function can appear in many different physical code forms it has been posited that only semantic analysis will reveal commonalities among variants of the same virus for effective signature generation.As a result, syntactic techniques for signature extraction based on structural detection of malware are relatively unexplored in comparison to semantic techniques, and so there is very little in the way of related literature.What literature there is discussed in Section 3. In order to understand syntactic-based polymorphism detection techniques it is useful to consider a simple example of linguistic signature extraction.Consider the following structurally-related sentences, where the first sentence is the original sentence, and the other three are polymorphic versions of it: mining rule-extraction techniques to the extracted signatures.Such signatures/meta-signatures can, in turn, be employed as rule-based string templates for creating more specific, variant-oriented polymorphic malware signatures for detecting known variants belonging to the same virus family.In other words, previous work has shown how to progress syntactically (i.e.without execution traces) from viral code consensus identification for a set of variants of the same virus family (training set) to generation of signatures in either a regular expression or rule format for identification of other known variants of the same virus family (test set).
Hex (Hexadecimal) dump extraction (Step-1) and testing (Step-8) were undertaken on a stand-alone system to prevent possible unintended infection of other systems.Downloading of polymorphic malware (and known variants) as well as the generation of unknown variants was performed using "Oracle VM Virtual-Box"[54] (an x86 software package with virtualization capability) with a pre-installed Linux-based (Ubuntu) operating system image.Due to possible security sensitivity, some of the methods below (Step-1 and Step-8) are not described in detail, especially details concerning generating hex dumps from polymorphic malware, which are omitted.Interested readers are requested to contact the corresponding author, using their academic email addresses, for further information.Our method consists of eight steps (see Figure1 below).
-1 deals with virus code variant generation and separating the training set from the test set.Step-2 deals with converting the hex code into a form acceptable for sequence alignment.Because variant generation leads to variable length code, Step-3 deals with the process of first pairwise (local) sequence alignment on the training set using the SWA to produce equal-length sequences for consensus extraction.Gap open and gap close penalties are introduced in this step.Step-4 deals with the extraction of common training subsequences (i.e.consensuses, or signatures) using a similarity measure.Step-5 deals with the process of multiple sequence alignment on these training signatures.Step-6 deals with the extraction of consensuses after the process of multiple sequence alignment.Step-7 deals with the process of second pairwise (local) sequence alignment between the consensuses (obtained from Step-6) and training set (obtained from Step-2) using the SWA and extraction of meta-signatures.Lastly, Step-8 deals with converting signatures back into viral hex code for the purposes of signature and meta-signature testing.More details concerning each step are provided below.

Step- 7 ( 5 . 7 .
Second pairwise (local) sequence alignment using the SWA and Extraction of meta-signatures): In this step, a pairwise (local) alignment between the consensus and the sequence of the original virus/variant was performed using the SWA with an ID matrix using JAligner[67].In total, three separate pairwise local alignments were performed, one for each type of polymorphic malware.The fixed combination of gap open (i.e.10) and gap extend (i.e. 1) penalty (as used in[29] [30][31]) was used in this step.The outcome of this alignment is a common substring, or meta-signature, that will be used to detect all the known (and the unknown/new) polymorphic variants of that family.In total, three meta-signatures for JS.Cassandra virus, three meta-signatures for W32.CTX/Cholera virus and five meta-signatures for W32.Kitti virus were extracted in this step.One of the eleven common substrings (i.e. the meta-signatures) of sequence length 56 obtained from this step for the JS.Cassandra virus is shown below in hex representation: 28272b4d6174682e726f756e64284d6174682e72616e646f6d28292a Amino Acid to Hex Conversion and Meta-Signature (and Signature) Testing Step-8 (Converting the sequences back into viral hex code and signature testing): In this final step, the eleven meta-signatures from Step-7 (and the 57 signatures obtained in Step-4) in their amino acid sequence representation were converted back to hexadecimal format for testing purposes.The eleven hex meta-signatures and the 57 signatures obtained in Step-4 were tested against the three polymorphic malware types along with their known and unknown variants using ClamAV (i.e.Clamscan antivirus scanner) software.One of the eleven hex meta-signatures, with a sequence length 76, obtained from this step for the JS.Cassandra virus is shown below: 393939292b273d3d272b4d6174682e726f756e64284d6174682e72616e646f6d28 292a393939 5.8.Summary By downloading the JS.Cassandra polymorphic virus and its known variants in its original JavaScript coding as well as generating new (unknown) variants of the other two viruses, the authenticity of the variants has been assured.By checking all 18 training variants against a number of AVS systems, we have provided assurance that these variants are genuinely malicious.The first pairwise alignment was conducted using ten different combinations of gap open and gap extend penalties, and the second pairwise alignment was conducted using a fixed combination of gap open (i.e.10) and gap extend penalty (i.e. 1).There were no gap open and gap extend penalty options available for the process of multiple sequence alignment.After signature extraction, all biologically-represented signatures and meta-signatures were converted back to hex code for evaluation (details below).All the signature/meta-signature testing against the polymorphic variants was conducted using the latest version of the Clamscan antivirus scanner [66].

Figure 2 .
Figure 2. Screenshot of the scan result obtained from Clamscan antivirus scanner for 352 JS.Cassandra viral variants using the meta-signature.

Figure 3 .
Figure 3. Screenshot of the scan result obtained from Clamscan antivirus scanner for 200 W32.CTX viral variants using the meta-signature.

Figure 4 .Figure 5 .
Figure 4. Screenshot of the scan result obtained from Clamscan antivirus scanner for 1106 W32.Kitti viral variants using the meta-signature.
[31][31]has shown that sequence alignment techniques supplemented with Smith-Waterman algorithm lead to signatures that genera- lized successfully to unseen but previously known variants of polymorphic viruses.This prior work adopted a fixed combination of gap open and gap extend penalties for the automatic generation of virus signatures.However, it is not known how well this method generalizes to new, unknown variants or what the effect of gap penalties is.In this paper, we use ten different combinations of gap

Table 1 .
Detection ratio based on the 55 state-of-the-art AVS products obtained from the VirusTotal website for the 18 malicious variants.
[66] obtained from various sources concerning polymorphic versions (details available on request).The percentage of training to test ratio of training variants lymorphic viruses, respectively.Hex dumps were then extracted from the 18 variants using "sigtool" (available on the ClamAV ("Clam AntiVirus")[66]website).A severely reduced proportion of training to test samples was used to reflect the current difficulty in identifying signatures that generalize from a small,

Table 2 .
Rules for converting hexadecimal into amino acid.
Ten different combinations of gap open and gap extend penalties were used while conducting the pairwise local alignments.A gap penalty of zero means no penalty for any gaps introduced in the alignment

Table 3 .
Results of the pairwise local alignments that were performed in Step-3.

Table 4 .
[71]ction rates for detection of JS.Cassandra Polymorphic Malware and its known variants by testing the Top Five 2016[71]state-of-the-art AVS products and our top signatures obtained in Step-4 using Clamscan.

Table 5 .
[71]ction rates for detection of W32.CTX/W32.Cholera Polymorphic Malware and its new/unknown variants by testing the Top Five 2016[71]state-of-the-art AVS products and our top signatures obtained in Step-4 using Clamscan.

Table 6 .
[71]ction rates for detection of W32.Kitti Polymorphic Malware and its new/unknown variants by testing the Top Five 2016[71]state-of-the-art AVS products and our top signatures obtained in Step-4 using Clamscan.