Identifying Cancer Disease through Deoxyribonucleic Acid ( DNA ) Sequential Pattern Mining

This paper aims to propose the sequential pattern discovery method of Deoxyribonucleic Acid (DNA) sequence database in order to identify cancer disease. The DNA which is composed of amino acids of gene P53 is mutated. It effects to change of P53 formation. Sequential pattern discovery is a process of extracting data to generate knowledge about the series of events that has the sequences in a certain frequency so that it creates a pattern. PrefixSpan is to propose method to find a pattern of DNA sequence database. As a result, there are various selected patterns of DNA sequence. The pattem which has high similarity is used as biomarker to identify the breast cancer disease. The performance measure of support value average is 0.8. It means that the frequent sequence pattern is high. Another measure is confidence. All of the confidence values are 1. Then, the last performance measure is lift ratio at average more than 1. It means that the composed sequence items in the pattern has high dependency and relatedness. Futhermore, the selected patterns are applied as biomarker with accuracy as 100%.


Introduction
Cancer is classified as malignant and deadly disease.The cancer disease is effective to uncontrolled growth of cells and gene mutations, i.e. gene of P53.This disease changes P53 protein sequence [1].This protein consists of a combination of 20 amino acids which are synthesized by ribosomes and are performed based on the genetic code of the Deoxyribonucleic Acid (DNA).If the DNA is mutated, then the protein composition will be incorrect.Continuously, it is effective to a variety of diseases and disorders such as cancer.Therefore, early detection can be conducted by analyzing of protein sequence through the blood test.The most frequently altered gene of P53 or TP53 mutations is found in human cancers.There are 30,000 somatic mutations of various cancer types of TP53 database which is collected over 20 years.Recently, the methodology of sequencing cancer genome impacts on the healing and data management [2].According to Soussi, P53 mutation analysis of the pattern has become essential to investigate the cause of cancer.His test result shows that infrequent of gene mutation is associated to the normal activity of the P53 protein [2].
A field that is expected to provide his role is bioinformatics.This field is a knowledge discipline that combines the study of molecular biology, mathematics and information technology (IT).It is defined as the application of computational tools and analysis to capture and interpret molecular biology data.Molecular biology itself is also an interdisciplinary field in molecular level of life sciences [3].Bioinformatics has a very important role, including the data management for molecular biology, especially DNA sequences and has huge volume of genetic information.
Many researches are conducted to find genome mutation using DNA sequence approach, identifying cancer patients through DNA micro-array to classify class of cancer using Genetic algorithm and K-means [4].Another research is to find Ebola disease through DNA sequence pattern discovery [5].Therefore, this research is to propose method of sequence pattern mining to identify cancer disease through DNA sequence discovery.Sequential pattern discovery is a part of data mining task that generates knowledge about the series of events that have frequent occur with specified threshold value [6].The pattern is expected to use as biomarker of cancer disease.In this research is used PrefixSpan algorithm to discovery its pattern of DNA sequence database.It is a method of Sequential Pattern Discovery which has high performance of computational time [7].This paper is organized as follows.In Section 2, it consists of literature studies including sequential genome, Sequential Pattern Discovery and performance measure evaluation.The design and implementation method of Prefix span is presented in this section.Then experimental result and analysis is provided in Section 3. The last section contains conclusion of this paper.

Research Method
In this research is started on data collection which is taken from NCBI.The related literature is sequential pattern discovery, prefix span algorithm and evaluation of the performance measure.

Sequencing Genome
DNA sequencing is a process of determining three million nucleotide bases order which consist of adenine, guanine, cytosine and thymine (A, T, G, C) in a DNA molecule.However, sequencing genome is the determination of the nucleotide sequence of DNA bases in the genome or in the body of an organism.
Sequencing results are expressed in the form of a sequence of letter nucleotide bases in specific DNA sequence, for example AGTCCGCAGGCTCGGT.Sequencing genome is always compared to coding process, whereas the sequencing process is not only defining a code, but also analogous genome sequence of letter from a mysterious language.It has an important and specific meaning.
There is a few mutations of DNA sequence that can effect to disorder gene and then it can be diseases.Hence, in medicine, sequencing genome is used to diagnose cancer diseases.A cancer is caused of abnormal DNA sequence construction in body cell.

Sequential Pattern Discovery
Sequential pattern mining is a data mining process that generates knowledge about the sequence of events that have frequent occur exceeds the threshold value [6].Sequential pattern is a derived pattern from the association rules, because of the both indicate events relationship.The difference is that the sequential pattern of events focused on finding patterns of event that appear after another event.It is consider to time sequence.However, association rules is a pattern of events that appear with other events together.

PrefixSpan Algorithm
This research is conducted to find the pattern of DNA sequence using Prefix Span for patient having breast cancer.PrefixSpan (Prefix-Projected Sequential Pattern Growth) is a development approach that uses an algorithm to search for sequential pattern sequences.PrefixSpan will seek frequent sequences of the elements and then develop the sequences by adding elements one by one.As a result, the additional sequence is still the previous sequence.This way is not necessary to generate and test for candidates.There are two steps to find the candidates, including Prefix and projected database.

Prefix
3) all items in ( ) n m a a − ′ as alphabetic is appear after m a′

Projected Database
Given the sequence A and B so that B is subsequence of A. A sequence A' is subsequence of A is projected of prefix B, iff: 1) A' has refix B 2) There is no supersequence A" of A' so that A" is sub-sequence of A and has prefix B The pseudocode of PrefixSpan algorithm is shown in Figure 1 [7] The pseudocode of Prefix Span algorithm in Figure 1 is designed and applied to the system as shown in Figure 2. The initial step is input for DNA sequence database and minimum support as constraint.The second step is sequence numbering.It is based on transaction identifier to be sorted by length of string.The next step is scanning sequence database.It is to get the DNA sequential patterns and to count their frequency based on minimum support.The next step is divide search space to find the projected database and sequence pattern.In this step, it is retrieved a subsequence of their patterns and frequencies.If the number of frequencies is greater than or equal to the minimum support, then it is added as a postfix.These steps are repeated recursively until there is no subsequence of pattern.

Evaluation
There are several measurements to know the performance of system including support, confidence, lift ratio and accuration rate.
• Support It is probability of frequent itemset in whole transaction.It is ratio of itemset transaction as in Equation ( 1) [6].
( ) support number of transactions containing both A and B total number of transactions • Confidence It is measurement that shows relation between two items conditionally as in Equation ( 2) [6]. where: ( ) = the number of antecedent in transaction

• Lift ratio
Lift ratio is a measure to know the strength of constructed rule of sequential pattern mining algorithm.The value of lift ratio is between 0 and unlimited.It has definition as below: [8] -If lift ratio value is less than 1, then it means that rule antecedent will be effect to negative of rule consequence.-If lift ratio value is equal to 1, then the rule is frequent but it is independent.
-The others, if lift ratio value is more than 1, then it means that the rule antecedent will be effect to positive of rule consequent.It is recommended value.The formula of lift ratio is stated in Equation (3) and Equation ( 4).

(
) ( ) where:  ( ) = the number of consequence in transaction  m = the number of transaction

• Accuracy Rate
The accuracy rate in this research is purposed to find a motif (pattern) of DNA sequences.If the pattern (without consider to contiguous sequence) is match then it indicates that the accuration rate is 100%.
In this research, it is used modified method of string matching algorithm approach.The pattern of string is not exactly the same but it is considered to the same sequence pattern.It is evaluated using threshold minimum support.As illustration, the algorithm can be described as the below example: SequenceID1: gttggctctg The pattern: attca Output: NotMatching SequenceID2: actgtaccac The pattern: attca Output: Matching (accuracy rate = 100%)

Results and Analysis
The data set of this experiment is taken from DNA sequence of patient's breast cancer.The DNA sequence database which is taken from 7 th exon of ten human as is shown in Table 1.The Exon is a subset of DNA sequence to construct the code of protein.
As ilustration, the interface of application is shown at Figure 3.The DNA sequence database as input is taken from NCBI with minimum support and sequence length as threshold.Then the sequence data is performed as transaction data before applying Prefix Span method.

Experimental Result
Several thresholds such as minimum support, maximum sequence length is given to the system and as a result, the sequence pattern of DNA is constructed.As illustration, it is given minimum support count = 7 and sequence length = 10.It is applied to the 7th exon of human#1, then the patterns are found as in Figure 4.
A pattern, <aac> or a  a  c, which is shown in Figure 4 means that it is composed from sequence a is followed by a and the latest is c.This pattern occurred minimum seven times per ten sequences.
After finding several patterns in each human, it is retrieved a certain pattern of DNA sequence database which has high similarity.In this experiment, the data sets are the 7th of exon five humans(human #1, human #2, human #3, human #4, and human #5) for pattern retrieval.By giving various minimum support, the selected patterns are applied to five others DNA sequences database of human (human #6, human #7, human #8, human #9, human #10).Then, the selected pattern and accuracy rate are shown in Table 2.

Analysis
In order to know performance of the system, it is evaluated based on performance measure of support, confidence, lift ratio and accuracy.First, the experiment is applied to the training data with various minimum support and sequence length of 7 th exon for human #1, human #2, human #3, human #4, and human #5.
The five graphs at Figure 5 up to Figure 9 show that the performance measurement including lift ratio and support of the constructed pattern tends to be   stable.The best performance is at minimum support in range 6 and 11 at the sequence length 10.The both measurements almost close to 1. Also, for the confidence is always at value 1.It means that the selected patterns have high strengthness and correctness.Then, the second experiment is used various sequence length with minimum support 7. As a result, the patterns are constructed with performance measure as shown in Figure 10 up to Figure 14.Almost all graphs on the Figure show that lift ratio and support value are stable.They have the best performance within interval range 6 -12 of sequence length.However, the confidence value is not depend on the sequence length.It is always at 1.
In general, the experimental result is presented that the more minimum support, the less number of the selected pattern is as shown in Table 2. Also, the maximum minimum support is ten for selecting pattern of five human.All accuration rates are 100% (match within pattern sequence) as in Table 3.This shows that the selected patterns are appear more frequently than and equal to     the threshold as minimum support and sequence length.Also, the maximum selected pattern is ten times.As illustration, one of the threshold is minimum support = 7 and the sequence length = 10.Then, there are two resulted patterns, such as <agtc> and <actc>.The both patterns, a  g  t  c and a  c  t  c, are found minimum seven times per length 10 of DNA sequences.
Furthermore, the computational time of proposed method for DNA sequential pattern mining is evaluated as at Table 4.The trend of this performance is related to the number of constructed rule (pattern) as in Table 5.It has similar characteristic to the selected pattern.The more minimum support value is given, the less time computation is required.

Conclusion
Sequential pattern discovery is a method in data mining task to find the pattern    on DNA sequence database of patient's cancer disease.By giving the threshold such as various minimum support and sequence length, different sequence patterns are obtained.In order to know reliability of the system, it is evaluated using three performance measures, including support, confidence, and lift ratio.The experimental results show that almost all the support value is closed to 1.It means that the probability of co-occurrence for sequence-items in the pattern is high.Also, the confidence value is always 1.This shows that the pattern has believable value at 100%.Furthermore, the lift ratio is almost always more than 1.This means that the patterns which consist of items are dependent each other.
After applying to other dataset, i.e. five sequence DNA database from different human, it is achieved the accuration rate as 100%.It means that the pattern selected is valid with maximum appearance of items for ten times in each 10 sequential items.Therefore, the patterns can be used to define the DNA sequence characteristic (motif) of the patient's cancer diseases.

Figure 3 .
Figure 3. Interface of sequential pattern mining for cancer patient's DNA sequence.

Figure 6 .
Figure 6.Performance measure of DNA sequence for human #2 with various minimum supports.

Figure 10 .
Figure 10.Performance measure of DNA sequence for human #1 (by various sequence length).

Figure 11 .
Figure 11.Performance measure of DNA sequence for human #2.

Figure 12 .
Figure 12.Performance measure of DNA sequence for human #3.

Figure 13 .
Figure 13.Performance measure of DNA sequence for human #4.

Figure 14 .
Figure 14.Performance measure of DNA sequence for human #5.

Table 3 .
Accuracy rate of the selected pattern.

Table 4 .
The running time of sequential pattern mining for five humans.

Table 5 .
The number of sequence pattern for five humans.