^{1}

^{1}

We analyzed DNA sequences using a new measure of entropy. The general aim was to analyze DNA sequences and find interesting sections of a genome using a new formulation of Shannon like entropy. We developed this new measure of entropy for any non-trivial graph or, more broadly, for any square matrix whose non-zero elements represent probabilistic weights assigned to connections or transitions between pairs of vertices. The new measure is called the graph entropy and it quantifies the aggregate indeterminacy effected by the variety of unique walks that exist between each pair of vertices. The new tool is shown to be uniquely capable of revealing CRISPR regions in bacterial genomes and to identify Tandem repeats and Direct repeats of genome. We have done experiment on 26 species and found many tandem repeats and direct repeats (CRISPR for bacteria or archaea). There are several existing separate CRISPR or Tandem finder tools but our entropy can find both of these features if present in genome.

Deciphering the enormously long nucleotide sequences that are being uncovered in the human genome is one of the major challenges in our days. Along with serious ethical issues, we encounter a series of tremendously hard scientific problems. These problems mainly arise from the fact that although sequencing techniques are almost completely automatic controlled the analysis of the sequenced data is not. Hence, the major goal of the Human Genome Project is the extraction of biologically and medically relevant information from almost automatically sequenced DNA and RNA molecules. In principle, biochemical methods are able to do this job, but since they are extremely expensive and time consuming, there is a high demand for alternative approaches to extract the information hidden in genome [

The motivation for this study is to analyze DNA sequences to determine interesting sections of genome that has repeating features using information theory tool.

In many organisms, the genomic DNA is highly repetitive accounting for close to 5% of the genome size [

In our study we also focus on a family of repeats known as Clustered Regularly Inter Spaced Palindromic Repeats (CRISPRs) [

Several software applications are available for identifying various form of repeats in [

A graph is an object that consists of a non-empty set of vertices and another set of edges. Vertices are often called nodes, and edges are referred as connections. The set of edges may be empty, in which case the graph is just a collection of points.

We say that two vertices i and j of a directed graph are connected if there is an edge from i to j or from j and i. Suppose we are given a directed graph with n vertices. We construct an n × n adjacency matrix A associated to it as follows: if there is an edge from vertex i to vertex j, we put 1 as the entry on row i, column j of the matrix A; if there is no edge, we put 0.

If one can walk from vertex i to vertex j along the edges of the graph then we say that there is a path from i to j. If we walked on k edges, then the path has length k. For matrices, we denote by A^{k} the matrix obtained by multiplying A with itself k times. The entry on row i, column j of A^{2} corresponds to the number of paths of length 2 from vertex i to vertex j in the graph.

Let us consider a directed graph and a positive integer k. Then the number of directed walks from vertex i to vertex j of length k is the entry on row i and column j of the matrix A^{k}, where A is the adjacency matrix.

In this section, we will discuss entropy of such adjacency matrix A. Let

Let

Note that

We noticed that the sum of all entrees of the matrix

where

ATGCCTGATGCGACGC

Taking 2-letter nodes with one overlap, we can create a graph as following:

We draw a graph as in the

For our sequence, graph entropy

We have downloaded wide range of genome data, eukaryotes (animals, plants, insects, fungus) and prokaryotes (bacteria, archaea) from Gen Bank: ftp://ftp.ncbi.nlm.nih.gov/genomes/.

We have implemented the Graph Entropy Algorithm in MATLAB platform and converted data to MATLAB format. Then we have computed graph entropy using our Graph Entropy Algorithm by scanning the data with a typical sample size of 512 base pairs (bp) and step size of 10 bp taking 3 nodes with 1 overlap. We have drawn graphs of entropy versus genome length of Acidovorax bacteria in

In

The following is the sequence in the interval taken. The colored string is repeating.

ATAAAAAAACCCGGTGCATGCACCGGGTGGGACCAGCCCCGCGGGCGGGGCGGCTGGCTGCTGTCGTCGCTCAGGGCTTGGTGCCCGTCGGGAAGGGCCATGCGGCCTGCGGGTTCAGCGTGGTCTGTGCTGCGGGTGCAGGCGCAGGGGCAGAGGCCTTGGAGGCCGCCTTTTTCGGGGCAGCCTTCTTCGGTGCAGCGGCCTTGGTCGTGCCGGTGGCCTTCTTCGCCGGTGCAGCTGCCTTCTTGGTGGAGGCTGCGGCCTTCTTTGCCGGTGCAGCTGCCTTCTTGGCGGGGGCTGCGGCCTTCTTCGCCGGTGCAGCTGCCTTCTTGGCGGGGGCTGCGGCCTTCTTTGCCGGTGCAGCTGCCTTCTTGGCAGGAGCTGCGGCCTTCTTTGCCGGTGCAGCTGCCTTCTTGGCGGGGGCTGCGGCCTTCTTTGCCGGTGCAGCTGCCTTCTTGGCAGGAGCTGCGGCCTTCTTTGCCGGTGCAGCTGCCTTCTTGGCAGGAGCTGCGGCCTTCTTTGCCGGTGCAGCTGCCTTCTTGGCGGGGGCTGCAGCCTTCTTCGCCGGAGCGGCCTTCTTCGTCGTGGCGGCGGCCTTCTT

Strfind(g,'GCCGGTGCAGCTGCCTTCTTGG') command gave us the following positions of those repeats in the sequence.

871227 871269 871311 871353 871395 871437 871479 871521

The spacers are almost identical. These are tandem repeats.

Similarly in the

We looked at the DNA sequence in the interval (2926000:2926650) around x = 2926000. The following is the sequence in the interval taken. The colored string is repeating.

AAAAATGCATCCTTCCCGAACGGCAATAGCTGGCACGACGTACGGCTTGATAATCAACAGCATATAGACAAGGCGCTGCCAGGGCGGATTGAGCGCCGTAGCCGCGATGTAGTGCGGATAATGCTGCCGTTGGTAAAAGAGCTGGCGAAGGCGGAAAAAACGTCCTGATATGCTGGTGAAACGTGTTTATCCCCGCTGGCGCGGGGAACACGGACAGCAACCCGTGTCGGATATCAGACAGATCGGTTTATCCCCGCTGGCGCGGGGAACACACGCGAATCGCCAATCGCCGCCGCGTGAATTGCGGTTTATCCCCGCTGGCGCGGGGAACACCCACGATGTATGCCGACCGTGATTTTTACCGCCGGTTTATCCCCGCTGGCGCGGGGAACACAGATACGCCTTTACGTCGCCCTCTTTGGCGCGCGGTTTATCCCCGCTGGCGCGGGGAACACTAAAACACCGGTTGCGCAACCTCCGCGGGGATCGGTTTATCCCCGCTGGCGCGGGGATCGGTTTATCCCCGCTGGCGCGGGGATCGGTTTATCCCCGCTGGCGCGGGGAACACTCTAAATCTACCCAATTGAATTTAAATACTTTTTTAGCGCACAAAAAACCCACCAACTTTTCCTAATTTTTAAAGATCTCTAA

We used strfind(g,'CGGTTTATCCCCGCTGGCGCGGGGAACAC') in MatLab and found more repeats outside the interval.

2926243 2926304 2926365 2926426 2926539 2943184 (does not belong to this region). length('CGGTTTATCCCCGCTGGCGCGGGGAACAC')=29 strfind(g,'GTGTTTATCCCCGCTGGCGCGGGGAACAC'): 2926182 strfind(g,'CGGTTTATCCCCGCTGGCGCGGGGATCGG') 2926487 2926513.

Starts: 2926182 Ends: 2926567.

In the interval (2926182, 2926513) we find three strings differing by 2 to 4 letters.

These repeats are called CRISPR. This is only CRISPR so far known for this strain of the bacteria.

Again, we studied the pattern of the DNA sequence of Caldicellulosiruptor Kristianssonii (Bacteria) in intervals around the points of low entropy and found repetitive patterns. In

TATTGCAATTATTGTCCTATGCACAGAGTTTGTAGCCTTCCCGTTGGGGATTGAAACATAGATTTCATTTCGCAGCCAATAGAGCGGTTTATAGTTTGTAGCCTTCCCGTTGGGGATTGAAACCTCAATTTCTGTTTCTCTTTTCTCAATTATTCTTGAGTTTGTAGCCTTCCCGTTGGGGATTGAAACTATAATAGCCCATTCATCAAAAACTTTTTCATCGAAGTTTGTAGCCTTCCCGTTGGGGATTGAAACTATAATAGCCCATTCATCAAAAACTTTTTCATCGAAGTTTGTAGCCTTCCCGTTGGGGATTGAAACCACAAAATTATAGTTTGGCGCAATGTAAACACGAACAGTTTGTAGCCTTCCCGTTGGGGATTGAAACTCTATGTCTTCTTCAAGATACATATCGAGCAGCTTATTGTTTGTAGCCTTCCCGTTGGGGATTGAAACATACTTTTTTTCTCACGGTCTGTATGGCCTGTTCAGT

We notice repeats and use Matlab to find the exact locations of that string.

strfind(g,'GTTTGTAGCCTTCCCGTTGGGGATTGAAAC')

Columns 1 through 61

2666352 2666419 2666484 2666551 2666616 2666682 2666748 2666813 2666879 2666947 2667014 2667081 2667147 2667213 2667278 2667344 2667410 2667476 2667544 2667611 2667676 2667741 2667805 2667872 2667939 2670817 2670882 2670949 2671016 2671081 2671147 2671214 2671279 2671345 2671411 2671476 2671544 2671610 2671675 2671740 2671805 2671871 2671936 2672001 2672070 2672135 2672201 2672267 2672333 2672399 2672466 2672534 2672599 2672666 2672733 2672798 2672864 2672930 2672996 2673523 2673590

Length ('GTTTGTAGCCTTCCCGTTGGGGATTGAAAC')=30

61 repeats of length of 30, unique spacers

Starts: 2666352 Ends: 2673620 Period: 65/66/67 Total Length = 7268

These repeats are CRISPR.

In

CCGTTTATATCCACGCAGGCGTTTCCCCTTACCTGCACCGAGCCTCCATTCCCGTTTATATCCACGCAGGCGTTTCCCCTTACCTGCACCGAGCCTCCCGCCCCGTTTACATCCACGCAGGCGTTTCCCCTTACCTGCACCGAGCCTCCATTCCCGTTTATATCCACGCAGGCGTTTCCCCTTACCTGCACCGAGCCTCCCGCCCCGTTTACATCCACGCAGGCGTTTCCCCTTACCTGCACCGAGCCTCCATTCCCGTTTATATCCACGCAGGCGTTTCCCCTTACCTGCACCGAGCCTCCCGCCCCGTTTACATCCACGCAGGCGTTTCCCCTTACCTGCACCGAGCCTCCATTCCCGTTTATATCCACGCAGGCGTTTCCCCTTACCTGCACCGAGCCTCCATTCCCGTTTATATCCACGCAGGCGTTTCCCCTTACCTGCACCGAGCCTCCATTCCCGTTTATATCCACGCAGGCGTTTCCCCTTACCTGCACCGAGCCTCCCGCCCCGTTTATATCCACGCAGGCGTTTCCCCTTACCTGCACCGGGCCTGCCGCCCCGTTTACATCCACGCATGCGTTTCCCCTTACCTGCACTG

strfind(g,'TTTCCCCTTACCTGCACCGAGCCTCCATTCCCGTTTATATCCACGCAGGCG')

Columns 1 through 18

44007626 44008952 44009105 44009156 44009258 44009360 44009462 44009513 44009615 44009717 44009819 44009870 44009921 44010023 44010125 44010227 44010278 44010329

The spacers are almost identical with this string except 4 letters (in purple). We also find the spacer string.

Strfind(g,“TTTCCCCTTACCTGCACCGAGCCTCCCGCCCCGTTTACATCCAC GCAGGCG”).

Columns 1 through 23

44007575 44007677 44007728 44007779 44007830 44007881 44007983

44008034 44008289 44008493 44008646 44008697 44008799 44008850

44008901 44009207 44009309 44009564 44009666 44009768 44009972

44010074 44010176

This is a repeat of a string without any gap in the region (44007575, 44010329).

DiscussionThe importance of identifying repetitive sequences is clear; however, the considerable size of many genomes makes fast and efficient repeat detection very challenging. In this paper, we have presented a new algorithm for finding repeats in DNA sequences. The algorithm is based on our new measure of entropy for any non-trivial graph. In [

We have studied the following species:

Eukaryotes: Homo sapiens chromosome 19 & 21, Anopheles gambiae, Caenorhabditis elegans, Plasmodium falciparum Saccharomyces cerevisiae.

Prokaryotes: Acidovorax, Ammonifex, Caldicellulosiruptor kristjanssonii, E.Coli, Salmonella Typhi, Listeria Monocyto genes, Bacillus clausii KSM, Chlamydia muridarum Nigg, Cyanobacterium aponinum, Gluconacetobacter diazotrophicus, Haemophilus influenzae R2866, Mycobacterium tuberculosis, Mycoplasma genitalium, Neisseria meningitidis, Streptococcus pneumoniae, Thermosipho africanus, Truepera radiovictrix (Bacteria), A. fulgidus (Archaea).

Viruses: HIV, Hepatitis B. After analyzing the DNA sequence at the points of low entropy for all these species, we conclude that low entropy in a genome graph corresponds to high repeatability in the sequence. These repeats can be classified as CRISPR or Tandem Repeats or something else.

This paper was written while two authors were Summer Faculty Fellow in SPAWARS YSCEN Atlantic, Charleston, SC funded by Office of Naval Research. Authors are thankful to their mentor for his assistance in the work.

Sengupta, D.C. and Sengupta, J.D. (2016) Application of Graph Entropy in CRISPR and Repeats Detection in DNA Sequences. Computational Molecular Bioscience, 6, 41-51. http://dx.doi.org/10.4236/cmb.2016.63004