^{1}

^{*}

^{1}

^{1}

^{2}

The novel coronavirus (SARS-COV-2) is generally referred to as Covid-19 virus has spread to 213 countries with nearly 7 million confirmed cases and nearly 400,000 deaths. Such major outbreaks demand classification and origin of the virus genomic sequence, for planning, containment, and treatment. Motivated by the above need, we report two alignment-free methods combing with CGR to perform clustering analysis and create a phylogenetic tree based on it. To each DNA sequence we associate a matrix then define distance between two DNA sequences to be the distance between their associated matrix. These methods are being used for phylogenetic analysis of coronavirus sequences. Our approach provides a powerful tool for analyzing and annotating genomes and their phylogenetic relationships. We also compare our tool to ClustalX algorithm which is one of the most popular alignment methods. Our alignment-free methods are shown to be capable of finding closest genetic relatives of coronaviruses.

Deoxyribonucleic Acid (DNA) is a molecule that encodes the genetic instructions used in the development and functioning of all known living organisms. As such, DNA has become a subject of both theoretical and applied studies for the last decades. DNA is a polymer of nucleotides. Nucleotides are the building blocks of DNA. The four different nucleotides of DNA are: adenine (A), cytosine (C), guanine (G), and thymine (T).

DNA sequences analysis, as one of the most important parts of bioinformatics, which was considered to reveal the essence of all life phenomenon, has been developing rapidly in recent years. Sequence comparison is crucial to understand the evolutionary relationships among organisms. Many methods have been proposed to compare genetic sequences. Traditionally, most of these approaches are the widely used alignment-based methods. In these methods, molecular sequences are optimally aligned based on selected scoring systems. The alignment-based methods often give high accuracy and may reveal the relationships among sequences. Some algorithms have been established and incorporated into software for sequence alignments. However, one of the main drawbacks of these techniques is that they are very time-consuming and expensive in memory usage. As a result, alignment-free approaches such as in [

Chaos Game Representation (CGR) is an iterative system method originally proposed by Jeffery [

The motivation of writing this paper came from the recent outbreak of novel coronavirus (SARS-Cov-2) now known as Covid-19. SARS-CoV-2 is the third pathogenic novel coronavirus to emerge over the past two decades. The first, discovered in 2003 and named SARS-CoV, caused SARS, a serious and atypical pneumonia. The second, MERS-CoV, emerged a decade later in the Middle East and caused a similar respiratory ailment called Middle East respiratory syndrome (MERS). Since its identification, 2494 cases of MERS-CoV infection and nearly 900 deaths have been documented [

In this paper, we proposed two methods, i.e., probability matrix method and centroid matrix method combining with CGR to construct distancematrix between two genomes, and then create dendrogram using Hierarchical Agglomerative Clustering (HAC) analysis. Our dendrogram can accurately identify the genetic relationship of different biology, and this method is generally applicable to various organisms.

In this section we first describe the dataset used for our analysis, then present an overview of the three main steps of the method and conclude with a description of the two distances that we considered.

Data acquisition: All viral sequences downloaded in FASTA format from two databases for our analysis: NCBI (https://www.ncbi.nlm.nih.gov/) and GISAID (https://www.gisaid.org/).

For our experiment, we used only complete genomes of 15 corona viruses as it is given in

The method we used to analyze and classify the 15 sequences of the dataset has three steps: 1) generate graphical representations (images) of each DNA sequence using CGR and define FCGR probability matrix and CGR centroid method using the features of CGR; 2) compute all pairwise distance to obtain two distance matrices; and 3) create the dendrogram of the distance matrix using Hierarchical Agglomerative Clustering (HAC) analysis.

Virus name | NCBI/GISAID Accession number |
---|---|

1) hCov-19/bat/Yunnan | EPI_ISL_412976 |

2) hCov-19/pangolin/Guangdong | EPI_ISL_410721 |

3) hCov-19/bat/Yunnan/RaTG13 | EPI_ISL_402131 |

4) hCov-19/India | EPI_ISL_431117 |

5) hCov-19/Italy | EPI_ISL_417446 |

6) hCov-19/Iran | EPI_ISL_437512 |

7) hCov-19/Spain | EPI_ISL_428684 |

8) hCov-19/USA | EPI_ISL_431086 |

9) hCov-19/Wuhan | EPI_ISL_412980 |

10) Human Coronavirus-229E | KF-514433 |

11) Human Coronavirus-HKU1 | KF-430201 |

12) Human Coronavirus-NL63 | KF-530114 |

13) Human Coronavirus-OC43 | KF-530099 |

14) SARS-Cov | NC_004718 |

15) MERS | KT-026456 |

CGR is an iterative method introduced by Jeffery [

For step (1), we will use a slight modification version of the original CGR, introduced in [

F C G R 1 ( s ) = ( N C N G N A N T ) and F C G R 2 ( s ) = ( N C C N G C N C G N G G N A C N T C N A G N T G N C A N G A N C T N G T N A A N T A N A T N T T ) .

The (k + 1)th order F C G R k + 1 ( s ) can be obtained by replacing each element N X in F C G R k ( s ) with four elements ( N C X N G X N A X N T X ) where X is a sequence

of length k over the alphabet {A, C, G, T}. For each k ≥ 1 , we can define a probability matrix of F C G R k ( s ) by taking each entry of F C G R k ( s ) dividing by the total counts of all k-mers. We denote the FCGR probability matrix by ( P i j ) , 1 ≤ i , j ≤ 2 k . Note that ∑ i , j P i j = 1 . Probability matrix can be interpreted as probability of distribution.

Since the CGR captures the information of the whole genome data, extracting the global features from the CGR may not be efficient enough to distinguish the genomes. In CGR Centroid method, we concentrate on extracting the local features as shown in [

For Chaos Centroid method, the CGR is partitioned into 10 × 10 equal subregion. The choice of 10 is to minimize the computation time. For each partition, we compute the centroidas follows. Let ( x k , y k ) be the coordinates of a point in the CGR. We define the centroid in each of the 10 × 10 grid as follows:

c i j = ( ∑ k = 1 | a i j | x k | a i j | , ∑ k = 1 | a i j | y k | a i j | ) , 1 ≤ i , j ≤ 10 .

For step (2), after computing FCGR probability matrices and computing centroid for each of the sequences in the dataset, the goal was to measure “distance” between two CGR images. There are many distances as it is given in [

In this section we formally define each of the two distances. For two FCGR probability matrices ( p i j ) and ( p ′ i j ) we define d i j = | p i j − p ′ i j | . The distance

between the two probability matrices denoted by D P M = ∑ i = 1 2 k ∑ j = 1 2 k d i j . For two

genomes, we calculate 100 centroids c i j = ( x i j , y i j ) and c ′ i j = ( x ′ i j , y ′ i j ) respectively for 1 ≤ i , j ≤ 10 . Then we found Euclidean distance between them

d i j = ( x i j − x ′ i j ) 2 + ( y i j − y ′ i j ) 2 . Then calculated the centroid distance between two genomes denoted by D c d = ∑ i = 1 10 ∑ j = 1 10 d i j .

For our dataset we used k = 7, that is, each DNA sequence represented as a 2 7 × 2 7 FCGR matrix. In [

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | |||||||||||||||

2 | 0.3079 | ||||||||||||||

3 | 0.4900 | 0.4606 | |||||||||||||

4 | 0.5129 | 0.6301 | 0.6303 | ||||||||||||

5 | 0.7076 | 0.7548 | 0.7506 | 0.7436 | |||||||||||

6 | 0.7342 | 0.7737 | 0.7602 | 0.7969 | 0.7858 | ||||||||||

7 | 0.8657 | 0.8700 | 0.8443 | 0.9420 | 0.8850 | 0.8406 | |||||||||

8 | 0.8074 | 0.8299 | 0.8037 | 0.8828 | 0.8587 | 0.7247 | 0.7237 | ||||||||

9 | 0.7578 | 0.7904 | 0.7744 | 0.8132 | 0.7894 | 0.7612 | 0.7067 | 0.7470 | |||||||

10 | 0.4920 | 0.7671 | 0.2929 | 0.6313 | 0.7441 | 0.7714 | 0.8531 | 0.8123 | 0.7846 | ||||||

11 | 0.4947 | 0.4750 | 0.0600 | 0.6408 | 0.7608 | 0.7614 | 0.8519 | 0.8029 | 0.7827 | 0.3143 | |||||

12 | 0.4930 | 0.4677 | 0.0321 | 0.6341 | 0.7553 | 0.7602 | 0.8477 | 0.8028 | 0.7783 | 0.3024 | 0.0299 | ||||

13 | 0.4905 | 0.4644 | 0.0180 | 0.6311 | 0.7529 | 0.7601 | 0.8456 | 0.8032 | 0.7757 | 0.2972 | 0.0492 | 0.0200 | |||

14 | 0.4901 | 0.4646 | 0.0179 | 0.6318 | 0.7524 | 0.7595 | 0.8451 | 0.8030 | 0.7748 | 0.2978 | 0.0530 | 0.0254 | 0.0168 | ||

15 | 0.4907 | 0.4623 | 0.0095 | 0.6306 | 0.7514 | 0.7599 | 0.8444 | 0.8037 | 0.7748 | 0.2953 | 0.0583 | 0.0320 | 0.0192 | 0.0192 |

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | |||||||||||||||

2 | 0.4531 | ||||||||||||||

3 | 0.5567 | 0.4439 | |||||||||||||

4 | 0.5408 | 0.6281 | 0.6188 | ||||||||||||

5 | 0.9029 | 0.9255 | 0.8784 | 0.7598 | |||||||||||

6 | 0.8845 | 0.8718 | 0.8409 | 0.8615 | 0.8762 | ||||||||||

7 | 1.4297 | 1.3203 | 1.2682 | 1.3924 | 1.300 | 1.2339 | |||||||||

8 | 1.2246 | 1.0924 | 1.0161 | 1.2011 | 1.200 | 0.9157 | 0.9635 | ||||||||

9 | 1.0256 | 0.9862 | 0.9295 | 0.9869 | 0.9310 | 0.8623 | 0.9538 | 0.9123 | |||||||

10 | 0.5581 | 0.4575 | 0.3303 | 0.6356 | 0.9271 | 0.8824 | 1.2759 | 1.0163 | 0.9912 | ||||||

11 | 0.5915 | 0.4816 | 0.1350 | 0.6525 | 0.9115 | 0.8667 | 1.2682 | 1.0432 | 0.9391 | 0.3694 | |||||

12 | 0.5654 | 0.4591 | 0.0969 | 0.6312 | 0.8839 | 0.8518 | 1.2604 | 1.0403 | 0.9217 | 0.3446 | 0.0670 | ||||

13 | 0.5607 | 0.4576 | 0.0702 | 0.6247 | 0.8837 | 0.8450 | 1.2644 | 1.0326 | 0.9291 | 0.3367 | 0.1156 | 0.0636 | |||

14 | 0.6113 | 0.5127 | 0.1596 | 0.6785 | 0.9097 | 0.8583 | 1.3064 | 1.0584 | 0.9613 | 0.3859 | 0.2254 | 0.1793 | 0.1558 | ||

15 | 0.5460 | 0.4416 | 0.0454 | 0.6167 | 0.8783 | 0.8332 | 1.2680 | 1.0221 | 0.9235 | 0.3290 | 0.1295 | 0.0943 | 0.0721 | 0.1586 |

From

All sequence data contain inherent information that can be measured by Shannon’s uncertainty theory. Measuring uncertainty may be used for rapid screening for sequences with matches in available database, prioritizing computational resources, and indicating which sequences with no known similarities are likely to be important for more detailed analysis as seen in [

Our methods are comparable to many other alignment-free methods as shown in [

In conclusion, results show that our method can accurately classify different genomic sequences. In terms of classification accuracy, our method is basically the same as the state-of-the art Clustal X and compare with the traditional Clustal X phylogenetic tree construction method [

This work was done while D.C.S. mentored two undergraduate students M. D. H. and K. R. B. However, M. D. H. was partially funded by NIH-Minority Access to Research Careers (MARC) program (Grant #T34-GM10083) in Spring 2020 and NSF-Louis Stokes Alliances for Minority Participation (LSAMP) Program (Grant #NSF-1712724) in Summer 2020.

The authors declare no conflicts of interest regarding the publication of this paper.

Sengupta, D.C., Hill, M.D., Benton, K.R. and Banerjee, H.N. (2020) Similarity Studies of Corona Viruses through Chaos Game Representation. Computational Molecular Bioscience, 10, 61-72. https://doi.org/10.4236/cmb.2020.103004