Similarity Studies of Corona Viruses through Chaos Game Representation

The novel coronavirus (SARS-COV-2) is generally referred to as Covid-19 virus has spread to 213 countries with nearly 7 million confirmed cases and nearly 400,000 deaths. Such major outbreaks demand classification and origin of the virus genomic sequence, for planning, containment, and treatment. Motivated by the above need, we report two alignment-free methods combing with CGR to perform clustering analysis and create a phylogenetic tree based on it. To each DNA sequence we associate a matrix then define distance between two DNA sequences to be the distance between their associated matrix. These methods are being used for phylogenetic analysis of coronavirus sequences. Our approach provides a powerful tool for analyzing and annotating genomes and their phylogenetic relationships. We also compare our tool to ClustalX algorithm which is one of the most popular alignment methods. Our alignment-free methods are shown to be capable of finding closest genetic relatives of coronaviruses.


Introduction
Deoxyribonucleic Acid (DNA) is a molecule that encodes the genetic instructions used in the development and functioning of all known living organisms. As such, DNA has become a subject of both theoretical and applied studies for the last decades. DNA is a polymer of nucleotides. Nucleotides are the building blocks of DNA. The four different nucleotides of DNA are: adenine (A), cytosine (C), guanine (G), and thymine (T).
DNA sequences analysis, as one of the most important parts of bioinformatics, which was considered to reveal the essence of all life phenomenon, has been developing rapidly in recent years. Sequence comparison is crucial to understand the evolutionary relationships among organisms. Many methods have been proposed to compare genetic sequences. Traditionally, most of these approaches are the widely used alignment-based methods. In these methods, molecular sequences are optimally aligned based on selected scoring systems. The alignment-based methods often give high accuracy and may reveal the relationships among sequences. Some algorithms have been established and incorporated into software for sequence alignments. However, one of the main drawbacks of these techniques is that they are very time-consuming and expensive in memory usage. As a result, alignment-free approaches such as in [1] [2] [3] have attracted more and more attention and have been applied to biological sequence comparison as well as phylogeny analysis.
Chaos Game Representation (CGR) is an iterative system method originally proposed by Jeffery [4]. CGR is a two-dimensional plot, where the primary sequence organization of DNA is mapping using iterative function. CGR patterns of DNA segments have been proposed as a method for the classification and Identification of genomic sequences [4]- [10]. The use of CGR has mostly been restricted to a visualization tool representing nucleotide sequences, in which patterns like over-or underrepresentation of nucleotides, dinucleotides, trinucleotides, etc. can be visually ascribed. Goldman concluded that the patterns exhibited by CGR are sufficient to evaluate word length composition of three, i.e., the frequencies of nucleotides, dinucleotides and trinucleotides [5]. However, it was shown later that longer oligonucleotide frequencies also influence the patterns seen in CGR [9]. Later, a spectrum of word lengths, in addition to nucleotide and dinucleotide, in CGRs were identified as factors that can differentiate between genomes of different species. Several distance measures were proposed to compare two or more CGRs and it was employed for studying phylogenetic relationships among diverse species [7] [8] [9] [10] [11]. However, it is not clear if intraspecies genomic variability, which is much less than between-species variation, can be resolved using CGRs with similar word lengths. Later it was found in [12], the value k = 7 achieved the highest accuracy scores for HIV-1 subtypes classification.
The motivation of writing this paper came from the recent outbreak of novel coronavirus (SARS-Cov-2) now known as Covid-19. SARS-CoV-2 is the third pathogenic novel coronavirus to emerge over the past two decades. The first, discovered in 2003 and named SARS-CoV, caused SARS, a serious and atypical pneumonia. The second, MERS-CoV, emerged a decade later in the Middle East and caused a similar respiratory ailment called Middle East respiratory syndrome (MERS). Since its identification, 2494 cases of MERS-CoV infection and nearly 900 deaths have been documented [13]. The SARS-CoV epidemic proved larger but less deadly, with approximately 8000 cases and nearly 800 deaths [14]. There are other four coronaviruses that cause colds in humans-known as HCoV-229E, HCoV-NL63, HCoV-OC43 and HCoV-HKU1 [15].
In this paper, we proposed two methods, i.e., probability matrix method and centroid matrix method combining with CGR to construct distancematrix between two genomes, and then create dendrogram using Hierarchical Agglomerative Clustering (HAC) analysis. Our dendrogram can accurately identify the genetic relationship of different biology, and this method is generally applicable to various organisms.

Methods
In this section we first describe the dataset used for our analysis, then present an overview of the three main steps of the method and conclude with a description of the two distances that we considered.
For our experiment, we used only complete genomes of 15 corona viruses as it is given in Table 1.

Overview
The method we used to analyze and classify the 15 sequences of the dataset has three steps: 1) generate graphical representations (images) of each DNA sequence using CGR and define FCGR probability matrix and CGR centroid method using the features of CGR; 2) compute all pairwise distance to obtain two distance matrices; and 3) create the dendrogram of the distance matrix using Hierarchical Agglomerative Clustering (HAC) analysis.
CGR is an iterative method introduced by Jeffery [4] to visualize the structure of a DNA sequence. A CGR associates an image to each DNA sequence as follows: starting from a square with corner labeled four nucleotides C, G, A and T, and the center of the square as the starting point, the image is obtained by successively plotting nucleotide as the middle point between the current point and the corner labeled by the nucleotide to be plotted. If the generated square image has a size of 2 k × 2 k pixels, then every pixel represents a distinct kmer: A pixel is color red if the k-mer it represents appears in the DNA sequence, otherwise it is white. CGR images of generating DNA sequences coming from various species show pattern such as squares, parallel lines, rectangles, triangles, and also complex fractal patterns. We have created CGR of all 15 virus genomes and visually they look similar (see Figure 1 below).
For step (1), we will use a slight modification version of the original CGR, introduced in [9]: a k-th order FCGR (Frequency Chaos Game Representation) is a 2 k × 2 k matrix that can be constructed by dividing the CGR plot into a 2 k × 2 k grid, and defining the element |a ij | is as the number of points that are situated in the corresponding grid square. A first-order FCGR and a second-order FCGR have the structure shown below, where N w is the number of occurrences of the k-mer w, in the sequence s is The (k + 1)th order FCGR k+1 (s) can be obtained by replacing each element N X in FCGR k (s) with four elements where X is a sequence of length k over the alphabet {A, C, G, T}. For each k ≥ 1, we can define a probability matrix of FCGR k (s) by taking each entry of FCGR k (s) dividing by the total counts of all k-mers. We denote the FCGR probability matrix by (P ij ),1 ≤ i, j ≤ 2 k . Note that ∑ i, j P ij = 1. Probability matrix can be interpreted as probability of distribution.
Since the CGR captures the information of the whole genome data, extracting the global features from the CGR may not be efficient enough to distinguish the genomes. In CGR Centroid method, we concentrate on extracting the local features as shown in [11]. We partition the CGR into sub-regions so that it reveals local information of the interested areas.
If two dots are within the same quadrant, they correspond to sequences with the same last mononucleotide; if they are in the same sub-quadrant, the sequences have the same last dinucleotides; and so on. This can demonstrate the structure of the sequences yielding the points in the CGR. Chaos Centroid method utilizes this biological significance by computing the centroid of the distributed points of each sub-region.
For Chaos Centroid method, the CGR is partitioned into 10×10 equal subregion. The choice of 10 is to minimize the computation time. For each partition, we compute the centroidas follows. Let (x k , y k ) be the coordinates of a point in the CGR. We define the centroid in each of the 10×10 grid as follows: For step (2), after computing FCGR probability matrices and computing centroid for each of the sequences in the dataset, the goal was to measure "distance" between two CGR images. There are many distances as it is given in [10] [11] that can be defined for our purpose. One of the goals of this study was to identify what distance is better able to differentiate the structural differences of various genomic DNA sequences. In this paper we use two different distances: FCGR Probability Matrix distance and CGR Centroid distance. Both use the Euclidean distance. For step (3), after computing all pairwise distances we obtained two different distance matrices. Then, we created the dendrogram of the distance matrices using Hierarchical Agglomerative Clustering (HAC) analysis.

Distances
In this section we formally define each of the two distances. For two FCGR probability matrices (p ij ) and (p′ ij ) we define d ij = p ij − p′ ij The distance between the two probability matrices denoted by D P M = ∑ i = 1 2 k ∑ j = 1 2 k d ij For two genomes, we calculate 100 centroids c ij = (x ij , y ij ) and c′ ij = (x′ ij , y′ ij ) respectively for 1 ≤ i, j ≤ 10. Then we found Euclidean

Author Manuscript
Author Manuscript

Author Manuscript
Author Manuscript distance between them d ij = (x ij − x′ ij ) 2 + (y ij − y′ ij ) 2 . Then calculated the centroid distance between two genomes denoted by D cd = ∑ i = 1 10 ∑ j = 1 10 d ij .

Results
For our dataset we used k = 7, that is, each DNA sequence represented as a 2 7 × 2 7 FCGR matrix. In [12], it was found highest accuracy in HIV-1 classification and this value is being used here as it is relevant for our viral analysis. Table 2 display the pairwise distance among 15-virus genomes in the dataset using probability matrix distance while Table 3 display the same using centroid distance. Figure 2 shows the phylogenetic tree obtained using Table 2 distances by python Hierarchical Agglomerative Clustering (HAC) analysis. Similarly Figure 3 shows the phylogenetic tree using Table 3. Figure 4 is the Neighbor Joining Phylogenetic tree using traditional Clustal X method.
From Figure 2 and Figure 4, we can see that the cluster results between Clustal X method and probability distance method are essentially same. Similar Phylogenetic analysis of bat coronaviruses with other coronaviruses and the phylogenetic tree was constructed using Clustal W also done in [15].
All sequence data contain inherent information that can be measured by Shannon's uncertainty theory. Measuring uncertainty may be used for rapid screening for sequences with matches in available database, prioritizing computational resources, and indicating which sequences with no known similarities are likely to be important for more detailed analysis as seen in [16]. We started with 57 genome sequences and then reduced to 15 based on the Shannon Entropy and Shannon Entropy of 7-mers of the sequences, see Figure 5 and Figure 6. All Covid-19 sequences have entropy close to 1.957. We choose only six Covid-19 sequences in the dataset along with all other sequences with deviated entropy from 1.957 for our analysis of corona viruses.

Discussion and Conclusions
Our methods are comparable to many other alignment-free methods as shown in [1] [2] [3] [11]. The proposed methods i.e. FCGR Probability and Chaos Centroid, are based on Chaos game representation, which provides a unique and scale-independent representation of DNA sequences through the statistical distribution of k-mers along DNA sequences. An advantage of CGR over alignment is that it has the potential to reveal the evolutionary and/or functional relationships between the sequences having no significant homology, as explained in [17]. Furthermore, it does not require prior knowledge of consensus sequences, nor does it involve exhaustive searches for sequences in databases. The limitation of CGR is that it takes a computational time to generate the representations from DNA sequences.
In conclusion, results show that our method can accurately classify different genomic sequences. In terms of classification accuracy, our method is basically the same as the stateof-the art Clustal X and compare with the traditional Clustal X phylogenetic tree construction method [18], our method is much faster. Furthermore, our dendrogram construct method can be widely applicable for various kinds of organisms. This research may contribute to reveal the biological evolution process to some extent, as well as promote the further development of bioinformatics. We may make efforts in our future work to provide a webserver for the methods presented in this paper. All the codes in this paper are written in python and can be available upon request. CGR images of all fifteen coronaviruses listed in Table 1.