Predicting the Underlying Structure for Phylogenetic Trees Using Neural Networks and Logistic Regression

Understanding an underlying structure for phylogenetic trees is very impor-tant as it informs on the methods that should be employed during phylogenetic inference. The methods used under a structured population differ from those needed when a population is not structured. In this paper, we compared two supervised machine learning techniques, that is artificial neural network (ANN) and logistic regression models for prediction of an underlying structure for phylogenetic trees. We carried out parameter tuning for the models to identify optimal models. We then performed 10-fold cross-validation on the optimal models for both logistic regression and ANN. We also performed a non-supervised technique called clustering to identify the number of clusters that could be identified from simulated phylogenetic trees. The trees were from both structured and non-structured populations. Clustering and prediction using classification techniques were done using tree statistics such as Colless, Sackin and cophenetic indices, among others. Results from 10-fold cross-validation revealed that both logistic regression and ANN models had comparable results, with both models having average accuracy rates of over 0.75. Most of the clustering indices used resulted in 2 or 3 as the optimal number of clusters.


Introduction
A phylogenetic tree is defined by [1] [2] as a tree that represents evolutionary Logistic regression is a special case of linear regression. Both linear and logistic regression have a dependent variable, say Y which is predicted using independent variables, say 1 2 , , , p X X X  , in case where we have p independent variables. For linear regression, Y is a continuous variable, while Y is a categorical variable which takes on two values (dichotomous) for logistic regression, for example, logistic regression can be used to predict presence or absence of a certain symptom in patients, using variables like age, weight, race and others. Many studies have employed logistic regression to study various phenomena. For example, [8] used logistic regression to analyse 46 variable amino acid sites in reverse transcriptase for their effect on susceptibility. Another classification technique which we used was artificial neural network.
An artificial neural network (ANN) model consists of input neurons, hidden layers (with hidden neurons) and output neuron(s) as described in [9] [10] [11] [12]. For a classification problem, input neurons are features that are used during the learning process of the network. These are the input variables for the network as pointed out by [10]. Hidden layers and hidden neurons connect input neurons with output neurons. The output neurons are classification targets, for example, presence or absence of a disease. Layers and neurons in ANN models are connected by weights that are determined during the learning algorithm. ANN models are applicable in many fields, including financial management, manufacturing, pattern recognition, control systems, environmental science, among others as noted by [13]. For example, [9] used ANN models to predict five-year mortality for patients who were diagnosed with breast cancer. [13] applied ANN models to study rainfall-runoff patterns and forecasting floods. [10] applied ANN models for eutrophication prediction, where water quality indicators of a certain lake were predicted with reasonable accuracy. In other ANN applications, [14] used back-propagation neural network on classification of multi-spectral remote sensing data.
In this paper, we used logistic regression and ANN models for classification.
The two classes were structured and non-structured populations. The independent variables were the tree statistics. We investigated the predictive ability of the logistic regression and ANN models. This was assessed using the average accuracy rates. We also performed unsupervised learning technique, called clus-

Methods
A linear regression model is given as: where j s β ′ are linear regression coefficients estimated using least squares method which minimises the residues A logistic regression model is given below as:  For parameters used in the simulation sets, the choice was based on a similar study done in [3]. We had three simulation sets, and in the first simulation set,  trees in total and therefore 500 for each of structured and non-structured populations. It should be noted that structured and non-structured population in this study correspond to asymmetry and symmetry models, respectively used by [3].
For the second simulation set, parameters for structured and non-structured remained the same as those in the first simulation set, but with only changes made on the number of leaves of trees. The total number of leaves was changed from 200 to 500. We therefore had 1 2 250 n n = = . For the third simulation set, only the number of phylogenetic trees was doubled and we had 1000 for either structured or non-structured population, while other parameter values were the same as those for simulation set 1.
Using simulated trees obtained under structured and non-structured populations, we used eight tree statistics for classification and clustering. These included: number of cherries, Sackin, Colless and total cophenetic indices, ladder length, maximum depth, maximum width and maximum width over maximum depth. A cherry is defined as two leaves (tips) that are adjacent to a common ancestor node as described in [15]. A Sackin index index adds the number of internal nodes between each leaf and the root in a tree. This index was proposed by Sackin in 1972. For Colless index, the absolute difference between left and right hand leaves subtended at each internal node is computed. This is done over all the internal nodes and the sum gives Colless index. Details for Colless index can be obtained in [7]. The definition of total cophenetic index is given by [16].
Other definitions for ladder length, maximum depth of a tree, maximum width and maximum width over maximum depth can be found in [17]. The implementation of phylogenetic tree simulation and computation of tree statistics were implemented in Python software, version 3.7.3.
We then performed standardization for all the eight variables using a formula given by Equation (4).

Training Artificial Neural Network and Logistic Models
With help of R package, neuralnet of [11], we first trained artificial neural network (ANN) models using all the standardized eight tree statistics as the input variables. These were: number of cherries, Sackin, Colless and total cophenetic indices; ladder length; maximum depth; maximum width, and width-to-depth ratio. We used generalized weights as described in [12] to identify four most influential input variables for each of the three simulation sets. We first used one hidden layer with one neuron to identify four most influential input variables.
This was done to reduce input variables for ANN models. Reduced ANN models with few input variables converged faster.
For logistic regression, models were fitted using glm function of an R package called stats. The glm function fits generalized linear models. As pointed out by [19], these models comprise of a dependent variable (z), a set of independent

Parameter Tuning for Neural Network and Logistic Models
Using ANN model with one hidden neuron, we identified four most influential input variables using generalized weights for all the three simulation sets. [9] analysed contributions of covariates (input variables) for ANN models using generalized weights. They point out that the distribution of generalized weights for a particular covariate signifies whether the effects are linear (small variance) or non-linear (large variance). We plotted the generalized weights for all the eight inputs for each of the three simulation sets using the same range. Input variables that had a distribution of generalized weights close to zero were deemed to have less contribution in explaining the output variable as pointed out by [11]. Parameter tuning was then performed on reduced ANN models. The parameters that were tuned to obtain optimal models were the number of hidden layers and hidden neurons.
For each of the simulation set, having identified the four most influential input variables, we ran reduced ANN models with two hidden layers. In each of hidden layers, we varied number of hidden neurons between one and two.  (5) and (6) For logistic models, we tuned the number of input variables. We reduced the input variables from eight to four. We identified four most significant for easy comparison with ANN models since we had also reduced ANN models to four input variables.

Cross-Validation of Classification Results for ANN and Logistic Models
Having obtained optimal models for each of the simulation set for both ANN and logistic regression models, we performed 10-fold cross-validation for classification of simulated trees from both structured and non-structured populations.

Clustering of Phylogenetic Trees Using Tree Statistics
Since ANN and logistic regression models are supervised learning techniques, we wanted to compare the two with unsupervised learning technique. We therefore did clustering by k-means. We were interested in finding out the optimal number of clusters that could be obtained from the tree simulated sets. We first used all the eight tree statistics and later reduced to four for easy comparison with ANN and logistic regression models. We used the exact four tree statistics that were used for reduced ANN and logistic regression models. Using R package NbClust of [23], we obtained optimal number of clusters for both full simu- lation sets (when all eight tree statistics used) and reduced simulation sets (when four tree statistics were used). NbClust gives optimal number of clusters for a given data set using thirty indices.

Results for ANN and Logistic Regression Models
The visualization for a full ANN model for simulation set 1 is shown in Figure 1.
The ANN model shown has one input layer with eight neurons. The entropy error was approximately 335 and it required 46934 steps to converge. The corresponding generalized weights for ANN model in Figure 1 are shown in Figure 2. These generalized weights are for all the eight tree statistics for simulation set 1. From Figure 2, the four input variables with the largest variance, hence most influential in explaining the underlying structure for simulation set 1 are Colless and Sackin indices, maximum width and width-to-depth ratio. We also plotted the generalized plots for simulation sets 2 and 3. For these two simulation sets, the four most influential input variables were the same and these were Colless, Sackin, and total cophenetic indices and maximum depth.
We obtained optimal ANN models for each of the three simulation sets using AIC, BIC and entropy error. Results for simulation sets 1 and 2 are shown in Figure 3. The optimal model for simulation sets 1 and 3, had 2 neurons for the first hidden layer and 1 neuron for second hidden layer. For simulation set 2, the optimal model had 1 neuron for the first hidden layer and 2 neurons for the second hidden layer.  For logistic regression models, the most significant variables for simulation sets 1 and 2 were: number of cherries, Colless, Sackin and total cophenetic indices. For simulation set 3, the four most significant variables were: number of cherries, total cophenetic index, maximum width and maximum depth.

Results for the 10-Fold Cross-Validation for ANN and Logistic Models
Having established the optimal models for both ANN and logistic regression models, we performed 10-fold cross-validation. Table 1 shows means for sensitivity, specificity, accuracy and AUC. Results for ANN and logistic regression are comparable, though in both models, simulation set 1 had the least mean values, but simulation set 3 had the best mean values for the measures used. Figure 4 shows the optimal number of clustering using average silhouette width and gap statistic for simulation sets 1 and 2. For these two statistics, the optimal number of clusters was 2. We analysed both for full simulation sets (when all eight tree statistics used) and for reduced simulation sets (when only four tree statistics) were used. We had six reduced simulation sets, three according to reduced simulation sets used for ANN models and three according to reduced simulation   sets used for logistic regression models. This led to nine simulation sets (since we had three full simulation sets) that we investigated the optimal number clusters that was ideal for the data. Out of nine, five simulation sets resulted in the optimal number of clusters as two and the rest three.

Conclusions
From the results obtained, it was evident that ANN and logistic regression models had comparable performance. A comparison of reduced models with four input variables revealed that for any of the three simulation sets, at least two input variables in the reduced models for ANN and logistic regression were similar.  For simulation set 1, ANN models had the four most significant input variables as Colless and Sackin indices, maximum width and width to depth ratio. For logistic regression, four most significant variables were: number of cherries, Colless, Sackin and total cophenetic indices. For simulation set 2, three of the four significant input variables were common in both ANN and logistic regression models. These were: Colless, Sackin and total cophenetic indices. For simulation set 3, two input variables of the four most significant were common in both ANN and logistic models. These were: total cophenetic index and maximum depth. For 10-fold cross-validation classification, in both ANN and logistic regression models, the mean values for sensitivity, specificity, accuracy and AUC were least for simulation set 1 and highest for simulation set 3, as shown in Table 1. The mean accuracy values for both ANN and logistic regression models were comparable with highest value of 0.974 for logistic regression for simulation set 3. The lowest was still for logistic regression of 0.696, and it was for simulation set 1. This was because phylogenetic trees simulated in set 3 had more leaves. This implied more information during the training of the classification models, hence better classification results for simulation set 3 compared to simulation set 1. We choose to compare logistic regression with ANN models because ANN models are considered as complex and whose internal mechanism is hard to understand, hence it is referred to as a black box classification technique in literature. Whereas logistic regression is one of the simplest regression models with only regression coefficients to be estimated during the model training. The fact that ANN models performed comparably with logistic regression models suggests that the tree statistics employed to predict the underlying population structure did well.
The results for clustering revealed that 2 or 3 clusters were optimal for most of the indices for clustering that were used. The unsupervised learning results reveal that structure was fairly detected by the clustering technique though not as accurate as expected since some indices were reporting 3 clusters. This is not surprising for clustering technique given the fact that it is a non-supervised technique.