Inferring Multi-Type Birth-Death Parameters for a Structured Host Population with Application to HIV Epidemic in Africa

Human Immunodeficiency Virus (HIV) dynamics in Africa are purely cha-racterised by sparse sampling of DNA sequences for individuals who are infected. There are some sub-groups that are more at risk than the general population. These sub-groups have higher infectivity rates. We came up with a likelihood inference model of multi-type birth-death process that can be used to make inference for HIV epidemic in an African setting. We employ a likelihood inference that incorporates a probability of removal from infectious pool in the model. We have simulated trees and made parameter inference on the simulated trees as well as investigating whether the model distinguishes between heterogeneous and homogeneous dynamics. The model makes fairly good parameter inference. It distinguishes between heterogeneous and homogeneous dynamics well. Parameter estimation was also performed under sparse sampling scenario. We investigated whether trees obtained from a structured population are more balanced than those from a non-structured host population using tree statistics that measure tree balance and imbalance. Trees from non-structured population were more balanced basing on Colless and Sackin indices.


Introduction
Human Immunodeficiency Virus (HIV) is among the most infectious diseases epidemic.
Many African countries are battling with HIV epidemic. For example, the total burden of HIV in Uganda is increasing according to spectrum estimates from the Ministry of Health as documented in [4]. The number of individuals living with HIV in Uganda rose from 1.4 million in 2013 to 1.5 million in 2015. This was due to continuous spread of the disease and increased life expectancy for people living with HIV [4]. From [4], the number of AIDS related deaths was 28,000, while new HIV infections were 83,000 in 2015. In Uganda, HIV infects individuals disproportionally according to both their geographical and risk behaviours [5]. Some groups are more at risk compared to the general population in Uganda. Such groups have high incidence for HIV. These groups include sexual workers, fish folk community and truck drivers [6] [7] [8] [9]. It was observed that sex workers, their clients and regular partners of clients, contributed between 7% and 11% of new infections in Uganda, Swaziland and Zambia [8]. Dis-proportionality is also visualised among females and males. Although the general prevalence of HIV among adults in Uganda was 6.2% in 2016, it was 7.6% among females and 4.7% among males [10]. HIV prevalence differs in various regions in Uganda as well. It was stated in [10] that the prevalence of HIV in Central, Eastern, West Nile, Northern and Western Uganda was 8.0%, 5.1%, 3.1%, 5.7% and 7.9%, respectively in 2016. Therefore, HIV epidemic in Uganda is structured as well as generalised.
In order to study such a host population so as to make inference about HIV epidemic, viewing it as a structured population gives more reliable parameter inference due to heterogeneity manifested by individuals in terms of infectivity.
In this paper, we are interested in a model that depicts HIV dynamics in an H. W. Kayondo  African setting. Dis-proportionality in terms of infectivity is one of the characteristics significant for HIV dynamics in Africa. In order to account for this, we employ models that make inference for structured host population. Incorporating such models for HIV dynamics in Africa is still a challenge as the models should entail characteristics like sparse sampling and heterogeneity in infectivity among infected individuals. We incorporate such characteristics in the model for a structured host population. This aids in making parameter inference for HIV epidemic in an African setting. Molecular sequence data uncovers underlying disease dynamics in structured host populations much better than prevalence and incidence data. Molecular sequence data is increasingly becoming available even in Africa, for example Southern African Treatment and Resistance Network has a growing database of greater than 70,000 HIV sequences [11]. The availability of this genomic data has resulted in analysis of DNA HIV sequences, for example, The Phylogenetics and Networks for Generalized Epidemics in Africa consortium (PANGEA-HIV) utilises viral sequence analyses to establish transmission patterns of HIV in Africa [11]. Analysis of sequence data is explored by many researchers for making inference about epidemiological parameters of epidemics like in computation of basic reproductive number. The basic reproductive number of the pathogen for HIV-1 epidemic in Switzerland was estimated using DNA sequences [12]. Some researchers have integrated both sequence data and compartment models to make inference about disease dynamics. The use of sequence data with a compartmental susceptible-infected-removed (SIR) enabled the joint estimation of epidemiological parameters and establishing disease origin [13]. The method was applied to HIV-1 sampled data from the UK to estimate the basic reproductive ratios within clusters and on hepatitis C virus data set from Argentina to establish the origin of the disease [13]. Analysis of sequence data involves using phylogenetic trees.
We extend the multi-type birth death model of likelihood inference implemented by [14]. We incorporate an additional parameter which is the probability of removal from an infectious pool. In this paper, a sampled individual becomes non-infectious immediately after sampling with a probability r and remains infectious with probability 1 . r − The added parameter accounts for scenarios where some infected individuals remain infectious even upon diagnosis like in most of Africa's HIV epidemic. Most of individuals who are diagnosed with HIV are not necessarily removed from the infectious pool owing to risky behaviour practised by some individuals. We extend the likelihood inference due to its flexibility in varying other parameters like sampling proportions of different types or sub-groups when making parameter inference. We estimate parameters by varying the sampling proportions as well. The effect of sparse sampling on parameter inference is investigated as well. This is because, analysing parameter inference in such scenarios of sparse sampling is important since HIV epidemic in Africa is still characterized by limited DNA sequences for HIV positive individuals. The model employed in this paper was used to infer parameter The model used in this paper differs from the Bayesian inference that was implemented by [15] as it extends the likelihood techniques of a birth-death multi-type model developed by [14]. In the Bayesian framework, both epidemiology and fossil calibration of parameters were estimated. We extended the likelihood inference of [14] by adding an extra parameter that accounts for individuals that remain infectious even after being diagnosed with HIV. The extended model which is used for parameter inference in this paper is helpful in studying HIV epidemic in an African setting. Section 2 gives a detailed explanation of the method we are employing as well as explaining extension of the model in [14]. The results of our study are presented in Section 3, while discussion of the results and conclusions are presented in Section 4.

Modelling Framework
Using ideas developed in [14] on multi-type birth-death process (MTBD), we  Figure 1 shows a structured host population with two demes and the parameters for the model.  ψ is the rate at which an individual of deme 1 is sampled.
Sampling in this context is the way HIV sequences are included into the study.
4) Parameter r is the probability that an individual in either of the demes becomes non-infectious upon sampling (being removed from infectious pool). An individual therefore remains infectious with probability 1 . r − Parameter r is the added parameter in our model which is not in the model presented in [14]. In our model framework, an individual either remains infectious or is removed from the infectious pool upon sampling. Therefore, 1 1 rµ ψ is the rate at which an individual in deme 1 is removed from the infectious pool upon sampling. The rate at which an individual in deme 1 who remains infectious, transmits to an individual in deme 1 upon sampling is ( ) 11  For the derivation of the likelihood inference for parameter estimation, we still build on ideas used in [14]. Figure 2 illustrates the multi-type birth-death tree that we use in the derivation for likelihood of multi-type birth-death tree.
In Figure 2, black dots are sampled individuals, individuals either belong to deme 1 or 2. E is an edge which represents an individual. For parameters on the  E. T is the time taken for the process or epidemic. This is the time from the origin of the epidemic to present. The model illustrated in Figure 1 results in a multi-type tree as visualized in Figure 2. In our model, the multi-type tree can have a sampled descendant, while a sampled individual in the model presented in [14] had no sampled descendants.
In this derivation, From Equation (1), the first term on the right is for no birth, death and sampling events, second term represents birth of an individual of deme j while lineage j produces no samples in time t, third term signifies birth of an individual of deme j while lineage 1 produces no samples in time t. The last term on right is the probability that more than one event occurs in the time interval t ∆ .
After re-arranging the terms in Equation (1) and letting 0 t ∆ → , we obtain the differential equation: The initial condition at τ is given as: , for 1, and 0, for 1, The first term on the right of Equation (3) depicts individuals who become non-infectious upon sampling while the last term signifies those who are not removed from the infectious pool, i.e., they remain infectious even after being sampled. In typical African communities, many individuals infected with HIV remain infecting others either knowingly or unknowingly. When r is 1, the initial condition given in Equation (3) reduces to that used in [14]. ( ) 1 Z t is derived using similar steps as those used in the derivation for Equation (2). The differential equation for ( ) 1 Z t is given as: with the initial condition, The first term on the right of Equation (4) represents death without sampling, the second signifies no birth, death and sampling as well as an individual in deme 1 producing no samples. The last term depicts a birth of an individual j but both individuals 1 and j producing no samples in time t. The initial condition given in Equation (5) implies that an individual at time 0 t = is not sampled. This is the case because the model used accounts for sampling through time only. Contemporaneous sampling, which is sampling at only one time point especially the present is not modelled in this inference.
The differential equations (2) and (4) with initial condition given by (3) are for deme 1. The corresponding differential equations for deme 2 were analogously derived. The system of differential equations with their initial conditions that is integrated along a given edge for 2 demes is given as: Solving the system of equations in (6) numerically gives the calculations along an edge of a tree. This is then followed by pruning at the nodes and finally reaching the root of the tree. This results in obtaining either D T depending on whether the root was in deme 1 or 2.

Probability Density of the Tree
The probability density of the tree with the first individual at time T conditioned on observing at least an individual who is sampled for a structured population The probability that an individual at time T is of deme 1 is given by 1 f and its distribution must be specified. Equation (7) is the likelihood of the parameters given the data. This is used for parameter inference. It should be noted that Z T can be 1. This is because if any takes on 1, then the process will not give raise to the inferred tree. The conditioning is necessary since studies have shown that this conditioning gives more accurate parameter values [16]. The parameters that we estimated were: , ψ ψ ψ = , r and T .

Parameter Inference
The necessary changes suggested in Equation (3)

to reflect HIV dynamics in an
African setting were effected in R using an R package, Treesim of [17]. The parameters for simulated trees were then estimated using an R package, TreePar of [18] with changes to suit our model. In the first set of simulations, 50 trees with 100 tips were simulated under multi-type birth-death model with 2 demes (MTBD-2), with sampling probability of 0.2 for each of the two sub-populations to investigate the effect of the probability of removal on parameter inference. We then increased the number of simulated trees to 100 but kept 0.8.

Heterogeneous and Homogeneous Dynamics
For comparison between two models, we used likelihood ratios. From [19] In [20], they defined the likelihood ratio statistic as X H is the likelihood of the data (X) under hypothesis H. This likelihood statistic simplifies to the one given in Equation (8). Likelihood ratio statistic is asymptotically chi-squared distributed under the null hypothesis according to [19].
For heterogeneous dynamics, we assume that the population is structured into demes, say 2 as in our model. MTBD-1 and MTBD-2. Likelihood ratio tests were used to find out how many trees were rejected whose inference was made under MTBD-2.
Since the use of likelihood ratios requires that 0 H is derived from a full al- H are defined as null and alternative models, respectively.

Simulated DNA Sequences
In the absence of real DNA sequence data for a structured host population, DNA sequences were simulated using a stochastic agent based model called, the Discrete Spatial Phylo Simulator (DSPS). DSPS was also used in [21].

BEAST Study of Simulated DNA Sequences
We made a Bayesian study for a sampled set of sequences using Bayesian Evolution Analysis by Sampling Trees (BEAST) developed by [22]. This was done us- , and yet we obtained 0i R and i µ from the BEAST study, we therefore set From the simulated trees, we estimated the parameters to evaluate the performance of our model with parameters that were obtained from the simulated DNA sequences.

Tree Balance versus Structured Host Population
We investigated whether there was a relationship between some tree statistics and the structuring of the host population. This was done using tree statistics that measure balance and imbalance of a phylogeny. In [23], they state that Gamma ( γ ) statistic was defined in [24]. Let 2 3 , , , n g g g  be the inter-node distances of the reconstructed phylogeny with n taxa, the γ -statistic is defined as: ( )  Colless index is another tree statistic that assesses tree imbalance. It is defined as: ( )( ) ance of the true phylogeny affects the accuracy of its estimates as stated in [26].
For Sackin and Colless indices, the higher the values for these two indices, the more unbalanced the trees are [26].
We simulated phylogenetic trees in R using TreeSim package of [17]. For a structured host population, we set our parameters as: 11 Table 1, upper and middle parts. There was an overall poor estimation for the second death parameter ( 2 µ ). The lower part of Table 1 shows parameter estimates when the number of simulated trees was increased from 50 to 100. Better estimates were realized when the number of simulated trees was increased. Parameter estimation for 2 µ greatly improved when the number of sampled trees was increased to 100. A close look at the middle and lower parts of   Having observed that increasing the number of simulated trees to 100 improved parameter estimates, we increased the number of sampled tips from 100 to 200 in the next set of simulations. This is shown in  Table 2 are narrower than those in the lower part of Table 1. When we varied the probability of removal (r) from 0.8 to 0.5, there was no significant difference in the parameter estimates as seen in the lower part of Table 2.

Parameter Inference
To investigate whether sampling intensity had an effect on parameter estimates, we varied the sampling proportion from 0.2 to 0.05. Parameter estimates when the sampling intensity was fixed at 1 2 0.05 ψ ψ = = are shown in Table 3.
Slightly wider intervals were obtained for the parameter estimates for this setting. For the error bars shown in Figure 5 and Figure 6, the intervals for parameter estimates obtained when each simulated tree had 200 sampled tips were narrower than those when the simulated trees had 100 sampled tips. The error bars for two parameters ( 11 λ and 2 µ ) are shown. The two selected parameters gave a representation for the rest of the parameters that were estimated.

Heterogeneous and Homogeneous Dynamics
To establish whether our model distinguished between heterogeneous population (MTBD-2) from homogeneous population (MTBD-1), we performed likelihood ratio tests. In the likelihood ratio inference, we used a decision rule of rejecting the null-hypothesis if the likelihood ratio (test statistic) is greater than     Table 4 and Table 5. However, according to [28], he suggests using a 2 χ with 1 degree of freedom.
So at α of 0.1, we reject the null hypothesis if the likelihood ratio is greater than 2.706. With this rule, instead of 41 out of 50, it is 49 trees that rejected MTBD-1 model in favour of MTBD-2 in first scenario and results for the second scenario remain the same when we use the idea of [28].

Results from BEAST Study of Simulated DNA Sequences
We analysed 195 sequences from the sample. These sequences were structured into two groups, namely MSM and Bisexuals. For this sample, we run MCMC in   Table 6.

Tree Balance versus Structured Host Population
From results shown in Table 7   The densities for the three tree statistics are shown in Figure 8. From these densities, there was no significant difference between structured and non-structured populations. The only difference is in the densities for Colless index as the median value for a structured population is higher than that for a non-structured one.

Discussion and Conclusion
From the results shown in Tables 1-6 and Figures 3 simply because it had an extra parameter that affected the identifiability of the parameters. The estimates were however better when 0.8 r = because as r approached 1, the model approached that used in [14] and it was a parameter less compared to the model used in our study. Our model is handy in making inference that depicts an African setting which is characterized by both sparse sampling of DNA sequences and a large proportion of individuals that always remain infectious even after diagnosis with HIV.
Our proposed model distinguished between heterogeneous and homogeneous populations. This is because 41 trees out of 50 supported the heterogeneous model dynamics, rejecting the homogeneous type of model inference. For homogeneous dynamics, all the 50 trees accepted the homogeneous dynamics and rejected the heterogeneous type dynamics. We employed the likelihood test statistic for this model selection using a level of significance ( α ) of 0.1. This was close to that used in [14] which was 0.2. In their work, a 2 χ distribution at 0.8 level was used for the inference. It should be noted that even after incorporating a new parameter (probability of removal), the model still accounted for a structured host population. The probability of removal was included to reflect a situation where individuals fail to change behaviour upon diagnosis with HIV. Whereas studies from Europe have reported that individuals become non-infectious upon diagnosis with HIV as reported in [14], it is still a challenge in many African communities. It is therefore imperative to emphasize that structured populations characterize HIV epidemics in many African countries due to the fact that there are some groups who are more at risk compared to the general population in many African countries.
For the investigation between structuring of a population and tree balance, it was concluded from the results shown in Table 7 that simulated trees from a non-structured host population are more balanced compared to those from a structured host population. This conclusion was based on Colless and Sackin indices. It is not surprising that when we increased the number of simulated trees from 5000 to 15,000, we did not realise a significant change. This is because the definitions of Colless and Sackin indices use internal nodes of a tree. This is the reason we observed a change in the value of indices when we increased the number of sampled tips from 200 to 500. When tips of a tree are increased, the internal nodes are increased as well. For a structured population, we had only 2 demes, it might be interesting to analyse tree balance in a structured population with more than 2 demes. The relationship between timing of bifurcation events and tree structure was assessed using the gamma-statistic values. From densities  Figure 8, those for gamma-statistic are identical regarding to whether the underlying population is structured or not. A general conclusion from Table   7 is that the gamma-statistic values could not be used to uncover the underlying population structure. This could be because the timing of the bifurcation events may be having very less influence on the overall tree shape of a phylogeny.