A Comparison of Results from Two Sampling Approaches in the South African National HIV Prevalence, Incidence and Behavior Survey, 2012

Background: South Africa implements variations of second generation su-veilance surveys to monitor human immunodeficiency virus (HIV) epidemic. Objective: This paper compares HIV estimates from two design variations: take all approach and sub-sampling approach to ascertain if any changes in HIV epidemic are due to methodological changes or inherent evolution of the epidemic. Methods: A multi-stage stratified cluster sample of 1000 census enumerator areas was implemented with 15 households systematically sampled within each census enumerator area. In each household, every member was invited to participate (take all approach). To compare to the previous survey designs, a sub-sampling approach of at most four people from each household was implemented by randomly sampling one person from each age group: <2 years, 2 - 14 approach were also slightly higher than those obtained in the sub-sampling. The overall synthetic measure of homogeneity for both methods was ρ = 0.10. Conclusion: In conclusion, as the household size increases the number of people living with HIV in each household increases thus increasing intraclass correlation. Similarity of resulting HIV estimates is re-assuring. However, the take all approach is more preferable than sub-sampling approach as it allows for detailed analyses of HIV data such as estimating discordance between sexual partners and parent-child pair.


Introduction
Reliable data level on human immunodeficiency virus (HIV) and associated determinants at population is crucial in understanding the dynamics of HIV.
Countries with generalized HIV epidemics obtain estimates from surveillance systems such as antenatal care surveillance surveys. These surveillance systems have their limitations [1] [2]. Population based second generation surveillance surveys have been used by a number of countries repeatedly to monitor the epidemic of HIV and are now considered a gold standard [2] [3]. These surveys address some of the weaknesses that are encountered in antenatal care surveillance surveys [2] [4] and thus greatly enhance surveillance systems [5]. Second generation surveillance surveys combine behavioural data, socio-economic data and biomedical data which provide together greater explanatory power to assess the HIV epidemic in a country [6].
In South Africa, several small-scale focused HIV surveys have been conducted [7] [8] [9]. In the past 10 years Human Sciences Research Council (HSRC) has conducted a series of population based second generation surveilance surveys [10] [11] [12] [13]. The sampling design of these surveys varies from country to country with varying challenges [14] [15].
The design of the South African second generation surveys has had some variations from one wave of the survey to another over the years. In 2002 and 2005 one person aged two years and above was randomly selected in each age group 2 to 14 years, 15 to 24 and 25 years and above in each sampled household [10] [11]. In 2008, persons younger than two years were further included in the survey [12]. This resulted in a sample of at most four people in each household from four distict age groups, that is younger than 2 years, 2 to 14 years, 15 to 24 years and 25 years and above. In 2012, all household members were eligible to participate in the survey [13].
In the analyses of surveys where inference is drawn on trends over time, it is important to ascertain whether any differences in the results are true differences or are an artifact of variations in methodological design. For example, inclusion or exclusion of high risk groups in some setting can lead to biased estimates [5]. The objective of this paper is to compare the HIV prevalence estimates when all persons in the sampled households were invited to participate [13] to those results obtained when one person is randomly sampled in each age group (younger than two years, 2 to 14 years, 15 to 24 years and 25 years and above) as has been carried out in the previous survey of 2008 using the population-based survey data of South African 2012 national HIV prevalence survey reported in Shisana, Rehle, Simbayi, et al. 2014 [13].

Design of Sampling Frame
Complex survey designs involve a combination of a number of design components including stratification, multistage sampling and selection with unequal probabilities or weighting. The design of the South African national HIV household surveys is complex and based on a multi-stage stratified cluster sample design. A random sample of 1000 census enumerator areas from a national database of 86,000 enumeration areas (EA) used during the 2001 census [16] served as the Master Sample of the primary sampling units. The master sample was explicitly stratified by province and locality type of the EAs. Locality types were urban formal, urban informal, rural formal (including commercial farms) and rural informal (tribal authority areas). In the formal urban areas, race was also used as an additional stratification variable [13]. In each sampled EA, a cluster of 15 households was randomly sampled to form the secondary sampling unit. In each sampled household all persons residing at the household including visitors who spent a night before were invited to participate and referred to as "take all approach. The "take all" approach implies that the designs of the previous HSRC surveys can be deduced from the database. Using the captured data, individuals within each household were grouped by age group as presented in Figure 1. In each age group at each household, a sampling scheme was implemented using proc survey select in Statistical Analysis System (SAS) version 9.3 (SAS Institute) to randomly select one person in each age group to be considered in further analysis ("sub-sampling"). The sub-sampled data (three-stage sampling at EA, dwelling and sub-sampled persons) mimics the previous HSRC survey designs for all practical purposes. In this way, if the sampled person was a refusal, the results were recorded as such in the sub-sampled data.
A consequence of implementing complex survey design is that sampling errors of the survey estimates cannot be computed using standard formulae found in standard statistical texts since they are based on independently and identically distributed random variables. Complex methods of estimating variances for complex sample designs are used [17] [18] which are often larger than those obtained from standard formulae. A design effect defined as the ratio of the complex variance estimate and variance obtained from standard formulae is computed to shed useful light on the precision of survey estimates between the two designs [19]. In this design, a household is considered a cluster and a synthetic measure of homogeneity within clusters (ρ) is computed to measure the level of homogeneity within household for each determinant [20].

Weighting and Benchmarking of the Sample
Owing to the multi-stage stratified sampling design of the survey, some individuals have a greater or lesser probability of being selected than others. To correct for potential bias due to unequal sampling probabilities, sample weights were introduced at the EA, household, and individual levels and also to adjust for non-response. The final sampling weight was thus equal to the final EA weight multiplied by the final VP sampling weight adjusted for individual non-response in the take all approach. In the sub-sampling the final sampling weight was thus equal to the final EA weight multiplied by the final VP sampling weight multiplied by the sampling weight of each person in the household in each age group adjusted for individual non-response. Thus, the sampling weights corrected for unequal number of household members within each age group.
The final individual weights were benchmarked to 2012 mid-year population estimates by age, race, sex, and province [21]. This process produced a final sample representative of the population in South Africa for sex, age, race, and province.

Data Capturing, Management and Analysis
Survey data from questionnaires were double entered and verified by the Data

Results
In total 15,000 households were sampled from 1000 EAs. The results in Table 1 show that there are no considerable differences in the questionnaire and HIV testing response rates between the take all and sub-sampling approaches. The differences in all other determinants such as age, race, sex and geotype are less than two percent.
In the take all approach, the household size ranged from 1 to 18 people whilst in the sub-sample ranged from 1 to 4 people since at most four people could be sampled from each household. The crude percentage of people infected with HIV per household size was practically similar between the take all approach and sub-sample approach (Figure 2). The percentages of households that had at least one person infected with HIV between the two approaches are consistently and systematically different. As the household size increases there is consistently an increasing likelihood that at least one person is infected with HIV. This was more pronounced in the take all approach than sub-sampling approach. This indicates an increased likelihood of similar HIV positive status among individuals from the same household in the take all approach compared to subsampling approach. Table 2 presents key comparison results between the two sampling approaches when assessing the validity of the HIV results. The HIV estimates between the two methods are very comparable with no consistent pattern in any direction. These results are in agreement with consistent similar proportions of HIV positives between the two methods in Figure 2    sub-sampling approach are more variable than those from the take all approach. The design effects in the take all approach are also slightly higher than those obtained in the sub-sampling. The design effects vary proportionally with the syn-thetic measure of homogeneity (ρ) indicating a higher intraclass correlation within households with respect to HIV in the take all approach compared to sub-sampling approach. The overall synthetic measure of homogeneity for both methods is ρ = 0.10.
The HIV prevalence estimates is slightly higher 12.2% in the take all approach compared to 11.6% in the subsampling approach. All other estimates for age, sex, race, geotype and province are in the same direction and order of magnitude (Table 2).

Conclusions
The results of the paper from the take all approach used in the 2012 survey [13] and the sub-sampling design implemented in the previous surveys are compared. The calculated response rate was similar for both methods. The findings show that the estimates of HIV are comparable for all key determinants. However, the estimates from sub-sampling are more variable than those from the take all approach. This could be a function of cluster (household) sample size and intraclass correlation within each cluster which leads to practically less effective sample size due to high correlation. In the generalised epidemic settings like South Africa, the risk of HIV infection is likely to be clustered within households [22] due to both heterosexual transmission among sexual partners within households and vertical transmission to their children. The overall estimate of intraclass correlation and design effect are similar for both methods.
However, for various determinants, the estimates for intraclass correlation and design effects are moderately higher for the take all approach than sub-sampling.
The omparison of the two methods is subject to some limitations. The sub-sampling arm of the study conditions on the household roster for the take all alternative. In actual implementation of sub-sampling method, there could be non-coverage and differential non-response. Thus the simulated experiment might not replicate exactly the outcome that would be obtained under two real survey conditions.
In conclusion, the two approaches yield similar results for all practical purposes. However, even though with high intraclass correlation resulting in lesser effective sample size, the take all approach is more preferable than sub-sampling approach. The take all approach allows for further analyses of data such as estimating discordance between sexual partners and parent-child pair.