Likelihood Methods for Basic Stratified Sampling, with Application to Von Bertalanffy Growth Model Estimation

DOI: 10.4236/ojs.2019.96040   PDF   HTML   XML   215 Downloads   446 Views   Citations


This paper mainly addresses maximum likelihood estimation for a response-selective stratified sampling scheme, the basic stratified sampling (BSS), in which the maximum subsample size in each stratum is fixed. We derived the complete-data likelihood for BSS, and extended it as a full-data likelihood by incorporating incomplete data. We also similarly extended the empirical proportion likelihood approach for consistent and efficient estimation. We conducted a simulation study to compare these two new approaches with the existing estimation methods in BSS. Our result indicates that they perform as well as the standard full information likelihood approach. Methods were illustrated using a growth model for fish size at age, including between-individual variability. One of our major conclusions is that the fully observed BSS data, the partially observed data used for stratification, and the sampling strategy are all important in constructing a consistent and efficient estimator.

Share and Cite:

Zheng, N. and Cadigan, N. (2019) Likelihood Methods for Basic Stratified Sampling, with Application to Von Bertalanffy Growth Model Estimation. Open Journal of Statistics, 9, 623-642. doi: 10.4236/ojs.2019.96040.

1. Introduction

In stratified random sampling (SRS), the population or a random sample of the population is partitioned into relatively homogeneous subgroups, or strata, and then random samples are taken independently in each stratum for full observation. Such sampling design may also be regarded as a kind of two-phase sampling, with the population or the large sample before partitioning being the first phase sample, and the smaller and more extensive subsamples after partitioning being the second phase samples.

Practical implementations of SRS frequently fall into two categories as classified by [1]: 1) basic stratified sampling (BSS) where the maximum second phase subsample size (BSS1) or subsampling fraction (BSS2) in each stratum is prefixed, and 2) variable probability sampling (VPS) in which sequential units are independently generated from a model and then classified into strata where they are selected for full observation with pre-specified probabilities. [2] classified BSS2 as VPS, and hence all the inference methods for VPS are also suitable for BSS2.

Assume that there are a total of N sampling units on which the stratified sampling is conducted. Let y i and x i , i = 1 , , N , denote respectively the vectors of responses and covariates of the ith individual generated from the joint distribution f ( y , x | θ ) = g ( y | x ; θ ) h ( x ) , with θ being a vector of all the parameters describing the conditional distribution of y given x . In SRS ( y , x ) are fully observed only for a subset of size n of the N units, which are called complete data in this paper, and only a subset z of ( y , x ) is observed for the other N n units, which are called incomplete data.

In SRS the unobserved elements of y and/or x are missing data, and missingness can be fully accounted for by variable z which is observed for all the N units; that is, the unsampled variables are missing at random (MAR) in the terminology of [3]. In addition, for BSS and VPS, given the observed data, the missing probability for all the missing data is a constant involving no parameters θ . As a result, the likelihood, which is called full information likelihood, is given by (see e.g. [1])

L F ( θ ) = [ i = 1 n f ( y i , x i | θ ) ] [ i = n + 1 N u ( z i | θ ) ] , (1)

where u ( z | θ ) is the density function of z , i = 1 , , n enumerates the second phase complete data, and i = n + 1 , , N enumerates the first phase incomplete data.

If the response y is not involved in the stratification, namely, vector z contains no elements of vector y , u ( z | θ ) = u ( z ) is independent of parameters θ , and the full likelihood L F ( θ ) reduces to i = 1 n g ( y i | x i ; θ ) , which is trivial since neither the sampling scheme nor the covariate distribution h ( x ) needs to be taken into account. In this paper we consider only the SRS where the response y is involved in stratification, which is often referred to as response-selective stratified sampling (RSSS).

In fisheries surveys, length-stratified age sampling (LSAS) is one of the most popular strategies for sampling the age distribution of a fish population. In the first phase of LSAS a large amount of caught fish of a certain species is measured for length, and classified into length strata (e.g. two centimeters, five centimeters). In the second phase a pre-specified small number of fish are randomly selected from each stratum for age measurement. LSAS is BSS, and since growth models generally describe how length increases as a function of age (i.e. length is the response and age is the covariate), it is response-selective. LSAS has been conducted world-wide for several decades. For example, the Canadian Department of Fisheries Oceans (DFO) conducts annual surveys since the 1970’s and uses LSAS for age sampling for many species such as cod and American plaice. Millions of length-at-age data have been accumulated for each species, which are invaluable for fisheries stock assessment and ocean ecosystem studies. In this paper we focus on BSS, with some of the methods and conclusions also applicable to VPS.

[4] suggested to model the age distribution of fish in a survey using the Gamma distribution. [5] also assumed a Gamma age distribution in their hierarchical model of growth for many fish populations, and they showed that parameter estimates did not change much when a more flexible parametric age distribution was used. [6] showed that a flexible Normal mixture distribution for age distribution is more robust to misspecification of the age distribution. Following these studies, in this paper we focus on the case where a valid parametric covariate distribution model is available. For the examples in the simulation studies and real data analysis, we simply use a Gamma distribution for age so that our comparison among various inference approaches is less influenced by numerical issues related to integrating over a complicated covariate distribution.

This is the motivation of this paper. [1] and [7] gave the complete-data likelihood for VPS, which is based solely on the second phase complete data and can be used when the information is not retained for units not selected for full observation. In this study we would like to derive a likelihood function for BSS requiring only the second phase complete data and the total sample size N, which can be used when the first phase BSS data is not available. Some authors ([e.g. [8]) applied a pseudoconditional likelihood approach [1] to LSAS data. We improved this approach to accommodate the first phase incomplete data and the complexities in fisheries LSAS. We conducted simulation studies to compare the new and existing likelihood and pseudolikelihood approaches that have been used or are conveniently applicable to fisheries LSAS. Our purpose is to identify the approaches with the best performance.

The outline of this paper is as follows. In Section 2 we define notations and review the likelihood and pseudolikelihood approaches relevant to this study. In Section 3 we derive the complete-data density function, complete-data likelihood and full-data likelihood for BSS. Application of an empirical proportion approach, which is an improved version of the pseudoconditional likelihood approach, to BSS is explored in Section 4. Results from simulation studies based on a linear model with between-individual (BI) variation and a Von Bertalanffy growth model with BI variation are presented in Section 5 to compare the performance of all these new and existing estimators discussed in this paper. The most promising estimators are then further illustrated in Section 6 by fitting the VonB model with BI variation to growth data for American plaice (Hippoglossoides platessoides) collected by DFO. Some further discussions are provided in Section 7.

2. Notation, Likelihoods and Pseudolikelihoods

Suppose that N units ( y i , x i ) , i = 1,2, , N , are generated from the joint distribution f ( y , x | θ ) . As mentioned previously we always assume that an appropriate parametric covariate distribution is available, then θ here includes not only the parameters describing conditional distribution of response y given covariate x , but also the parameters defining the covariate distribution. The range of ( y , x ) is divided into H exhaustive and mutually exclusive strata S 1 , S 2 , , S H . Denote the probability for ( y , x ) to fall into the hth stratum as Q h ( θ ) , namely,

Q h ( θ ) = Pr { ( y , x ) S h } . (2)

Define the indicator variable

R i = ( 1, if ( y i , x i ) is fully observed , 0, if some information on ( y i , x i ) is missing . (3)

Because BSS2 can be classified as VPS [2], in the following we use BSS specially for BSS1. For BSS we assume that in each stratum S h there are N h units from which n h m h units are randomly selected for full observation of ( y , x ) , with N = h = 1 H N h and n = h = 1 H n h . For the remaining N h n h units the values of ( y , x ) are only partially observed for a subset z . Here m h is the maximum sample size for full observation and

n h = ( N h , if N h < m h , m h , if N h m h . (4)

Although the likelihood for BSS (4) is given by (1), several published studies use other likelihoods, and some of these are described as follows.

[9] studied maximizing the likelihood function i : R i = 1 f c ( y i | x i , R i = 1 ; θ ) for fitting regression models, and called this approach the conditional maximum likelihood. Under the assumption that a valid parametric covariate distribution is available, and the randomness in n h can be neglected for all the strata so that the n h in (4) are always equal to m h in all strata, in the Appendix we show that

f c ( y i , x i | R i = 1 ; θ ) f ( y i , x i | θ ) Q h i ( θ ) if ( y i , x i ) S h i , (5)

and the constant of proportionality does not depend on θ . The conditional likelihood then becomes

L c ( θ ) = i : R i = 1 f ( y i , x i | θ ) Q h i ( θ ) , (6)

which is adopted in [10] and [11].

Weighted pseudo-likelihood estimators have been studied extensively since the 1980’s for problems involving response-selective sampling. For this topic we refer to [12] - [18]. In the most basic and popular version of this approach, the log-likelihood function if all the N units were fully observed, i = 1 N ln f ( y i , x i | θ ) with ln denoting natural logarithm, is estimated by the Horvitz-Thompson (HT) method based on the n units that are actually observed in full,

l w ( θ ) = h = 1 H N h n h i = 1 n h ln f ( y i , x i | θ ) . (7)

Although this weighted log-pseudo-likelihood (7) may provide an unbiased parameter estimating equation, the HT approach is known to be inefficient, and can be seriously so in some situations such as when the sample unit values are not approximately proportional to the inclusion probabilities ([19]; [20], page 103-104).

An approach for addressing this inefficiency issue is to adjust the standard HT weights by using the whole set of incomplete data, namely, those with only a subset of ( y , x ) measured but available for all the N sample units (see e.g. [17] [18] [21]). We call this method the calibrated weighted likelihood approach. As an implementation of the calibrated weighted likelihood approach, in this study we modified the traditional Horvitz-Thompson weights by minimizing the chi-squared distance (see Equation (1.1) in [21]) between the original and modified weights subject to the constraint

i = 1 N R i w i y i = i = 1 N y i , (8)

where w i are the modified weights. Similarly one can also calibrate up to higher order moments or calibrate the empirical distributions by imposing the constraints

j = 1 N R j w j 1 y j y i = j = 1 N 1 y j y i , (9)

where i enumerates all the subjects selected for full observation, and 1 y j y i = 1 if y j y i and 0 otherwise. Nevertheless, these calibration strategies may not produce better estimates than (8) does, according to our simulation studies. Hence, in this paper we only report results with constraint (8). The calibrated weighted likelihood approach under all these constraints can be conveniently implemented with Equation (9) in [17].

In some applications (e.g. [8]) researchers use an approximate density based on variable probability sampling (VPS). In VPS, units are randomly selected for full observation from the N h partially observed units, with subsampling probabilities γ h that vary for each stratum h. The density approximation is based on the empirical subsampling probability γ ^ h = n h / N h (see Equation (2) in [2]),

f ( y , x | BSS ; θ ) f ( y , x | VPS ; θ ) = n h N h f ( y , x | θ ) h = 1 H n h N h Q h . (10)

Parameters are estimated based on the likelihood function defined from (10). Note that with the availability of a valid covariate distribution, a density function similar to (10) can also be constructed for the N n incomplete observations (i.e. those partially observed units). In LSAS there are always some empty strata with N h = 0 but non-negligible occupation probability Q h , which are missing in the denominator of (10). We will address these issues in Section 4 and call the improved likelihood the “empirical proportion (EP) likelihood”.

3. Complete-Data Likelihood for BSS

As mentioned previously, the methods for VPS are applicable to BSS2, and the complete-data likelihood for VPS is given in [1] and [7]. Therefore in this section we only consider BSS1 and refer to BSS1 as BSS for convenience.

We denote dbin ( x , N , p ) and pbin ( x , N , p ) respectively as the binomial probability mass function and cumulative probability, with number of successes x, total number of events N and success probability p. The density function for a unit selected for full observation in BSS is denoted as f B C ( ) with “BC” indicating “BSS complete data”.

Theorem 1. In BSS the density function of a unit ( y , x ) selected for full observation is given by

f B C ( y , x | R = 1 ; θ ) = f ( y , x | θ ) Q h × [ N h = 1 m h 1 N h dbin ( N h , N , Q h ) ] + m h [ 1 pbin ( m h 1, N , Q h ) ] h = 1 H { [ N h = 1 m h 1 N h dbin ( N h , N , Q h ) ] + m h [ 1 pbin ( m h 1, N , Q h ) ] } (11)

if ( y , x ) S h .

The proof of Theorem 1 is given in the Appendix. As suggested in [9], [10], [22] and [23], the BSS complete-data (BC) likelihood can be constructed as

L B C = i : R i = 1 f B C ( y i , x i | R i = 1 ; θ ) . (12)

With the same arguments for deriving (11), the density function for the partially observed units is

f B I ( z | R = 0 ; θ ) = f ( z | θ ) Q h × N h = m h + 1 N ( N h m h ) dbin ( N h , N , Q h ) h = 1 H { N h = m h + 1 N ( N h m h ) dbin ( N h , N , Q h ) } , (13)

where the subscript “BI” denotes “BSS incomplete data”. The summations in (13) may be calculated more efficiently using

N h = m h + 1 N ( N h m h ) dbin ( N h , N , Q h ) = N Q h N h = 1 m h N h dbin ( N h , N , Q h ) m h [ 1 pbin ( m h , N , Q h ) ] . (14)

Densities (11) and (13) incorporate respectively the information of complete data and incomplete data. We anticipate that they together can lead to better inference than using only complete data. The BSS full-data (BF) likelihood is

L B F ( θ ) = h = 1 H [ i = 1 n h f B C ( y i , x i | R i = 1 ; θ ) ] [ i = n h + 1 N h f B I ( z i | R i = 0 ; θ ) ] . (15)

Here and in the remainder of this paper, we enumerate the fully observed units in the hth stratum as 1, , n h , and the partially observed units in the same stratum as n h + 1, , N h .

In some cases only the number of incomplete measurements, ( N h n h ) , in each stratum are known, instead of the measured values of all z i ’s. In this situation we need to integrate out z in (13) and rewrite the likelihood function (15) as

L B F ( θ ) = h = 1 H [ i = 1 n h f B C ( y i , x i | R i = 1 ; θ ) ] × [ N h = m h + 1 N ( N h m h ) dbin ( N h , N , Q h ) h = 1 H { N h = m h + 1 N ( N h m h ) dbin ( N h , N , Q h ) } ] N h n h . (16)

In real data analysis it is important to examine residuals for the fitted model to assess the validity of assumptions. Equation (11) gives the density function for BSS complete data, and can be used to calculate residuals. For simplicity we assume response y to be univariate y. Define the density function of x conditional on R = 1 as

h B C ( x | R = 1 ; θ ) = f B C ( y , x | R = 1 ; θ ) d y .

E ( y | x , R = 1 ) = y f B C ( y , x | R = 1 ; θ ) d y h B C ( x | R = 1 ; θ ) ,

E ( y 2 | x , R = 1 ) = y 2 f B C ( y , x | R = 1 ; θ ) d y h B C ( x | R = 1 ; θ ) ,

Var ( y | x , R = 1 ) = E ( y 2 | x , R = 1 ) [ E ( y | x , R = 1 ) ] 2 .

The standardized residual for the ith observation ( y i , x i ) is

y i E ( y | x i , R = 1 ) Var ( y | x i , R = 1 ) . (17)

The measured data such as length and age are usually discrete, and the above integrations become summations, which are easier to evaluate.

4. Application of Empirical Proportion Approach to BSS

In this section we expand density (10) for application in BSS and especially in LSAS.

Empty strata ( N h = 0 ) always happen with LSAS. For the empty strata in (10), the empirical selection proportions n h / N h ( = 0 / 0 ) are not defined. We need to assign selection probabilities for full and incomplete observations to those unobserved strata. In VPS these selection probabilities may be determined by the maximum likelihood method [10]. For sampling model (4), when N h m h , all the individuals in the hth stratum are selected for full measurement; hence, logically the empirical selection probability is 1 when N h = 0 < m h . We assume that in unobserved strata the probability for full observation is 1, and the probability for incomplete observation is 0. Hence, the empirical proportion (EP) density of the complete data with ( y , x ) fully measured is given by

f EP ( y , x | R = 1 ; θ ) = n h N h f ( y , x | θ ) h = 1 H o b s n h N h Q h + h = H o b s + 1 H t o t a l Q h . (18)

Here h = 1, , H o b s enumerate the strata with data observed, and h = H o b s + 1, , H t o t a l enumerate the strata without data. H t o t a l is the total number of strata with nonnegligible occupation probabilities Q h (see Equation (2)).

Similarly, we can include information from the incomplete observations using their EP density,

f EP ( z | R = 0 ; θ ) = N h n h N h f ( z | θ ) h = 1 H o b s N h n h N h Q h . (19)

Here, without loss of generality, we assume that z falls in the hth stratum. For an unobserved stratum h, since we have defined its proportion for full observation n h / N h = 1 , its proportion for partial observation ( N h n h ) / N h = 0 . The EP likelihood function then has the form

L EP ( θ ) = h = 1 H [ i = 1 n h f EP ( y i , x i | R i = 1 ; θ ) ] [ i = n h + 1 N h f EP ( z i | R i = 0 ; θ ) ] . (20)

If only the number of incomplete observations in each stratum is reported without knowing the z values, z in (19) needs to be integrated out and the likelihood (20) becomes

L EP ( θ ) = h = 1 H [ i = 1 n h f EP ( y i , x i | R i = 1 ; θ ) ] [ Q h h = 1 H o b s N h n h N h Q h ] N h n h . (21)

5. Simulation Study

In this section we examine the performance of the inference approaches for BSS described in the previous sections. We use two simple examples: a linear model with between individual (BI) variation, and a nonlinear Von Bertalanffy (VonB) growth model with BI variation. The simulation setup is as follows.

5.1. Linear Model with BI Variation

The linear model with BI variation is

Y = a + B X + ε , (22)

where B N ( μ b , σ b 2 ) , X N ( μ x , σ x 2 ) and ε N ( 0, σ ε 2 ) . Capital letter B denotes the random effect of BI variation. We randomly generated N = 5000 ( x i , y i ) pairs, i = 1 , , N , from model (22). The parameters of the model were chosen as a = 0.5 , μ b = 0.2 , 0.5 and 1.0, σ b = 1.0 , μ x = 1.0 , σ x = 5.0 , and σ ε = 0.7 . Here we selected a small intercept a so that the issues with the relative performance in its estimation as defined by (25) can be clearly seen. Slope is an important parameter in linear model. Hence we selected small, moderate and large values for its mean μ b and a relatively large standard deviation (SD) σ b to test different approaches in identifying the slope under various situations. The mean μ x and SD σ x for covariate X are chosen so that the spread of the covariate allows reasonable estimates of the model parameters. We adopted a moderate error SD ( σ ε ) relative to the other parameters. We stratified the data by length (Y) bins of size 2 and randomly selected a maximum of 15 units per length stratum to keep their X values, and dropped the X values of the other units not selected. This sampling design is close to the LSAS of fishery surveys that we would like to address in this study.

5.2. VonB Growth Model with BI Variation

The VonB model is a commonly used growth model in fisheries science (e.g. [24]). The basic VonB model is given by

y ( a ) = l ( 1 e k ( a a 0 ) ) , (23)

where y ( a ) denotes length at age a , l is the maximum possible size (as a ), k is the growth rate parameter, and a 0 ( < 0 ) is the theoretical age at which the fish would have had zero length. Variation in growth is also important for population and community dynamics (e.g. [25]). Not accounting for individual variation in growth may lead to bias in estimating the population mean growth parameters and length at age, as noted by [26] and [27]. The VonB model with BI variation follows [11],

Y = μ ( A ) + ε , (24)

where Y is the measured length, μ ( A ) = l ( 1 e k ( A a 0 ) ) , A G a m m a ( α , β ) and ε N ( 0, [ CV × μ ( A ) ] 2 ) . The error ε here in fact includes both BI variation and Y observation error.

We randomly generated N = 5000 ages from a gamma distribution with Case 1: ( α , β ) = ( 3.643 , 1.225 ) , and Case 2: ( α , β ) = ( 11.227 , 0.641 ) . α and β are determined by matching the mean = α β and variance = α β 2 with those of the age data for American plaice that we have been investigating. Case 1 represents a younger population with mean age = 4.46 and variance = 5.47, while case 2 represents an older population with mean = 7.20 and variance = 4.61. Case 1 has a broad age distribution close to the origin, and case 2 has a narrower distribution of ages. Lengths were then generated from model (24) with l = 70 , k = 0.2 , a 0 = 0.07 and CV = 0.2 . We stratified the data by length classes of size 2 and randomly sampled a maximum of 15 units per length stratum to keep their ages and dropped all the other ages not selected.

5.3. Estimation Performance

Relative biases (RBias), relative standard errors (RSE), and relative square root mean squared errors (RRMSE) are defined as

RBias = 100 × Estimate Truevalue | Truevalue | , RSE = 100 × Standarderror | Truevalue | , and RRMSE = 100 × MSE | Truevalue | . (25)

We derived these values using 500 simulations for the full information likelihood (1), conditional likelihood (6), weighted likelihood (7), calibrated weighted likelihood, complete-data likelihood (12), full-data likelihood (15), and EP likelihood (20) (see Tables 1-4). We also include the “random approach” based on maximizing the likelihood

Table 1. Relative bias (RBias), relative standard error (RSE) and relative square root mean squared error (RRMSE) of the estimates from various approaches for the parameters in the linear model with BI variation (22). μ b = 0.2 .

Table 2. Relative bias (RBias), relative standard error (RSE) and relative square root mean squared error (RRMSE) of the estimates from various approaches for the parameters in the linear model with BI variation (22). μ b = 0.5 .

Table 3. Relative bias (RBias), relative standard error (RSE) and relative square root mean squared error (RRMSE) of the estimates from various approaches for the parameters in the linear model with BI variation (22). μ b = 1.0 .

L R = i = 1 n f ( y i , x i | θ ) (26)

as a reference point to see the difference between considering BSS and totally ignoring BSS.

For the linear model with BI variation (22), Tables 1-3 indicate that the full information, full-data and EP likelihood approaches have quite close performance, and in general they perform substantially better than all the other approaches in terms of RBias, RSE and RRMSE for all estimated parameters. The weighted likelihood (WL) and calibrated WL approaches have close performance, and there is no evidence that calibration improves the estimation; that is,

Table 4. Relative bias (RBias), relative standard error (RSE) and relative square root mean squared error (RRMSE) of the estimates from various approaches for the parameters in the VonB model with BI variation (24). Case 1: ( α , β ) = ( 3.643 , 1.225 ) . Case 2: ( α , β ) = ( 11.227 , 0.641 ) .

in some cases the calibrated WL has a little smaller RRMSEs than WL, and in the other cases the reverse happens, but the differences have no clear pattern, and are too small to draw reliable conclusions. Similarly, even though there is some difference in performance between the complete-data likelihood approach and the two WL approaches, it is not clear which method performs better. The two WL approaches have smaller RRMSEs for a and σ ε estimation, while the complete-data likelihood approach has smaller RRMSEs for other parameter estimation. The conditional likelihood approach based on (6) performs the worst among all the approaches in this study except the random approach. Especially for μ x , σ x , μ b and σ b estimation, its RRMSEs are more than twice of those from the complete-data likelihood approach. Nevertheless, the conditional likelihood approach performs substantially better than the random approach.

Simulation results presented in Table 4 provide a comparison of the various estimation approaches for a nonlinear VonB model with BI variation. The outcomes for this nonlinear case are similar to the linear case just described. The full information, full-data and EP likelihood approaches have only tiny differences in performance, and in general perform better than the other approaches. The WL and calibrated WL approaches have almost identical performance. The complete-data likelihood approach has close performance as the two WL approaches and it is not clear which method is better. In Case 1 the conditional likelihood approach performs better than the random approach, but worse than all the other approaches including the complete-data approach. In Case 2 its performance is much worse than all the approaches including the random approach. Actually, the conditional likelihood approach failed for this case because it did not converge in 107 of the 500 simulations. All the methods in this study cannot estimate a 0 well, with large RBias, large RSEs, and hence large RRMSEs. In practice we suggest to borrow information from other studies such as larvae studies to fix a 0 , or equivalently to fix length at age 0, for the VonB model.

6. Real Data Analysis

The simulation study indicates that the full information likelihood (1), full-data likelihood (15) and EP likelihood (20) approaches perform better than the other estimation methods. In this section we apply these three approaches to fit the VonB model (24) using a dataset collected by DFO in NAFO Division 3N during the spring of 2011. Here we consider only female American plaice because males and females follow different growth models.

The LSAS within each Division involved measuring the length of all fish caught in research trawl tows, classifying them into 2 cm length strata, and subsampling a few or no otoliths from each length stratum. The sampling goal in each Division was to obtain about 25 age measurements per 2 cm length stratum by sex if length 10 cm , and about 15 age measurements per stratum without sex distinguishment if length < 10 cm .

Parameter estimates (ESTs) and the corresponding standard errors (SEs) are provided in Table 5. The three estimation approaches give similar values for all the parameters and SEs, which agrees with their close performance in the simulation study. For comparison, we also included estimates from the random approach (26), which result in a substantially larger value for l and a smaller value for k. The standard errors of the estimates from the random approach are also larger, especially for l .

Applying (17), we obtained the standardized residuals of the second phase complete data for all approaches, whose box-and-whisker plots by age are shown in Figure 1. The standardized residuals from the full information likelihood, EP

Figure 1. Box-and-whisker plots of standardized residuals vs. age from fitting the VonB model with BI variation (24) to the American plaice data from DFO 2011 Spring survey in NAFO Division 3N by the four likelihood approaches: full information likelihood (Full information), empirical proportion likelihood (EP), full-data likelihood (Full-data) and random sample assumption based likelihood (Random). The black dots are the medians. The boxes indicate the lower and upper quartiles. The ends of the whiskers represent the lowest datum still within 1.5 IQR (interquartile range) of the lower quartile, and the highest datum still within 1.5 IQR of the upper quartile.

Table 5. Parameter estimates (EST) and standard errors (SE) for the VonB model with between-individual variation (24).

likelihood and full-data likelihood approaches do not indicate bias in fitted mean length at age from the data mean along the full range of age. The standardized residuals from the random approach (26) exhibit clear bias to negative values at ages larger than about 12, indicating over-estimation of l . In Figure 1 the interquartile range (IQR, the box) of the residuals is much larger at younger ages (≤4) compared to older ages (≥9). The standard deviation (SD) of the standardized residuals at each age is supposed to be 1. However, the calculated SDs (results not shown) transfers from being greater than 1 at younger ages (≤4) to being mainly smaller than about 0.6 at older ages (≥9). These suggest two problems with the model: 1) the BI variation model in (24) under-estimates the variation at shorter lengths and vice-versa at longer lengths for this data, and 2) due to reproduction, the juvenile female American plaice follows a different growth model from the adult female American plaice, which is neglected by the current model.

7. Discussion

We derived the density function (11) for BSS (basic stratified sampling) complete data, and constructed the complete-data likelihood (12), which allows statistical inference when the incomplete data are not well retained. The complete-data density can also be used for standardized residual calculation as discussed in Section 3. Residuals are important for validation of fitted models.

Both the complete-data likelihood approach and the random approach make use of only the complete data. The complete approach performs substantially better than the random approach in the simulation studies, indicating the importance of correctly incorporating the sampling scheme in the inference methods. The conditional likelihood (6) accounts for the sampling scheme approximately by ignoring the randomness in n h in all the strata. Therefore its performance lies between the random and the complete-data likelihood approaches in almost all the cases in the simulation study. However in some BSS sampling projects where the number of strata is small and the maximum subsample size m h for each stratum can usually be obtained, then the conditional likelihood (6) is appropriate.

Another method to incorporate the sampling scheme is to use the count information of the incomplete data in each stratum, as in the weighted likelihood (WL) and calibrated WL approaches. Even though in the simulation study the two methods of accounting for the sampling scheme, namely the complete-data likelihood and the (calibrated) WL approaches, have comparable performance, the complete-data likelihood requires an appropriate distribution model for covariates, which can limit its application. The WL and calibrated WL approaches are not subject to this restriction, and hence can be more practical.

A full utilization of the information in incomplete data is to incorporate the density function of the incomplete data in the likelihood. In this regard, we proposed two new likelihoods for BSS, namely, the full-data likelihood and the empirical proportion (EP) likelihood. If the covariate distribution can be properly modeled, the two new approaches perform as well as the standard full information likelihood approach, and they all perform substantially better than the other methods covered in this study. This result suggests the significance of the information in the incomplete data.

On the whole this study indicates that the complete data, the incomplete data, and the sampling scheme are all important for a consistent and efficient statistical inference from BSS data.

In this work we found that the EP likelihood approach, which was originally proposed for the variable probability sampling (VPS), works well (or the best together with the full-data and full information likelihood approaches) for BSS data. Its merits will further show up when covariates cannot be modeled effectively. This work is under the condition that a valid covariate distribution model is available, which may be a strong assumption in practice. We will explore the case when no appropriate covariate distribution model is available in another paper.


Research funding to Nan Zheng was provided by the Ocean Frontier Institute, through an award from the Canada First Research Excellence Fund. Funding was also provided by a Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery grant to NC and NC’s Ocean Choice International Industry Research Chair program at the Marine Institute of Memorial University of Newfoundland.

Appendix: Proof of Theorem 1

Without loss of generality, we assume that ( y , x ) S h , then

f ( y , x | R = 1 ; θ ) = Pr ( ( y , x ) S h | R = 1 ; θ ) Pr ( y , x | ( y , x ) S h , R = 1 ; θ ) .

Since the selection for full observation is random given ( y , x ) S h ,

Pr ( y , x | ( y , x ) S h , R = 1 ; θ ) = Pr ( y , x | ( y , x ) S h ; θ ) = f ( y , x | θ ) Q h ,

and we have

f ( y , x | R = 1 ; θ ) = Pr ( ( y , x ) S h | R = 1 ; θ ) f ( y , x | θ ) Q h = n h = 0 m h Pr ( n h | R = 1 ; θ ) Pr ( ( y , x ) S h | n h , R = 1 ; θ ) f ( y , x | θ ) Q h , (27)

where n h is the sample size in the hth stratum as defined by (4).

Pr ( ( y , x ) S h | n h , R = 1 ; θ ) n h , that is, the probability for a selected unit to be in a stratum h is proportional to the number of vacancies in the stratum h. Also, Pr ( n h | R = 1 ; θ ) = Pr ( n h | θ ) , namely, the event {a unit is selected without any further information about its ( y , x ) } is independent of the event {there are n h units that are selected in the stratum h}.

Pr ( n h | θ ) = ( dbin ( N h , N , Q h ) , if N h < m h and hence n h = N h , 1 pbin ( m h 1, N , Q h ) , if N h m h and hence n h = m h .

Hence, when ( y , x ) S h ,

f ( y , x | R = 1 ; θ ) f ( y , x | θ ) Q h × { [ N h = 1 m h 1 N h dbin ( N h , N , Q h ) ] + m h [ 1 pbin ( m h 1, N , Q h ) ] } ,

which can be normalized into (11).

Note that in the case Pr ( n h = m h | θ ) = 1 for all the strata h = 1 , , H , Pr ( ( y , x ) S h | m h , R = 1 ; θ ) = m h / h = 1 H m h , which is a constant independent of θ . Then (27) leads to f ( y , x | R = 1 ; θ ) f ( y , x | θ ) / Q h , which proved (5).

Conflicts of Interest

The authors declare no conflicts of interest.


[1] Lawless, J.F., Kalbfleisch, J.D. and Wild, C.J. (1999) Semiparametric Methods for Response-Selective and Missing Data Problems in Regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61, 413-438.
[2] Jewell, N.P. (1985) Least Squares Regression with Data Arising from Stratified Samples of the Dependent Variable. Biometrika, 72, 11-21.
[3] Rubin, D.B. (1976) Inference and Missing Data. Biometrika, 63, 581-592.
[4] Cope, J.M. and Punt, A.E. (2007) Admitting Ageing Error When Fitting Growth Curves: An Example Using the Von Bertalanffy Growth Function with Random Effects. Canadian Journal of Fisheries and Aquatic Sciences, 64, 205-218.
[5] Cadigan, N.G. and Campana, S.E. (2017) Hierarchical Model-Based Estimation of Population Growth Curves for Redfish (Sebastes mentella and Sebastes fasciatus) off the Eastern Coast of Canada. ICES Journal of Marine Science, 74, 687-697.
[6] Dey, R., Cadigan, N. and Zheng, N. (2019) Estimation of the von Bertalanffy Growth Model When Ages Are Measured with Error. Journal of the Royal Statistical Society: Series C (Applied Statistics), 68, 1131-1147.
[7] Carroll, R.J., Ruppert, D., Stefanski, L.A. and Crainiceanu, C.M. (2006) Measurement Error in Nonlinear Models: A Modern Perspective. CRC Press, Boca Raton.
[8] Candy, S.G., Constable, A.J., Lamb, T. and Williams, R. (2007) A von Bertalanffy Growth Model for Toothfish at Heard Island Fitted to Length-at-Age Data and Compared to Observed Growth from Mark-Recapture Studies. CCAMLR Science, 14, 43-66.
[9] Scott, A.J. and Wild, C.J. (2011) Fitting Regression Models with Response-Biased Samples. Canadian Journal of Statistics, 39, 519-536.
[10] Hausman, J.A. and Wise, D.A. (1982) Stratification on Endogenous Variables and Estimation: The Gray Income Maintenance Experiment. In: Manski, C. and McFadden, D., Eds., Structural Analysis of Discrete Data: With Econometric Applications, Chapter 10, MIT Press, Cambridge, 365-391.
[11] Piner, K.R., Lee, H.-H. and Maunder, M.N. (2016) Evaluation of Using Random-at-Length Observations and an Equilibrium Approximation of the Population Age Structure in Fitting the von Bertalanffy Growth Function. Fisheries Research, 180, 128-137.
[12] Hsieh, D.A., Manski, C.F. and McFadden, D. (1985) Estimation of Response Probabilities from Augmented Retrospective Observations. Journal of the American Statistical Association, 80, 651-662.
[13] Scott, A.J. and Wild, C.J. (1986) Fitting Logistic Models under Case-Control or Choice Based Sampling. Journal of the Royal Statistical Society. Series B (Methodological), 48, 170-182.
[14] Kalbeisch, J.D. and Lawless, J.F. (1988) Estimation of Reliability in Field-Performance Studies. Technometrics, 30, 365-378.
[15] Kalbeisch, J.D. and Lawless, J.F. (1988) Likelihood Analysis of Multi-State Models for Disease Incidence and Mortality. Statistics in Medicine, 7, 149-160.
[16] Whittemore, A.S. (1997) Multistage Sampling Designs and Estimating Equations. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59, 589-602.
[17] Breslow, N.E., Lumley, T., Ballantyne, C.M., Chambless, L.E. and Kulich, M. (2009) Improved Horvitz-Thompson Estimation of Model Parameters from Two-Phase Stratified Samples: Applications in Epidemiology. Statistics in Biosciences, 1, 32-49.
[18] Saegusa, T. and Wellner, J.A. (2013) Weighted Likelihood Estimation under Two-Phase Sampling. Annals of Statistics, 41, 269.
[19] Robins, J.M., Rotnitzky, A. and Zhao, L.P. (1994) Estimation of Regression Coeficients When Some Regressors Are Not Always Observed. Journal of the American Statistical Association, 89, 846-866.
[20] Thompson, S. (2012) Sampling. John Wiley & Sons, Hoboken.
[21] Wu, C.B. (2003) Optimal Calibration Estimators in Survey Sampling. Biometrika, 90, 937-951.
[22] Breslow, N.E. and Cain, K.C. (1988) Logistic Regression for Two-Stage Case-Control Data. Biometrika, 75, 11-20.
[23] Pfeffermann, D. and Sverchkov, M. (1999) Parametric and Semi-Parametric Estimation of Regression Models Fitted to Survey Data. Sankhyā: The Indian Journal of Statistics, Series B, 61, 166-186.
[24] Quist, M.C., Pegg, M.A. and DeVries, D.R. (2012) Age and Growth. Fisheries Techniques. 3rd Edition, American Fisheries Society, Bethesda, 677-731.
[25] Shelton, A.O., Satterthwaite, W.H., Beakes, M.P., Munch, S.B., Sogard, S.M. and Mangel, M. (2013) Separating Intrinsic and Environmental Contributions to Growth and Their Population Consequences. The American Naturalist, 181, 799-814.
[26] Sainsbury, K.J. (1980) Effect of Individual Variability on the von Bertalanffy Growth Equation. Canadian Journal of Fisheries and Aquatic Sciences, 37, 241-247.
[27] Wang, Y.-G., Thomas, M.R. and Somers, I.F. (1995) A Maximum Likelihood Approach for Estimating Growth from Tag-Recapture Data. Canadian Journal of Fisheries and Aquatic Sciences, 52, 252-259.

comments powered by Disqus

Copyright © 2020 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.