In the applications of Tobit regression models we always encounter the data sets which contain too many variables that only a few of them contribute to the model. Therefore, it will waste much more samples to estimate the “non-effective” variables in the inference. In this paper, we use a sequential procedure for constructing the fixed size confidence set for the “effective” parameters to the model by using an adaptive shrinkage estimate such that the “effective” coefficients can be efficiently identified with the minimum sample size based on Tobit regression model. Fixed design is considered for numerical simulation.
Tobit regression model is called sample selection model or restricted dependent variable model, see in [
The rest of this paper is organized as follows. In Section 2, we will give the adaptive shrinkage estimate (ASE) based on the Least Absolute Deviation Estimate (LAD) of Tobit regression models and its asymptotic properties. In Section 3, Sequential sampling strategy based on ASE and stopping rule as well as random size confident set is presented. In Section 4, an example with numerical simulation is given to illustrate the performance of the proposed method via sequential fixed size confidence estimation using synthesized data sets.
Suppose a + = max { a , c } , where c is a known constant, we can define Tobit regression model as:
y i + = max { x i T β 0 + ε i , c } , i = 1 , 2 , ⋯ , n (1)
where y i is dependent variable, β 0 is a p-dimensional vector of the unknown regression coefficients, x i is a p-dimensional vector of covariates and ε i is a random error. Without losing generality, suppose c = 0 . Let ε i , i = 1 , 2 , ⋯ , n be independent identically distributed and follows a standard normal distribution with mean 0 and variance σ 2 , then the Likelihood function will be
L = ∏ 0 ( 1 − Φ ( x i T β / σ ) ) ∏ 1 ( σ − 1 Φ ( x i T β / σ ) ) (2)
where Φ and ϕ are standard normal distribution function and density function, Π 0 and Π 1 are the products in { i : y i ≤ 0 } and { i : y i > 0 } separately.
Powell proposed a Least Absolute Deviation Estimate (LAD) of β 0 in [
Q n ( β ) = ∑ i = 1 n | y i + − max { x i T β , 0 } | (3)
Under the assumptions (A1) Let sup i ‖ x i ‖ < ∞ and (A2) Let the density function of the random error ε i , satisfies f ( 0 ) = 0 and m e d ( ε i ) = 0 , then there exists some δ > 0
lim n → ∞ λ log n ∑ i = 1 n I ( x i T β > δ ) x i x i T = ∞ (4)
Chen and Wu proved that lim n → ∞ β ˜ n = β 0 , a . s . and
( 2 f ( 0 ) M n 1 / 2 ) ⋅ n ( β ˜ n − β 0 ) → d N ( 0 , I n ) (5)
in [
Let κ = κ ( n ) be a non-random function of n such that for some 0 < δ < 1 / 2 and γ > 0 , n 1 / 2 κ → 0 and n 1 / 2 + γ δ κ → ∞ , as n → ∞ . Then, under the assumptions (A1) and (A2), by using Equation (4) we can see that n 1 / 2 − η ( β ˜ n − β 0 ) = O ( 1 ) almost surely as n tends to ∞ for some η > 0 . Similar to Wang and Chang in [
So far, we get good statistical properties of the proposed ASE estimate under non-random sample size, but our goal is to determine a sample size under which the ASE attains the required accuracy. So we will introduce the sequential sampling scheme based on the ASE below. It is known that construction of the confidence set for β 0 depends on the asymptotic distribution of β ^ n and sample size under sequential analysis is a random variable. So we need to study asymptotic properties of ASE under random sample size. Fortunately, property of uniform continuity in probability, see in [
Theorem 1. Suppose that the (A1) and (A2) are satisfied, and let N ( t ) be a positive integer-valued random variable such that N ( t ) / t converges to 1 in probability as t → ∞ . Then
N ( t ) ( β ^ N ( t ) − β 0 ) → N ( 0 , I 0 Σ I 0 − 1 )
in distribution as t → ∞ .
From Theorem 1, we can construct a confidence set of β 0 and a stopping rule on sequential sampling procedure to determine final sample size. Let { ( y i , x i ) : i = 1 , 2 , ⋯ , k } be the first k observations and denoted by C k . Define a stopping rule N d as
N = N d ≡ inf { k : d 2 a k 2 ≥ ν k , ∀ k ≥ n 0 } (5)
For sequential estimation procedure, one new observation is collected at a time until the stopping criterion is satisfied. When the stopping rule holds, based on N samples a confidence set of β 0 is constructed as follow,
R N = { Z ∈ R p : S N N ≤ d 2 ν N ; I N j ( ε ) = 0 → z j = 0 , 1 ≤ j ≤ p } (6)
where S N = ( Z N 1 − β ^ N 1 ) T Σ ˜ 11 ( Z N 1 − β ^ N 1 ) . Properties of the sequential procedure and the confidence set R N are summarized below.
Theorem 2. Assume that the (A1) and (A2) are satisfied, and let N be the stopping time defined in Equation (5). Then: 1) lim d → 0 d 2 N / a 2 ν = 1 almost surely; 2) lim d → 0 d 2 N / a 2 ν = 1 ; 3) lim d → 0 d 2 E ( N ) / a 2 ν = 1 ; 4) lim d → 0 p ^ 0 ( N ) = p 0 almost surely; 5) lim d → 0 E ( p ^ 0 ( N ) ) = p 0 where ν is the maximum eigen-value of matrix I 0 Σ − 1 I 0 .
We evaluate the performance of the proposed method via sequential fixed size confidence estimation using synthesized data sets. As mentioned previously, by the definition of the stopping rule, when sampling is stopped, the final confidence ellipsoid constructed will have the prescribed precision and coverage probability. Thus, we can compare the average stopping times of procedures based on LAD and ASE. Since the proposed method ignores the non-effective variables, we expect the average stopping time to be significantly smaller than that of the procedure based on LAD with no variable identification mechanism. If the p0 variables are known in advance, then the most efficient procedure is, of course, to use only these p0 variables. Therefore, we also construct a sequential procedure under such a situation, and the results of the cases with known p0 can serve as the baseline, in which the smallest sample size is achieved, asymptotically.
The synthesized data sets for the model with fixed designs are generated as follows: the regressor x i are generated independently from a standard multivariate normal distribution with mean 0 and identity covariance matrix beforehand, and the error term e i is independently drawn from the standard normal distribution for each i ≥ 1 . The system error is assumed to follow the standard normal distribution. The response generated by Equation (1) and the true parameter β 0 = ( − 1.2 , 2.0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ) with 8 non-effective variables. Different precisions of confidence ellipsoid d ∈ { 0.3 , 0.4 , 0.5 , 0.6 } are chosen with coverage probability equal to 95%, α = 0.05 in the simulation. We choose γ = 1 , δ = 0.55 and θ = 0.70 in analyzing simulated data. When applying the ASE method, the regularization parameter ε needs to be determined by some model selection criteria, as the AIC, BIC together with a GCV method. For convenience, we only use BIC to illustrate our method.
β 0 = ( − 1.2 , 2.0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ) | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
LAD p 0 | ASE | LAD | ||||||||
Design | d | N | κ * | CP | N | κ | CP | N | κ | CP |
fixed | 0.6 | 85.44 (14.75) | 1.008 | 0.96 | 92.32 (17.40) | 1.044 | 0.94 | 327.8 (23.194) | 1.01 | 0.92 |
0.5 | 126.84 (19.75) | 1.019 | 0.96 | 131.52 (19.13) | 1.034 | 0.98 | 433.21 (29.586) | 1.006 | 1 | |
0.4 | 179.13 (25.587) | 1.001 | 0.94 | 190.34 (25.311) | 1.021 | 0.93 | 674.26 (37.868) | 1.003 | 0.90 | |
0.3 | 363.72 (38.211) | 1.001 | 0.97 | 373.60 (37.087) | 1.017 | 0.95 | 1203.05 (44.707) | 1.002 | 0.94 |
κ * = d 2 N / ( a 2 ν ) ; C P + is the empirical coverage probability of 95% confidence ellipsoid region R N ; ** Empirical standard deviations are in parentheses.
β 1 = − 1.2 , β 2 = 2.0 | |||||||||
---|---|---|---|---|---|---|---|---|---|
ASE | LAD | ||||||||
Design | d | N i c ∗ | N c ∗ | β 1 | β 2 | N i c ∗ | N c ∗ | β 1 | β 2 |
fixed | 0.6 | 0 | 7.912 | −1.223 (0.155) | 2.16 (0.010) | - | - | −1.240 (0.13) | 2.061 (0.31) |
0.5 | 0 | 7.959 | −1.214 (0.129) | 2.042 (0.193) | - | - | −1.224 (0.009) | 2.031 (0.066) | |
0.4 | 0 | 7.982 | −1.208 (0.105) | 2.073 (0.112) | - | - | −1.211 (0.054) | 2.021 (0.077) | |
0.3 | 0 | 7.933 | −1.210 (0.076) | 2.035 (0.102) | - | - | −1.201 (0.032) | 2.004 (0.043) |
N i c ∗ and N c ∗ are the average number of zero components in β correctly identified and nonzero components incorrectly estimated as zero values, respectively; + standard deviations are in parentheses.
to 0, and the number of correctly identified zero variables ( N c ∗ ) are all very close to the true number of effective variables (2 and 8). These results suggest that p ^ 0 is a good estimator of p 0 under the sequential sampling method based on ASE. The LAD procedure does not identify the effective variables, so N c ∗ and N i c ∗ are not available. In addition, all of parameter estimates of effective variables are very close to the true values.
Based on an ASE estimate of the parameter in Tobit regression model, a sequential sampling procedure is constructed to estimate a minimum sample size to identify the effective variables and simultaneously make estimate of parameters with required accuracy. We prove that the proposed sequential procedure is asymptotically optimal in the sense of Chow and Robbins, see in [
This research was supported by Research projects of universities in Xinjiang Uygur Autonomous Region under Grant No. XJEDU2016I033 and Xinjiang Normal University postdoctoral research foundation under Grant No. XJNUBS1539.
The authors declare no conflicts of interest regarding the publication of this paper.
Lu, H.B., Dong, C.L. and Zhou, J.L. (2021) A Sequential Shrinkage Estimating Method for Tobit Regression Model. Open Journal of Modelling and Simulation, 9, 275-280. https://doi.org/10.4236/ojmsi.2021.93018