A One-Step Variable Selection Procedure for SCAD Penalized Quantile Regression Models

Jan G. De Gooijer

doi:10.4236/tel.2026.161004

Theoretical Economics Letters > Vol.16 No.1, February 2026

A One-Step Variable Selection Procedure for SCAD Penalized Quantile Regression Models

Jan G. De Gooijer
Amsterdam School of Economics, University of Amsterdam, Amsterdam, Netherlands.
DOI: 10.4236/tel.2026.161004 PDF HTML XML 77 Downloads 337 Views

Abstract

Variable selection using penalized estimation methods in quantile regression models is an important step in screening for relevant covariates. In this paper, we present a one-step estimation procedure for variable selection in sparse, linear additive quantile regression models, using the SCAD penalty. The main idea of the proposed procedure is that the usual L₁-norm objective function in quantile regression estimation is replaced by a smooth parametric approximation of this function, via iterative least squares computations. We conduct a simulation study and a real data analysis to check the finite sample performance of the one-step estimator. The results reveal that the one-step quantile SCAD method identifies relevant variables efficiently even when the process under study has heavy tails or if the process is contaminated with outliers.

Keywords

High-Dimensional Data, Iterative Least Squares, Quantile Regression, Variable Selection

Share and Cite:

Gooijer, J. (2026) A One-Step Variable Selection Procedure for SCAD Penalized Quantile Regression Models. Theoretical Economics Letters, 16, 57-71. doi: 10.4236/tel.2026.161004.

1. Introduction

Analyzing high-dimensional data has become an important task in diverse research fields such as biology, genetics, economics, and finance¹. It has gained increasing attention in the application of emerging statistical methods that are proficient in analyzing and interpreting large datasets. Selection of the best representative covariate of the response variable is a key part of this process, but the traditional variable (or component) selection methods like subset selection, stepwise selection, backward and forward selection are not stable and accurate when the size of covariates is near or equal to the size of observations. These procedures are also not applicable when the number of observations is less than the number of covariates.

Multiple studies on penalized regression methods have addressed the above drawbacks. For example, the LASSO (least absolute shrinkage and selection operator) penalty was proposed by Tibshirani (1996), the adaptive LASSO by Zou (2006), the elastic net by Zou and Hastie (2005), and the smoothly clipped absolute deviation (SCAD) penalty by Fan and Li (2001). These methods are good with respect to prediction accuracy and efficiency. They also have the oracle property, but are limited to the small size paradigm of high-dimensional data (Fan & Li, 2001)². Also, these methods work well with a fixed number of covariates. However, in the case of high-dimensional data, both classical and penalized techniques face problems like speed, stability and accuracy (Fan & Lv, 2008).

To overcome these issues, various variable selection methods in the context of quantile regression (QR) have been proposed. These include penalized quantile regression approaches using the LASSO penalty, and those using a non-convex penalty such as SCAD; see, e.g., Wang et al. (2012), Sherwood and Wang (2016), Khan et al. (2019) and the references therein. More recently, various penalized quantile regression approaches have been developed. For instance, Sherwood and Maidman (2022) proposed a method for simultaneous estimation and variable selection of an additive nonlinear quantile regression model in a high-dimensional setting. The R-rqPen package of Sherwood et al. (2025) readily solves the problem. Song et al. (2022) consider, as a generalized regression model, the single-index regression model with the SCAD and Laplace error penalty. Additionally, Chen and Lee (2023) addressed sparse linear quantile regression through the so-called $ℓ_{0}$ approach, which focuses on selecting a subset of covariates; see, e.g., Mai (2024) for an extensive literature review.

Using the SCAD penalty, the present paper introduces a computationally feasible and efficient approach, called one-step quantile estimation procedure, for selecting the nonzero QR parameters. The proposed method is similar in spirit to the one-step local linear approximation estimator of Zou and Li (2008). Given ordinary least squares estimates as initial values, these authors obtain a one-step estimator for the nonzero linear regression parameters by maximizing a penalized likelihood function. We use the same idea within the context of a penalized additive QR model³. However, we replace the usual $L_{1}$ -norm objective function by a smooth parametric approximation of this function, via iterative least squares computations. This approach offers an efficient way to obtain conditional quantile estimates in high-dimensions.

The rest of the paper is organized as follows. Section 2 briefly describes the standard QR minimization problem. Section 3 first introduces an initial QR estimator without variable selection. Next, a final QR estimator with variable selection is proposed. Finally, we discuss SCAD consistency in variable selection in this section. In Section 4, we assess the finite-sample properties of the estimator through a Monte Carlo simulation study. This section also contains a comparison with a similar one-step LASSO estimator. Section 5 contains a real data illustration. Section 6 provides some summary remarks.

2. Preliminaries

Let $Y_{i} \in ℝ$ be a response variable which depends on a $p_{n} \times 1$ vector of random covariates $X_{i} = {(X_{i, 1}, \dots, X_{i, p_{n}})}^{T} \in ℝ^{p_{n}}$ , $(p_{n} > 2)$ of the $i$ th observation, $i = 1, \dots, n$ . In the “traditional”, additive QR framework, interest is focussed on the $τ th$ $(0 < τ < 1)$ conditional quantile of $Y_{i}$ given $X_{i} = x_{i}$ , i.e.,

$Q_{Y_{i}} (τ | x_{i}) = x_{i}^{T} β_{0} (τ), i = 1, \dots, n,$ (1)

where $β_{0} (τ) = {(β_{1, 0} (τ), \dots, β_{p_{n}, 0} (τ))}^{T}$ is an unknown $p_{n} \times 1$ parameter vector, including an intercept, depending on $τ$ . Note that for including an intercept term, the corresponding covariate needs to take value 1 for each observation.

For a fixed value of $τ$ , the parameter vector $β_{0} (τ)$ can be estimated by minimizing the objective function

$ℒ (β) = \sum_{i = 1}^{n} ρ_{τ} (r_{i} (β)),$ (2)

where $r_{i} (β) = Y_{i} - X_{i}^{T} β$ , $β = {(β_{1}, \dots, β_{p})}^{T}$ , $ρ_{τ} (u) = (τ - I (u \leq 0)) u$ is the quantile check function, and $I (\cdot)$ is the usual indicator function. To simplify the notation from now on, we suppress the argument $β$ of $r_{i} (β)$ . We also omit the subscript “ $n$ ” from $p_{n}$ whenever no confusion is caused.

Linear programming algorithms can be employed to solve Equation (2). To describe the basic solution (feasible or not) of the QR minimization problem, we first introduce the following additional notations. In particular, let $X = (X_{1}, \dots, X_{p})$ , and let $h = (i_{1}, \dots, i_{p})$ denote a $p$ -dimensional index, which identifies proper $p$ observations in the sample of size $n$ . Further, let $X (h)$ denote the $p \times p$ submatrix of $X$ with rows ${X_{i}, i \in h}$ . Likewise, let $Y (h)$ be a $p$ -dimensional vector with coordinates ${Y_{i}, i \in h}$ . Then the basic solution (Abdelmalek, 1974; Koenker & Bassett, 1978) of the QR minimization problem may be expressed as

$\hat{β} (τ) = X {(h)}^{- 1} Y (h) .$ (3)

Clearly, only the $p$ data points identified by $h$ supply information about the QR estimator of $β_{0} (τ)$ , although the whole observed sample is typically used to find the “optimal” $h$ . Observe that the design matrix $X$ is partitioned into a critical active dataset, whose cardinality is $p$ , and a non-active complement, yielding $Y (h) - X (h) \hat{β} (h) = 0_{p}$ with $0_{p}$ a $p$ -dimensional zero vector.

3. Description of the QR Estimators

It is well known that conditional QR estimates are severely affected by the so-called curse-of-dimensionality when $p$ is large. To tackle this issue, several authors have proposed penalized QR methods which identify relevant and irrelevant components consistently. In this section, first we discuss an initial estimator for $β_{0} (τ)$ based on a parametric smooth approximation of the objective function (2). Then we will explain how to use it in variable selection when constructing a one-step sparse conditional quantile estimator.

3.1. An Initial QR Estimator without Variable Selection

Muggeo et al. (2011) proposed an exact path-following algorithm leading to the QR solution (3). For a fixed value of $τ$ , they approximate $ρ_{τ} (r_{i})$ by an asymmetric “bent arc” expressed as a piecewise polynomial denoted by $S_{i, τ} (β; c)$ , where $c$ is a small but known positive constant. Specifically, let $A (c) = {i : - c τ < r_{i} < c (1 - τ)}$ , $i = 1, \dots, n$ , be a partition of the sample units. Then the approximation $S_{i, τ} (β; c)$ is defined as follows

$S_{i, τ} (β; c) = \frac{ω_{i, 2} (β; c) r_{i}^{2}}{2 c} + ω_{i, 1} (β; c) r_{i} + I (i \in A (c)) \frac{c τ (1 - τ)}{2},$ (4)

where $ω_{i, u} (β; c)$ , $u = 1, 2$ , are weights given by

$\begin{array}{l} ω_{i, 1} (β; c) = (τ - 1) ν_{i, 1} (β; c) + τ ν_{i, 4} (β; c), \\ ω_{i, 2} (β; c) = \frac{{(1 - τ)}^{2} ν_{i, 2} (β; c) + τ^{2} ν_{i, 3} (β; c)}{τ (1 - τ)}, \end{array}$ (5)

and where $ν_{i, g}$ , $g = 1, 2, 3, 4$ , are the following indicator functions

$v_{i, 1} (β; c) = I (r_{i} \leq - c τ), v_{i, 2} (β; c) = I (- c τ < r_{i} \leq 0),$

$v_{i, 3} (β; c) = I (0 < r_{i} < c (1 - τ)), v_{i, 4} (β; c) = I (r_{i} \geq c (1 - τ)) .$

Note that we suppressed the dependence of $ω_{i, u}$ and $ν_{i, g}$ on $τ$ . This simplification in notation also holds in the rest of the paper.

The smooth objective function takes the form

$\begin{matrix} S_{τ} (β; c) = \sum_{i = 1}^{n} S_{i, τ} (β; c) \\ = \frac{1}{2 c} \sum_{i \in A (c)} ω_{i, 2} (β; c) r_{i}^{2} + \sum_{i \in (1 - A (c))} ω_{i, 1} (β; c) r_{i} + | A (c) | \frac{c τ (1 - τ)}{2}, \end{matrix}$ (6)

where $\sum_{i} I (i \in A (c)) = | A (c) |$ , i.e., the cardinality of the set $A (c)$ . Now for minimizing Equation (6), the idea is to define a piecewise linear solution using a decreasing sequence of values, say $0 < c_{J} < \dots < c_{1}$ , such that $A (c_{j + 1}) \subseteq A_{j}$ and $| A (c_{J}) | = p$ . Additionally, let $c_{j} > 0$ be a fixed value and $β (c_{j})$ the corresponding minimizer of (6), with $c_{j}$ chosen such that $| A (c_{j}) | > p$ . Given this setup, the set of constant weight indicator functions is given by $D (c_{j}) = {(β; c) \in ℝ^{p} \times ℝ^{+} | \forall i, g : ν_{i, g} (β; c) = ν_{i, g} (c_{j})}$ . Then the $p$ -dimensional gradient vector of $S_{τ} (β; c)$ can be written as follows

$\frac{\partial S_{τ} (β; c)}{\partial β} = \frac{1}{c} X^{T} W_{2} (c_{j}) (Y - X β) + X^{T} W_{1} (c_{j}) 1_{n} \forall (β, c) \in D (c_{j}),$ (7)

where $W_{u} (c_{j}) = diag (ω_{1, u} (c_{j}), \dots, ω_{n, u} (c_{j}))$ , $j = 1, \dots, J$ , $u = 1, 2$ . Here $Y = {(Y_{1}, \dots, Y_{n})}^{T}$ and $1_{n}$ is an $n$ -dimensional vector of ones. Solving $\partial S_{τ} (β; c) / \partial β = 0_{p}$ gives the $p$ -dimensional weighted least squares estimator

$\begin{matrix} {\hat{β}}_{τ} (c) = {(X^{T} W_{2} (c_{j}) X)}^{- 1} X^{T} W_{2} (c_{j}) Y + c {(X^{T} W_{2} (c_{j}) X)}^{- 1} X^{T} W_{1} (c_{j}) 1_{n} \\ = {\hat{β}}_{Y} (c_{j}) + c ν (c_{j}), \end{matrix}$ (8)

where

$\begin{array}{l} {\hat{β}}_{Y} (c_{j}) = {(X^{T} W_{2} (c_{j}) X)}^{- 1} X^{T} W_{2} (c_{j}) Y, \\ ν (c_{j}) = {(X^{T} W_{2} (c_{j}) X)}^{- 1} X^{T} W_{1} (c_{j}) 1_{n} . \end{array}$ (9)

For every new value of $c_{j}$ , information from the previous solution is used. Then, it is not hard to show that

$c_{j + 1} = max_{i} {\frac{{\hat{r}}_{i, Y} (c_{j})}{X_{i}^{T} ν (c_{j}) - τ}; \frac{{\hat{r}}_{i, Y} (c_{j})}{X_{i}^{T} ν (c_{j})}; \frac{{\hat{r}}_{i, Y} (c_{j})}{X_{i}^{T} ν (c_{j}) - τ + 1}}_{(0, c_{j})},$ (10)

where ${\hat{r}}_{i, Y} (c_{j}) = Y_{i} - X_{i}^{T} {\hat{β}}_{Y} (c_{j})$ is an estimate of the random error $r_{i, Y} (τ) = Y_{i} - X_{i}^{T} β_{0} (τ)$ . Thus if $| A (c_{j + 1}) | = p$ , then ${\hat{β}}_{τ} (c) = X^{- 1} (h) Y (h)$ , with $h \equiv | A (c_{J}) |$ , is an iteratively weighted least squares (IWLS) estimator of $β_{0} (τ)$ .

Suppose that we have $x = {(x_{1}, x_{2}, \dots)}^{T}$ with sign vector $sign (x) = {(sign (x_{1}), sign (x_{2}), \dots)}^{T}$ . The proposed path-following algorithm is based on the following theorem.

Theorem 1. (Muggeo et al., 2011) Let $c_{J} > 0$ be a fixed value such that $| A (c_{J}) | = p$ . Then under the assumption that for any $c \in [0, c_{J}]$ , $sign (r_{ℓ} ({\hat{β}}_{τ} (c))) = sign (r_{ℓ} ({\hat{β}}_{τ} (c_{J})))$ , $\forall ℓ \in 1 - | A (c_{J}) |$ , the solution curve evaluated at $0_{p}$ is equal to the solution of (2), namely ${\hat{β}}_{τ} (0) = \hat{β} (τ)$ .

3.2. A Final QR Estimator with Variable Selection

It is well known that conditional QR estimates are severely affected by the so-called curse-of-dimensionality when $p$ is large. To tackle this issue, several authors have proposed dimension reduction methods. The general framework for these methods can be characterized by a set of nonzero coefficients $A = {j : β_{j, 0} (τ) \neq 0, j = 1, \dots, p}$ , and $q = | A |$ is the cardinality of $A$ . The set $A$ is unknown and can be estimated by various (non)parametric approaches such as kernel-based smoothing or B-spline basis functions. Suppose that the true parameter vector $β_{0} (τ)$ is sparse. Hence, we can write $β_{0} (τ) = {(β_{1, 0}^{T} (τ), β_{2, 0}^{T} (τ))}^{T}$ , where $β_{1, 0} (τ)$ is a $q \times 1$ vector and $β_{2, 0} (τ) = 0_{p - q}$ . Moreover, it is unknown which coefficients are nonzero and which are zero. To this end, one can use a penalty function $p_{λ} (\cdot)$ with tuning parameter $λ$ which depends on $n$ , and where for ease of presentation, we set $λ \equiv λ_{n}$ .

In this paper, we consider the SCAD penalty which is defined on $ℝ^{+}$ by its first derivatives as

${p^{'}}_{λ} (x) = λ {I (x \leq λ) + \frac{{(a λ - x)}_{+}}{a - 1} I (x > λ)},$ (11)

where $a > 2$ and $λ > 0$ are tuning parameters, and where ${(x)}_{+} = max (x, 0)$ . Note that (11) is symmetric, non-convex on $[0, \infty)$ , and singular at the origin. Its derivative vanishes outside $[- a λ, a λ]$ . Following the recommendation of Fan and Li (2001), we use $a = 3.7$ in the numerical applications. The parameter $λ$ controls the complexity of the selected model, and $λ \to 0$ as $n \to \infty$ . The function (11) overcomes some of the undesirable properties of the LASSO penalty, where $p_{λ} (x) = λ | x |$ .

For a fixed $c$ , let ${\hat{β}}_{j, τ} (c)$ , $j = 1, \dots, p$ , denote the univariate conditional (marginal) quantile estimator of $β_{j, 0} (τ)$ obtained through the IWLS method of Section 3.1. Then, for given values of $λ$ and $τ$ , the proposed final one-step conditional QR estimator ${\hat{β}}_{λ, τ} (c)$ of $β_{0} (τ)$ can be obtained by minimizing the following penalized objective function

$Q_{λ, τ} (β; c) = \frac{1}{2 n} \sum_{i = 1}^{n} ρ_{τ} (Y_{i} - X_{i}^{T} β) + \sum_{j = 1}^{p} {p^{'}}_{λ} (| {\hat{β}}_{j, τ} (c) |) | β_{j} | .$ (12)

Thus, we use ${\hat{β}}_{j, τ} (c)$ as an initial estimate to determine the penalty weights for individual coefficients in QR. This adaptive approach is a natural extension of the local linear approximation as proposed by Zou and Li (2008) and briefly discussed in Section 1. The approach allows the procedure to apply smaller penalties to coefficients with strong evidence of being non-zero, thereby improving selection accuracy⁴.

Let $| {\hat{A}}_{λ, τ} |$ be the number of nonzero estimates. Then, following Sherwood and Wang (2016), we choose the value of $λ$ that minimizes the following high-dimensional Bayesian-type information criterion (BIC):

${QBIC}_{τ} (λ) = \log (\sum_{i = 1}^{n} ρ_{τ} (Y_{i} - X_{i}^{T} {\hat{β}}_{λ, τ} (c))) + | {\hat{A}}_{λ, τ} | \frac{\log (p) \log (\log (n))}{2 n} .$ (13)

In the case ${\hat{β}}_{j, τ} (c)$ in (12) is replaced by an initial estimator not depending on $c$ , simulation results by Lee et al. (2014) show that if the choice of $λ$ is based on (13), all relevant components are selected consistently as $n \to \infty$ . Further, similar to Theorem 1, it follows that ${\hat{β}}_{λ, τ} (c)$ is equal in sign to the true $β_{0} (τ)$ (denoted by ${\hat{β}}_{λ, τ} (c) =_{s} β_{0} (τ)$ ) if and only if $sign ({\hat{β}}_{λ, τ} (c)) = sign (β_{0} (τ))$ .

3.3. Asymptotic Properties

Due to the non-smoothness and non-convexity of the SCAD penalty function (11), the classical Karush-Kuhn-Tucker optimality condition is not applicable to study the asymptotic properties of the final estimator ${\hat{β}}_{λ, τ} (c)$ . One common used technique to get around it in QR is to decompose a non-convex objective function as the difference of two convex (check loss) functions; see, e.g., Wu and Liu (2009) and Sherwood and Wang (2016). Using this technique to decompose the SCAD function, and assuming that certain regularity conditions hold, the following theorem can be proven.

Theorem 2. (Consistency) Assume that ${p^{'}}_{λ} (\cdot) = λ p^{'} (\cdot)$ . Suppose $p^{'} (\cdot)$ is a continuous function on $(0, \infty)$ and there is some $k > 0$ such that $p^{'} (θ) = O (θ^{- k})$ as $θ \to 0^{+}$ , and $q^{1 / 2} n^{- k} \to 0$ . In addition, let ${p^{'}}_{λ} (\cdot)$ be the SCAD function and $λ n^{k} \to \infty$ , then we have $ℙ ({\hat{β}}_{λ, τ} (c) =_{s} β_{0} (τ)) \to 1$ .

The result follows from similar arguments as in the proof of Theorem 4.2(i) in Noh and Lee (2014) and, hence, has been omitted from the text. Zhao and Yu (2006) study the consistency of LASSO variable selection in a high-dimensional linear regression setting.

Remark (Oracle property) Under certain regularity conditions on the design of the model, Amin et al. (2015), Theorem 3.2, proved sparsity and asymptotic normality of the SCAD-penalized quantile estimator ${\hat{β}}_{λ, τ} (c)$ .

3.4. Selecting c

Following Muggeo et al. (2011), we now present an algorithm for selecting $c$ . Ideally, $c$ should be small enough to get a good approximate solution to the objective function (12). However, for small values of $c$ estimation will be more difficult as the objective function $S_{τ} (β; c)$ approaches $ℒ (β)$ having a step-function gradient. The proposed pseudo code is given by Algorithm 1. High values of $P$ lead to high $c^{*}$ and then to better convergence performance but to more biased estimates; small values of $P$ are likely to cause failures in the estimating algorithm. Throughout the next two sections, we set $c_{start} = 0.5$ and $P = 0.7$ . These choices are recommended by Muggeo et al. (2011). In particular, these authors provide guidance on selecting $c$ and $p$ when $n$ is large ( $n \geq 10000$ ). Among other things, their results indicate that it is possible to reduce $P$ (or equivalently $c$ ) without affecting the convergence performance of the IWLS.

Algorithm 1. An adaptive algorithm for selecting c.

4. Numerical Studies

4.1. Simulation

In this example, we simulate 100 datasets from the $p$ -dimensional linear regression model

$Y_{i} = X_{i}^{T} β + ε_{i}, i = 1, \dots, n,$ (14)

where $(n, p) = (100, 200)$ and $(100, 400)$ , and where $ε_{i}$ is an independent and identically (i.i.d.) distributed random variable. The nonzero parameters are given by $β_{1} = 3$ , $β_{2} = 1.5$ , $β_{5} = 2$ , and the rest of the $β_{j}$ ’s are zero.

We generate the covariates $X_{i}$ from the $N (0_{p}, Σ)$ multivariate normal distribution, where $Σ = {(σ_{u, v})}_{p \times p}$ with $σ_{u, v} = r^{| u - v |}$ and $r = 0.5$ . For the distribution of the error term, we consider the following three scenarios: 1) $ε_{i} \overset{i .i .d .}{~} N (0, 1)$ ; 2) $ε_{i} \overset{i .i .d .}{~} t_{3}$ , i.e., the Student $t$ distribution with 3 degrees of freedom; and 3) a contaminated $N (0, 1)$ distribution having 10% outliers from a Cauchy distribution, denoted by C₁₀- $N$ . We show results for quantile levels $τ = 0.25, 0.50$ , and 0.75. We focus on the one-step quantile regression method SCAD penalty (denoted by QS), and also consider the one-step quantile regression method with LASSO penalty (denoted by QL) as a “benchmark”⁵. As a second benchmark for comparison, we compute the Oracle estimator, which is the least squares estimator of the true model.

To assess the performance of both quantile regression methods, we adopt the following criteria: 1) NN: average number of nonzero covariates; 2) True Positives (TP): average number of covariates that are correctly included in the model; 3) False Positives (FP): average number of covariates that are incorrectly included in the model; 4) AE: average of the absolute value of the estimation error which is defined as $\sum_{j = 1}^{p_{n}} | {\hat{β}}_{j, τ} (c^{*}) - β_{j, 0} (τ) |$ .

Table 1 summarizes the simulation results. The variable selection performance of QS is quite descent for all values of $τ$ and $p_{n}$ . Indeed, except for one case (QL with $τ = 0.75$ , $p_{n} = 400$ , and $ε_{i} \overset{i .i .d .}{~} N (0, 1)$ ), all computed TP values are equal to the true active model size of 3, with standard errors equal to zero. Hence, to keep the paper clear and readable, all computed TP values have been omitted from Table 1. The QL method tends to have substantially higher AE values than QS at different quantile levels for almost all $p_{n}$ values and error distributions. In the case $ε_{i} \overset{i .i .d .}{~} t_{3}$ , the QS method is near the oracle method in terms of AE, but the performance of QL is relatively poor. Also, the advantage of the QS method over the QL method can be seen by its lower AE values when the distribution of $ε_{i}$ is heavy-tailed. In general, the proposed QS method tends to be efficient for the three error distributions under study.

Table 1. Variable selection performance of one-step quantile regression with LASSO (QL) and SCAD (QS) type penalties for $n = 100$ . Standard errors are in parentheses. NN = Average number of nonzero covariates; FP = False positives; AE = Average of the absolute values of the estimation error.

Method	$p_{n}$	$ε_{i}$	$τ = 0.25$			$τ = 0.5$			$τ = 0.75$
Method	$p_{n}$	$ε_{i}$	NN	FP	AE	NN	FP	AE	NN	FP	AE
QL	200	$N (0, 1)$	4.05	1.05	1.16	3.75	0.75	1.16	4.25	1.25	1.15
QL			(1.07)	(1.07)	(0.41)	(0.87)	(0.87)	(0.34)	(1.17)	(1.17)	(0.40)
QS			3.27	0.27	0.48	3.29	0.29	0.44	3.25	0.58	0.47
QS			(0.57)	(0.57)	(0.26)	(0.57)	(0.57)	(0.24)	(0.52)	(0.52)	(0.25)
Oracle			3.00	0.37	0.00	3.00	0.31	0.00	3.00	0.34	0.00
Oracle			(0.00)	(0.17)	(0.00)	(0.00)	(0.14)	(0.00)	(0.00)	(0.17)	(0.00)
QL	400		4.11	1.10	1.30	3.69	0.68	1.15	3.89	0.90	1.36
QL			(1.14)	(1.13)	(0.41)	(0.83)	(0.83)	(0.34)	(1.09)	(1.09)	(0.41)
QS			3.26	0.26	0.50	6.14	3.10	0.80	3.27	0.27	0.49
QS			(0.54)	(0.54)	(0.25)	(16.60)	(16.37)	(1.98)	(0.53)	(0.53)	(0.25)
Oracle			3.00	0.35	0.00	3.00	0.31	0.00	3.00	0.36	0.00
Oracle			(0.00)	(0.18)	(0.00)	(0.00)	(0.16)	(0.00)	(0.00)	(0.20)	(0.00)
QL	200	$t_{3}$	3.87	0.87	1.37	3.55	0.55	1.20	3.92	0.92	1.32
QL			(0.98)	(0.98)	(0.55)	(0.69)	(0.69)	(0.38)	(0.93)	(0.93)	(0.42)
QS			3.17	0.17	0.52	3.17	0.17	0.48	3.16	0.16	0.53
QS			(0.40)	(0.40)	(0.52)	(0.38)	(0.38)	(0.31)	(0.37)	(0.37)	(0.31)
Oracle			3.00	0.40	0.00	3.00	0.35	0.00	3.00	0.36	0.00
Oracle			(0.00)	(0.22)	(0.00)	(0.00)	(0.20)	(0.00)	(0.00)	(0.23)	(0.00)
QL	400		4.19	1.17	1.63	3.60	0.59	1.39	4.11	1.11	1.57
QL			(1.31)	(1.28)	(0.58)	(0.83)	(0.83)	(0.43)	(1.18)	(1.17)	(0.56)
QS			3.20	0.20	0.51	6.13	3.11	0.87	3.20	0.20	0.50
QS			(0.45)	(0.45)	(0.34)	(16.60)	(16.48)	(2.50)	(0.47)	(0.47)	(0.29)
Oracle			3.00	0.41	0.00	3.00	0.35	0.00	3.00	0.43	0.00
Oracle			(0.00)	(0.18)	(0.00)	(0.00)	(0.17)	(0.00)	(0.00)	(0.21)	(0.00)
QL	200	C₁₀- $N$	3.96	0.96	1.21	3.55	0.55	1.10	3.94	0.94	1.21
QL			(0.86)	(0.86)	(0.37)	(0.73)	(0.73)	(0.34)	(1.08)	(0.94)	(0.39)
QS			3.44	0.60	0.97	3.53	0.68	0.95	3.48	0.64	1.03
QS			(0.80)	(0.99)	(1.71)	(1.26)	(1.49)	(1.84)	(1.14)	(1.36)	(1.92)
Oracle			3.00	0.32	0.00	3.00	0.34	0.00	3.00	0.37	0.00
Oracle			(0.00)	(0.18)	(0.00)	(0.00)	(0.17)	(0.00)	(0.00)	(0.16)	(0.00)
QL	400		4.07	1.07	1.43	3.56	0.56	1.21	3.88	0.88	1.36
QL			(1.13)	(1.13)	(0.45)	(0.81)	(0.81)	(0.36)	(1.06)	(1.06)	(0.45)
QS			5.14	2.29	3.33	4.30	1.42	1.00	3.39	0.53	0.91
QS			(13.63)	(13.77)	(17.24)	(9.71)	(9.53)	(2.04)	(0.99)	(1.30)	(0.91)
Oracle			3.00	0.36	0.00	3.00	0.31	0.00	3.00	0.35	0.00
Oracle			(0.00)	(0.21)	(0.00)	(0.00)	(0.18)	(0.00)	(0.00)	(0.24)	(0.00)

4.2. Comparing Execution Time

We next compare the execution time of the QL and QS methods with initial quantile parameter estimates obtained by two methods: 1) the IWLS method of Section 3.1, and by 2) the well-known quantile regression method rq in the R-quant package and denoted here by the short-hand notation “bet”. The simulation setting is the same as in Section 4.1 with Scenario 1. However, in this case $n = 80$ , $n = 1000$ , and $p_{n} = 100$ . Using the R-package microbenchmark, Figure 1(a) and Figure 1(b) show plots of the microbenchmark timings for three different values of $τ$ , with “neval” denoting the number of replicates.

(a)

(b)

Figure 1. Autoplots of microbenchmark timings: plot (a) is for ( $n = 80$ , $p_{n} = 100$ ), and plot (b) is for ( $n = 1000$ , $p_{n} = 100$ ).

Figure 1(a) shows that there is hardly any difference between quantile estimates obtained with the QS-IWLS method and the QL-bet method for most values of $τ$ . However, we see from Figure 1(b) that the difference between both methods is more noteworthy for the case $(n, p_{n}) = (1000, 100)$ . Indeed, QL is about 2 times faster than QS, for all values of $τ$ . This difference in computing time was also noted for simulation results obtained with the Student $t_{3}$ error distribution and with the contaminated $N (0, 1)$ error distribution, as specified earlier in Section 4.1. Thus, in terms of computing time, there is a price to be paid for using the proposed one-step QS-IWLS estimation method when $n$ is large.

5. An Illustrative Application

In this section, we illustrate the one-step SCAD penalized QR estimator using a real dataset corresponding to beta-carotene levels (ng/mL) in 315 patients. The dataset is available online at http://lib.stat.cmu.edu/datasets/Plasma_Retinol. Also it is included in the R-Lock5Data package under the name NutritionStudy. The dataset contains observations on the following 14 variables: Age (years), Cholesterol (mg per day), Fiber (Grams consumed per day), Sex (1 = Female, 0 = Male), Smoking Status (1 = Never, 2 = Former, 3 = Current Smoker), Quetelet (Weight/Height²), Vitamin Use (1 = Yes, fairly often, 2 = Yes, not often, 3 = No), Calories (Number consumed per day), Fat (Grams consumed per day), Alcohol (Number of drinks consumed per week), Betadiet (Dietary beta-carotene consumed (mcg) per day), Retdiet (Dietary retinol consumed (mcg per day)), Betaplasma (Plasma beta-carotene (ng/mL)), and Retplasma (Plasma retinol (ng/mL)). Study subjects were patients who had an elective surgical procedure during a three-year period to biopsy or remove a lesion of the lung, colon, breast, skin, ovary or uterus that was found to be non-cancerous. The medical interest of the work was to investigate the relation between personal characteristics and diet factors on the one hand, and plasma concentrations of retinol, beta-carotene, or other carotenoids may be as associated with greater risk of developing certain types of cancer.

A boxplot of the response variable (Retplasma) indicated several outliers. In addition, the Shapiro-Wilk test showed that this variable is far from normally distributed with a p-value of 0.00. There was one extremely high (203) leverage point in the series “Alcohol” that was deleted prior to the analysis. Thus the actual set of observations consists of $n = 314$ values. All series are standardized, except for the integer series Age, Sex, Smoking Status, and Vitamin Use. Figure 2 shows the estimated additive fits of seven covariates. It appears that the relationship between the response variable and each of these covariates is nonlinear. The remaining six covariates seem to be linearly related to Retplasma. So, it is natural to analyze the complete dataset via penalized QR.

5.1. Low Dimension

To assess the performance of the one-step variable selection method, we randomly selected 200 observations from the complete dataset. We used these observations

Figure 2. Plots of the estimated additive functions for seven predictors, with associated standard errors (dotted lines), using the R-gam function.

for model fitting and variable selection. The remaining 114 observations are used for checking the prediction performance of the fitted model. This procedure is repeated 100 times through the bootstrapping technique. Given this setup, and to guard against false positives, we declare a covariate to be relevant when it is selected 60% of the time.

Panel I of Table 2 contains the variable selection results for QL and QS, respectively. For $τ = 0.25$ , QL identifies different covariates than QS. However, for $τ = 0.5$ both methods select the variables Age and Alcohol as relevant covariates. On the other hand, when $τ = 0.75$ we see that the variables Age and Smoke1 are selected frequently. In terms of median absolute prediction error (MAPE) computed across all covariates, we have MAPE (QS) = 0.27 and MAPE (QL) = 0.37 for $τ = 0.25$ . When $τ = 0.5$ , MAPE (QL) = MAPE (QS) = 0.36, and when $τ = 0.75$ the MAPEs are equal to 0.36 for both methods.

Table 2. Frequency (≥60%) of covariates selected by QL and QS at three different quantiles based on 100 bootstrap replicates. Selected frequencies are in parentheses.

Method	$τ = 0.25$	$τ = 0.5$	$τ = 0.75$
	I. Low dimension (Original dataset), $(n, p_{n}) = (315, 15)$
QL	Age (61), Alcohol (95)	Age (99), Alcohol (94)	Age (99), Alcohol (76)
QL	Betaplasma (97)		Sex (85), Smoke1 (71)
QS	Sex (100)	Age (90), Alcohol (62))	Age (78), Smoke1 (70)
QS		Sex (96)
	II. High dimension (Augmented dataset), $(n, p_{n}) = (315, 215)$
QL	Alcohol (84), Betaplasma (70)	Age (98), Alcohol (89)
QS		Age (85)

5.2. High Dimension

To investigate the QL and QS methods in high dimensions, we add 200 irrelevant covariates to the dataset. These new variables are randomly drawn from a U (0, 1) distribution. Hence, the augmented dataset consists of 215 covariates (200 irrelevant and 15 relevant ones). Using the same procedure as above, Panel II of Table 2 summarizes the variable selection results. For $τ = 0.25$ and $τ = 0.75$ , no covariates are selected by QS while QL selects covariates at these quantiles. Clearly, QS tends to produce more sparse models than QL. Further, in terms of MAPE, the QS method performs better than QL at $τ = 0.25$ . Specifically, MAPE (QS) = 0.28 and MAPE (QL) = 0.37. For $τ = 0.5$ MAPE (QL) = MAPE (QS) = 0.36, while for $τ = 0.75$ MAPE (QL) = 0.35 and MAPE (QS) = 0.34.

Figure 3 presents a boxplot of the prediction errors of QL and QS at different quantiles using the augmented plasma dataset, and based on 100 bootstrapped replications. The QS method at $τ = 0.25$ provides a good prediction, followed by the same method at $τ = 0.5$ and $τ = 0.75$ . Overall, the one-step QS method is found to be a good variable selection procedure in detecting the significant variables, even when some irrelevant variables are included in the dataset under study. The prediction errors of the proposed method are also smaller than those using the one-step QL method.

Figure 3. Boxplot of the prediction errors for the augmented plasma dataset.

6. Summary

In this paper, we propose a one-step estimation procedure for variable selection in sparse, high-dimensional, linear QR models. In particular, we used the SCAD penalty with an iteratively weighted least squares estimator to determine the penalty weights. In a simulation study, we have shown that the resulting one-step QS method performs better than a one-step QL method for three different scenarios representing the error distribution. Additionally, we illustrated the merit of our method via a real data example.

A limitation of the proposed procedure is that there is a computational speed trade-off, as briefly discussed in Section 4.2. However, this result is outweighed by an increase in estimation accuracy when the sample size and/or the number of covariates increases. Another issue worth mentioning is the practical implication of the IWLS method. For instance, in specific fields like genomics or financial modeling, analyzing a large number of observations, our method’s robustness to heavy-tailed distributions and outliers would be particularly advantageous.

Several possible extensions to this paper can be pursued. For instance, one may wish to estimate nonlinear regressions via the QS method. Perhaps, in that case, efficiency may be further enhanced by a more prudent choice of the initial estimator than using the IWLS method. It may also be of interest to assess the robustness of the IWLS method by changing the level of temporal correlation among the covariates. Selecting the optimal SCAD tuning parameter $λ$ by a different data-driven method (e.g., AIC) and then investigating its impact on the QS estimates is another fruitful future research direction.

NOTES

¹Throughout the paper the number of variables is denoted by p and the number of measurements (observations) by n. With high-dimensional data, we refer to the case that p is assumed to be large, possibly larger than n.

²The oracle property of a method means that it can correctly select the nonzero parameters with probability converging to one, and that the estimators of the nonzero parameters are normal with the same mean and covariance that they would have if there is a priori knowledge about the active covariates.

³The concept of additivity, as a simplifying structure, is basic to many proposed models in science. Within economics, Deaton and Muelbauer (1980) provide many microeconomic examples in which the structure is convenient for analysis and important for interpretability. Indeed, additivity is widely used in parametric, semi- and nonparametric analysis of economic data.

⁴Note that in (12) can be written as , the above minimization problem can be restated as an unpenalized weighted QR problem; see, e.g., Sherwood and Wang (2016).

⁵The OL method is also a one-step adaptive procedure using the same IWLS initial estimates as the proposed QS method to ensure the comparison is on equal footing.

Conflicts of Interest

The author reports no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

References

[1]	Abdelmalek, N. N. (1974). On the Discrete Linear L1 Approximation and L1 Solutions of Overdetermined Linear Equations. Journal of Approximation Theory, 11, 38-53. [Google Scholar] [CrossRef]
[2]	Amin, M., Song, L., Thorlie, M. A., & Wang, X. (2015). SCAD-Penalized Quantile Regression for High-Dimensional Data Analysis and Variable Selection. Statistica Neerlandica, 69, 212-235. [Google Scholar] [CrossRef]
[3]	Chen, L., & Lee, S. (2023). Sparse Quantile Regression. Journal of Econometrics, 235, 2195-2217. [Google Scholar] [CrossRef]
[4]	Deaton, A., & Muellbauer, J. (1980). Economics and Consumer Behavior. Cambridge University Press. [Google Scholar] [CrossRef]
[5]	Fan, J., & Li, R. (2001). Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association, 96, 1348-1360. [Google Scholar] [CrossRef]
[6]	Fan, J., & Lv, J. (2008). Sure Independence Screening for Ultrahigh Dimensional Feature Space. Journal of the Royal Statistical Society Series B: Statistical Methodology, 70, 849-911. [Google Scholar] [CrossRef] [PubMed]
[7]	Khan, D. M., Yaqoob, A., Iqbal, N., Wahid, A., Khalil, U., Khan, M. et al. (2019). Variable Selection via SCAD-Penalized Quantile Regression for High-Dimensional Count Data. IEEE Access, 7, 153205-153216. [Google Scholar] [CrossRef]
[8]	Koenker, R., & Bassett, G. (1978). Regression Quantiles. Econometrica, 46, 33-50. [Google Scholar] [CrossRef]
[9]	Lee, E. R., Noh, H., & Park, B. U. (2014). Model Selection via Bayesian Information Criterion for Quantile Regression Models. Journal of the American Statistical Association, 109, 216-229. [Google Scholar] [CrossRef]
[10]	Mai, T. T. (2024). A Sparse Pac-Bayesian Approach for High-Dimensional Quantile Prediction. Statistics and Computing, 35, Article No. 93. [Google Scholar] [CrossRef]
[11]	Muggeo, V. M. R., Sciandra, M., & Augugliaro, L. (2011). Quantile Regression via Iterative Least Squares Computations. Journal of Statistical Computation and Simulation, 82, 1557-1569. [Google Scholar] [CrossRef]
[12]	Noh, H., & Lee, E. R. (2014). Component Selection in Additive Quantile Regression Models. Journal of the Korean Statistical Society, 43, 439-452. [Google Scholar] [CrossRef]
[13]	Sherwood, B., & Maidman, A. (2022). Additive Nonlinear Quantile Regression in Ultra-High Dimension. Journal of Machine Learning Research, 23, 1-47.
[14]	Sherwood, B., & Wang, L. (2016). Partially Linear Additive Quantile Regression in Ultra-High Dimension. The Annals of Statistics, 44, 288-317. [Google Scholar] [CrossRef]
[15]	Sherwood, B., Li, S., & Maidman, A. (2025). rqPen: Penalized Quantile Regression. R Package Version 4.1.3. https://github.com/bssherwood/rqpen
[16]	Song, Y., Li, Z., & Fang, M. (2022). Robust Variable Selection Based on Penalized Composite Quantile Regression for High-Dimensional Single-Index Models. Mathematics, 10, Article 2000. [Google Scholar] [CrossRef]
[17]	Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58, 267-288. [Google Scholar] [CrossRef]
[18]	Wang, L., Wu, Y., & Li, R. (2012). Quantile Regression for Analyzing Heterogeneity in Ultra-High Dimension. Journal of the American Statistical Association, 107, 214-222. [Google Scholar] [CrossRef] [PubMed]
[19]	Wu, Y., & Liu, Y. (2009). Variable Selection in Quantile Regression. Statistica Sinica, 19, 801-817. https://www.jstor.org/stable/24308857
[20]	Zhao, P., & Yu, B. (2006). On Model Selection Consistency of Lasso. The Journal of Machine Learning Research, 7, 2541-2563.
[21]	Zou, H. (2006). The Adaptive Lasso and Its Oracle Properties. Journal of the American Statistical Association, 101, 1418-1429. [Google Scholar] [CrossRef]
[22]	Zou, H., & Hastie, T. (2005). Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society Series B: Statistical Methodology, 67, 301-320. [Google Scholar] [CrossRef]
[23]	Zou, H., & Li, R. (2008). One-Step Sparse Estimates in Nonconcave Penalized Likelihood Models. The Annals of Statistics, 36, 1509-1533 [Google Scholar] [CrossRef] [PubMed]

	customer@scirp.org
	+86 18163351462 (WhatsApp)
	1655362766
	SCIRP WeChat

Journals Menu

Home

About SCIRP

Service

Policies