_{1}

^{*}

A sparse vector regression model is developed. The model is established by employing Bayesian formulation and trained by using a set of data . The parameters needed to be determined in the algorithm are reduced by a special prior hyperparameter setting, and therefore the algorithm is simpler than similar type of Bayesian vector regression models. The examples of applications to the function approximation and inverse scattering problem are presented.

There has been a lot of interest in studying the Bayesian vector regression and its application on various classification and regression problems [

The rest of this paper is organized as follows. The theory of the vector-regression formulation is presented in Section 2, with application example provided in Section 3. The work is summarized in Section 4.

Assume we have available a set of training data D = { x n , t n } n = 1 N , where x n = [ x n ( 1 ) x n ( 2 ) ⋯ x n ( L ) ] ⊺ and t n = [ t n ( 1 ) t n ( 2 ) ⋯ t n ( M ) ] ⊺ . Our objective is to develop a function y ( x ; w ) that is dependent on the parameters w. After y ( x ; w ) is so designed, it may be used to map an arbitrary x to an approximation of the target parameters t.

The specific vector-regression function y ( x ; w ) = [ y ( 1 ) ( x ; w ) y ( 2 ) ( x ; w ) ⋯ y ( M ) ( x ; w ) ] ⊺ employed here is defined as

y ( x ; w ) = ∑ i = 1 N w i t i K ( x , x i ) + w 0 (1)

where w 0 = [ w 0 ( 1 ) w 0 ( 2 ) ⋯ w 0 ( M ) ] ⊺ , and K ( x , x i ) is a kernel function that is designed such that K ( x , x i ) is large if x i ≈ x and otherwise K ( x , x i ) is small. Hence in (1) only those x i ≈ x are important in defining y ( x ; w ) .

Let

w = [ w 1 w 2 ⋯ w N w 0 ( 1 ) w 0 ( 2 ) ⋯ w 0 ( M ) ] ⊺ ,

ψ i ( x ) = [ ϕ i ( 1 ) ϕ i ( 2 ) ⋯ ϕ i ( M ) ] ⊺ , i = 1 , 2 , ⋯ , N

with

ϕ i ( k ) = t i ( k ) K ( x , x i ) , i = 1 , 2 , ⋯ , N ; k = 1 , 2 , ⋯ , M (2)

and M × ( N + M ) matrix

Ψ ( x ) = [ ψ 1 ( x ) ψ 2 ( x ) ⋯ ψ N ( x ) I M ] , (3)

where I M is M × M identity matrix, then (1) can be expressed in matrix form

y ( x ; w ) = Ψ ( x ) w (4)

Assume that target is from the model with additive noise

t = y ( x ; w ) + ε = Ψ ( x ) w + ε , (5)

where model error ε = [ ε ( 1 ) ε ( 2 ) ⋯ ε ( M ) ] ⊺ and ε ( k ) , k = 1 , 2 , ⋯ , M are independent samples from a zero-mean Gaussian process with variance α 0 − 1

p ( ε ( k ) ) = N ( ε ( k ) | 0 , α 0 − 1 ) , k = 1 , 2 , ⋯ , M (6)

We therefore have

p ( t | x , w , α 0 ) = ( 2 π α 0 ) − M 2 exp ( − α 0 2 ‖ t − Ψ ( x ) w ‖ 2 2 ) = N ( t | Ψ ( x ) w , α 0 − 1 I M ) (7)

We wish to constrain the weights w such that a simple model is favored, this accomplished by invoking a prior distribution on w that favors most of the weights being zero. In this context, only the most relevant members of the training set D = { x n , t n } n = 1 N , those with nonzero weights w n , are ultimately used in the final regression model. This simplicity allows improved regression performance for ( x , t ) ∉ D [

We employ a zero-mean Gaussian prior distribution for w

p ( w | α 0 , α ) = N ( w | 0 N + M , α 0 − 1 α − 1 I N + M ) , (8)

where 0 N + M is a (N + M)-dimensional zero vector, I N + M is a ( N + M ) × ( N + M ) identity matrix, and suitable priors over hyperparameters α 0 and α are Gamma distributions [

p ( α 0 | a , b ) = Gamma ( α 0 | a , b ) (9)

p ( α | c , d ) = Gamma ( α | c , d ) (10)

where Gamma ( α 0 | a , b ) = Γ ( a ) − 1 b a α 0 a − 1 e − b α 0 with Γ ( a ) = ∫ 0 ∞ t a − 1 e − t d t .

The hierarchical prior over w favors a sparse model and the prior over α 0 will be used to favor small model error on the training data D.

For training data D = { x n , t n } n = 1 N we introduce LN-dimensional vector

X = [ x 1 ⊺ x 2 ⊺ ⋯ x N ⊺ ] ⊺

and MN-dimensional vector

T = [ t 1 ⊺ t 2 ⊺ ⋯ t N ⊺ ] ⊺

and let ( M N ) × ( M + N ) matrix

Φ = [ Φ 1 ⊺ Φ 2 ⊺ ⋯ Φ N ⊺ ] ⊺ with Φ i = Ψ ( x i ) , i = 1 , 2 , ⋯ , N ,

then by (7), we have

p ( T | w , α 0 , X ) = ( 2 π α 0 ) − M N 2 exp ( − α 0 2 ‖ T − Φ w ‖ 2 2 ) = N ( T | Φ w , α 0 − 1 I M N ) (11)

Noting that p ( T | α 0 , α , X ) = ∫ p ( T | w , α 0 , X ) p ( w | α 0 , α ) d w is a convolution of Gaussians, the posterior distribution over the weights w can be derived as

p ( w | α 0 , α , X , T ) = p ( T | w , α 0 , X ) p ( w | α 0 , α ) p ( T | α 0 , α , X ) = N ( w | μ , α 0 − 1 Σ ) (12)

where

Σ = ( Φ ⊺ Φ + α I M + N ) − 1 = ( ∑ i = 1 N Φ i ⊺ Φ i + α I M + N ) − 1 (13)

μ = Σ Φ ⊺ T = Σ ∑ i = 1 N ( Φ i t i ) (14)

We determine α in (13) by maximizing p ( α | T , X ) ∝ p ( T | α , X ) p ( α ) with respect to α . It is equivalent to maximize the ln of this quantity. In addition, we can choose to maximize with respect to ln α as we can assume hyperpriors over a logarithmic scale.

Since

ln p ( T | α , X ) = ln ∫ p ( T | w , α 0 , X ) p ( w | α 0 , α ) p ( α 0 | a , b ) d w d α 0 = − 1 2 [ ln | B | + ( M N + 2 a ) ln ( T ⊺ B − 1 T + 2 b ) ] + c o n s t

where B = I M N + α − 1 Φ Φ ⊺ , and p ( ln α ) = α p ( α ) , we obtain objective function

L ( α ) = − 1 2 [ ln | B | + ( M N + 2 a ) ln ( T ⊺ B − 1 T + 2 b ) ] + c ln α − d α (15)

By the determinant identity [

| B | = | I M N + α − 1 Φ Φ ⊺ | = α − ( M + N ) | α I M + N + Φ ⊺ Φ | = α − ( M + N ) | Σ − 1 | ,

and so

ln | B | = − ( M + N ) ln α + ln | Σ − 1 | (16)

Using the Woodbury formula, we obtain

B − 1 = ( I M N + α − 1 Φ Φ ⊺ ) − 1 = I M N − Φ ( α I M + N + Φ ⊺ Φ ) − 1 Φ ⊺ = I M N − Φ Σ Φ ⊺ ,

thus

T ⊺ B − 1 T = T ⊺ ( T − Φ Σ Φ ⊺ T )

= T ⊺ ( T − Φ μ ) (17)

= ‖ T ‖ 2 − T ⊺ Φ Σ Φ ⊺ T (18)

Then by (16) and Jacobi’s formula, we have

d ln | B | d ln α = − ( M + N ) + 1 | Σ − 1 | d | Σ − 1 | d ln α = − ( M + N ) + t r ( Σ d Σ − 1 d ln α ) = − ( M + N ) + α ∑ j = 1 M + N Σ j j (19)

where Σ j j is the j-th diagonal element of matrix Σ .

By (18)

d T ⊺ B − 1 T d ln α = − d T ⊺ Φ Σ Φ ⊺ T d ln α = − T ⊺ Φ d Σ d ln α Φ ⊺ T = − T ⊺ Φ Σ d Σ − 1 d ln α Σ Φ ⊺ T = α ‖ μ ‖ 2 (20)

Using (17), (19) and (20), we have

d L ( α ) d α = 1 2 ( M + N − α ∑ j = 1 M + N Σ j j ) − ( M N + 2 a ) 2 ( T ⊺ B − 1 T + 2 b ) d T ⊺ B − 1 T d ln α + c − d α = 1 2 ( M + N − α ∑ j = 1 M + N Σ j j ) − ( M N + 2 a ) ‖ μ ‖ 2 α 2 [ T ⊺ ( T − Φ μ ) + 2 b ] + c − d α (21)

Setting (21) to zero, followed by algebra operations, yield

α = M + N + 2 c ∑ j = 1 M + N Σ j j + 2 d + ( M N + 2 a ) ‖ μ ‖ 2 / [ T ⊺ ( T − Φ μ ) + 2 b ] (22)

The algorithm consists of (13), (14) and (22) with iteration for α , Σ and μ .

Assume α M P and α 0 M P are maximizing values obtained by maximizing p ( α | T , X ) (Sec. 2.3) and p ( α 0 | T , X ) , respectively. Assume

p ( α 0 , α | X , T ) ≈ δ ( α 0 − α 0 M P ) δ ( α − α M P )

then

p ( t | x , X , T ) = ∫ p ( t | x , w , α 0 , α ) p ( w , α 0 , α | X , T ) d w d α 0 d α = ∫ p ( t | x , w , α 0 ) p ( w | α 0 , α , X , T ) p ( α 0 , α | X , T ) d w d α 0 d α ≈ ∫ p ( t | x , w , α 0 ) p ( w | α 0 , α , X , T ) δ ( α 0 − α 0 M P ) δ ( α − α M P ) d w d α 0 d α = ∫ p ( t | x , w , α 0 M P ) p ( w | α 0 M P α M P , X , T ) d w = N ( t | y ( x ; μ ) , ( α 0 M P ) − 1 Ω ) (23)

with

y ( x ; μ ) = Ψ ( x ) μ (24)

Ω = I M + Ψ ( x ) Σ Ψ ( x ) ⊺ (25)

In examples we employ a radial-basis-function kernel K ( x , x i ) = exp ( − ‖ x − x i ‖ 2 / r 2 ) , and just parameters a, b, c and d by training and testing on given training data, finally we take a = b = c = d = 0.05 for all examples in this section. In all figures the horizontal axis is the index of samples and the vertical axis is output.

The model can be used to establish the relation between independent variables and dependent variables of a function.

Example 1 2-dimensional vector function with two variables

t 1 = sinc ( x 1 + x 2 4 )

t 2 = − 0.5 sinc ( x 1 + x 2 4 ) sin ( x 1 x 2 20 ) − 0.4

in domain { ( x 1 , x 2 ) | − 10 ≤ x 1 ≤ 10 , 0 ≤ x 2 ≤ 20 } , where sinc ( x ) = sin ( x ) / x .

Example 2 3-dimensional vector function with 200 variables ( x 1 , x 2 , ⋯ , x 200 ) → ( t 1 , t 2 , t 3 ) .

t 1 = ∑ k = 1 200 sin ( ( x k ) 5 / 7 ) + x 50 100

t 2 = x 200 800 t 1 + x 50 200 + cos ( x 100 5 ) − 10

t 3 = atan ( t 1 + t 2 6 ) + t 2 − t 1 2 − 10

We choose samples at point x n = ( x 1 n , x 2 n , ⋯ , x 200 n ) with x k n = k + ( n − 1 ) π / 4 . 100 samples at points x n with n = 1 , 3 , 5 , ⋯ , 199 used as training data, and 100 samples at points x n with n = 2 , 4 , 6 , ⋯ , 200 used as testing data.

The model can be used to characterize the connection between measured vector

scattered-field data x and the underlying target responsible for these fields, characterized by the parameter vector t. The scattering data x may be measured at multiple positions. In the examples the measure data is simulated by forward model.

We consider a homogeneous lossless dielectric target buried in a lossy dielectric half space. The objective is to invert for the parameters of the target. In the examples, the parameter vector t is composed of three real numbers: the depth of target, the size of target, and the dielectric constant of target. For each target there are 100 simulated measure data. Training data D = { x n , t n } n = 1 N is composed of N = 180 examples and testing data is composed of 125 examples that are not in D.

Example 1 We consider cube target in this example.

Example 2 We consider sphere target in this example.

We applied the model to two completely different types of problems, the model works well for both application. The results display this regression model can apply to various types of regression problems.

A Bayesian vector-regression algorithm has been developed. The model employs a statistical prior that favors a sparse model, for which most of its weights are zero [

The author declares no conflicts of interest regarding the publication of this paper.

Yu, Y.J. (2020) A Bayesian Regression Model and Applications. Journal of Applied Mathematics and Physics, 8, 1877-1887. https://doi.org/10.4236/jamp.2020.89141