^{1}

^{1}

This study first considered using 5 dimensions/variables of color readings (including blue, green, saturation, red, hue) to predict the concentration of matter. The procedure was demonstrated using internet acquired data in the public domain. A stepwise regression method based on software MATLAB was used to build a model to predict the concentration of the sulfur dioxide. In the course of the study, we also discussed other linear or non-linear models to predict the concentration of the sulfur dioxide, but these models didn’t perform as the former, which may be due to the strong collinearity between the orig inal data variables. Statistical test results of the model including the number of observations, root mean squared error, adjusted R-square, F-statistic vs constant model were shown to assess the reliability. We also carried out error analysis and discussion.

Colorimetry [

Data for a particular material sulfur dioxide (SO_{2}) are given in

We find that the green, blue, and hue readings in

concentration (C): (ppm) | Red (R) | Green (G) | Blue (B) | Saturation (S) | Hue (H) |
---|---|---|---|---|---|

water | 153 | 148 | 157 | 138 | 14 |

153 | 147 | 157 | 138 | 16 | |

153 | 146 | 158 | 137 | 20 | |

153 | 146 | 158 | 137 | 20 | |

154 | 145 | 157 | 141 | 19 | |

20 | 144 | 115 | 170 | 135 | 82 |

144 | 115 | 169 | 136 | 81 | |

145 | 115 | 172 | 135 | 83 | |

30 | 145 | 114 | 174 | 135 | 87 |

145 | 114 | 176 | 135 | 89 | |

145 | 114 | 175 | 135 | 89 | |

146 | 114 | 175 | 135 | 88 | |

50 | 142 | 99 | 175 | 137 | 110 |

141 | 99 | 174 | 137 | 109 | |

142 | 99 | 176 | 136 | 110 | |

80 | 141 | 96 | 181 | 135 | 119 |

141 | 96 | 182 | 135 | 119 | |

140 | 96 | 182 | 135 | 120 | |

100 | 139 | 96 | 175 | 136 | 115 |

139 | 96 | 174 | 136 | 114 | |

139 | 96 | 176 | 136 | 116 | |

150 | 139 | 86 | 178 | 136 | 131 |

139 | 87 | 177 | 137 | 129 | |

138 | 86 | 177 | 137 | 130 | |

139 | 86 | 178 | 137 | 131 |

In this table, the correlation coefficients are calculated by the following formula:

r ( X , Y ) = C o v ( X , Y ) V a r ( X ) V a r ( Y ) (1)

Here, C o v ( X , Y ) is the covariance, V a r ( X ) , V a r ( Y ) is the variance of X and Y respectively.

C | H | B | S | R | G | |
---|---|---|---|---|---|---|

C | 1.00 | |||||

H | 0.83 | 1.00 | ||||

B | 0.70 | 0.96 | 1.00 | |||

S | −0.15 | −0.52 | −0.67 | 1.00 | ||

R | −0.84 | −0.98 | −0.91 | 0.49 | 1.00 | |

G | −0.87 | −1.00 | −0.93 | 0.45 | 0.99 | 1.00 |

In statistics, stepwise regression [

Forward selection, which involves starting with no variables in the model, testing the addition of each variable using a chosen model fit criterion, adding the variable (if any) whose inclusion gives the most statistically significant improvement of the fit, and repeating this process until none improves the model to a statistically significant extent.

Backward elimination, which involves starting with all candidate variables, testing the deletion of each variable using a chosen model fit criterion, deleting the variable (if any) whose loss gives the most statistically insignificant deterioration of the model fit, and repeating this process until no further variables can be deleted without a statistically significant loss of fit.

Bidirectional elimination, a combination of the above, testing at each step for variables to be included or excluded.

We can use the algorithm flowchart [

In this paper, we use the bidirectional elimination stepwise regression to determine a final model in MATLAB [

that the term would have a zero coefficient if added to the model. If there is sufficient evidence to reject the null hypothesis, for example, the p-value for an F-test of the change in the sum of squared error of the model is smaller than the default value 0.05, add the term to the model the term is added to the model. Conversely, if a term is currently in the model, the null hypothesis is that the term has a zero coefficient. If there is insufficient evidence to reject the null hypothesis, the term is removed from the model.

Using MATLAB function “stepwiselm” [

1) Adding G, F-Stat = 69.8123, p-Value = 2.01854e−08

2) Adding H, F-Stat = 24.5393, p-Value = 5.89362e−05

3) Adding G*H, F-Stat = 25.4857, p-Value = 5.34767e−05

4) Adding B, F-Stat = 11.9566, p-Value = 0.00248622

We see stepwise algorithm adds color variable G (green), H (hue), the interaction item G*H, and B (blue) to the model with respectively the corresponding p-values less than default 0.05. That is to say, there is sufficient evidence to reject the null hypothesis, the term is added to the model. The last model is as follows:

C = − 1565.4 + 20.661 ( G ) − 10.399 ( B ) + 18.763 ( H ) − 0.059739 ( G × H ) (1)

From this model, we can see the concentration of sulfur dioxide is mainly influenced by the linear influence of green, blue, hue, and the interaction between green and hue, which is consistent with the previous data analysis.

In MATLAB function “stepwiselm”, the F-Stat (F statistic) of F-test and other parameters are calculated by the following formula [

β = ( X T X ) − 1 X T Y , S S E = σ 2 = 1 n | Y − X β | 2 (2)

F - S t a t = M S R M S E = S S R / ( p − 1 ) S S E / ( n − p ) (3)

S S R = β T X T X β n , S S T = S S E + S S R (3)

R 2 = S S R S S T (4)

Here Y is the n × 1 vectors of the response variable, X is n × p matrix in which the first column are all 1, n is the number of samples, p is the number of predictor variables (including constant term, the interaction item) in each stepwise procedure.

At last the program displays the model parameters are as follows:

Number of observations: 25, Error degrees of freedom: 20

Root Mean Squared Error (the σ in Formula (1)): 10.4

R^{2}: 0.967

F-statistic vs. constant model: 146, p-value = 1.69e−14

The above parameters indicate that we used all 25 data in

In ^{th}, 14^{th} observations are outliers. From ^{th} and 14^{th} cases have larger residuals (marked in red).

A way to test for errors in models created by step-wise regression, is to not rely on the model’s F-statistic, significance, or multiple R, but instead assess the model against a set of data that was not used to create the model [

1.31 | −21.14 |
---|---|

1.17 | −13.19 |

−8.89 | −6.18 |

−8.89 | 4.22 |

10.28 | −8.80 |

1.99 | 3.54 |

3.48 | 6.17 |

10.89 | 0.91 |

9.58 | 4.64 |

6.47 | 8.54 |

−3.92 | 7.87 |

8.03 | 4.64 |

−23.59 |

particularly valuable when data are collected in different settings (e.g., different times, social vs. solitary situations) or when models are assumed to be generalizable. However, in this article, there are only 7 different concentrations of data in

In addition to using the stepwise regression method to establish a regression model of color variables and the concentration in

Using mathematical models to establish the relationship between color variables and material concentration is of practical value in the study. Combining with modern photography techniques, the material concentration can be predicted through mathematical model relative quickly and accurately. Of course, in practice, we first need to analyze the characteristics of the data itself, and then choose a more reliable model from multiple models to fit and predict, and attention is paid to the relationship between equilibrium fitting and prediction.

Pan, X.Y. and Cui, Y. (2018) Working with Color Readings: Application of Regression Models for Deter- mining the Concentration of Substance. Open Access Library Journal, 5: e4377. https://doi.org/10.4236/oalib.1104377