^{1}

^{*}

^{2}

^{2}

^{2}

^{1}

Although near infrared (NIR) spectroscopy has been evaluated for numerous applications, the number of actual on-line or even on-site industrial applications seems to be very limited. In the present paper, the attempts to produce online predictions of the chemical oxygen demand (COD) in wastewater from a pulp and paper mill using NIR spectroscopy are described. The task was perceived as very challenging, but with a root mean square error of prediction of 149 mg/l, roughly corresponding to 1/10 of the studied concentration interval, this attempt was deemed as successful. This result was obtained by using partial least squares model regression, interpolated reference values for calibration purposes, and by evenly distributing the calibration data in the concentration space. This work may also represent the first industrial application of online COD measurements in wastewater using NIR spectroscopy.

Wastewater flows are characterized by constantly changing flow rates and composition [

One potential method of gaining on-line information of the organic load is to use near infrared (NIR) spectroscopy combined with quantitative models based on multivariate methods. Although NIR spectroscopy can be seen as an extremely powerful tool for industrial quality control and process monitoring [

However, in most treatment plants, COD is probably the most important measure, which is also reflected in the number of publications that describe the use of NIR spectroscopy for COD measurements. These include off-line measurements of only COD [

The spectra were collected with a Red Eye^{®} Online sensor for suspensions and fluids (Pulp Eye AB, P. O. Box 70, 89,122 Örnsköldsvik, Sweden). The sampling system (also constructed by Pulp Eye AB), or measurement head, consisted of a filter unit followed by a flow through cell coupled with optical fibres and equipped with an automated back flush system using tap water and activated in between every measurement. The sampling system was mounted on a bypass loop of the main pipe. For every third spectrum, a new reference spectrum was collected of the tap water. Each spectrum consisted of 50 averaged scans, and the path length of the flow through cell was 1 mm. The spectra were made up by 256 wavelengths recorded between 1018 and 2032 nm at an average data resolution of 4 nm. The spectra were collected on-line at 10 minutes intervals for a period of 29 days.

During weekdays, laboratory COD reference measurements were generally performed twice a day on-site. In total 4099 spectra were collected and for 36 of these COD reference measurements were performed. The calibration models were calculated using the PLS Toolbox v. 7.0.1 (Eigenvector Research, Inc. 3905 West Eagle rock Drive Wenatchee, WA 98801, USA), together with MATLAB R2011b (The Math Works AB, Kista, Sweden) where all matrix calculations were also performed. The calibration methods used were partial least squares (PLS) regression and principal component regression (PCR). The performance of the models was assessed by, among other things, the root mean square error of calibration (RMSEC), the root mean square error of cross validation (RMSECV), and the root mean square error of prediction (RMSEP). A description of the regression methods and the definition of the performance parameters can be found in [

Initially the measurement was performed with a transflectance probe mounted in the bypass loop. However, the probe was clogged within hours and therefore replaced with the flow through cell with an automated back flush system. This reduced the problems with clogging and fouling significantly, but at the same time reduced the potential information about suspended solids to a minimum. In order to carry out the investigation with a minimal intrusion on the daily activities in the facility, it was also decided that any quantitative calibration will have to rely on reference data obtained from the measurements routineously carried out by the plant operators. As will be discussed, this trivial experimental design posed some limitations when trying to create an accurate calibration model. It was therefore clear from the beginning that it would be a challenging task to establish a reliable calibration model for the intended application. With a PCR model built on the 36 spectra corresponding to the reference measurements and using mean centering as spectral pre-processing, 89% of the spectral variance and only 0.04% of the COD variance were explained by the first PC. The corresponding numbers for the second PC were 11% and 3.7%. Thus, 2 PCs explained essentially all of the spectral variance and almost nothing of the COD variance. The situation was very much the same in the PLS space, with spectral/COD variance explained by the first PLS component at 46/2.7% and 54/1.5% by the second. In other words, the relationship between spectral and COD variance was very weak in this data. One reason for this was that the spectra from the end of the time series displayed extreme absorbance values, apparently due to fouling of the windows of the flow through cell.

New PCR and PLS models were therefore built on the 22 first spectra corresponding to reference measurements. In this case a second order derivative (Savitzky-Golay) based on a 9 point third order polynomial was applied to minimize baseline effects. This spectral pre-treatment was followed by auto scaling instead of mean centering in an effort to enhance minor variance in the spectra potentially relatable to the COD concentration. This pre-processing was also used in all later models. The two models are summarized in

According to

PCR | PLS | |||
---|---|---|---|---|

RMSEC [mg/l] | 176 | 122 | ||

RMSECV [mg/l] | 276 | 280 | ||

R^{2} (calibration) | 0.50 | 0.76 | ||

R^{2} (cross validation) | 0.13 | 0.19 | ||

Cumulative variance by component # | ||||

PCR | PLS | |||

Spectral | COD | Spectral | COD | |

1 | 96.24 | 15.80 | 96.18 | 16.55 |

2 | 99.05 | 26.38 | 98.97 | 30.63 |

3 | 99.70 | 33.68 | 99.67 | 38.50 |

4 | 99.86 | 37.12 | 99.83 | 46.10 |

5 | 99.96 | 43.78 | 99.94 | 50.23 |

6 | 99.97 | 44.68 | 99.97 | 66.17 |

7 | 99.98 | 49.60 | 99.98 | 75.83 |

provements compared to the PLS model previously accounted for was obtained, but the model was still only able to fit the regression data and showed no capacity for cross validation.

Due to the need for additional reference data, “synthetic” reference values were assigned to all the 2644 spectra by means of interpolation between the actual reference values. Here linear interpolation was used based on the simple fact that no information on the behavior of the COD concentration between the reference measurement points was available. A new PLS model based on all the 2644 spectra was thereafter regressed. For this model, an RMSEC of 75 mg/l, an RMSECV of 77 mg/l, and a coefficient of determination in cross validation of 0.86 were obtained by using 10 PLS components. It should be noted that the performance of the model is related to mainly the interpolated reference values, and should therefore be interpreted with some caution.

Since the use of interpolated reference values resulted in an abundance of data for regression and validation, the data set was simply split in half using the first 1322 spectra for model regression and the remaining spectra for model validation. This gave a model that in calibration and cross validation performed very similarly to the model described above. However, on the external validation data set the RMSEP was as high as 168 mg/l and

the coefficient of determination only 0.36. Switching the regression set for the validation set and vice versa did not improve the validation performance. One reason for this could have been the gradual fouling of the windows of the flow through cell. However, what is evident from

This new split gave a model with an RMSEP of 144 mg/l and a coefficient of determination of 0.43. Considering the fact that these performance parameters were based on the interpolated reference values, no further attempt was made to optimize these. Instead the focus was set on finding a model that could predict the high and low COD concentrations well, rather than being accurate around the average concentration. This was attempted by reweighting the information in the model regression data set. The data was split into 20 concentration intervals (matlab: hist). In this split, 5 intervals contained 125 spectra or more, and 6 intervals contained 12 or less. The data was thereafter reweighted by reducing the maximal number of spectra in each interval to 20. This was done by generating a random sequence of indices to remove within each concentration interval (matlab: randperm). In this way the number of spectra in the calibration data set was reduced from 1070 to 334.

After reweighting the regression data the data set was further refined by a stepwise removal of spectra with high absolute cross validation residuals (a model was regressed, high residual samples removed, and a new model regressed, etc.). This further reduced the calibration data to 274 spectra. For this model an RMSEP of 182 mg/l and a coefficient of determination of 0.35 were obtained. Based on these parameters, the reduction of the calibration data set apparently deteriorated the model performance. However, these values were obtained for the interpolated reference values instead of real reference values, and the objective was to obtain a model that predicted changes rather than average concentrations. To evaluate how this objective was met the standard deviation of the predictions of the validation data before and after the reduction of the calibration data set was computed. The standard deviation for the predictions with the model regressed on the original regression data set was determined to 176 mg/l and the corresponding value for the reduced data set was 224 mg/l. Thus, according to this somewhat unconventional performance parameter, the reduction or refinement of the calibration data set resulted in an improved model.

The predictions by this last model of the regression and validation data are shown in

that this filter reduced the noise level very significantly, but at the same time a phase shift was introduced. Whether or not this phase shift is of importance can be debated, but on the validation data, and computed on spectra corresponding to actual reference measurements, an RMSEP of 201 mg/l and an coefficient of determination of 0.35 were obtained for the raw model predictions and the corresponding values after the filtering were 149 mg/l and 0.65 respectively. Based on

The starting point for this attempt to create a quantitative model for the COD concentration in wastewater from a pulp and paper mill was basically a data set of 36 spectra and their corresponding reference measurements. On this data, essentially no relation between the COD concentration and the spectral features could be established, at least when only mean centering was used as spectral pre-processing. By using more advanced pre-processing options and removing the highest wavelengths, a relation could be modelled within the calibration data, but cross validation results were still not very promising. However, by increasing the amount of calibration data available by means of interpolated reference values, also the cross validation results started to look promising. The use of interpolated reference values in calibration, in combination with reweighting and refining the calibration data set, resulted in a model with very reasonable validation results. By further adding a filter to the predictions, a very appealing time series behavior was obtained. Unfortunately, if this behavior depicts the true changes in the COD concentration, much more frequent samplings would have been necessary in order to fully validate this. However, since this was an industrial installation and not a study performed in a laboratory, obtaining additional measurements is very difficult. On the other hand, the validation was still made against 14 reference measurements and this should not be an alarmingly low number.

This work was made possible through the Mare Purum project, funded by the European territorial cooperation programme Botnia-Atlantica. The authors also want to thank Hans Thorén at Svenska Cellulosa Aktiebolaget SCA, Elias Sundvall, Öjvind Sundvall, and Thomas Storsjö at PulpEye AB, as well as Anders Jonsson and Robin Norman at ProcessIT Innovations for this fruitful cooperation.