_{1}

^{*}

Applications of the multivariate technique called correspondence analysis for environmental studies are relatively new and are limited to spatial multivariate data set. In this paper, a procedure of applying correspondence analysis to a large space-time data set for multiple environmental variables is shown. In particular, nitrogen dioxide and carbon monoxide hourly concentrations measured during January 1999 at several monitored stations in a district of Northern Italy are analyzed. The procedure consists in transforming the continuous variables into categorical ones by the means of appropriate indicator variables, generating special contingency tables and applying correspondence analysis. The use of this classical multivariate technique allows the identification of important relationships among pollution levels and monitoring stations and/or relationships among pollution levels and observation times.

Usually environmental monitoring networks collect a huge amount of data such as pollutant concentrations, atmospheric variates, weather conditions, and so on, which are of particular interest for public policies oriented to environmental and human health protection.

Such data sets may have the following features:

• they are multivariate, as several variables are simultaneously measured;

• they present a spatio-temporal structure, since the measurements are taken in several point of the study area and for a certain period of time.

Classical multivariate techniques represent useful tools for analyzing multiple va-riables. Their main goal is to obtain a summary description of the data: Principal Component Analysis (PCA) finds a smaller number of variates representing all those collected, without loss of essential information; Correspondence Analysis (CA) studies the association between two or more categorical variables by representing the categories of the variables as points in a low-dimensional space; Canonical Correlation Analysis (CCA) describes the relationships between two groups of several variables. Classical multivariate techniques can be also applied to space-time data sets in order to summarize the spatial and temporal profiles which characterize the information, finding relationships among the data. In De Iaco et al. [

Hence, when multiple variables are measured at several locations of the area under study and for a period of time, in other words, when a space-time multivariate data set is available, and the aim is studying the simul- taneous behaviour of the va-riables in order to understand the relationships among the space-time observations, a multivariate technique is the most useful tool. CA is one of the multivariate techniques with a wide range of applications in several fields such as social and political sciences, marketing research, economy, ecology and biology. This technique is usually applied as an exploratory method, with the aim to describe the structure of the data under study with minimal constraints on the form of the same structure [

In this paper, it will be shown that even CA can be applied to a space-time multivariate data set, finding very important results which other techniques may not highlight. In particular, in this paper CA will be applied to an air pollution data set involving two contaminants measured at monitoring stations in northern Italy during January 1999. The analysis will identify relationships in space among pollution levels and monitoring stations and relationships in time among pollution levels and observation times.

After a presentation of CA (Section 2) and a review of its theory (Section 2.1), the description of compu- tational aspects follows (Section 2.2). Then, the data set (Section 3) and the most important results from the applied CA and their interpretation are given (Section 4).

CA is an algebraic technique analogous to PCA, but, while PCA is used for tables of continuous measurements, CA is more appropriate for categorical variates. Hence, CA is suitable for analyzing qualitative information represented by a contingency table. Lebart et al. [

For a long time CA has been applied by European statistical community for psycho-metric and economic studies. This technique has been very popular in France, mainly owing to the efforts of Jean-Paul Benzécri [

All CA applications for environmental studies are limited to spatial multivariate data sets, where observations for several variables are spatially located [

The theory of CA is discussed in several books, [

From an initial data matrix

Let

Two different matrices are used to re-scale

and

where:

and

CA consists in finding a vector u, in a p-dimensional space, which maximizes

subject to the constraint:

It is known that this is equivalent to finding the vector v, in an l-dimensional space, which maximizes

subject to the constraint:

The eigenvectors

where

This duality formula permits displaying the row and column projections in the same graph (called biplots) and this CA feature has been considered as its advantage with respect to others multivariate techniques.

Sequentially, the method searches for new solutions orthogonal to the previous ones; in particular, orthogo- nality is considered with respect to the inner product defined by the weighting matrices (2) and (3). There will be

The factors

and

define the plane where rows and columns of the data matrix are projected.

Results from CA consist of graphical representations of the projections of rows and columns of the data matrix onto factorial planes, in order to find and understand underlying relationships [

• the percentage of explained variation, which is a measure of fit when a particular factor is retained, so that the cumulative percentage of explained variation

represents a global measure of fit when K factors,

contribution of a particular factor. Note that the terminology is similar to that one used in PCA, but in CA the term variation does not refer to variance in the statistical sense; it is an increasing function of K and it is used to choose the number of factors to be kept;

• the absolute contributions of the h-th row

and

• the relative contributions of a retained factor with the h-th row

and

Note that the ACs serve primarily as guides to the interpretation of the dimension defined by the retained factors; whereas the RCs indicate how well a point is described by the retained factors. Usually, a large AC implies a large RC, but not conversely [

The application of CA to a space-time data set for multiple environmental variables is based on special con- tingency matrices generated as follows.

Let

Let

Through the indicator transform, the belonging of

From the four dimensional matrix (variable, station, time, class of values) obtained after the indicator trans- formation (18), the following two dimensional matrices are generated.

• Matrix

In A, the

• Matrix

In B, the

The indicator transform allows the user to categorize continuous variables, synthesizing a large multivariate space-time data set. The above two dimensional matrices relate different classes of values (in the case study pollution levels) to locations (matrix A) or to observation times (matrix B), jointly for the variables (pollutants) under study. Thus, CA applied to each matrix, A and B, will allow describing relationships

• in space, among pollution levels and monitored stations,

• in time, among pollution levels and observation times,

simultaneously for the variables under study.

CA results will also identify clusters of survey stations and intervals of time which need of closer controls when the contaminants frequently exceed fixed thresholds.

The data set consists of concentration values of two pollutants over a particular period of time and at stations of the monitoring network in Milan district, Lombardy (this is one of the northern Italy regions which suffers a serious air pollution pro-blem). The air quality monitoring network covers a wide area with about 190 stations where the main atmospheric contaminants, such as sulphur dioxide (SO_{2}), ozone (O_{3}), nitric oxide (NO), nitrogen dioxide (NO_{2}), carbon monoxide (CO), and meteorological variates, such as humidity, wind velocity, temperature, solar radiation, are continuously measured.

In the Milan district, air pollution is mainly caused by traffic and industrial activities. Two pollutants, which are primarily generated by the human activities, considered among the most dangerous ones for the atmosphere and human health and have been analyzed in this paper: NO_{2} and CO. Nitrogen dioxide is a secondary pollutant generated by the thermic and photochemical reactions among the primary pollutants; it is caused, mainly in winter, by civil and industrial heating systems and by traffic. Therefore its concentration values are very high in urban areas characte-rized by high population density. Carbon monoxide is a primary pollutant caused by the motor vehicles emissions and its values are very high in areas with heavy traffic and poor ventilation. These characteristics are considered to choose the period of the year to be analyzed: January 1999. Indeed, most of the highest values for both pollutants under study were observed during the first month of the year. The box plot of the hourly averages for each pollutant, measured during January 1999 (

The national laws, particularly the Premier’s Decree of the 12th of November, 1992, according to the European settlements, lay down, for each pollutant, a specific threshold called level of attention. When the pollution concentrations exceed this level for a long time and at several monitoring stations, air quality is poor and the situation is considered dangerous for the public health.

The analysis is limited to stations in the Milan district where data for both contaminants are available at all the desidered time points. In

• stations C, which are located in areas with heavy traffic and poor ventilation; in these areas the CO plume is more evident;

• stations B, which are located in areas with high density population, therefore these areas are subject to both NO_{2} and CO pollution.

In order to split each spatial-temporal distribution into non-overlapping classes of values, the following thresholds:

a) 1.6 2.3 3 3.9 5.4 mg/m^{3}

b) 52 64 75 90 115 mg/m^{3}

corresponding to the 0.17, 0.33, 0.50, 0.67, 0.83 quantiles of the distributions of CO a) and NO_{2} b) hourly averages, are considered. Hence, six classes of CO and NO_{2} concentrations are defined as follows:

Then, through the indicator transform, two dimensional matrices are generated as described in (2.2); so that:

• A is a

• B is a

CA is applied to these matrices.

A French package software, SPAD [

Even if it is a commercial software, it is a very powerful software for data mining, indeed it can perform many statistical data analysis, as Factorial Analysis, Classification, Segmentation, as well as Textual analysis. Moreover, SPAD has a good graphical tools and is easy to use (user-friendly) [

CA is applied to matrix A and matrix B, since information from both analysis are useful for the aim of the paper, as it will be shown.

The results from CA are displayed in a series of tables and graphs. In particular,

On the other hand,

Eigenvalues | Variation Explained (%) | |
---|---|---|

Factor 1 | 0.0975 | 59.90 |

Factor 2 | 0.0408 | 25.05 |

Factor 3 | 0.0161 | 9.89 |

Factor 4 | 0.0048 | 2.94 |

Factor 5 | 0.0036 | 2.22 |

Eigenvalues | Variation Explained (%) | |
---|---|---|

Factor 1 | 0.0951 | 85.26 |

Factor 2 | 0.0115 | 10.29 |

Factor 3 | 0.0030 | 2.69 |

Factor 4 | 0.0013 | 1.21 |

Factor 5 | 0.0006 | 0.56 |

Factors | Stations/Pollutants | Classes of Values |
---|---|---|

Factor 1 | 86/CO (11) 15/CO (8) 41/NO_{2} (8) 93/CO (5) 45/CO (4) 125/NO_{2} (4) | c6 (39) c1 (29) |

Factor 2 | 97/NO_{2} (27) 62/CO (9) 97/CO (7) 111/CO (6) | c1 (50) c3 (21) |

Factors | Stations/Pollutants | Classes of Values |
---|---|---|

Factor 1 | 41/NO_{2} (95) 101/NO_{2} (95) 86/CO (93) 103/NO_{2} (91) 125/NO_{2} (89) 42/NO_{2} (87) | c6 (84) c5 (76) |

Factor 2 | 85/NO_{2} (92) 107/CO (89) 97/CO (86) | c3 (65) c1 (41) |

The position of the points and the absolute and relative contributions suggest the following comments.

CA applied to the matrix A.

As previously described, matrix A relates six non-overlapping classes of values to CO and NO_{2} survey stations, so that, by analyzing this matrix, it is possible to finding underlying relationships in space among

Factors | Hours/Pollutants | Classes of Values |
---|---|---|

Factor 1 | 6/NO_{2} (7) 5/NO_{2} (6) 6/CO (6) 5/CO (6) 4/NO_{2} (5) 19/CO (5) 20/CO (5) 4/CO (4) | c6 (42) c1 (32) |

Factor 2 | 12/CO (9) 13/CO (7) 13/NO_{2} (7) 12/NO_{2} (4) | c6 (34) c1 (25) |

Factors | Hours/Pollutants | Classes of Values |
---|---|---|

Factor 1 | 6/NO_{2} (97) 6/CO (97) 4/CO (96) 5/CO (95) 5/NO_{2} (92) 19/CO (89) 20/CO (88) 4/NO_{2} (86) | c2 (92) c6 (91) |

Factor 2 | 13/CO (83) 12/CO (82) | c3 (44) c4 (43) |

different pollution levels and monitored locations. The first two factors are retained since they explain together about 85% of the total variation (

The last class of values (c6) and the first one (c1) have the highest absolute contributions to the first factor, respectively, 39% and 29% (^{3} for CO and 115 mg/m^{3} for NO_{2}), while the second factor better explains the variation of low pollution levels, i.e. levels which are smaller than 1,6 mg/m^{3} for CO and 52 mg/m^{3} for NO_{2}.

The cumulative relative contributions to the first two factors (

The projection of the classes to the first factorial plane (

In

1) points 111, 11, 45, 81, referred to CO, and point 41, referred to NO_{2}, with positive first and second co-ordinate;

2) points 86, 15, 93, 113, 101, referred to CO, and points 102, 15, referred to NO_{2}, with negative first co-ordinate.

The position of the second cluster on the factorial plane, being the points closer to point c6 with respect to the other points, highlights that most of the highest pollutant concentrations was read during January 1999 at those locations.

CA applied to the matrix B.

Matrix B summarizes the spatial aspect for each hour, since in this matrix each entry indicates how many monitoring stations, at a fixed hour, have recorded pollution levels belonging to a given class of values. Hence, by analyzing this matrix, underling relationships among observation times (hours) and different pollution levels can be identified.

_{2} low readings was measured from the 4-th to the 6-th hour. Instead, most of the high pollution concentrations was observed during the evening, particularly during the 19-th to the 22-nd hour, for CO and from the 12-th to 14-th hour, for NO_{2}.

In this work, an application of CA to an air pollution space-time data set for CO and NO_{2} hourly concentrations, recorded at some monitoring stations in Milan district, is given. The transformation of the original continuous variables into new categorical ones has been formally presented in this paper by the means of the indicator approach. By counting the indicator data over both spatial locations and observation times, two contingency matrices are generated. Each of them accounts information of both pollutants examined in this paper. CA is applied to these matrices providing a summary description of spatial and temporal profiles, simultaneously for the contaminants under study. The data analysis allows identifying relationships in space among CO and NO_{2} pollution levels and monitored stations and relationships in time among CO and NO_{2} pollution levels and observation times. The aim of each air quality control system is to obtain information about the atmospherical conditions and evaluate the opportunity of major restrictions and closer controls. The application of CA carried out in this paper makes it possible, since its graphical results and diagnostics help in identifying stations inside the area under study and intervals of time during the day for which the contaminants of interest need closer controls because of joint exceeding of fixed pollution levels.

The author would like to thank Prof. Donato Posa of University of Salento, Apulian region (Italy), whose suggestions have been helpful and improved this paper.

PalmaMonica, (2015) Correspondence Analysis on a Space-Time Data Set for Multiple Environmental Variables. International Journal of Geosciences,06,1154-1165. doi: 10.4236/ijg.2015.610090