Analysis of the Homogeneity of Wind Roses' Groups Employing Andrews’ Curves

The homogeneity of groups of 16-dimensional wind direction roses (obtained by hierarchical clustering in a previous report) is discussed through the application of Andrews’ Curves. Principal Component Analysis (PCA) is employed to reduce dimensionality and to provide an ordering of the variables to compute Andrews’ Curves. Our results suggest that Andrews’ Curves greatly facilitate the visualization of homogeneity as well as reveal information that allows improving the clusters’ arrangement. A combined analysis employing Andrews’ Curves and Calinkski and Harabasz’ approach (a method for determining the optimal number of groups) helps to assess the strength of the group structure of the data as well as to detect anomalies such as misclassified objects or atypical values. Furthermore, it allows finding out that the 24 original seasonal hourly roses (representing the “day”) become better represented by 6 groups (rather than by 5 as proposed in the previous report). The new group arrangement was consistent with the dendogram for another cut-off distance. As a result the wind occurrences are now represented by a more detailed and smooth pattern: there is a decrease in northern wind between midday and twilight while eastern winds become more important towards the evening. The methodology proposed is a subject to be considered to become part of an automated system.


Introduction
In [1] wind roses at La Plata City and surroundings were studied employing hierarchical cluster analysis.This method allowed synthesizing information covering 1998-2003 as well as assisting the discussion of physical phenomena related to wind occurrences.Hierarchical clustering allowed us to reduce from 24 to 5, the number of representative wind roses characterizing the "day" for each season and monitoring site.
The goal of the present work is to evaluate the homogeneity of the groups obtained with cluster analysis by employing Andrews' Curves [2] which are a type of graphical display to present and explore data [3].To this end, the observed data for summer at site "J"-one of the monitoring sites referred in [1]-was taken as an example.This season was selected because it was the most variable one; site "J" was chosen because it had the most complete records (above 97% percent of completeness).
Each hourly averaged wind rose is represented by a 16-dimensional vector (the 16 directions of the compass) which can be well represented by Andrews' Curves.These curves, often employed to visual data mining [4], allow representing multidimensional data in two (or three) dimensional plots; its importance lies on the simplicity of the method and becomes very suitable in those cases in which the dimensionality reduction applied to the original data still yields more than three dimensions (in these cases the classic plots become complex).
Although in this paper Andrews' Curves are employed to carry out a qualitative analysis concerning the homogeneity of the groups, it is important to point out that its validity [5] [6] is due to its mathematical properties [7] that are related with other methods [8] allowing therefore working in a less subjective manner.According to [9] these curves also help to visualize structures in high dimensional data.Andrews showed that the difference between two given curves is proportional to the Euclidean distance, i.e., close points in the multidimensional space will be observed as close Andrew's Curves in the plane.This characteristic provides affinity with the clustering method that is the starting point of this work.
Principal Components Analysis (PCA) is used as a method to reduce dimensionality: its outputs are employed as inputs for the computation of Andrews' Curves.Calinski and Harabasz' index is employed to support the findings suggested by Andrews' Curves.
In summary, this paper gathers four well-known approaches (hierarchical clustering, PCA, Andrews's Curves and Calinski and Harabasz's index) that are independent but that supply a powerful tool to allow gaining insight into data characteristics.The results discussed in the present paper were obtained applying to the interaction between a human user and the outcomes of different software packages (e.g., Excel, Ststistica, Matlab, etc.) but the authors considered that the whole methodology could become part of an automated system.

Each
, , , p z z z z =  defines a periodic function given by: called Andrews' curve where t is defined in the range [−180, 180] in sexagesimal degrees.The number of terms in the equation is given by the number of dimensions of the data.Then f(t) is a linear combination of orthonormal functions.Two consequences of this representation are that the mean of the observations equals the mean of the corresponding Andrews' Curves, and the squared Euclidean distance between observations is the same that of the corresponding Andrews' Curves.These properties allow working with a large number of variables.For these reasons, the f(t)'s plots in the mentioned range are very useful to detect group configurations of multivariate vectors.Given a data set where all the curves can be grouped (in two or more groups) showing different patterns, the curves help to find out the group structure in the data set.If the curves are much overlapping it is not possible to distinguish groups, then it may be considered that the data set has no well conformed groups.
As can be seen from Equation (1), f(t) depends on the order of the variables, the first coordinates in the equation emphasize low frequencies that tend to dominate the visual plot [10].Nevertheless, this fact does not influence the application of the curves to detect group structure or atypical values [11] because any chosen order will allow detecting relative differences among curves (the inherent information is the same).Gnanadesikan [12] points out that when it is not possible to assign different importance to the variables in Equation (1), one may analyze the results of some permutations of them and, in this way, get a deeper insight on the nature of the data under study.
In the present study the order of the variables followed the "natural order" given by that of the principal components.This "implicit" order given by the application of PCA provides a solution to the variable order assignment [13] and gives a criterion for future comparisons among different data sets.Besides providing an order in the terms of Equation ( 1), the PCA method gives a sound approach to reduce the dimensionality [14] from the original 16 dimensions to a lower number but retaining a high proportion of the total variance.
It has been pointed out [11] [12] [15] that two possible drawbacks of Andrews' Curves are their computing time and the cluttering effect in to the plots.In the present case none of them are relevant; PCA reduces the dimensionality from 16 to 5, which simplifies the computations while the size of the data set (24 vectors represented by 24 Andrews' Curves) makes it easy to manage from the visual point of view.
The current use of Andrews' Curves is reflected by the different degrees of sophistication of the software involved [9] [13] [16]- [18] that goes from simple graphing to interactive tools and animation models; most of them devoted as visualization and, in a less extent as exploration tools.As other visual data mining approaches [19] Andrews' Curves recalls the use of the human visual perception system as part of the data processing task.
Calinski and Harabasz' index [20] allows determining the optimal number of groups in a given set of multivariate data.In this work it is mainly employed to corroborate findings coming from the application of hierarchical clustering and Andrews' Curves.This index is defined as ) where B (k) indicates the degree of dispersion that exists between the groups formed in the agglomeration process to get k-groups (i.e., the between-groups sum of squares).B (k) is computed as the total sum of the squared distances between the centroid of a group and the centroid of the original data (general centroid).W (k) indicates the degree of dispersion that exists within a group (i.e., the within-group sum of squares).W (k) is computed as the total sum of the squared distances between each individual data and the centroid of its group for all the groups.A plot of W (k) versus k is employed traditionally to show the degree in which additional groups give more "compact" groups.k indicates a particular number of groups obtained from the original data set.n is the total number of single p-dimensional vectors.CH (k) is defined for k > 1, when there is a strong group structure CH (k) gives a unique maximum.When this is not the fact (e.g., local extremes indicate there is moderate group structure) the authors recommend to adopt the first local maximum.In the case CH (k) increases as k increases the approach predicts there is no hierarchical structure.This index was chosen due to its simplicity and high performance characteristics as demonstrated by [21] and [22].

Generalities
Figure 1 shows the dendogram of the hourly wind roses for summer at site "J".The "Y" axis refers to 16 hourly averaged wind direction roses covering 1998-2003, e.g., "Hour 0" covers observations during 00:00 -00:59 h (Local Time)."X" axis correspond to the squared Euclidean distance expressed in percent.The linkage criteria for the clustering process were the mean squared Euclidean distance between groups.The dashed vertical line (located around 50% in the X scale) indicates the five groups provided by the cluster analysis according to [1].These five groups of representative wind roses are shown in Figure 2. The wind rose named as Group 1 (assigned as such arbitrarily) is obtained by averaging the wind roses corresponding to Hour 0, Hour 1, Hour 2 and Hour 3; the same procedure was applied to obtain the rest of the groups.
Following [23] it is pertinent to explore the number of significant dimensions of the data (16 dimension wind roses) in order to simplify the computing process to build Andrews' Curves.With this purpose, and considering the advantages exposed in Section 2, PCA was applied to the hourly wind direction roses of Figure 1.This method was carried out with Statistica's software package Version 8.0 employing the covariance matrix of the variables as reference.With illustrative purposes Figure 3 shows the configuration of points obtained with the two first principal components (that explain 86.70% of the total variance).In this plot it is possible to see the existence of a group structure and the absence of atypical values (not conclusive).
Figure 4 show the scree plot corresponding to all principal components.It can be easily seen that after the first five or six eigenvalues the curve becomes flat.This implies that with a few components (4 or 5 of the new variables) it is possible to represent the original 16-dimensional objects (wind roses).As can be seen from Table 1 the accumulated variance regarding the first five principal components explains more than 95% of the total variation.So, Andrews' Curves can be built using only these five components (instead of the original 16 variables).
Figure 5 shows Andrews' Curves obtained for each of the members of the groups defined by the dendogram (Figure 1); besides, the group averages are shown.Value of the eigenvalue (%)

Visual Inspection of the Andrews' Curves
A panoramic view of Figure 5 from (a1) to (e1) allows determining that there is a good homogeneity in each of the groups, i.e., the individual curves tend to stay close to each other and with a similar shape.In almost all the groups the curves that belong to the extremes of the interval (e.g., Hour 0 and Hour 3 in Group 1, Hour 13 and Hour 19 in Group 4, etc.) are the most different ones (considering distance and/or shape).Throughout the groups the occurrences of peaks and valleys (that give identity to the group) are different: Figure 5 from (a2) to (e2) -solid line-allows seeing this phenomenon on average.In the same figure the dotted lines indicate the general average for each of the group that corresponds to the first term in Andrews' series (influenced by the first principal component).For Group 1 this average is -2.8, for Group 2 is 5, 4, for Group 3 is 6.6, for Group 4 is 0.1 and for Group 5 is -11.This means that, in some cases, it is possible to distinguish important differences among groups (e.g., between Group 3 and 4 or between Group 4 and 5) only with the first principal component.Some of the groups show few oscillations (e.g., Group 1) showing more influence of sin(t) and cos(t) (that correspond to the second and third principal component respectively) than sin(2t) and cos(2t) (that correspond to the fourth and fifth principal component respectively).The contrary occurs with Group 5 where the functions sin(2t) and cos(2t) associated with more oscillations are easy to notice.In Group 4 (Figure 5 In summary, on one hand it is pertinent to regroup the members of Group 4 into two "new" groups: Hour 13 -Hour 17 and Hour 18 -Hour 19.This is in accordance with the dendogram (Figure 1) if the cut off distance is carried towards 40%.So, the 24 original wind roses will be grouped into 6 groups (rather than into 5).On the other hand the presence of a potential outlier is rejected.

Calinski and Harabasz Index
Calinski and Harabasz's index (see Section 2) is computed for nine possible ways of grouping the 24 original data following the dendogram of Figure 1.Table 2 shows the CH (k) index computed up to nine groups.The local maximum is the reference to consider that the optimal number of groups is six.To better visualize this result Figure 6 shows the values of W (k) for different numbers of groups.At the beginning the curve shows a steep descent slope indicating the segregation between groups for small values of k (between 1 and 3).Then the slope smoothes (when k is between 3 and 5) and for k = 6 it flattens.K = 6 can be considered as a critical value because higher values will not indicate the presence of real groups.So, it can be concluded that the Calinski and Harabasz's approach reinforces the findings of previous sections determining that six averaged wind roses will be the best number to represent the original 24 ones.
As stated in Section 3.2 Andrews' Curves allow revealing information regarding the group structure of the data.In some cases the wind roses that are neighbors but belong to different groups may look similar.So it can  be concluded, from a general point of view, that the original data have a moderate group structure.This is in accordance with the values of CH (k) that contain a local maximum (Section 2).An illustration of the moderate structure fact can be appreciated in Figure 3 that considers only the two first principal components (for example, compare the distance between Hour 19 and 18 to that between Hour 19 and 20).

Meteorological Implications
Figure 7 shows the new groups, namely Group 4* and Group 5*, that modifies Figure 2 as a result of the findings of the previous sections.Group 5 in Figure 2 is now called Group 6.An advantage of Figure 7 (compared to Figure 2) is that allows appreciating a smoother change of the prevailing winds from midday to twilight.The Group 5* reveals the decrease in northern wind and the importance of eastern winds towards the evening (this effect was not caught by Group 4 in Figure 2).

Conclusions
Andrews' Curves have been used to gain insight in the characteristics of wind roses groups obtained in a previous report with a hierarchical clustering method.PCA was employed to reduce dimensionality, and hence, to simplify computations.Five principal components (explaining more than 95% of the variance) were employed instead of the 16 original variables (wind directions); this method was also helpful in the order allocation of the variables in Andrews' Curves equation.
Andrews' Curves allowed visualizing in two dimensions, in a very simply and tangible manner, multidimensional vectors (wind roses).Therefore, the homogeneity of the groups merged from the dendogram was visually inspected.These groups showed, in general, high degree of homogeneity; detected anomalies such as the presence of potential outliers and new subgroups were further discussed.As a result the presence of outliers was discarded and a new configuration of the groups was defined.This finding was supported by the Calinski and Harabasz's index that gave 6 as the number of optimal groups.The combined analysis (Andrews' Curves and Calinski and Harabasz approach) evidenced the degree of strength of the group structure indicating that the original data had a moderate structure.The consequences of these results in the description of wind occurrences were outlined: there was a decrease in northern wind between midday and twilight while eastern winds became more important towards the evening (this was not evidenced in the previous report).
The integration of the approaches employed (hierarchical clustering, principal component analysis, Andrews' Curves and Calinski and Harabasz's index) can be viewed as a guidance to be followed when finding homogeneous groups in high-dimensional data is required.Furthermore, this outlined guidance is capable to become part of an automated processing system which is very helpful when large data sets (in our case more seasons and monitoring sites) need to be assessed and compared.

Figure 1 .
Figure 1.Dendogram for the 16 direction hourly wind roses observed in summer 1998-2003 at site "J" at La Plata City.

Figure 2 .
Figure 2. Wind direction roses representing "the day" as a result of the dendogram (Figure 1) for a cut distance around 50%.

Figure 3 .
Figure 3.The blue points show the wind roses of the dendogram (Figure 1) expressed by the first two principal components.The involving dashed lines indicate the groups given by the dendogram for a cut distance around 50%.The involving solid line comprising Hour 18 and 19 indicate a possible subgroup (Section 3.3.2and Section 3.3.3).Both lines do not reflect the shape of the groups; they have been drawn just to illustrate the idea of group structure.X axis values divided by 2 constitute the first term of Equation (1) and the constant value for each of the curves of Figure 5 from (a1) to (e1).

Figure 4 .
Figure 4. Scree plot.It helps to determine the optimal number of eigenvalues to retain.
(d1)) the curves corresponding to Hour 18 and 19 are somewhat different from the rest; besides, they don't look very similar to curves of the adjacent groups (i.e., Group 3 and Group 5).Comparing neighbors (i.e., Hour 17 with Hour 18 and Hour 19 with Hour 20) there exist differences but they do not seem very strong.The general average was computed considering that there may exist two subgroups in Group 4. The obtained values were 2.3 for Hour 13 -Hour 17 while -5.3 for Hour 18 -Hour 19 what implies a relevant difference between subgroups.In Group 5 (Figure 5(e1)) the curve for Hour 23 shows a different pattern than the rest.Comparing Hour 23 with Hour 0 (nearest neighbor of Group 1) and with Hour 22 (nearest neighbor within the group) it is not possible to conclude that Hour 23 constitutes a misclassified member.Although somewhat atypical it is not possible to consider that Hour 23 is an outlier.

Figure 6 .
Figure 6.Extinction diagram for the within-group sum of squares.Note that W (k) is defined for k = 1.

Figure 7 .
Figure 7. New groups of averaged wind roses in accordance with Figure 1 for a cut distance around 40%.Group 4 In Figure 2 was converted into Group 4* and Group 5*.The asterisk (*) indicates a new group.

Table 1 .
Accumulated variance according to the eigenvalues order.