Using optimized distributional parameters as inputs in a sequential unsupervised and supervised modelling of sunspots data

Detecting naturally arising structures in data is central to knowledge extraction from data. In most applications, the main challenge lies in the choice of the appropriate model for exploring the data features. Quite often, the choice is generally poorly understood and any tentative choice may be too restrictive. Growing volumes of data, disparate data sources and modelling techniques entail the need for model optimization via adaptability rather than comparability. We propose a novel two-stage algorithm to modelling continuous data consisting of an unsupervised stage whereby the al-gorithm searches through the data for optimal parameter values and a supervised stage that adapts the parameters for predictive modelling. The method is implemented on the sunspots data with inherently Gaussian distributional properties and assumed bi-modality. Optimal values separating high from lows cycles are obtained via multiple simulations. Early patterns for each recorded cycle reveal that the first 3 years provide a sufficient basis for predicting the peak. Multiple Support Vector Machine runs using repeatedly improved data parameters show that the approach yields greater accuracy and reliability than conventional approaches and provides a good basis for model selection. Model reliability is established via multiple simulations of this type.


Introduction
Many real-life problems are tackled via knowledge extraction from dataa process typically associated with detecting naturally arising structures in the data. A typical example is the sunspots dataset [11] an average oscillating sequence of the beginning and ending periods of solar cycles with an approximate periodicity of 11 years [7]. Recorded sunspots span across the first cycle (March 1755 to June 1766) to the first few months of the current (24th) cycle. Clustered in non-random positions above and below the equator, the spots are generated by interactions between the sun's surface plasma and its magnetic field [19 and 22]. Solar magnetic activity cycles have attracted the attention of scientists for many years. Solar flares, for instance, affect our planet in different ways -including ejecting plasma and energetic particles and potentially causing geomagnetic storms and damaging satellites [16]. The paper is motivated by the documented effects of sunspots on terrestrial conditions. Correlations between space and terrestrial weather have been indicated in solar studies dating back many years [13, 18 and 20]. Climatic variations in Lapland via complex variations in the atmosphere, lunar gravitation and solar activity have also been explained [11]. This paper will be subjecting sunspots data to a sequential analysis involving unsupervised and supervised modeling. The two concepts represent the typical data mining problemsdata clustering and classification. The primary goal of the former is to partition a given dataset with a known or unknown distribution into subgroups in such a way that data points in each group are as homogeneous as possible while those in different groups are as heterogeneous as possible. The method is typically applied in problems in which there is no clear mathematical formulation for describing the underlying structures. Various approaches to data clustering have been studied and are well-documented in the literature [21, 17 and 6].
However, determining the number of naturally arising structures in data remains a daunting challenge among the data science community. Many clustering tools in the literature are based on the conventional mechanics of minimization of the distances between data pointsa feature which inherently constitutes the same challenge the methods are designed to addressthat is, determining the optimal number of clusters. The primary goal of the latter is to allocate new cases in known classes and one of its main challenges is balancing model accuracy and reliability.
Let a dataset of independently identically distributed random vectors    R represent features an underlying density function. The main features of interest may include modes (local maxima), anti-modes (local minima) and bumps -regions where the second derivative is negative. In an exploratory setting, the number and locations of these features are not known a priori. Many real-life data take this form and with large volumes of data generated from different sources and inputted into different models, we are constantly faced with the challenge to determine optimal stationery points. The challenge is to address model complexity via adaptability rather than comparability. In other words, we seek to minimise inherent randomness in training and test data via novel adaptive methods of data analysis [10 and 1].
This paper proposes a novel approach to detecting naturally arising structures in data that searches for generalising parameter levels and adapts them to supervised modeling. Its main research problem is to develop an algorithm for predicting future cycles given historical solar activity data. We try to address this problem via the following objectives.
1) To determine naturally arising structures in the data. For simplicity, we shall be seeking to identify and separate high from low solar activity cycles. This objective constitutes the unsupervised stage of the algorithm.
2) To predict future cycles based on information in previous cycles. This is the supervised stage.
3) To search for an optimal solution based on repeated simulations at the unsupervised and supervised stages.
The paper is organised as follows. Section 1 provides the introduction followed by methods in Section 2. Data analyses and discussions are in Section 3 and concluding remarks and potential new directions in Section 4.

Methods
Choosing a parametric form of the density to explore features is generally poorly understood and any tentative choice may be too restrictive. Often under such circumstances non-parametric density estimation, e.g. Kernel Density Estimation (KDE) technique [21] allows for practical solutions to the classical problem of choosing the level of smoothing (bandwidth), can be efficiently used. For example, given the data points the KDE approach to clustering defines clusters as regions of high density separated by regions of no or low density. Its main idea is to first compute a kernel density estimate,   t fx , say, from the data, with a Gaussian kernel and isotropic bandwidth 0 t  controlling the amount of smoothing. In its simplest form, KDE can be thought of as an alternative to the histogram as it typically provides a smoother representation of the data, and unlike the histogram, its appearance does not depend on a choice of starting point. The scenario represents a problem amenable to the multivariate kernel function in Equation 1 where T is a symmetric positive d by d bandwidth matrix defined as the diagonal Without loss of generality, consider a phenomenon with a binary structure of, say, "highs" and "lows". Depending on the context, a number of models can be applied. For instance, if we assume a Gaussian kernel, we can define a parametric pattern of "lows" and "highs" in the form of a normal mixture model and use the parameter estimates   Θ μ, Σ  to track the dynamics of the cycles. Further, if we assume that the probability of a "high" followed by another "high" structure is P hh and that of a "low" followed by a "low" structure is P 11 , we can define a Hidden Markov Model as in Table 1. In this case, an HMM provides a formal foundation for linear sequence labeling of data. Balancing accuracy and reliability amounts to defining an appropriate way of labeling data using the probabilities and interpreting the results probabilistically. We could also define associations, the corresponding scores and the underlying confidence.

Data Description, Research Problem and Objectives
We adopt the sunspots data [11] an average oscillating sequence of the beginning and ending periods of solar cycles forming the densities in Figure 1.  The densities in Figure 1 exhibit different umber of modesa feature typically determined by the adopted level of smoothing. By controlling the level of smoothing via a kernel function of the form in Equation 1 or otherwise we are able to identify different structures in data. Figure 2 presents a 2-D plot of the sunspots means and standard deviations. The numbers in the plot represent the indices for each of the last 23 cycles and the current cycle (24 th ). Using a rule of thumb, we can identify the high, moderate and low solar activity cycles, say. Following [10 and 1] we can treat each cycle as a separate density and then use their distributional behavior to explore the underlying structures of the cycles.

Modelling Strategy
Conventional approaches to modeling sunspots include data assimilation [8] and rotational solar dynamo-based predictive models for short-term predictions [2 and 14]. The densities in Figure 1 exhibit typically bivariate patterns and so we shall assume that the cycles form a parametric pattern of "lows" and "highs" and define where S i denotes the sunspots numbers, K is the number of components,   ** f. is a normal distribution, k π is the prior probability of class membership and i Sk  are class allocations. Statistically, the high-peaked (more than normal) and low-peaked (less than normal) cycles imply high and low solar activities respectively while those skewed to the right imply few increases and frequent decreases in solar activity and vice versa. Our strategy involves two main levelsunsupervised and supervised. At the former level, we examine the initial and subsequent patterns of the cycles in order to separate the "lows" from the "highs". The maximum likelihood estimates (MLEs) of the random finite mixture densities are estimated and passed on to a predictive model at the supervised level as outlined below.

Figure 2. Sunspot means and standard deviations.
The above algorithm adapts the EM converging features described in [5 and 9]. Its form suits any supervised modelling technique. In this paper it is implemented in Support Vector Machines (SVM).

Supervised Modeling of Labeled Data
We adopt Support Vector Machines (SVM) -a kernel-based discriminant function the mechanics of which rely on supervised learning of the underlying discriminating rules from the training data [5]. To put it in context, let the "high" and "low" cycles in our modified set The SVM kernel [4] is generally defined as

Analyses and Discussions
We now present the two-level analyses described above in order to establish whether sunspots follow identifiable patterns which can be used as inputs in a predictive model.   cycle is defined by its early patterns. In particular, the maximum values reached by each cycle appear to provide an insight into the overall activity of the cycle before it starts to subside. The foregoing structural detection of patterns in the sunspots data amounts to unsupervised modelling. Adopting these patterns as a guide to data labelling rule yields the two class priors as

Unsupervised: Initial Patterns and Maximisation
lh  computed as above. As the average early patterns for cycle 24 fall below the cut-off point, it is reasonable to suggest that it will be a low activity cycle. Implementation of SVM modelling follows below.

Supervised Level: SVM Supervised Modelling
Results from SVM modelling based on the initial class patterns with prior probabilities lĥâ nd  gave an averaged accuracy of 58% on a cost range of 0.005 to 5 and a training sample of 500. Posterior class probabilities conditioned on maximised averages of the early low and high group means reached an average accuracy of 98% on the same cost range and training sample size. The support vectors are shown in Figure 4 with the horizontal and vertical axes corresponding to the support vectors and indices respectively. Figure 5 shows the best discriminating SVM decision values at two different bandwidths. The bandwidths and hence decision values are chosen from multiple simulations as determined by the binary cut-off point demar-cating low from high cycles. Notice how each of the modes also exhibits sub-modes To avoid spurious modes (over-fitting) or masking effects (under-fitting) it is recommended to use significance test for changes or, for clear patterns, graphical visualization. Typically, SVM model weights for each of the support vectors are obtained as a cross product of the model coefficients and support vectors [15]. Weights from multiple SVM runs can be recorded and their graphical patterns be used to guide model selection. Other SVM outputs include the individual probabilities and decision values as in  Figure 4, the decision values in Figure 5 and to the corresponding probabilities in Figure 6 can be identified by indexing.

Concluding Remarks and Potential Future Directions
Predicting solar activity cycles remains one of the major challenges the scientific community faces with intricacy being compared to predicting, say, the severity of next year's winter. In this weather analogue, if all that is available is a long vector of temperature readings over many years, the only sensible approach is to search for  naturally arising structures in the data with the hope that if uncovered they may provide potentially useful information. This paper adopted the foregoing philosophy and sought to develop a predictive framework for modelling sunspots data using inherent distributional properties in the data. The paper relied on a continuous flow of data for prediction, but rather than assessing model accuracy on the NOAA benchmark, an SVM model was trained and tested on a notionally infinite dataset of cycles. By examining multiple sets of observations from the onset of each cycle via graphical visualisation early patterns of sun cycles and their binary nature were determined. Comparing multiple early patterns for each recorded cycle extracted at different time periods to the corresponding full cycles revealed that the first 3 years provide a sufficient basis for predicting the cycle's peak. The patterns were then adapted as inputs into an integrated unsupervised and supervised modelling algorithm. The novel method's mechanics are geared towards simultaneously tracing anomalies via an adaptive approach. Repeated SVM runs using repeatedly improved parameters showed that the approach yields greater accuracy and reliability than conventional approaches. Multiple simulations of this type can be generated based on the algorithm above to assist in selecting the most consistent model. The paper's main substance can be described as an enhancement of algorithmic methods for learning underlying rules from data.
Finally, it is worth noting that while the study was confined to the conventional periodicity of 11.11 years [22] with a binary pattern of cycles, the definition implies that the periodicities can differ according to definitions. Further, while we assumed a binary scenario of the cycles in Figure 1, different bandwidths are likely to yield different patterns. To address this limitation, the paper's findings highlight potential investigations paths into such variations. Further, the current study, based on a single application and a single method, could not confirm the algorithm's robustness. Although we adopted SVM for implementation, the approach is amenable to any domain-partitioning method. Thus, for model enhancement purposes, it will be useful to provide a comparative study using other learning algorithms such as neural networks and decision trees.