Regularities in Sequences of Observations

The objective of this paper is to propose an adjustment to the three methods of calculating the probability that regularities in a sample data represent a systemic influence in the population data. The method proposed is called data profiling. It consists of calculating vertical and horizontal correlation coefficients in a sample data. The two correlation coefficients indicate the internal dynamic or inter dependency among observation points, and thus add new information. This information is incorporated in the already established methods and the consequence of this integration is that one can conclude with certainty that the probability calculated is indeed a valid indication of systemic influence in the population data.


Introduction
Suppose that in a sequence of observations one observes a striking regularity; for example suppose that the values arrange themselves in an increasing or decreasing order of magnitude, or a maximum or a minimum is indicated.Many questions arise.Is the observed regularity a general phenomenon, or is it true only of the sequence of the data set sampled.Is the observed regularity due to the particular sequence sampled or is it due to sampling from a random sequence.In other words, in recurrent sampling, is it reasonable to believe that approximately the same general results will occur.Is it the manner of sampling that creates artificial regularities.The occurrence of regularity in a data set that results from random sampling is highly improbable; thus regularity in a sample data is a justification for regarding regularity as a true representative of the population data.The assumption is that unless the probability of random occurrence is small, there is no objective proof that there exists an actual regularity in the population data.
To explore regularities in random sample data sets many researchers have made significant contributions, [1][2][3][4][5][6].For example, assuming that the sequence of individual numerical values is available, they have applied various tests based on characteristics of a random sequence.For example, they concluded that the number of maxima in a sequence of unrelated numbers is one-third of the number of data points.The deviation of any sequence of data in any characteristic from what is assumed for a random sample of sequences implies that there is a systematic influence, the extent of which depends on the magnitude of deviation and the number of data points in a sample.In general, random sampling of data is not a sufficient criteria for proving systematic influence.It is shown that unless there are a large number of data points, the proof of the existence of systematic influence remains unresolved.
Up to now, the attempts to determine the probability of getting a short sequence of terms having a strict appearance of regularity have proven to be rather misleading.Given the uncertainties researchers have modified the analysis of regularities in random samples.In the new approach a sequence of averages of groups of individual observations is obtained in a systematic way.For example, random samples are drawn any number of times.The averages of each sequence of data sample are calculated.These averages form a composite sequence that can be used in testing the systematic influence in samples.The statistical significance of such a sequence of averages can be determined by comparing the variance of the individual observations in a random sample computed directly with that calculated from the variances of the averages, [7][8][9][10][11][12][13][14].This analysis of variance principle can be applied to a general case where the values of the independent variable are related to each of a number of correlated independent variables.This is a problem of multiple curvilinear correlation, where a sequence of averages of the dependent variable is computed with respect to each independent variable and correlated to the constant values of the other independent variables, [15][16][17].Although the use of a sequence of averages is a logical approach this method is highly uncertain and in some cases inapplicable.Analysis of variance principle, and the multiple curvilinear correlation have a solid logic, they provide approximate indications of any systematic influence.The main shortcoming of these models is that they do not detect the source of variability in a data set.The focus of the three models is on the variability within and the correlation among averages in a sample.To address this shortcoming of the three approaches, a modification to these models is proposed.The modification consists of detecting the correlation among individual observations both within and across groups in a data sample, or in another word, data profiling.This aim is achieved by calculating vertical

The Approach Based on the Theory of Large Samples
Commonly, the values of sequences in a data sample are averages of measurements or numbers grouped in some systematic fashion.There are many readily available methods that calculate the probability of such systematic grouping of data.These methods are based on calculating the variability between the averages and within the groups.These methods are extended to special cases where the regularities of a sequence are periodic, [7][8][9][10][11][12][13][14].
The method based on the theory of large numbers developed by [9] The sample standard deviation is calculated using the following equation: It is assumed that the difference in value between small, then the standard error of the ratio ( ) should  n  ).Cox assumes that if there are be proportional to ( systematic influences, then the expression  should on the average equal zero given the theory of large samples, and the standard error Practice has shown that this method that is based on the theory of large samples is often inapplicable.One way to circumvent this problem is to introduce vertical and horizontal correlation coefficients.Correlation coefficients show the variations between observations, and across groups.A minor change of notation is introduced.
The    correlation coefficients are calculated using the following formulas: and If the ratio max is equal to one, then the indication is that each observation is related to the other, both within each column and across columns; in other words, there is evidence of systematic influence or systematic regularity.On the other hand, if the ratio is either less than one or greater than one, then the evidence points to the contrary, which translates into the lack of any systematic influence.Thus, in general if the ratio equal to one, then there is absolute certainty that the po- Copyright © 2012 SciRes.OJS lation data exhibits systematic influence.The reverse case where the ratio , is either less than or greater than one, then there is no systematic influence in the population data.
The approach based on the theory of large samples looks at the sample data from the macroscopic level, meaning sample averages and sample standard deviations.Data profiling explores the data set from the microscopic level, meaning the vertical and the horizontal correlation coefficients.Data profiling method adds new information which allows for an efficient and accurate detection of systemic influence.To state this formally, let be the space of almost surely random sets, where (Ω) is the set of all random sets, and (A) is a subset with σ-algebra.It can be stated that the sample data ij , and ij exhibits systemic influence if and only if the probability that 0  , 1, exists and is equal to 1, or , where (t) is some constant.Data profiling assigns to the space ( ) a metric (d) which is associated with the probability of convergence.Let then it is easy to notice that (d) represents a distance in , and is invariant under any transformation (no matter which subset of random sets is used).If  are true representations of data at the two levels (macroscopic and microscopic respectively), then one would expect In fact the convergence of (d) to zero causes the convergence of the probability.This is due to the Bienaymé-Tchebychev or Markov inequality and the fact that as holds then let for any (ε > 0) ; 0 1 2 assures almost surely the detection of systemic influence in a data set.

The Approach Based on the Method of Analysis of Variance
This method finds the probability that any variation in between averages is purely random, [18].An outline of the procedure follows: The mean variance between columns is calculated: The residual variance is calculated using the formula: Let log , then the probability of no systematic influence is found from tables, [18][19][20] given (Z), and the degrees of freedom 1 , and n  2 .The method of analysis of variance looks into the variability between column means and the variability of individual observations from the corresponding mean within each column.This method has a shortcoming in that it does not look at the corresponding correlations between individual observations in each column and across groups.Data profiling allows for a better analysis and detection of internal or systematic variability.To account for data profiling, the formula for (Z) should be modified in the following way: .
The addition of a log of the fraction of vertical and horizontal correlation coefficients has one major effect; it either augments the value of (Z), in which case lowers the probability of systematic influence or lowers the value of (Z), in which case raises the probability of systematic influence.

The Approach Based on Multiple Regression
Up to this point, we have been dealing with one independent variable only.[17] generalizes the method of analysis of variance to many independent variables which may be mutually correlated.In other words, the group averages are given as a multiple regression of (K) independent variables.He thus modifies the mean variance between columns  s V , and the residual variance   r , using the multiple regression method.The outline of the procedure is as follows: V M = Total number of columns (groups) to be averaged with respect to all the independent variables K = Number of independent variables Ky' = Value of an observation corrected with respect to all except the kth independent variable 1 corrected group averages of the dependent variable .w.r.t. to independent variable The overall variance between columns is calculated: The residual variance is calculated using the formula: The probability that data being random is obtained as , and the degrees of free-  1 n , and   2 n ; and the probability of systematic dom 1 log influence is thus .The shortcoming of the generalized method of the analysis of variance is that although it tries to look more closely at individual data sets, it does not look at the strength of the relationship between each individual data points.Data profiling in this case allows for adjusting for this shortcoming.The vertical and horizontal correlation coefficients,    be the vertical correlation coefficients calculated for the K independent variables, and be the horizontal correlation coefficients calculated for the K independent variables.New correlation coefficients are introduced: represent residual vertical correlation coefficients of (Ky') adjusted observations, and h h represent residual vertical correlation coefficients of (Ky') adjusted observations.For each independent variable (k), the vertical and horizontal correlation coefficients,   Copyright © 2012 SciRes.OJS and The overall variance between columns is then modified as follows: In order to modify the residual variance, residual vertical and horizontal correlation coefficients are calculated using (ky'), the value of an observation which is corrected to constant values of all the rest except the kth independent variable given in McEwen's generalized method of the analysis of variance, [17].The residual vertical correlation coefficients are calculated: and the residual horizontal correlation coefficients are given by: 1, , The residual variance is modified as: The probability that data exhibits systematic influence log and the degrees of freedom ( ), and ( n ) as is already explained.

An Example: Sunspot Numbers
In this section the validity of the improvement in the form of data profiling is tested.For this purpose the data set used in [17] is revisited and the probability of the existence of systematic influences in the data is calculated once given the proposed analysis of variance method, which is already demonstrated in [17], and once with a modified version.Consider the data corresponding to sunspot numbers arranged with respect to a trial cycle of length 11 years, i.e. from 1749 to 1826.The sunspot numbers exceeding 99 are excluded.The data is shown in a matrix form as (Table 1): The  decrease up to the 6th column, and then increase from then on.To calculate the probability that the sample data is indicative of the population data, and thus there are cyclic effects, the (Z) statistic is calculated.The statistic (Z) is calculated using the mean variance between the columns   s V V , and the residual variance ( ).
The value of (Z) corresponding to the 20, 5, 1, and 0.1 percent points are respectively 0.19, 0.38, 0.54, and 0.71.Since (Z = 0.64) is greater than 0.54, then the probability of random effects is 0.01, which makes the probability of systematic influence to be 0.99.Though the results seem to point in favor of systematic influence or the existence of cycles, the evidence is not conclusive.To find out if the sample obtained implies cyclic appearance of sun spots, the data profiling method is tested.The vertical and horizontal correlation coefficients are calculated given Equations ( 3) and ( 4).The vertical averages The max of (  ), and ( h  ) are calculated as well.

  max
is equal to 0.2586587. The modified value of the statistic (Z) is then obtained by adding the two values of which then would give (0.6408 + 0.2587 ) = 0.8995.Since the value (0.8995) is higher than (0.71), it indicates that the probability that the population data is random is less than 0.001 which is less than 0.1 indicating with certainty that the number of sunspots is cyclic.The existence of systemic influence is indisputable.Applying the approach based on the method of large samples, the calculated using the approach based on multiple regression is a slight improvement over the statistic obtained using the method of analysis of variance (Z = 0.6408).Using data profiling method, the statistic Z is corrected to (Z = 1.0).As in the case of the analysis of variance method, it can be stated with absolute certainty that there is indeed a systemic influence in the sample data.

Conclusion
The objective is to derive conclusions about the randomness of observations in a population given that the sample data set exhibits strict regularities.Three methods are analyzed and their shortcomings are indicated.An improvement to the three methods is suggested and formulated.The improvement comes in the form of data profiling which in essence is the integration of vertical and horizontal correlation coefficients in the equations.Through a simple example, it is shown that data profiling is indeed a compliment of the original formulation.
testing the statistical significances of each sequence of averages as well as the composite significance of all of the sequences is derived. should be less than the sampling error.If this principle holds then, Cox employed the criterion of significance.If the error of standard deviation of   o  is


correlation coefficients, and incorporating them in the calculations.The precise definitions of these variables are given and the manner in which they are integrated into the three models are demonstrated in the following sections.

sn
= number of entries in column (s)

n
The number of columns is (m = 11).The number of observations in each column is ( s = 4).The number of observations of the dependent variable is (N = 44).The overall average is y = 25.59.The degrees of freedom There is a large dis- crepancy between this statistic and the adjustment proposed in Section 2,

Table 1 . Sunspot numbers arranged with respect to a trail cycle of 11 years, 1749-1826.
s y