Open Journal of Statistics, 2012, 2, 408414 http://dx.doi.org/10.4236/ojs.2012.24049 Published Online October 2012 (http://www.SciRP.org/journal/ojs) Regularities in Sequences of Observations Mahkame Megan Khoshyaran Economics Traffic Clinic (ETC), Paris, France Email: megan.khoshyaran@wanadoo.fr Received July 18, 2012; revised August 20, 2012; accepted September 2, 2012 ABSTRACT The objective of this paper is to propose an adjustment to the three methods of calculating the probability that regulari ties in a sample data represent a systemic influence in the population data. The method proposed is called data profiling. It consists of calculating vertical and horizontal correlation coefficients in a sample data. The two correlation coeffi cients indicate the internal dynamic or inter dependency among observation points, and thus add new information. This information is incorporated in the already established methods and the consequence of this integration is that one can conclude with certainty that the probability calculated is indeed a valid indication of systemic influence in the popula tion data. Keywords: Systematic Influence; Theory of Large Samples; Analysis of the Variance Principle; Multiple Regression; Data Profiling; Vertical Correlation Coefficient; Horizontal Correlation Coefficient 1. Introduction Suppose that in a sequence of observations one observes a striking regularity; for example suppose that the values arrange themselves in an increasing or decreasing order of magnitude, or a maximum or a minimum is indicated. Many questions arise. Is the observed regularity a general phenomenon, or is it true only of the sequence of the data set sampled. Is the observed regularity due to the par ticular sequence sampled or is it due to sampling from a random sequence. In other words, in recurrent sampling, is it reasonable to believe that approximately the same general results will occur. Is it the manner of sampling that creates artificial regularities. The occurrence of regu larity in a data set that results from random sampling is highly improbable; thus regularity in a sample data is a justification for regarding regularity as a true representa tive of the population data. The assumption is that unless the probability of random occurrence is small, there is no objective proof that there exists an actual regularity in the population data. To explore regularities in random sample data sets many researchers have made significant contributions, [16]. For example, assuming that the sequence of indi vidual numerical values is available, they have applied various tests based on characteristics of a random se quence. For example, they concluded that the number of maxima in a sequence of unrelated numbers is onethird of the number of data points. The deviation of any se quence of data in any characteristic from what is as sumed for a random sample of sequences implies that there is a systematic influence, the extent of which de pends on the magnitude of deviation and the number of data points in a sample. In general, random sampling of data is not a sufficient criteria for proving systematic influence. It is shown that unless there are a large number of data points, the proof of the existence of systematic influence remains unresolved. Up to now, the attempts to determine the probability of getting a short sequence of terms having a strict appear ance of regularity have proven to be rather misleading. Given the uncertainties researchers have modified the analysis of regularities in random samples. In the new approach a sequence of averages of groups of individual observations is obtained in a systematic way. For exam ple, random samples are drawn any number of times. The averages of each sequence of data sample are calculated. These averages form a composite sequence that can be used in testing the systematic influence in samples. The statistical significance of such a sequence of averages can be determined by comparing the variance of the in dividual observations in a random sample computed di rectly with that calculated from the variances of the av erages, [714]. This analysis of variance principle can be applied to a general case where the values of the inde pendent variable are related to each of a number of cor related independent variables. This is a problem of mul tiple curvilinear correlation, where a sequence of aver ages of the dependent variable is computed with respect to each independent variable and correlated to the con stant values of the other independent variables, [1517]. C opyright © 2012 SciRes. OJS
M. M. KHOSHYARAN 409 o and n A method of testing the statistical significances of each sequence of averages as well as the composite signifi cance of all of the sequences is derived. should be less than the sampling error. If this principle holds then, Cox employed the criterion of significance. If the error of standard deviation of o is Although the use of a sequence of averages is a logical approach this method is highly uncertain and in some cases inapplicable. Analysis of variance principle, and the multiple curvilinear correlation have a solid logic, they provide approximate indications of any systematic influence. The main shortcoming of these models is that they do not detect the source of variability in a data set. The focus of the three models is on the variability within and the correlation among averages in a sample. To ad dress this shortcoming of the three approaches, a modifi cation to these models is proposed. The modification consists of detecting the correlation among individual observations both within and across groups in a data sample, or in another word, data profiling. This aim is achieved by calculating vertical v n o n and horizontal h correlation coefficients, and incorporating them in the calculations. The precise definitions of these vari ables are given and the manner in which they are inte grated into the three models are demonstrated in the fol lowing sections. 2. The Approach Based on the Theory of Large Samples Commonly, the values of sequences in a data sample are averages of measurements or numbers grouped in some systematic fashion. There are many readily available methods that calculate the probability of such systematic grouping of data. These methods are based on calculating the variability between the averages and within the groups. These methods are extended to special cases where the regularities of a sequence are periodic, [714]. The method based on the theory of large numbers devel oped by [9] consists of computing the standard deviation of the groups means multiplied by the square root of the number of observations nn , and the standard de viation of the entire series in a data sample o . Let m = number of columns or groups, n = number of entries per column, s ya group mean, and = the grand mean, then the group standard deviation is given by the following: 2 s yy m n nn (1) The sample standard deviation is calculated using the following equation: 2 s yy mn o (2) It is assumed that the difference in value between small, then the standard error of the ratio () should n ). Cox assumes that if there are be proportional to ( systematic influences, then the expression 1 n o n should on the average equal zero given the theory of large samples, and the standard error 2 n o nm . Practice has shown that this method that is based on the theory of large samples is often inapplicable. One way to circumvent this problem is to introduce vertical and horizontal correlation coefficients. Correlation coeffi cients show the variations between observations, and across groups. A minor change of notation is introduced. The that represented the group mean is modified to ,1,,yj m j to reflect the mean by group of the observations or vertical means. Horizontal means ,1,, i in reflect observation means. Each obser vation is represented by ,1,,; 1,,yi nj m ij . The vertical v and horizontal h correlation coeffi cients are calculated using the following formulas: .1 0 2 . 0 for1, , n ij j i i vn ij i yyy y jm yy (3) and 1 0 2 . 0 for1,, m ji i j j hm ji j yyy y in yy (4) If the ratio max max v h is equal to one, then the indication is that each observation is related to the other, both within each column and across columns; in other words, there is evidence of systematic influence or sys tematic regularity. On the other hand, if the ratio max max v h is either less than one or greater than one, then the evidence points to the contrary, which translates into the lack of any systematic influence. Thus, in general if the ratio n o n is equal to ma x max v h , equal to one, then there is absolute certainty that the po Copyright © 2012 SciRes. OJS
M. M. KHOSHYARAN 410 lation data exhibits systematic influence. The reverse case where the ratio n o n is equal to max max v h ,,,LAPR ,;nj 0 yL , is either less than or greater than one, then there is no systematic influence in the population data. The approach based on the theory of large samples looks at the sample data from the macroscopic level, meaning sample averages and sample standard deviations. Data profiling explores the data set from the microscopic level, meaning the vertical and the horizontal correlation coefficients. Data profiling method adds new information which allows for an efficient and accurate detection of systemic influence. To state this formally, let be the space of almost surely random sets, where (Ω) is the set of all random sets, and (A) is a subset with σalgebra. It can be stated that the sample data ij , and ij exhibits systemic influence if and only if the probability that 0 ,1,yi1,,m 1ma max max and ma p n n n n n 00 x; x V oh V oh LL exists and is equal to 1, or max max Vn ho n Pt 00 t 0 L , where (t) is some constant. Data profiling assigns to the space () a metric (d) which is associated with the probability of convergence. Let max max max ,max max 1max V V n oh n dE n ho Vn ho n n 0 L ; then it is easy to notice that (d) represents a distance in , and is invariant under any transformation (no matter which subset of random sets is used). If n o n and max max v h are true representations of data at the two levels (macroscopic and microscopic respectively), then one would expect max max v n oh n d ,0 as n max 00 max vn ho n Ptt if and only if . In fact the convergence of (d) to zero causes the con vergence of the probability. This is due to the Bienaymé Tchebychev or Markov inequality and the fact that as 1 t nt t . Inversely if max 0 max vn ho n Pt holds then let for any (ε > 0) ;0 12 tt t and 00 max for, 1 max 2 vn ho n Ptnnn 0 nn , then for max max max max max ,max max max d max 1max max max d max 1max max max12 2 vn ho vn ho v n oh vn ho v nn t ho vn ho v nn t ho vn ho n d n P n n P n nt Pt t The conclusion is that max max v h assures almost surely the detection of systemic influence in a data set. 3. The Approach Based on the Method of Analysis of Variance This method finds the probability that any variation in between averages is purely random, [18]. An outline of the procedure follows: n = number of entries in column (s) Copyright © 2012 SciRes. OJS
M. M. KHOSHYARAN 411 Nn is the total number of entries a = a reasonable estimate of hya The mean variance between columns is calculated: 22 1 1 a Nh mn 0 m ss j s ny V (5) The residual variance is calculated using the formula: 2 0 2 ss ny a n m j r ya VNm (6) Let log s e r V V Z, then the probability of no syste matic influence is found from tables, [1820] given (Z), and the degrees of freedom 1, and n 2. The method of analysis of variance looks into the variability between column means and the variability of individual observa tions from the corresponding mean within each column. This method has a shortcoming in that it does not look at the corresponding correlations between individual ob servations in each column and across groups. Data pro filing allows for a better analysis and detection of inter nal or systematic variability. To account for data profil ing, the formula for (Z) should be modified in the fol lowing way: n max max v h log log s ee r V ZV . The addition of a log of the fraction of vertical and horizontal correlation coefficients has one major effect; it either augments the value of (Z), in which case lowers the probability of systematic influence or lowers the value of (Z), in which case raises the probability of sys tematic influence. 4. The Approach Based on Multiple Regression Up to this point, we have been dealing with one inde pendent variable only. [17] generalizes the method of analysis of variance to many independent variables which may be mutually correlated. In other words, the group averages are given as a multiple regression of (K) inde pendent variables. He thus modifies the mean variance between columns V, and the residual variance r, using the multiple regression method. The outline of the procedure is as follows: V M = Total number of columns (groups) to be averaged with respect to all the independent variables K = Number of independent variables Ky' = Value of an observation corrected with respect to all except the kth independent variable 1 corrected group averages of the dependent variable .w.r.t. to independent variable s K s y K y 1 1weightedaverage of .. weightedaverage of s yy yy The overall variance between columns is calculated: 2 22 12 12 00 0 1 s mm m K ss ssss jj j V ny ynyynyy MK n (7) The residual variance is calculated using the formula: 2 1 k s r ky y VNK Mn (8) The probability that data being random is obtained as log s e r V ZV before from , and the degrees of free 1 n, and 2 n; and the probability of systematic dom 1log s e r V ZV influence is thus . The shortcom ing of the generalized method of the analysis of variance is that although it tries to look more closely at individual data sets, it does not look at the strength of the relation ship between each individual data points. Data profiling in this case allows for adjusting for this shortcoming. The vertical and horizontal correlation coefficients, v , h are modified to adjust to the (K) independent variables. Let 1,, vv be the vertical correlation coe fficients calculated for the K independent variables, and 1,, hh be the horizontal correlation coefficients calculated for the K independent variables. New correla tion coefficients are introduced: 1,, K rr 1,, K rr vv re present residual vertical correlation coefficients of (Ky') adjusted observations, and hh repre sent residual vertical correlation coefficients of (Ky') adjusted observations. For each independent variable (k), the vertical and horizontal correlation coefficients, v , h are calculated as before: (1) 0 2 0 for1,, ; and1, , nkkk k iji j ki vnkk ij i yyy y jm yy kK (9) Copyright © 2012 SciRes. OJS
M. M. KHOSHYARAN right © 2012 SciRes. OJS 412 and 1 0 2 0 for1,, an mkkkk ji i j j k hmkk ji j yyy y in yy d1, , kK (10) The overall variance between columns is then modified as follows: 12 12 max max vv hh 22 12 12 00 1 2 0 1 max max max max mm ss ss jj s K mv K ss K jh ny ynyy VMK n ny y K MK n (11) In order to modify the residual variance, residual ver tical and horizontal correlation coefficients are calculated using (ky'), the value of an observation which is cor rected to constant values of all the rest except the kth independent variable given in McEwen’s generalized method of the analysis of variance, [17]. The residual vertical correlation coefficients are calculated: Copy 1 0 2 0 for1, ,;and1, , nkk ij j i ki vnk ij i ky ykyy rjmkK ky y (12) and the residual horizontal correlation coefficients are given by: 1 0 2 for 1,, mkk ji i j j k hmk ky y kyy r 0 and 1,, ji j i nkK ky y (13) The residual variance is modified as: 2 max max 1 k v k sk h r ky yr NK Mn r V (14) The probability that data exhibits systematic influence log s e r ZV V is obtained using 1 n2 and the degrees of free dom (), and (n) as is already explained. 5. An Example: Sunspot Numbers In this section the validity of the improvement in the form of data profiling is tested. For this purpose the data set used in [17] is revisited and the probability of the existence of systematic influences in the data is calcu lated once given the proposed analysis of variance method, which is already demonstrated in [17], and once with a modified version. Consider the data corresponding to sunspot numbers arranged with respect to a trial cycle of length 11 years, i.e. from 1749 to 1826. The sunspot numbers exceeding 99 are excluded. The data is shown in a matrix form as (Table 1): The averages j yy are given: 52.5, 43.2, 26.0, 21.5, 13.5, 6.5, 7.5, 12.7, 24.5, 30.5, 43.2 s y n The number of columns is (m = 11). The number of observations in each column is (s = 4). The number of observations of the dependent variable is (N = 44). The overall average is y = 25.59. The degrees of freedom
M. M. KHOSHYARAN 413 Table 1. Sunspot numbers arranged with respect to a trail cycle of 11 years, 17491826. 17491759 17941804 18051815 18161826 1 81 41 42 46 2 83 21 28 41 3 48 16 10 30 4 48 6 8 24 5 31 4 3 16 6 12 7 0 7 7 10 15 1 4 8 10 34 5 2 9 32 45 12 9 10 48 43 14 17 11 54 48 35 36 1 n, and 2 n 11 110 44 1133 are respectively , and 2. The averages 1 n n decrease up to the 6th column, and then increase from then on. To cal culate the probability that the sample data is indicative of the population data, and thus there are cyclic effects, the (Z) statistic is calculated. The statistic (Z) is calculated using the mean variance between the columns V V , and the residual variance (). r V956.32 s , and 265.0 r V . The statistic 956.32 log 0.6408 265.0 s e r V ZV . The value of (Z) corresponding to the 20, 5, 1, and 0.1 percent points are respectively 0.19, 0.38, 0.54, and 0.71. Since (Z = 0.64) is greater than 0.54, then the probability of random effects is 0.01, which makes the probability of systematic influence to be 0.99. Though the results seem to point in favor of systematic influence or the existence of cycles, the evidence is not conclusive. To find out if the sample obtained implies cyclic appearance of sun spots, the data profiling method is tested. The vertical and horizontal correlation coefficients are calculated given Equations (3) and (4). The vertical averages , 1,2,3,4yj j are calculated as: (41.5, 25 j y.4, 14.3, 21.0v ). The two statistics ( ), and (h ) are calculated. 13.82, 5.40, 4.06, 1.41, 1.12, 1.85, 1 6.00, 12.22, 15.99 v .13, 3.85, , 1322.41 (3280.54, 54.19, 1214.77 h v ) The max of ( ), and (h ) are calculated as well. max v 15.996053 54.188689 max h The ratio max max v h is calculated as: max 0.2951 917 max v h . The value max log max v e h is equal to 0.2586587. The modified value of the statistic (Z) is then obtained by adding the two values of max loglog max v s ee rh V ZV which then would give (0.6408 + 0.2587 ) = 0.8995. Since the value (0.8995) is higher than (0.71), it indicates that the probability that the population data is random is less than 0.001 which is less than 0.1 indicating with certainty that the number of sunspots is cyclic. The exis tence of systemic influence is indisputable. Applying the approach based on the method of large samples, the 8 n o n statistic is obtained. There is a large dis crepancy between this statistic and the adjustment pro posed in Section 2, max 10.71 max v h . The statis tic n o n is thus inapplicable. The statistic e log 0.6669 s r V ZV calculated using the ap proach based on multiple regression is a slight im provement over the statistic obtained using the method of analysis of variance (Z = 0.6408). Using data profiling method, the statistic Z is corrected to (Z = 1.0). As in the case of the analysis of variance method, it can be stated with absolute certainty that there is indeed a systemic influence in the sample data. 6. Conclusion The objective is to derive conclusions about the random ness of observations in a population given that the sam ple data set exhibits strict regularities. Three methods are analyzed and their shortcomings are indicated. An im provement to the three methods is suggested and formu lated. The improvement comes in the form of data pro filing which in essence is the integration of vertical and horizontal correlation coefficients in the equations. Through a simple example, it is shown that data profiling is indeed Copyright © 2012 SciRes. OJS
M. M. KHOSHYARAN Copyright © 2012 SciRes. OJS 414 a compliment of the original formulation. REFERENCES [1] L. Besson, “On the Comparison of Methodological Data with Results of Chance,” Journal of Monthly Weather Review, Vol. 48, 1920, pp. 8994. [2] H. W. Clough, “A Statistical Comparison of Meteoro logical Data with Data of Random Occurrence,” Journal of Monthly Weather Review, Vol. 49, No. 3, 1921, pp. 124132. doi:10.1175/15200493(1921)49<124:ASCOMD>2.0.CO ;2 [3] W. L. Crum, “A Measure of Dispersion for Ordered Se ries,” Journal of American Statistical Association Quar terly Publication, Vol. 17, 1921, pp. 969975. [4] E. W. Wooland, “On the Mean Variability in Random Series,” Journal of Monthly Weather Review, Vol. 53, No. 3, 1925, pp. 107111. doi:10.1175/15200493(1925)53<107:OTMVIR>2.0.CO; 2 [5] H. Working, “A Random Difference Series for Use in the Analysis of Time Series,” Journal of American Statistical Association Quarterly Publication, Vol. 24, 1934, pp. 11 24. doi:10.1080/01621459.1934.10502683 [6] W. O. Kermack and A. G. McKendrick, “A Measure of Dispersion for Ordered Series,” Journal of the Proceed ings of the Royal Society Edinburgh, Vol. 57, 1937, pp. 228240. [7] D. Alter, “A Group or Correlation Periodogram with Ap plication to the Rainfall of the British Iles,” Journal of Monthly Weather Review, Vol. 55, No. 210, 1927, pp. 263266. doi:10.1175/15200493(1927)55<263:AGOCPW>2.0.CO ;2 [8] C. Chree, “Periodicities Solar and Meteorological,” Jour nal of the Royal Meteorological Society, Vol. 85, 1924, pp. 8797. [9] J. B. Cox, “Periodic Fluctuations of Rainfall in Hawaii” Proceedings of the American Society of Civil Engineers, Vol. 87, 1924, pp. 461491. [10] E. L. Dodd, “The Probability Law for the Intensity of a Trail Period with Data Subject to the Gaussian Law,” Bulletin of the American Mathematical Association Soci ety, Vol. 33, 1927, pp. 681684. doi:10.1090/S000299041927044512 [11] S. Kuznets, “Random Events and Cyclical Oscillations,” Journal of the American Statistical Association, Vol. 24, 1929, pp. 258275. doi:10.1080/01621459.1929.10503048 [12] R. W. Powell, “Successive Integration as a Method of Finding Long Period Cycles,” Annals of the Mathemati cal Statistics, Vol. 1, No. 2, 1930, pp. 123136. doi:10.1214/aoms/1177733127 [13] K. Stumpff, “Grunlagen und Methoden der Periodenfor schung,” Springer, Berlin, 1925. [14] G. T. Walker, “On Periodicity—Criteria for Reality,” Memorandum of the Royal meteorological Society, Vol. 3, No. 25, 1930, pp. 97101. [15] C. F. McEwen and E. L. Michel, “The Functional Rela tion of One Variable to Each of a Number of Correlated Variables Determined by a Method of Successive Ap proximations to Group Averages,” Proceedings of the American Academy of Arts and Sciences, Vol. 55, No. 8, 1919, pp. 89133. [16] C. F. McEwen, “The Minimum Temperature, a Function of the Dew Point and Humidity, at 5 p.m. of the Preced ing Day; Method of Determining This Function by Suc cessive Approximations to Group Averages,” Monthly Weather Review Supplement, No. 16, 1920, pp. 6469. [17] C. F. McEwen, “The Reality of Regularities Indicated in Sequences of Observations,” Proceedings of the Berkeley Symposium on Mathematical Statistics and Probability, San Francisco, 1318 August 1945, pp. 229238. [18] R. A. Fisher, “Statistical Methods for Research Workers,” 4th Edition, Biological Monographs and Manuals, Lon don, 1932. [19] G. U. Yule and M. G. Kendall, “An Introduction to the Theory of Statistics” 11th Edition, Charles Griffin and Company Ltd., London, 1937. [20] R. A. Fisher and F. Yates, “Statistical Tables for Biologi cal, Agricultural, and Medical Research,” Oliver and Boyd, London, 1938.
