Working with Aggregate Data: An Excel Macro for Pairwise Comparison Using Z Test for Two Proportions ()
1. Introduction
Aggregate data refers to numerical or non-numerical information that is: 1) collected from multiple sources and/or on multiple measures, variables, or individuals; and 2) compiled into data summaries or summary reports, typically for the purposes of public reporting or statistical analysis―i.e., examining trends, making comparisons, or revealing information and insights that would not be observable when data elements are viewed in isolation [1] . Because the unit of analysis in aggregated data is no longer at the individual entity level, researchers must exercise care in trying to conduct correlational or inferential statistics to avoid spurious results.
Aggregate data might still yield important information by moving to the next higher unit of analysis that provides a grouping unifier. Various versions of Chi Square, time series and proportional analyses may still be performed on aggregate datasets where a proper unifier exists.
Proportional aggregate analysis is the focus of this paper. A method and an Excel VBA macro is demonstrated that compares and contrasts a spreadsheet (above row to row) for unique and non-repeating pairwise row comparisons. This procedure incorporates the familiar Z-Test for Two Proportions to test paired data for statistical significance at α ≤ 0.05 [2] .
2. Fundamental Principles
The VBA macro uses an “up one row”, “down one row” iteration that populates the variables for pi and pj. The built-in Z-test for proportions has a two-tailed null hypothesis of no statistically significant difference between two proportions, H0: Pi = Pj. The alternate hypothesis is Ha: Pi ≠ Pj. There are three assumptions inherent in this procedure: 1) sampling independence; 2) sufficient size (≥5; the macro will reject if violated); and 3) randomness of selection. A pooled proportion is used to compute the standard error of the sampling distribution, using the individual proportions, pi and pj and the associated population for each, ni and nj. The test statistic is a Z-score which is the ratio of the absolute proportion difference divided by the standard error. Significance is determined as Z ≥ 1.96, the two-tailed critical value for a normal distribution.
3. An Illustration
For illustration purposes, a mock research question was created that asked if there were any statistically significant between-county differences in the proportion of registered voters for the Green Party within the state of Arizonain January 2017 [3] . After minor cleansing, the data were inserted into a blank Excel macro-enabled (pairwise.xlsm) spreadsheet which incorporates the pairwise macro described in this report. The order of insertion must be followed exactly (Group Name, Sample Size and Total) starting in cell “A1” which is required by the macro (Figure 1).
The goal for this mock research question was to determine if there were any statistically significant proportional differences of Green Party registered voters between compared counties. For example, is the proportion of Green Party registered voters in Apache County significantly different from the proportions of Green Party registered voters in other counties? How many matched pairings of county-county data would be significantly different? This information could be pursued to investigate trends and patterns.
Because of the requirement of the Z-test for proportional differences, the minimum number of registered voters per county was 5. Only one county, Greenlee,
Figure 1. The correct order of data insertion starting at Cell “A1”. Note: As of January 2017 per https://www.azsos.gov/elections/voter-registration-historical-election-data/voter-registration-counts
failed to meet the minimum sample size and all of its combinations were eliminated.
4. Results
Output begins in cell “F1” and continues for k(k − 1)/2 rows. For the fifteen rows illustrated, an output of 105 rows is generated (Table 1). The output grows exponentially and while the macro can accommodate very large datasets, there is a practical output limitation. For example, 50 rows of input would create 1225 matched pairs of unique data. The size of the input range is the researcher’s choice.
This exercise was primarily for illustration but it did use real data which produced real results. Of the 105 county-county combinations, 59 (56%) showed statistically significant differences. Questions need to be asked of the data so that the differences in Green Party registered voters could perhaps be explained. For those in the social or political sciences, these differences might be important to pursue.
5. The Macro Methodology
The VBA macro uses an “up one row”, “down one row” iteration that populates the upper row/lower row variables with their respective proportions, P1 and P2. With these values, the null hypothesis (P1 = P2) can be tested using the following standard proportion equations.
1) The pooled proportion:
Table 1. Results of pairwise comparison of between-county Z-test of proportions for green party registered voters.
Note: All comparisons using Greenlee county were rejected because of small sample size (less than or equal to 5).
where:
p = the pooled sample proportion,
pi = first proportion,
pj = second proportion,
ni = population size associated with the first proportion,
nj = population size associated with the second proportion.
2) The standard error of the weighted samples:
where:
sepi-pj = the standard error,
p = the weighted estimate of two populations,
ni = sample size associated with the first proportion,
nj = sample size associated with the second proportion.
3) The determination of the Z-score:
where:
Z = the Z-score,
pi = first proportion,
pj = second proportion,
sepi-pj = the standard error.
The null hypothesis is rejected if the Z-score exceeds 1.96, the two-tailed critical value that is associated with a p-value ≤ 0.05.
6. Conclusions
An Excel macro procedure has been demonstrated as a screening tool to reveal patterns within aggregate data. It creates unique within-column pairwise comparisons and tests the data for proportional statistical significance. This method could be applied where aggregated data is available that includes, as a minimum, the named group, a proportion or count of a desired variable and a total for each row. The exponential growth of the output as the number of rows (k) increases will be a practical limiting factor.
The Excel macro can be saved as an Excel macro file (*.xlsm) and various internet references can be accessed for instructions for using an Excel macro files as an add-in.
This macro is also available for download at http://www.viclandry.com/pairwise-comparison.html
The VBA Macro
Sub Pairwise()
Dim i As Integer
Dim j As Integer
Dim k As Integer
Dim lastrow As Long
Dim answer As Variant
Dim n1 As Variant
Dim n2 As Variant
Dim p As Variant
Dim p1 As Variant
Dim p2 As Variant
Dim z As Variant
Dim se As Variant
Dim r As Variant
MsgBox ("You must have HEADERS with category names in Column A; place data in Column B; place interval COUNTS in Column C")
lastrow = (Cells(Rows.Count, "A").End(xlUp).Row)-1
Range("f1").Value = "Compared Groups"
Range("f1").Offset(0, 1) = "Group 1"
Range("f1").Offset(0, 2).Value = "N1"
Range("f1").Offset(0, 3).Value = "P1"
Range("f1").Offset(0, 4).Value = "Group 2"
Range("f1").Offset(0, 5).Value = "N2"
Range("f1").Offset(0, 6).Value = "P2"
Range("f1").Offset(0, 7).Value = "Z-Score"
Range("f1").Offset(0, 8).Value = "Result"
For i = 1 To lastrow
For j = i + 1 To lastrow
k = k + 1
Range("f1").Offset(k, 0).Value = (Range("a1").Offset(i, 0).Value & " - " & Range("a1").Offset(j, 0).Value) 'first row header
p1 = Range("a1").Offset(i, 1).Value/Range("a1").Offset(i, 2).Value 'value for first proportion
p2 = Range("a1").Offset(j, 1).Value/Range("a1").Offset(j, 2).Value 'value for second proportion
r = (Abs(p1 - p2)) 'find absolute difference
n1 = Range("a1").Offset(i, 2).Value
n2 = Range("a1").Offset(j, 2).Value
p = ((p1 * n1) + (p2 * n2))/(n1 + n2)
se = Sqr((p * (1 - p)) * ((1/n1) + (1/n2)))
z = r/se
Range("f1").Offset(k, 1).Value = Range("a1").Offset(i, 1).Value 'first count
Range("f1").Offset(k, 2).Value = n1 'first total
Range("f1").Offset(k, 3).Value = Round(p1, 4) 'first proportion
Range("f1").Offset(k, 4).Value = Range("a1").Offset(j, 1).Value 'second count
Range("f1").Offset(k, 5).Value = n2 'second total
Range("f1").Offset(k, 6).Value = Round(p2, 4) 'second proportion
Range("f1").Offset(k, 7).Value = Round(z, 4) 'z score
If z > 1.96 Then
Range("f1").Offset(k, 8).Value = "Sig."
Else
Range("f1").Offset(k, 8).Value = "NS"
End If
If n1 * p1 < 6 Or n2 * p2 < 6 Then
Range("f1").Offset(k, 8).Value = "N<=5"
End If
Next j
Next i
Range("f1:m1").EntireColumn.AutoFit
End Sub