<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article  PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="3.0" xml:lang="en" article-type="research article">
  <front>
    <journal-meta>
      <journal-id journal-id-type="publisher-id">OALibJ</journal-id>
      <journal-title-group>
        <journal-title>Open Access Library Journal</journal-title>
      </journal-title-group>
      <issn pub-type="epub">2333-9705</issn>
      <publisher>
        <publisher-name>Scientific Research Publishing</publisher-name>
      </publisher>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.4236/oalib.1103927</article-id>
      <article-id pub-id-type="publisher-id">OALibJ-79715</article-id>
      <article-categories>
        <subj-group subj-group-type="heading">
          <subject>Articles</subject>
        </subj-group>
        <subj-group subj-group-type="Discipline-v2">
          <subject>Biomedical&amp;Life Sciences</subject>
          <subject> Business&amp;Economics</subject>
          <subject> Chemistry&amp;Materials Science</subject>
          <subject> Computer Science&amp;Communications</subject>
          <subject> Earth&amp;Environmental Sciences</subject>
          <subject> Engineering</subject>
          <subject> Medicine&amp;Healthcare</subject>
          <subject> Physics&amp;Mathematics</subject>
          <subject> Social Sciences&amp;Humanities</subject>
        </subj-group>
      </article-categories>
      <title-group>
        <article-title>


          Working with Aggregate Data: An Excel Macro for Pairwise Comparison Using Z Test for Two Proportions

        </article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author" xlink:type="simple">
          <name name-style="western">
            <surname>Victor</surname>
            <given-names>L. Landry</given-names>
          </name>
          <xref ref-type="aff" rid="aff1">
            <sub>1</sub>
          </xref>
          <xref ref-type="corresp" rid="cor1">
            <sup>*</sup>
          </xref>
        </contrib>
      </contrib-group>
      <aff id="aff1">
        <label>1</label>
        <addr-line>College of Doctoral Studies, Grand Canyon University, Phoenix, AZ, USA</addr-line>
      </aff>
      <author-notes>
        <corresp id="cor1">
          * E-mail:<email>victor.landry@my.gcu.edu</email>
        </corresp>
      </author-notes>
      <pub-date pub-type="epub">
        <day>11</day>
        <month>10</month>
        <year>2017</year>
      </pub-date>
      <volume>04</volume>
      <issue>10</issue>
      <fpage>1</fpage>
      <lpage>9</lpage>
      <history>
        <date date-type="received">
          <day>8,</day>
          <month>September</month>
          <year>2017</year>
        </date>
        <date date-type="rev-recd">
          <day>16,</day>
          <month>October</month>
          <year>2017</year>
        </date>
        <date date-type="accepted">
          <day>19,</day>
          <month>October</month>
          <year>2017</year>
        </date>
      </history>
      <permissions>
        <copyright-statement>&#169; Copyright  2014 by authors and Scientific Research Publishing Inc. </copyright-statement>
        <copyright-year>2014</copyright-year>
        <license>
          <license-p>This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0/</license-p>
        </license>
      </permissions>
      <abstract>
        <p>


          A Visual Basic for Applications (VBA) Excel macro was created for doing a pairwise, two-sample
          Z
          -test of within-column proportions for
          k
          data rows in an Excel spreadsheet. By program iteration, the
          Z
          -score for
          k
          (
          k－
          1)/2

          unique, non-repeating and non-duplicated within-column comparisons was generated and the null hypothesis is tested against a two-tailed
          Z
          -score critical value. This within-column process is useful for ext
          racting potential meaning from large aggregate columnar data. The procedure was demonstrated using aggregate internet acquired summary data in the public domain. The VBA macro is provided and it is also available at the author’s website.

        </p>
      </abstract>
      <kwd-group>
        <kwd>Pairwise Comparison</kwd>
        <kwd> Two-Proportion Z-Test</kwd>
        <kwd> Aggregated Data</kwd>
        <kwd>  Internet Data</kwd>
        <kwd> Excel VBA Macro</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="s1">
      <title>1. Introduction</title>
      <p>
        Aggregate data refers to numerical or non-numerical information that is: 1) collected from multiple sources and/or on multiple measures, variables, or individuals; and 2) compiled into data summaries or summary reports, typically for the purposes of public reporting or statistical analysis―i.e., examining trends, making comparisons, or revealing information and insights that would not be observable when data elements are viewed in isolation [<xref ref-type="bibr" rid="scirp.79715-ref1">1</xref>] . Because the unit of analysis in aggregated data is no longer at the individual entity level, researchers must exercise care in trying to conduct correlational or inferential statistics to avoid spurious results.
      </p>
      <p>Aggregate data might still yield important information by moving to the next higher unit of analysis that provides a grouping unifier. Various versions of Chi Square, time series and proportional analyses may still be performed on aggregate datasets where a proper unifier exists.</p>
      <p>
        Proportional aggregate analysis is the focus of this paper. A method and an Excel VBA macro is demonstrated that compares and contrasts a spreadsheet (above row to row) for unique and non-repeating pairwise row comparisons. This procedure incorporates the familiar Z-Test for Two Proportions to test paired data for statistical significance at α ≤ 0.05 [<xref ref-type="bibr" rid="scirp.79715-ref2">2</xref>] .<sub> </sub>
      </p>
    </sec>
    <sec id="s2">
      <title>2. Fundamental Principles</title>
      <p>
        The VBA macro uses an “up one row”, “down one row” iteration that populates the variables for p<sub>i</sub> and p<sub>j</sub>. The built-in Z-test for proportions has a two-tailed null hypothesis of no statistically significant difference between two proportions, H<sub>0</sub>: P<sub>i</sub> = P<sub>j</sub>. The alternate hypothesis is Ha: P<sub>i</sub> ≠ P<sub>j</sub>. There are three assumptions inherent in this procedure: 1) sampling independence; 2) sufficient size (≥5; the macro will reject if violated); and 3) randomness of selection. A pooled proportion is used to compute the standard error of the sampling distribution, using the individual proportions, p<sub>i</sub> and p<sub>j</sub> and the associated population for each, n<sub>i</sub> and n<sub>j</sub>. The test statistic is a Z-score which is the ratio of the absolute proportion difference divided by the standard error. Significance is determined as Z ≥ 1.96, the two-tailed critical value for a normal distribution.
      </p>
    </sec>
    <sec id="s3">
      <title>3. An Illustration</title>
      <p>
        For illustration purposes, a mock research question was created that asked if there were any statistically significant between-county differences in the proportion of registered voters for the Green Party within the state of Arizonain January 2017 [<xref ref-type="bibr" rid="scirp.79715-ref3">3</xref>] . After minor cleansing, the data were inserted into a blank Excel macro-enabled (pairwise.xlsm) spreadsheet which incorporates the pairwise macro described in this report. The order of insertion must be followed exactly (Group Name, Sample Size and Total) starting in cell “A1” which is required by the macro (<xref ref-type="fig" rid="fig1">Figure 1</xref>).
      </p>
      <p>The goal for this mock research question was to determine if there were any statistically significant proportional differences of Green Party registered voters between compared counties. For example, is the proportion of Green Party registered voters in Apache County significantly different from the proportions of Green Party registered voters in other counties? How many matched pairings of county-county data would be significantly different? This information could be pursued to investigate trends and patterns.</p>
      <p>Because of the requirement of the Z-test for proportional differences, the minimum number of registered voters per county was 5. Only one county, Greenlee,</p>
      <p>failed to meet the minimum sample size and all of its combinations were eliminated.</p>
    </sec>
    <sec id="s4">
      <title>4. Results</title>
      <p>
        Output begins in cell “F1” and continues for k(k − 1)/2 rows. For the fifteen rows illustrated, an output of 105 rows is generated (<xref ref-type="table" rid="table1">Table 1</xref>). The output grows exponentially and while the macro can accommodate very large datasets, there is a practical output limitation. For example, 50 rows of input would create 1225 matched pairs of unique data. The size of the input range is the researcher’s choice.
      </p>
      <p>This exercise was primarily for illustration but it did use real data which produced real results. Of the 105 county-county combinations, 59 (56%) showed statistically significant differences. Questions need to be asked of the data so that the differences in Green Party registered voters could perhaps be explained. For those in the social or political sciences, these differences might be important to pursue.</p>
    </sec>
    <sec id="s5">
      <title>5. The Macro Methodology</title>
      <p>
        The VBA macro uses an “up one row”, “down one row” iteration that populates the upper row/lower row variables with their respective proportions, P<sub>1</sub> and P<sub>2</sub>. With these values, the null hypothesis (P<sub>1</sub> = P<sub>2</sub>) can be tested using the following standard proportion equations.
      </p>
      <p>1) The pooled proportion:</p>
      <p>p = p i ∗ n i + p j ∗ n j n i + n j</p>
      <table-wrap-group id="1">
        <label>
          <xref ref-type="table" rid="table1">Table 1</xref>
        </label>
        <caption>
          <title> Results of pairwise comparison of between-county Z-test of proportions for green party registered voters</title>
        </caption>
        <table-wrap id="1_1">
        
        </table-wrap>
        <table-wrap id="1_2">
          
        </table-wrap>
        <table-wrap id="1_3">
          
        </table-wrap>
      </table-wrap-group>
      <p>Note: All comparisons using Greenlee county were rejected because of small sample size (less than or equal to 5).</p>
      <p>where:</p>
      <p>p = the pooled sample proportion,</p>
      <p>
        p<sub>i</sub> = first proportion,
      </p>
      <p>
        p<sub>j</sub> = second proportion,
      </p>
      <p>
        n<sub>i</sub> = population size associated with the first proportion,
      </p>
      <p>
        n<sub>j</sub> = population size associated with the second proportion.
      </p>
      <p>2) The standard error of the weighted samples:</p>
      <p>s e p i − p j = p ( 1 − p ) ∗ [ ( 1 n i ) + ( 1 n j ) ]</p>
      <p>where:</p>
      <p>
        se<sub>pi-p</sub><sub>j</sub> = the standard error,
      </p>
      <p>p = the weighted estimate of two populations,</p>
      <p>
        n<sub>i</sub> = sample size associated with the first proportion,
      </p>
      <p>
        n<sub>j</sub> = sample size associated with the second proportion.
      </p>
      <p>3) The determination of the Z-score:</p>
      <p>Z = | p i − p j | s e</p>
      <p>where:</p>
      <p>Z = the Z-score,</p>
      <p>
        p<sub>i</sub> = first proportion,
      </p>
      <p>
        p<sub>j</sub> = second proportion,
      </p>
      <p>
        se<sub>pi-p</sub><sub>j</sub> = the standard error.
      </p>
      <p>The null hypothesis is rejected if the Z-score exceeds 1.96, the two-tailed critical value that is associated with a p-value ≤ 0.05.</p>
    </sec>
    <sec id="s6">
      <title>6. Conclusions</title>
      <p>An Excel macro procedure has been demonstrated as a screening tool to reveal patterns within aggregate data. It creates unique within-column pairwise comparisons and tests the data for proportional statistical significance. This method could be applied where aggregated data is available that includes, as a minimum, the named group, a proportion or count of a desired variable and a total for each row. The exponential growth of the output as the number of rows (k) increases will be a practical limiting factor.</p>
      <p>The Excel macro can be saved as an Excel macro file (*.xlsm) and various internet references can be accessed for instructions for using an Excel macro files as an add-in.</p>
      <p>This macro is also available for download at http://www.viclandry.com/pairwise-comparison.html</p>
      <p>The VBA Macro</p>
      <p>Sub Pairwise()</p>
      <p>Dim i As Integer</p>
      <p>Dim j As Integer</p>
      <p>Dim k As Integer</p>
      <p>Dim lastrow As Long</p>
      <p>Dim answer As Variant</p>
      <p>Dim n1 As Variant</p>
      <p>Dim n2 As Variant</p>
      <p>Dim p As Variant</p>
      <p>Dim p1 As Variant</p>
      <p>Dim p2 As Variant</p>
      <p>Dim z As Variant</p>
      <p>Dim se As Variant</p>
      <p>Dim r As Variant</p>
      <p>MsgBox (&quot;You must have HEADERS with category names in Column A; place data in Column B; place interval COUNTS in Column C&quot;)</p>
      <p>lastrow = (Cells(Rows.Count, &quot;A&quot;).End(xlUp).Row)-1</p>
      <p>Range(&quot;f1&quot;).Value = &quot;Compared Groups&quot;</p>
      <p>Range(&quot;f1&quot;).Offset(0, 1) = &quot;Group 1&quot;</p>
      <p>Range(&quot;f1&quot;).Offset(0, 2).Value = &quot;N1&quot;</p>
      <p>Range(&quot;f1&quot;).Offset(0, 3).Value = &quot;P1&quot;</p>
      <p>Range(&quot;f1&quot;).Offset(0, 4).Value = &quot;Group 2&quot;</p>
      <p>Range(&quot;f1&quot;).Offset(0, 5).Value = &quot;N2&quot;</p>
      <p>Range(&quot;f1&quot;).Offset(0, 6).Value = &quot;P2&quot;</p>
      <p>Range(&quot;f1&quot;).Offset(0, 7).Value = &quot;Z-Score&quot;</p>
      <p>Range(&quot;f1&quot;).Offset(0, 8).Value = &quot;Result&quot;</p>
      <p>For i = 1 To lastrow</p>
      <p>For j = i + 1 To lastrow</p>
      <p>k = k + 1</p>
      <p>Range(&quot;f1&quot;).Offset(k, 0).Value = (Range(&quot;a1&quot;).Offset(i, 0).Value &amp; &quot; - &quot; &amp; Range(&quot;a1&quot;).Offset(j, 0).Value) 'first row header</p>
      <p>p1 = Range(&quot;a1&quot;).Offset(i, 1).Value/Range(&quot;a1&quot;).Offset(i, 2).Value 'value for first proportion</p>
      <p>p2 = Range(&quot;a1&quot;).Offset(j, 1).Value/Range(&quot;a1&quot;).Offset(j, 2).Value 'value for second proportion</p>
      <p>r = (Abs(p1 - p2)) 'find absolute difference</p>
      <p>n1 = Range(&quot;a1&quot;).Offset(i, 2).Value</p>
      <p>n2 = Range(&quot;a1&quot;).Offset(j, 2).Value</p>
      <p>p = ((p1 * n1) + (p2 * n2))/(n1 + n2)</p>
      <p>se = Sqr((p * (1 - p)) * ((1/n1) + (1/n2)))</p>
      <p>z = r/se</p>
      <p>Range(&quot;f1&quot;).Offset(k, 1).Value = Range(&quot;a1&quot;).Offset(i, 1).Value 'first count</p>
      <p>Range(&quot;f1&quot;).Offset(k, 2).Value = n1 'first total</p>
      <p>Range(&quot;f1&quot;).Offset(k, 3).Value = Round(p1, 4) 'first proportion</p>
      <p>Range(&quot;f1&quot;).Offset(k, 4).Value = Range(&quot;a1&quot;).Offset(j, 1).Value 'second count</p>
      <p>Range(&quot;f1&quot;).Offset(k, 5).Value = n2 'second total</p>
      <p>Range(&quot;f1&quot;).Offset(k, 6).Value = Round(p2, 4) 'second proportion</p>
      <p>Range(&quot;f1&quot;).Offset(k, 7).Value = Round(z, 4) 'z score</p>
      <p>If z &gt; 1.96 Then</p>
      <p>Range(&quot;f1&quot;).Offset(k, 8).Value = &quot;Sig.&quot;</p>
      <p>Else</p>
      <p>Range(&quot;f1&quot;).Offset(k, 8).Value = &quot;NS&quot;</p>
      <p>End If</p>
      <p>If n1 * p1 &lt; 6 Or n2 * p2 &lt; 6 Then</p>
      <p>Range(&quot;f1&quot;).Offset(k, 8).Value = &quot;N&lt;=5&quot;</p>
      <p>End If</p>
      <p>Next j</p>
      <p>Next i</p>
      <p>Range(&quot;f1:m1&quot;).EntireColumn.AutoFit</p>
      <p>End Sub</p>
    </sec>
    <sec id="s7">
      <title>Cite this paper</title>
      <p>Landry, V.L. (2017) Working with Aggregate Data: An Excel Macro for Pairwise Comparison Using Z Test for Two Proportions. Open Access Library Journal, 4: e3927. https://doi.org/10.4236/oalib.1103927</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <title>References</title>
      <ref id="scirp.79715-ref1">
        <label>1</label>
        <mixed-citation publication-type="other" xlink:type="simple">Concepts, L. (2015) Aggregate Data Definition. http://edglossary.org/aggregate-data/</mixed-citation>
      </ref>
      <ref id="scirp.79715-ref2">
        <label>2</label>
        <mixed-citation publication-type="other" xlink:type="simple">StatisticsLectures.com (2017) Z-Test for Proportions, Two Samples. http://www.statisticslectures.com/topics/ztestproportions/</mixed-citation>
      </ref>
      <ref id="scirp.79715-ref3">
        <label>3</label>
        <mixed-citation publication-type="other" xlink:type="simple">Arizona Secretary of State (2017) Voter Registration Counts.https://www.azsos.gov/elections/voter-registration-historical-election-data/voter-registration-counts</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>