Comparison of Data Screening Methods for Evaluating School-Level Fitness Patterns in Youth : Findings from the NFL PLAY 60 FITNESSGRAM Partnership Project

Background: There has been a great interest in tracking health-related fitness across the United States. The NFL PLAY 60 FITNESSGRAM Partnership Project (NFL P60FGPP) is a large participatory research network that involves the surveillance of fitness among more than 1000 schools spread throughout the country. Fitness data are collected by school staff and therefore these data can vary in quality and representativeness. Therefore, careful screening procedures are needed to ensure that the data can reflect actual patterns in the schools. This study examined the impact of different data screening procedures on outcomes of aerobic fitness (AF) collected from the NFL P60FGPP. Methods: Data were compiled from 149,101 youth from 504 schools and were processed using the established ageand gender-specific AF FITNESSGRAM health-related standards. Data were subjected to three different screening procedures (based on grade size and boy-to-girl ratio per grade). Linear models were computed to obtain unadjusted and adjusted (for age, BMI-Z, and socio-economic status) estimates of % youth in the Healthy Fitness Zone (HFZ) in order to determine if, 1) there were differences in % in the HFZ and 2) if differences could be explained by changes in the representativeness of the sample due to the different data screening procedures. Results: Depending on the screening procedure used, the final sample ranged from 96,999 (no screening) to 46,572 youth (most stringent criteria). The proportion of youth achieving appropriate levels of AF ranged from 56% to 61% with unscreened data resulting in consistently lower percentages of youth achieving the standard (P < 0.05). Overall, these differences were not explained by possible changes in demographic characteristics as the result of applying different screening criteria. Conclusions: The findings demonstrate the importance of establishing approCorresponding author. P. F. Saint-Maurice et al. 877 priate screening procedures that maximize sample size while also ensuring generalizability of the findings.


Introduction
A recent study of 27 countries has documental global declines in aerobic fitness performance (−0.36% per year) over the past 50 years [1].This study highlighted issues with public health surveillance and substantiated the importance of youth physical activity and fitness promotion on an international level.
Schools have been increasingly emphasized as a promising target for coordinated programming [2] [3] and studies have evaluated numerous school-based interventions [4] [5] and policies focused on youth fitness [3] [6] [7].Fitness testing has been a mainstay of Physical Education (PE) programs in the United States for over 50 years [8], but existent fitness surveys [9] [10] are out of date due to the secular changes of youth fitness [1].Interestingly, many states and large districts now mandate, fund, and/or promote the systematic collection of health-related fitness data for youth fitness surveillance (e.g., California, Texas, Georgia, Delaware, New York City).
The FITNESSGRAM ® battery is one of the most commonly used fitness batteries across the globe [11] [12].The FITNESSGRAM program helped to shift the focus of fitness testing from performance-related to healthrelated fitness and from norm-referenced to criterion-referenced evaluation standards [11].In addition to personalized reporting, the new web-based platform makes it possible to compile and track youth fitness data by class, school, district and state level, along with printed personalized assessment reports for individuals.A series of recent studies published in the American Journal of Preventive Medicine refined the criterion-referenced standards for aerobic fitness and body composition using nationally-representative data from the National Health and Nutrition and Examination Survey.The revised standards for aerobic fitness [13] and body composition [14] have documented utility for detecting risks of metabolic syndrome in youth and have been shown to be related with the fulfillment of physical activity guidelines [15].This allows the school-based FITNESSGRAM assessment to provide valuable information about levels of health-related fitness in youth.
One example of a large-scale application of FITNESSGRAM was the Texas Youth Fitness Study [16].Results from the ongoing FITNESSGRAM adoption in Texas have been published in a series of studies [13] [17]- [20].The supplement included detailed reports of fitness results [17] as well as a controlled study that demonstrated that trained teachers could provide valid and reliable data on youth fitness [18].These results demonstrate the potential for trained teachers to adopt FITNESSGRAM testing for public health surveillance.In addition, the Presidential Youth Fitness Program (www.presidentialyouthfitnessprogram.org) has now established FITNESS-GRAM program as the exclusive national youth fitness battery.
The widespread adoption of FITNESSGRAM (both in the US and internationally) opens up exciting opportunities to systematically study age and gender patterns of youth fitness with the same test.However, there are many complex issues that must be considered when using field-based tests for tracking fitness levels at a largescale (e.g., district, state, national).Data collected from schools can vary in quality and representativeness.Therefore, careful data screening procedures are needed to ensure that the data can reflect actual patterns in the schools.Some of these concerns have been studied in other health surveillance research areas [21]- [24].
The purpose of this study was to systematically evaluate different data screening procedures on school level estimates of fitness outcomes collected from local schools spread throughout the US.The impact of different data screening procedures was examined in a sample of over 500 schools from 22 different states.

Design and Sample
Institutional Review Board approval was granted from the Cooper institute and Iowa State University.Data for the present study were obtained through a participatory research project called the NFL PLAY 60 FITNESS-GRAM Partnership Project (NFL P60FGPP).The NFL P60FGPP, launched in 2009, provides training and support to a network of over 1100 schools/sites (35 site licenses per each of the 32 NFL franchises) [25].Schools are provided training materials that include the FITNESSGRAM protocol and philosophy and are encouraged to use various NFL PLAY 60 resources to promote physical activity and healthy eating in students.FITNESS-GRAM records from the participating schools are compiled through web-based servers and these data are tracked over time to evaluate fitness patterns and the impact of school programming under real world conditions.
The data for the present study were collected in the Spring of 2012 and extracted from the project database in the Fall of 2012.The dataset included 149,101 student records from 504 schools from 32 different NFL franchise cities) in 22 states and one region (New England) in the US.There were 7546 cases with missing demographic information (age, gender, grade scores and 21,940 that were either missing or had unfeasible aerobic fitness scores so these cases were excluded (individual-level screening).An additional 22,616 cases were removed because they were out of the targeted age range (10 -18) for aerobic fitness evaluation.This resulted in a final sample of 96,099 records that had complete and clean data on the outcome of interest.

The FITNESSGRAM ® Assessment
The established FITNESSGRAM battery includes a variety of practical, field-based assessments for each of the key dimensions of health-related fitness (aerobic capacity, body composition, and musculoskeletal fitness and flexibility) but the focus in the present study was on aerobic fitness measures.The FITNESSGRAM battery provides schools with three different assessments of aerobic fitness, a progressive 20 m shuttle run (PACER), the one-mile run test (MRT), and a 15-meter PACER test (a modified version of the PACER test).These assessment's scores are equated to PACER laps [26]- [28] and then predicted maximal VO 2 is evaluated using the established health-related standards.Youth that achieve the standard are placed into the Healthy Fitness Zone ® (HFZ) while youth falling below this value are placed in the Needs Improvement Zone (NIZ).Schools that utilize the FITNESSGRAM web-based software have the ability to generate personalized fitness reports and aggregate level reports that schools can use to view the percentage of youth achieving the HFZ.However, the focus in the present study was on decisions that would influence interpretations of compiled school level data for public health surveillance.

Data Processing
Data analyses were conducted in the Fall of 2013.Fitness data from the NFL PLAY 60 campaign are hierarchically structured with individuals nested in grades within a particular school and further nested by franchise.
De-identified fitness data files were first exported from the web servers and cleaned using standard procedures to ensure the quality of the individual records.Participants were excluded if they were missing demographic information such as age, gender, or grade information, or did not have aerobic capacity scores.Participants were also excluded if they had abnormal values for aerobic capacity (e.g., values = 0, or out of range scores).The cleaned data were then aggregated by gender and grade and then screened using three different approaches that varied in rigor: o Screening A (conservative): This screening protocol was the most conservative.Cases were removed if there was an unbalance between the number of boys and girls (a gender ratio greater than 1.2 or 12:10 ratio) and if there were less than 60 students per grade.o Screening B (intermediate): This screening protocol was defined to be less conservative than protocol A. The ratio boy:girl criteria was the same but cases were removed if there were less than 30 students per grade.o Screening C (liberal): This was the most liberal protocol.Cases were only removed if there was an excessive boy:girl unbalance (a gender ratio greater than 2.0 or 20:10 ratio) and if there was an excessively small sample of students per grade (grades with less than 15 students).
The four different (screened and unfiltered) data sets were then processed to compare the impact on school level fitness outcomes.We used the percentage of students per grade meeting the Healthy Fitness Zone as our outcome variable since this is a popular indicator used in youth fitness research.This was calculated separately for boys and girls using the formula below: Number of students meeting the HFZ for AC at each grade per school 100 .
Total number of students at each grade per school

Data Analysis
As part of our analysis, we used aggregated grade-level data to provide a visual illustration (histograms) of how the raw data (no filters) was distributed for the ratio boy/girl per grade and total participants per grade, screening variables.The impact of each screening decision on the initial sample size and outcome scores was determined visually using histograms along with skewness and kurtosis values as indicators of the shape of the distribution.We were particularly interested in the effect of data screening on state-level outcomes.Therefore, the effect of screening protocol was determined on aggregated state-level data using a within-subjects design with "Percent Meeting the Healthy Fitness Zone" (% Meeting the HFZ) as the outcome variable.We computed two linear models: The first included only one predictor (screening protocol) and examined if there were statistically significant differences between each screening protocol.The second linear model included average state-level age, BMI-Z scores, and socio-economic status (SES) indicators and respective interactions with screening protocol, to determine if differences between screening protocols (model 1) could be explained by changes in demographic characteristics in each screening protocol sample.Differences in the output obtained from the two models would indicate that data screening protocols can result in exclusion of important segments of the population being studied which might lead to important fluctuations on state-level estimates of health-related fitness.Socio-economic status was calculated as the percentage of participants at each school eligible for free or reduced lunch and the three predictor variables were centered at the sample median score.The solution for fixed effects resultant from each model above was followed by pre-determined contrasts between the raw outcome scores (no filters) and each of the screening protocols.Differences in least square means were tested using 95% confidence intervals (P < 0.05).

Results
The number of excluded grades varied across the three screening protocols depending on the stringency of the criteria (see Table 1).When no filters were applied, the final sample resulted in 1279 grades.The boy per girl ratio for each school grade ranged from 0 (i.e., indicating some school grades just had either boys or girls with valid CVF scores) to 17 (i.e., indicating a ratio of 17 boys per girl per school grade with valid CVF scores) (Figure 1(a)).The total number of students per school grade ranged from 1 to 867 with approximately 30% of the school grades having less than 15 students per grade (Figure 1(b)).The impact of different screening protocols on grade-level aerobic fitness estimates was first examined based on standard indicators of sample distribution.Figure 2 indicates that the more stringent the screening protocol was the more normally distributed the indicators of fitness were.When no filters were applied, there was a higher prevalence of extreme scores.For example, the "no filters" histogram indicated that approximately 8% of the total number of grades had 0% to 3% of the students meeting the HFZ while 6% of the total grades had 99% of their students meeting the HFZ for aerobic fitness.The prevalence of scores at the two ends of the spectrum dropped to approximately 2% when filters were applied.Values for kurtosis were lower for the unscreened sample and similar among the three screening procedures (i.e., indicating that scores were more spread out when no filters were applied).There were no differences in the skewness between the protocols except that screening protocol C had slightly lower skewness values (Figure 2).Grade-level indicators of fitness were aggregated by franchise (n = 31) to examine the effect of data reduction decisions when reporting nationwide fitness results.Figure 3 illustrates how the average estimates of students meeting the HFZ resultant from each screening protocol (A, B, and C) changes when all the raw data were included for analysis.Then visual patterns suggested that including all the raw data when processing fitness data would lead to lower estimates of aerobic fitness in most of the states.The black bars in Figure 3 are more visible as the protocol becomes more stringent.If no filters were to be used, the proportion of students meeting the HFZ in most of the states would be less or equal to 60%.This value fluctuates when some screening criteria are used and several states actually reach the 80% mark when the most stringent protocol was used (Figure 3).

Discussion
There are a number of large-scale fitness surveillance initiatives taking place in states and nations across the globe; however, little consideration has been given to the techniques used to process and control the quality of the data.As shown in this study, the distribution of fitness scores when data were not filtered can result in overall lower estimates of youth meeting the HFZ.We characterized the distribution of field-based fitness assessment scores in a large national cohort of schools involved in the NFL P60FGPP, but the conclusions and implications would have relevant impact for district, state or national evaluations.
We focused the analyses on aerobic fitness because it is widely considered to be the most important component of health-related fitness and is almost universally used in youth fitness batteries.The FITNESSGRAM battery recommends the use of the PACER test and this assessment is based on the original 20 m shuttle run test  that is widely used in European test batteries (Eurofit) and other national batteries [29].An advantage of the PACER test is that it essentially replicates the timing and structure of standard lab-based maximal fitness tests [30].It was recommended by the Institute of Medicine as the primary field-based assessment of aerobic fitness for youth and has good psychometric and motivational properties for school-based assessments [12] [31].The PACER test, like other assessments in the FITNESSGRAM battery, was developed primarily for fitness education and youth fitness promotion.Youth typically receive personalized reports about their level of fitness and feedback on how to maintain or improve it.Educating students about their level of health-related fitness is the primary function of youth fitness testing but the FITNESSGRAM advisory board has also endorsed the use of "institutional testing" as appropriate uses of fitness data [32].Districts often track data to examine the impact of curricular changes or the long term impact of programming on youth fitness.States may be more interested in evaluating the impact of school environments or policies on youth fitness outcomes.For example, an evaluation of the Texas Youth Fitness study examined whether various school level factors (i.e., school physical education policy, school resources, physical education duration and frequency, teachers' training and testing experience) could explain the variability in fitness results observed across the state [20].In either example, the most important consideration is for standardization in the processes used for screening and processing the data.Based on the present analyses, the simple inclusion of all available data would likely lead to spurious findings when used to explain age and gender patterns of fitness.
There are no definitive guidelines to determine what the "correct" screening criteria would be.The use of more stringent criteria would restrict the available sample and possibly increase the internal validity of the findings.However, the restrictions could jeopardize the external validity of the findings since it would lead to a less-representative sample population.The complexities of these issues are compounded when trying to understand differences across schools since there is considerable variability in the nature and size of schools.A sample of 15 children per grade may be a small number of students in a large school but it could be the full grade contingent in a small rural school.Therefore, it may be important to also consider the percent of the available students tested rather than simply the number of students.This discussion has important implications for future surveillance reports on physical fitness.It is important to facilitate and promote the integration of fitness assessment in schools across the country in order to improve the quality of the data for surveillance.These trends can provide important information on the state of art of surveillance of a specific health indicator [33].
Additionally, it is important to consider the boy-to-girl ratio when using aggregated data as representative of a grade.It is well known that both absolute and relative aerobic capacity is higher in boys, particularly, during late adolescence [34].The FITNESSGRAM criterion-referenced standards account for these differences; however, gender differences can also be expected in the proportion of youth meeting the recommended levels of aerobic fitness [17].These findings support our decision to include a gender ratio screening criteria.Any unbalance at a specific grade can lead to flawed comparisons when using aggregated data.At this point we were not able to determine which gender ratio would be the most appropriate but our study shows that the inclusion of this requirement has some implications for aerobic fitness outcomes.
A strength of the present study is the large sample of schools, this made it possible to simulate issues that would arise when aggregating data at both the state and national level.Limitations of the analyses are the use of only one fitness measure and the focus on only 3 key screening methods.The screening criteria methods used in this study are more closely related to the definition of coverage (e.g., extent to what students assessed are representative of a particular school or state).Coverage can be seen as an indicator of the quality of the data [22] and can be improved when several of the study design decisions (e.g., random selection, stratified sample selection) are within the control of the researchers [35]- [37].It is possible that more robust analyses could optimize criteria for specific decisions and include other dimensions of data quality.If consensus can be reached on appropriate screening procedures for "naturalistic" studies designs (i.e., non-random sample selection) it could enable more effective comparisons across districts, states and nations.The broad adoption of FITNESSGRAM across the United States and many other countries offer considerable promise for advancing understanding of youth fitness outcomes.However, the results of the present study demonstrate the importance of selecting an appropriate screening protocol to ensure appropriate interpretations.

Conclusion
The present study used a large national sample to demonstrate the impact of the quality of the data on state-level estimates of health-related fitness.The NFL PLAY 60 FITNESSGRAM Partnership has been successfully in implementing a sustainable strategy for the surveillance of fitness across the country.This initiative will allow states to have a better understanding of youth fitness levels and their implications for public health.However, this approach relies on local school efforts to assess children and record their fitness data using appropriate pro-cedures which might lead to a lack of standardization of data collection/record procedures.It can also lead to poorly represented sites/schools/states if only a small set of students are assessed without any additional information on the selection criteria used to determine this sample (selection bias).This will most likely affect the quality of the data.It is challenging to define or quantify the quality of large scale data but our study provides evidence that the quality of the data, namely, coverage, can vary between schools and that unscreened fitness data can result in a higher prevalence of unsound estimates of aerobic fitness.We demonstrated that unscreened data led to lower levels of aerobic fitness in youth when compared to the same data when some screening procedures were used.Importantly, our results also demonstrated that more stringent screening procedures can reduce the available sample to a great degree; however, it did not affect the representativeness of the sample being considered.The NFL PLAY 60 FITNESSGRAM initiative provides the most up to date information about youth fitness levels across the country.This initiative provides a unique opportunity to explore some of the challenges associated with large scale tracking of health-related indicators.We suggest that screening procedures be considered when quality control is not sustainable or cannot be assured.This is a very common case in surveillance research.

Figure 1 .
Figure 1.(a) Distribution of boy to girl ratio per grade per school in the raw (unfiltered) data (n = 1279 grades).The minimum (MIN) value for this distribution was equal to 0 (indicating some grades only had either boys or girls) while the maximum (MAX) was 17 (indicating some grades had a 17 boy to girl ratio); (b) Distribution of total number of participants per grade per school in the raw (unfiltered) data (n = 1279).The minimum (MIN) value for this distribution was 1 (indicating some grades only had 1 participant) while some grades had 867 participants (MAX).

Figure 2 .
Figure 2. Distribution of the percent of participants meeting the Healthy Fitness Zone for the raw data (left: no filters) and by screening protocol.Values for kurtosis (KURT) and skewness (SKEW) are provided in the top right side of each distribution.

Figure 3 .
Figure 3. Pairwise comparisons of the average US state (franchise) percent of students meeting the Healthy Fitness Zone.The raw data average % in HFZ was set as the reference (white bars) while each screening protocol was defined using black bars.

Figure 4 .
Figure 4. Adjusted and unadjusted average number of participants meeting the Healthy Fitness Zone for the raw data (unfiltered) and each screening protocol.Statistically significant differences are indicated with * and represent pairwise comparisons (using adjusted or unadjusted least square means) between the raw data and each screening protocol.

Table 1 .
Flow of sample size by screening protocol.
IL: individual level screening (phase I); GL: grade level screening (phase II); NA: not applicable; screening A: boy per girl ratio = 1.2 and total number of participants per grade = 60; screening B: boy per girl ratio = 1.2 and total number of participants per grade = 30; screening C: boy per girl ratio = 2.0 and total number of participants per grade = 15.