International Studies of Physical Education Using SOFIT : A Review

Objective evaluations are essential to improving physical education (PE) policy and practice, and the System for Observing Fitness Instruction Time (SOFIT) is a valid and reliable tool designed to reach this end. This review assesses peer-reviewed studies that used SOFIT to describe preK-12 PE in international schools. Methods were informed by Preferred Reporting Items for Systematic Reviews (PRISMA) and articles were located by searching nine library databases and Google Scholar. A total of 739 records were located, 567 were screened, and 29 full-text articles were scrutinized. Data extraction was conducted to evaluate the characteristics of the 29 studies and to synthesize commonly reported SOFIT variables. The studies, conducted on 5 continents, included direct observations of 2703 lessons in 348 schools taught by more than 600 teachers in 10 different countries. There was substantial variability in study characteristics, how results were reported, and in study outcomes. All studies assessed physical activity (PA) and 90% (n = 26) assessed both PA and lesson context. More than two-thirds of the studies (69%; n = 20) assessed PA, lesson context, and teacher behavior. A common goal of the reviewed studies was to describe PE using SOFIT, however, researcher modifications to the established protocol and variability in how results were reported limited data syntheses and generalizations. As SOFIT is widely endorsed for assessing PE policies and practices, researchers could improve the generalizability of their study findings by adhering to the standard SOFIT protocol and by reporting results in a consistent manner.


Introduction 1.The Importance of PreK-12 Physical Education (PE)
The World Health Organization (WHO) recommends that children and adolescents engage in at least 60 minutes of moderate to vigorous physical activity (MVPA) daily that includes muscle and bone strengthening activities at least three times per week (WHO, 2011).Unfortunately, more than 80% of adolescents do not meet the guidelines (WHO, 2011) and increasing physical activity (PA) among school-age children is a global priority (WHO, 2018a).
The consequences of physical inactivity are severe as sedentary living is associated with numerous health conditions.Physical inactivity is associated with increased risk for overweight and obesity and the consequences become apparent at a young age.The World Health Organization (WHO), for example, has indicated the prevalence of obesity worldwide has tripled since the onset of the obesity crisis in the 1970's and that millions of children worldwide are already overweight or obese by age five (WHO, 2018b).
There is global consensus that physical education (PE) is an essential program within preK to grade 12 (preK-12) schools, largely because of its potential to increase PA and play an important role in obesity prevention (UNESCO, 2015).
Schools reach nearly all children and most countries have established recommendations for PE that recognize the importance of engaging students in health-enhancing MVPA during PE in order to develop student physical fitness and motor skills and to promote the engagement of lifetime PA (Hardman, 2014).
Although key stakeholders recognize that quality PE programs are a worthwhile public health investment, numerous barriers impact both the quantity and quality of PE, including limited schedules, inadequately trained teachers, lack of curricular resources, and insufficient equipment and facilities (McKenzie & Lounsbery, 2009).Assessing how PE is conducted is an important step in overcoming these barriers.
Global efforts to evaluate children's PA and the quality of PE and other school-based PA opportunities are currently underway (Hardman, 2014;Tremblay et al., 2016).The Active Healthy Kids Global Alliance, for example, recently published Report Cards on PA for international schools from 38 countries located on 6 continents (Tremblay et al., 2016).As well, in 2013 the United Nations Educational, Scientific and Cultural Organization (UNESCO) published the results of a worldwide survey of PE administered in 232 countries (Hardman, 2014).These efforts demonstrate a commitment to monitoring PE and improving its quality worldwide; experts acknowledge, however, that current data are limited, partly because objective assessment tools have not been widely adopted (Hardman, 2014;Tremblay et al., 2016).

The System for Observing Fitness Instruction Time (SOFIT)
The System for Observing Fitness Instruction Time (SOFIT) is a valid and reliable DOI: 10.4236/ape.2019.91005instrument for objectively assessing PE programs (McKenzie, 2012;McKenzie, Sallis, & Nader, 1991a;McKenzie & Smith, 2017).SOFIT provides objective and contextually rich-data on the conduct of PE lessons and has been widely used.
Observers are trained to use SOFIT via a standardized observation protocol that includes video segments for both instruction and assessment.Momentary time sampling methods (i.e., 10 seconds observe; 10 seconds record) are employed to simultaneously code student PA levels (i.e., lying down, sitting, standing, walking/moderate, vigorous), lesson context (i.e., how lesson time is being spent-management, knowledge, fitness, skill development, game play, free time), and teacher behavior (i.e., time spent promoting fitness, demonstrating fitness, instructing generally, managing, observing, or doing other tasks) or teacher interactions (i.e., instances of promoting "in-class" or "out-of-class" PA).
Observers also record lesson start and end times, lesson location, target student gender, teacher gender, grade level, and the number of boys and girls engaged in the lesson.
The validity of the contextual and behavioral categories is also well-established, with studies consistently reporting significant relationships between student PA levels, how lesson time is allocated, and how teachers spend their time and interact with students (McKenzie, et al., 1991a;McKenzie, Sallis, & Nader, 1991b;McKenzie et al., 1995;McKenzie, Marshall, Sallis, & Conway, 2000;Smith, Monnat, & Lounsbery, 2015).A recent review of SOFIT studies conducted in the US found consistently high inter-observer agreement (i.e., reliabilities > 85%) (McKenzie & Smith, 2017).

Purpose
The current investigation reviews studies that used SOFIT to assess PE in preK-12 schools located outside of the US Specifically, our objectives are to describe the characteristics of international SOFIT studies and to quantitatively synthesize results for the SOFIT main variables (i.e., student PA levels, lesson context, teacher behavior) and two other commonly reported variables--class size and lesson length.

Significance
SOFIT has been widely used to assess PE internationally, and this investigation complements a review of SOFIT studies published in the US between 1991-2016(McKenzie & Smith, 2017).This review increases awareness about research findings from studies that have utilized SOFIT to describe PA, lesson contexts, and teacher promotion of PA in international settings.The findings have important implications for public health stakeholders, teacher preparation programs, and researchers.Foremost, the findings increase awareness about the DOI: 10.4236/ape.2019.91005potential of PE to increase PA internationally.This is important because of the need to obtain objective evidence about opportunities for children and adolescents to accrue health-related PA.The SOFIT data specifically shed light on how teachers allocate lesson time and interact with students during PE.These factors have important implications for designing professional development for current and future teachers.Finally, this review identifies the strengths and limitations of existing international SOFIT studies and should lead to improving the data collection methods and the reporting of results in future studies.As well, because SOFIT has been recommended for surveillance (McKenzie & Smith, 2017;IOM, 2013), our data summaries for student activity, lesson context, teacher behavior, class size, and lesson length contribute to efforts to monitor PE globally (WHO, 2018a;UNESCO, 2015;Hardman, 2014).

Review Guidelines
Based on the recommendations of Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA; see Figure 1), we completed a series of steps (Liberati, et al., 2009).First, we determined inclusion and exclusion criteria for potential studies and then conducted a comprehensive search.We removed duplicates from the resulting lists and then screened the remaining abstracts and records.We obtained full-texts of selected papers to confirm their eligibility for inclusion and then extracted relevant data from studies meeting inclusion criteria.

Inclusion Criteria
To be included in the review, studies had to: 1) use the standard SOFIT protocol; 2) describe PE lessons taught in typical preK-12 schools located outside of the US; and 3) be published in English in a peer-reviewed journal between 1991-2017.Table 1 describes the 29 studies meeting these general criteria.Of these, 12 met three additional criteria in order to be included in a quantitative synthesis (Table 2).These were: a) include data from at least 30 typical PE lessons that were not influenced by an experiment or intervention; b) report mean scores and standard deviations for the main SOFIT variables; and c) provide evidence that the observational data were collected reliably throughout the study.

Search Terms and Information Sources
We searched nine databases for full-text, peer-reviewed research articles using the terms "physical education" OR "PE" AND "System for Observing Fitness Instruction Time" OR "SOFIT" AND "lesson context."The databases were: 1) Academic Search Ultimate; 2) CINAHL Plus with Full Text (EBSCO); 3) Education Research Complete (EBSCO); 4) PsycINFO; 5) SPORTDiscus with full text (EBSCO); 6) Physical Education Index (ProQuest); 7) PubMed; 8) Science Direct (Elsevier); and 9) Web of Science.As well, we searched the reference lists of selected papers and used Google Scholar to locate additional relevant papers.DOI: 10.4236/ape.2019.91005

Data Extraction
All authors played a role in the process.The first author was responsible for initial data extraction with help from the third author and two student assistants.
The first and third authors reviewed full-texts independently, and in the rare case of a disagreement, the second author arbitrated final decisions.The study characteristics extracted from the 29 papers that met initial inclusion criteria included: 1) author; 2) publication year

Quantitative Data Syntheses
We limited quantitative data syntheses to the main SOFIT variables (i.e., student PA levels, lesson context, teacher behavior) and two other commonly reported variables (class size and lesson length).Mean scores, standard deviation values, and sample size were extracted using an Excel tool.The range of mean scores was determined by sorting data from low to high values for each variable.Lower and upper values for the 95 th confidence interval were estimated for MVPA%, a measure of PA intensity during lessons, using the formula: ( ) M M t s µ = ± (Stangroom, 2018).Excel for Mac version 15.30 was used to compute the median, first and third quartiles, and interquartile range.Figure 2 provides a forest plot to illustrate the average MVPA% for the studies as well as the lower and upper values for the 95 th confidence interval.MVPA% from 0 -100% is noted on the abscissa and the studies are listed in ascending order on the ordinate.

Search Results
Figure 1 illustrates the number of records located, screened, and included in our report.We located a total of 739 records from 9 databases (n = 292) and other sources (n = 447).After removing duplicates (n = 172), we screened 567 records for eligibility and excluded 399 more for the reasons identified in Figure 1.We then evaluated 168 full-text articles and excluded 139 of them for reasons summarized in Figure 1, resulting in 29 studies that met inclusion criteria (Table 1).(Sutherland et al., 2016; Mean intervals = 0.3%; SD = 0.8%).

Study Characteristics
meeting inclusion criteria.They included the direct observations of 2703 PE lessons that were taught by at least 603 teachers in more than 348 schools.

Observer Reliability
Twenty-three studies (80%) described how data collectors were certified prior to starting data collection and 20 (69%) described the periodic assessment of observers (i.e., reliability) in the field during the data collection period.Studies consistently reported reliability scores met or exceeded the criteria standard (≥85% agreement; McKenzie, 2012) with inter observer agreements ranging between 80% -90% for each main SOFIT variable (Mode = 85%) with between 84% -100% for PA, 86% -100% for lesson context, and 80% -96% for teacher behavior.

Study Analyses & Other Variables Reported
Seventeen studies (59%) examined student gender, including 13 that compared boys and girls within the same lessons and four that investigated differences by class gender composition (i.e., boys-only, girls-only, and co-educational classes).
Eight studies (28%) examined differences based on the preparation of teachers, mainly PE specialists vs classroom teachers.Ten studies (34%) described lessons taught only by PE specialists.Six studies (21%) investigated the location of lessons, with most comparing lessons taught indoors vs outdoors.Cardon et al. (2004), however, compared swimming and non-swimming lessons and Sutherland et al. ( 2016) compared lessons taught in rural and urban schools.Twenty-three studies (79%) reported actual (i.e., observed) lesson length and 13 (45%) provided scheduled lesson length.PE dosage (i.e., lesson frequency x lesson length) was reported anecdotally, but not objectively assessed.Thirteen studies (45%) reported the number of boys and girls present in class and 13 described student activity levels during the different lesson contexts.Only four studies (14%) reported estimated student energy expenditure rates (i.e., an overall measure of PA intensity).

Syntheses of Results Reported in the Studies
Table 2 presents the range of mean scores, medians, and interquartile ranges for the SOFIT main variables that were identified in the 29 studies meeting the inclusionary criteria for quantitative data syntheses (i.e., included at least 30 typical PE lessons not influenced by an intervention; reported mean scores and standard deviations; and provided evidence of observer reliability throughout the study).Figure 2 provides a forest plot of mean MVPA% including the lower and upper values for the 95 th confidence interval for 11 of the 12 studies included in the synthesis of MVPA%.Table 2 and Figure 2 indicate that there was substantial variability in the results both within and among the 29 studies.What follows is a description of syntheses for PA, lesson context, teacher behavior, teacher interactions, observed class size, and lesson length.
Figure 2 also shows variability in MVPA% was particularly high in the secondary school studies, with MVPA% ranging between 20.9% and 58.2% (see Table 2).
Analyses for student gender were reported in 6 of the 12 studies that met the criterion for PA% syntheses (data not shown).Boys were typically observed be-

Lesson Context
Nine studies (n = 31%) met the inclusion criteria for quantitative data syntheses for lesson context.These included 2 preschool, 2 elementary, and 5 secondary school studies for a total of 1,050 lessons (n = 125 preschool; n = 426 elementary; n = 500 secondary) in 150 schools taught by more than 304 teachers (Table 2).
Noteworthy findings related to skill practice and game play were found for school level and country of origin.Skill practice was the most prevalent lesson context in the two preschool studies (Chow, McKenzie, & Louie, 2015;Van Cauwenberghe, Labarque, Gubbels, DeBourdeaudhuij, & Cardon, 2011) where it averaged 41.7% and 43.8% of lesson time.In comparison, game play was the most prevalent context in four of the five secondary school studies.On average DOI: 10.4236/ape.2019.91005 in these four studies, game play ranged between 12.1% and 46.6% of lesson time and skill practice occurred between 5.2% and 16.5% of lessons (data not shown).
The exception was the Hong Kong secondary school study (Chow, et al., 2009) which reported students spent 36.5% of lesson time in skill practice and 12.1% of it in game play.Relative to country of origin, skill practice was the most prevalent context in all three Hong Kong studies (Chow, McKenzie, & Louie, 2008;Chow et al., 2009;Chow et al., 2015), regardless of school level (preschool, elementary, secondary) and game play was the most prevalent context in all three Australian secondary school studies (Dudley, Okely, Cotton, Pearson, & Caputi, 2012a;Dudley, Okely, Pearson, Cotton, & Caputi, 2012b;Sutherland, Campbell, Lubans et al., 2016).
Only six studies assessed MVPA% during different lesson contexts.Generally, lesson time allocated for fitness activities, skill practice, and game play was positively associated with MVPA%, and time for management and knowledge was negatively associated with it (Chow, et al., 2008;Chow, et al., 2009;Chow, et al., 2015;van Beurden, et al., 2003;Van Cauwenberghe, et al., 2011;Verstraete, 2007).The Verstraete et al. (2007) study found that involving teachers in a professional development intervention led to them being more efficient in allocating lesson time and this subsequently increased student MVPA%.

Teacher Behavior
Seven studies (n = 24%) met the inclusion criteria for a quantitative syntheses for teacher behavior.These included two preschool, one elementary, and four secondary school studies for a total of 841 lessons (n = 125 preschool, n = 368 elementary, and n = 348 secondary) from 122 schools taught by 280 teachers (Table 2).General instruction was most the most prevalent teacher behavior, and it occurred between 49.1% (SD = 12.8) and 69.2% (SD = 15.4) of the time in six of the seven studies (data not shown).In contrast, the same studies found teachers spent between 18.1% (SD = 11.4%) and 24.2% (SD = 20.7%) of lesson time in management and less than 13% of lesson time in fitness promotion (data not shown).The one exception was the Hong Kong preschool study (Chow, et al., 2015) where teachers were observed managing nearly half the time (Mean = 46.5%;SD = 21.5%) and spending little lesson time in general instruction (Mean = 6.7%;SD = 8.4%; data not shown).

Teacher Interactions
Five studies (17%) described teacher interactions, but only three secondary studies met the inclusion criteria for a quantitative synthesis.These included a total of 232 lessons from 22 schools taught by more than 48 teachers (Table 2).

Observed Class Size
Thirteen studies (45%) reported observed class size, but only four (14%), met the criteria for inclusion in a quantitative synthesis.These included one preschool study (n = 125 lessons) and three secondary school studies (n = 318 lessons) for a total of 408 observed lessons in 44 schools taught by 130 teachers (Median = 22.6 students; IQR = 20.6 -26.1;Table 2).The smallest classes observed were reported by Curtner-Smith, et al., 1995 in secondary schools in England (Mean = 18.5 students; SD = 6.0) and the largest were reported by Chow et al., 2009 in secondary schools in Hong Kong (Mean = 32.8students; SD = 9.0).

Lesson Length
Lesson length was described in 23 studies (79%), but only six (21%) met the inclusion criteria for a quantitative synthesis.These included two preschool, one elementary, and three secondary school studies for a total of 344 total lessons (n = 125 preschool; n = 39 elementary; n = 180 secondary) in 75 schools and taught by more than 100 teachers (Table 2).Mean study lesson length ranged from 19.8 minutes (SD = 4.2) in four preschools in Hong Kong (Chow et al., 2015) to 43.8 minutes (SD = 11.8) in five secondary schools in England (Median = 39.9;IQR = 36.9-43.3;Curtner- Smith et al., 1995).Two Australian studies were not included in the quantitative syntheses because they did not report data means and standard deviations; nonetheless, lesson length in these cases ranged widely, between 19 -110 minutes (data not shown; Dudley et al., 2012a;Dudley et al., 2012b).
Thirteen studies (45%) reported the number of PE minutes scheduled weekly, but only Chow et al. (2015) indicated students (preschool) had PE daily (between 25 -30 minutes a day).The other 12 studies reported that students were typically scheduled to have PE lessons 1 -2 days per week (Mode = 2 days per week) that they were between 20 -120 minutes long (data not shown).
Actual observed lesson length was typically shorter than the scheduled lesson length because of student transitions to the instructional areas.Studies in Hong Kong elementary and secondary schools reported actual observed lessons were from 22% to 27% shorter than their scheduled lengths (Chow et al., 2008;Chow et al., 2009).In the Cardon et al. (2004) study, mean scheduled time was much longer for swimming lessons than regular lessons (83.0 min; SD = 22.0 min vs. 50.8mi; SD = 7.1 min; data not shown), however, lesson scheduled length was not significantly associated with the proportion of time that students were engaged in MVPA.

Discussion
Our purpose was to review SOFIT PE studies conducted in preK-12 schools outside the US.We located 739 records and systematically assessed 29 studies that were conducted in 10 different countries on 5 continents.Data for these studies were obtained via trained observers that used the same SOFIT instrument reliably to directly assess 2703 lessons that were taught by more than 603 teachers in DOI: 10.4236/ape.2019.91005 schools, but two involved preschools.Entering data line-by-line is especially recommended for intervention studies because it will enable a more fine-tuned analysis of how changes in MVPA came about.

Study Characteristics
Synthesizing the results of studies was challenging because papers often did not always report specific information, such as for sample sizes (i.e., number of schools, teachers, and/or classes), field reliability tests, and standard deviations.Precision in sample size (i.e., number of schools, teachers, and classes) was lacking in numerous studies.Specifically, within the 29 studies where data were synthesized, one paper did not identify the number of schools, six did not identify the number of teachers, and five did not indicate the number of different classes observed.Additionally, it was not always clear if "lessons" and "classes" were distinct or if the terms were synonymous.Accurate and complete reporting of sample sizes is essential for understanding the scope of studies and should be reported consistently (e.g., how many schools were included, how many teachers, and how many distinct classes).
In some cases, the trustworthiness of the data was limited because observer reliabilities were not reported.Reliabilities were reported for 25 of the 29 studies, and the results consistently exceeded the established SOFIT protocol standard (i.e., >85% agreement).Not all studies reported detailed scores for certification and field tests, and subsequently 12 (41%) were excluded from quantitative syntheses because they did not provide sufficient evidence of data reliability throughout the study.Of these, four did not report any reliabilities, seven described reliabilities only during observer training, and one reported a low kappa value (i.e., kappa = 0.091).A strength of SOFIT is that following a standardized protocol makes comparisons among studies possible, but this is only appropriate when the data are trustworthy (i.e., reliable).
Syntheses of lesson length and class size were limited, mainly because standard deviation scores and or reliabilities were not reported.Nearly 80% of studies (n = 23) described actual lesson length and 45% (n = 13) reported class size; however, only six and four studies, respectively, were synthesized, primarily and per week varied widely.In many cases investigators reported that there were regional recommendations for PE time, but they also identified that school administrators were responsible for making site-based scheduling decisions for PE.
Greater consistency in class size and lesson length at the school site level could ensure students have more equitable opportunities to become physically educated regardless of where they live.

Comparisons with the US Review
The findings of this investigation are similar to those reported in our review of SOFIT studies conducted in the US (McKenzie & Smith, 2017).For example, the challenges with synthesizing data were similar due important information being left out or reported inconsistently (i.e., observer reliabilities, sample sizes, and standard deviations).Nonetheless, there was similar variability in lesson characteristics in both the US studies and the current ones (e.g., how time was spent in lesson contexts).
A major difference between the US and international studies is the sample size, especially the number of lessons and schools observed.The 29 US studies included observations of 12,256 lessons, nearly five times the number of the 29 international studies.This difference is likely because SOFIT was used in randomized control-trials (e.g., SPARK, MSPAN, CATCH, TAAG) that were conducted in the US and sponsored by the National Institutes of Health (NIH).

Conclusion
The current description is limited to the assessment of the peer-reviewed reports of 29 different investigations that included direct observations of 2703 lessons using SOFIT in schools in 10 countries.Out syntheses of the main SOFIT variables were restricted to only the 12 studies that included at least 30 typical PE lessons that were not influenced by experiment or intervention, identified mean scores and standard deviations for main SOFIT variables, and provided evidence of observer reliability throughout the study.As the original study locations (e.g., county, city, school district, and schools) and the lessons themselves were not selected at random, our results may not accurately reflect the conduct of PE globally.
Nonetheless, the review has important implications for increasing awareness about the characteristics of preK-12 PE in international schools and for the conduct of future PE studies.Assessing PE is essential for improving its quality, and SOFIT has potential as a ground truthing tool that helps inform program-matic and instructional improvement efforts.In order to realize this potential, however, there is need for additional observations of PE in preK-12 international schools and for greater consistency in study design and how results are reported.
To inform policy and best practices that could improve PE globally, it is important for future investigations using direct observation to establish observer reliability prior to the start of data collection and continue to assess it throughout the study.As well, the utility and generalizability of the results of these studies can be improved by reporting sample sizes, means, and standard deviations scores in a consistent manner.Improved generalizability could result from investigators adhering to the standard SOFIT protocol and using the observer training videos that available for no cost on YouTube.For larger studies, investigators should consider using the iSOFIT iOS application.This app is free and it has the potential to streamline data entry and reporting processes (e.g., it generates data graphs immediately and can export data files via email).
SOFIT provides objective data on student physical activity levels and how teachers allocate lesson time and behave during lessons.The resulting information can be used to assess how well these factors align with programmatic and instructional goals.PE goals may differ by country, state/province, school district, school, grade level, and even teacher.SOFIT was developed with the belief that PE should be conducted in a pleasant environment that provides students with ample amounts of MVPA in order for them to accrue health benefits while simultaneously becoming physically fit and motorically skilled.The instrument examines the potential of lessons relative to these goals; it does not assess opportunities for students to reach other relevant PE goals such as cognitive, social, and emotional outcomes.

Figure 2
Figure 2 shows substantial variability in MVPA% (Walking/Moderate plus Vigorous) within and among the studies.Figure 2 also shows that the mean MVPA% for 5 of the studies met or exceeded the public health objective of ≥50%

Table 1 .
Characteristics of Selected International SOFIT Studies (n = 29).(Note: Not all studies reported all characteristics).

Table 2 .
Range of Study means, medians, and interquartile ranges for main SOFIT variables.
Table 1 provides a detailed summary of the characteristics of the 29 studies DOI: 10.4236/ape.2019.91005 Chow et al. (2008)eviation scores were not reported and/or it was not clear if observer reliabilities were maintained.Lesson length and class size have important implications for PE dosage and program quality, and it is important that this information be included in studies.Future reports should also include standard deviation scores and results of reliability assessments.Class size and lesson length varied widely and these variables have important implications for program outcomes.Chow et al. (2008)counted an average of 33.6 students in Hong Kong lessons (range = 15 to 45), nearly twice as many as many as in the two studies by Curtner-Smith and colleagues in England in 1995 and 1996and a third larger than the two secondary school studies in Australia reported by Dudley and colleagues in 2012.As well, PE was typically offered only two days per week with daily PE was identified only for the children in Hong Kong preschools.As well, the total minutes per lesson (e.g., 20 -120 minutes) Only five of 29 studies met the public health of 50% MVPA.Further, there were differences by student gender, with boys accruing more MVPA than girls.This was found during both coeducational lessons and during boys-only and girls-only lessons.Teachers should strive to achieve the public health goal of 50% MVPA and provide more equitable PA opportunities for boys and girls.