Adoptability of Peer Assessment in ESL Classroom

In ESL class, a teacher in charge of the class usually evaluates all the students’ performances, where using peer assessment may be a good way to confirm or modify the teacher assessment. In this study, whether peer assessment can be adopted in class is considered using FACET analysis. Since this is a regular small English class in Japan, the participants are 18 ESL university students and one teacher. First, one misfitting rater is eliminated and all the other raters including the teacher are included as assessors. The rater measurement report shows that, after eliminating one rater, no raters are misfits. The FACET map shows that most of them, including the teacher, are lenient raters. In addition, only a few unexpected responses are detected. Overall, this study concludes that peer assessment can be reasonably used as additional assessment in class.


Introduction
Assessment is an important activity in any educational setting; however, it is quite a burden for teachers.Especially when they must evaluate their students' oral performances, it may cause some troubles since they can often see those performances only once unless they record them.In those situations, peer assessment can be an additional assessment method.Peer assessment involves students in making judgments of their peers' work.Although numerous attempts have been made by scholars to show educational effectiveness of peer assessment (Brown, 2004;Li, 2017;Liu & Li, 2014;Pope, 2001), some research results suggest that peer assessment could not be useful for formal assessment in class (Anderson, 1998;Cheng & Warren, 1999).In fact, there is little agreement as to adoptability of peer assessment as additional assessment in class.Therefore, S. Matsuno this study is intended as an investigation of whether peer assessment can be reasonably adopted along with teacher assessment in EFL classroom.

Literature Review
Studies on peer assessment have been conducted both in L1 (English as a first language) and L2 (English as a second/foreign language) settings.In L1 setting, it is often controversial whether peer assessment is meaningful since students tend to be doubtful of the worth of peer assessment and they often feel uncertain and uncomfortable by assessing peers (Anderson, 1998).Domingo, Martinez, Gomariz, & Gamiz (2014) also mention that assessing more than thirty peers make them not assess seriously.Although peer assessment is skeptical in terms of effectiveness, many research results proved that peer assessment gave various benefits in educational settings such as promoting student learning (Liu & Li, 2014;Pope, 2001) and students' motivation, autonomy and responsibility (Brown, 2004;Pope, 2001).Li (2017) researched 77 students participating in a peer assessment activity and found that peer assessment could be meaningful in classrooms if it was anonymous and/or students were trained.
Regarding L2 settings, Cheng & Warren (1999) claimed that correlations between teacher raters and peer raters varied depending on the tasks and the situations (1999).On the other hand, some studies have shown that peer assessment is interrelated with instructor assessment (Jafarpur, 1991;Patri, 2002;Saito & Fujita, 2004).Saito (2008) proved that training could improve the quality of peer assessment.Matsuno (2009) also found that peer-assessment were internally consistent and showed few bias interactions.Moreover, Jones & Alcock (2014) found high validity and inter-rater reliability by asking students to compare pairs of scripts against one another and concluded that the students performed well as peer assessors.
As you can see from the literature review, there is little agreement as to adoptability of peer assessment as formal assessment.Therefore, the present study is conducted with the aim of giving further evidence of whether peer assessment can be adopted in EFL class.

Procedure
Eighteen university students gave a presentation for about three minutes in English.They major in engineering in one of national universities in Japan.They take the English class as a requirement class and they learn how to make an effective English presentation in class.This is the regular small English as a foreign language class in Japan, where only one teacher teaches the class.Hence, in this class, the raters are one teacher (T) and 18 students (R1 -R18).The presenters are also 18 students (P1 -P18).Peer assessment often engages students in both roles as assessor and assesses, which is the case in this study, too.During and after each presentation, the teacher and the students evaluated the presentations based on five domains.Domains refer to aspects or characteristics of essay quality that are analyzed and separately scored.In the present study, five domains are assessed: posture, eye contact, gestures & voice inflection, visuals, and content.Each domain was scored on a 3-point scale holistically.The rater assigned a score of 1, 2, or 3, representing a presentation that ranged from "inadequate" to "OK" to "Good".The raters also wrote some comments toward each presenter.
Before the presentation, the teacher explained what each domain is and how to rate the presentations thoroughly, using the domains, for about three successive classes (each class is 90 minutes).The textbook Speaking of Speech (Harrington & LeBeau, 2009) published by Macmillan language house was utilized in class.In the three successive classes, the students learned the physical message (Unit 1 to Unit 3), where posture, eye contact, gestures, and voice inflection were covered.
In addition, visuals and content were explained by the teacher.The presenters were asked to make effective visuals using either Microsoft PowerPoint or hand-written posters.Regarding content, they were asked to choose one speech among informative speech, layout speech, and demonstration speech.They watched the model presentation and were explained when they get good or poor scores.

Analysis
Multifaceted Rasch analysis is conducted using the FACETS computer program, version 3.80.0(Linacre, 2017).In the analysis, presenters, raters, and domains are specified as facets.The output of the FACETS analysis reports a FACETS map.The FACETS map provides visual information about differences that might exist among different elements of a facet such as differences in severity among raters.Presenter ability logit measures are estimated concurrently with the rater severity logit estimates and domain difficulty logit estimates.These are placed on the same linear measurement scale, so they are easily compared.The FACETS analysis also reports an ability measure and fit statistic for each presenter, a severity measure and fit statistic for each rater, and a difficulty estimate and fit statistic for each domain.It also shows unexpected responses, which may cause misfitting presenters, raters, or domains.In this study, the teacher's assessment is included along with the peer assessment because whether using both peer assessment and teacher assessment would be beneficial or not will be scrutinized.

Initial Analysis
Based on Linacre (2012) and Engelhard & Wind (2016), the values from 0.5 to 1.5 (the logit scale) of infit and outfit mean-square statistics are considered as "productive for measurement" (Linacre, 2012: p. 15).Unlike raw test scores in which the distances between points may not be equal, the logit scale is a true interval scale.Infit and outfit mean-square statistics are summaries of residuals that describe departures from model expectations at the individual facet level.As an initial analysis, misfitting presenters, raters, and domains are examined, and one presenter and one rater are detected as misfits.Values more than 1.50 of fit statistics indicate that these presenters or raters are idiosyncratic compared with the other presenters or raters.Values less than 0.50 indicate that these presenters or raters simply have too little variation.The infit mean square value of R1 is 1.10.On the other hand, outfit mean square value of R1 is 1.73, which means there are some unexpected responses.The outfit statistic is useful because it is particularly sensitive to outliers, or extreme unexpected observations (Engelhard & Wind, 2016).The following is R1'residuals plot using the logit scale.As you can see from the plot, R1 has an extreme score.Because of this, R1 is detected as a misfit [Figure 1].
Further examining R1's ratings, R1 rated P14's visuals and content extremely severely, although he rated other presenters leniently, giving most of the domains of the other presenters the highest score 3.
Regarding the misfitting presenter, P14 is a misfit.He was a very funny person in class.He was actively engaged in his presentation; however, he forgot to bring his USB and his visuals were poor.Since six raters gave bad scores on his visuals, he is detected as a misfit.This may be strange, but the peer raters often assessed their peers leniently, but they assessed P14's visuals severely, which caused misfit to the Rasch model.
From a pedagogical point of view, all student presentations must be evaluated because they are a graded class presentation regardless of their degree of fit to the Rasch model; on the other hand, when it has been determined that some student raters did not assess the presentations seriously or did not meet the expectations of the Rasch model, their ratings can be justifiably eliminated in order to improve the precision of the ability estimates.Therefore, one misfitting presenter is included and one rater is eliminated in the further analysis.

Summary Statistics
The following is the summary statistics of the multifaceted Rasch measurement [Table 1].
In this table, all of the presenters, raters, and domains seem to be acceptable because they are in the acceptable range between 0.5 and 1.5 of infit and outfit mean square statistics.within a facet can be differentiated from one another.In addition, a chi square statistic determines whether the element within a facet can be exchangeable.As can be seen in the table, the overall differences between elements within the presenter, rater, and domain facets are significant, based on the chi-square statistics (p < 0.05).The reliability of separation for presenters is quite high.This finding of a high reliability of separation statistic for presenters suggests that there are reliable differences in the judged locations of each presenter's ability on the logit scale.For the raters, a high reliability of separation statistic was observed for raters (0.94), which suggest that there are significant differences among the individual raters in terms of severity.This is not ideal for raters; however, this is often the case in real classroom settings.In addition, domains are significantly different, which suggests the difficulty of the domains is different.

The Rater's Measurement Report
The following is the detailed rater's measurement report [Table 2].
From the left, each column shows rater ID's, rater severity, error, infit mean square values, and outfit mean square values.As mentioned earlier, mean square values of 0.5 to 1.5 are utilized.After eliminating the one rater (R1), no rater is identified as misfits.This indicates that the raters are self-consistent across writers and domains, which is a good sign to use peer assessment as an additional assessment in class.

The FACETS Map
The following figure [Figure 2] is the Facet map.The first column is the logit scale that represents presentation achievement.As mentioned earlier, the logit scale is a true interval scale.The next three columns display the logit-scale locations for the three facets: presenters, raters, and domains.In order to interpret the logit-scale locations of the three facets, raters and domains are centered at zero (mean set to zero), and only the average location of the presenter facet is allowed to vary.The second column displays the presenter locations (n = 18).As can be seen in the map, many of the presenters are located in the upper part of the map, which suggests that the presenters often obtain good ratings on their presentations.This is because the presenters who are located higher on the logit scale receive higher ratings, and the presenters who are located lower on the logit scale receive lower ratings.The third column shows raters' severity locations; the raters who are located higher on the logit scale are more severe; that is, they assign lower ratings more often.The raters who are located lower on the logit scale are less severe; that is, they assign higher ratings more often.As can be seen, many raters are quite lenient since they are below 0.00 logit.Finally, the locations of the domains on the logit scale reflect the difficulty.The domains that are located higher on the logit scale are associated with more severe ratings, and the do-  study, how to rate visuals and content should have been explained more thoroughly.Since the students did not have enough skills to evaluate those domains, they had some unexpected responses.After giving some time to practice their ratings and after eliminating misfitting raters, using Mutifacted Rasch analysis may be a good choice.Teachers may compare their assessments with those of peer assessments.They also can check the students' comments, which may help them understand why the students assign their scores.Those proceedings could make teachers' assessment be in good quality.As much as they can, they should try to improve the quality of their assessment.
mains that are located lower on the logit scale are associated with less severe ratings.In this map, gesture & voice inflection is the most severely rated because it obtains severe scores and content and visuals are the least severely rated since they obtain lenient scores.The last column shows each point on the 3-point rating scale used in this analysis.