Inter-Rater Reliability : Comparison of Checklist and Global Scoring for OSCEs *

Bunmi S. Malau-Aduli, Sue Mulcahy, Emma Warnecke, Petr Otahal, Peta-Ann Teague, Richard Turner, Cees Van der Vleuten School of Medicine, Faculty of Health Science, University of Tasmania, Hobart, Australia Centre for the Advancement of Learning and Teaching, University of Tasmania, Hobart, Australia Menzies Research Institute, Hobart, Australia School of Medicine and Dentistry, Faculty of Medicine, Health and Molecular Sciences, James Cook University, Townsville, Australia School of Health Professions Education, Faculty of Health, Medicine and Life Sciences, Maastricht University, Maastricht, Netherlands Email: bunmi.malauaduli@utas.edu.au, sue.mulcahy@utas.edu.au, emma.warnecke@utas.edu.au, petr.otahal@utas.edu.au, peta.teague@jcu.edu.au, richard.turner@utas.edu.au, c.vandervleuten@maastrichtuniversity.nl


Introduction
The Objective Structured Clinical Examination (OSCE) is recognised by medical educators as an opportunity to evaluate essential clinical skills and competencies necessary for progression in the medical course (Harden & Gleeson, 1979;Hodges, 2003;Newble, 2004).Its widespread use to surmount many of the inherent validity problems of oral clinical examinations is due to its desirable characteristics of objective testing in which examinees are exposed to the same test conditions (Harden et al., 1975;Kirby & Curry, 1982;Downing & Yudkowsky, 2009).
The OSCE format comprises a student rotating through a series of time limited clinical "stations".At each station the stu-dent is faced with a simulated scenario, usually involving a simulated patient (SP).The student has to perform the required clinical task under the direct observation of a clinical assessor (examiner), who scores student performance against a checklist and/or global rating scale.There is a body of research on the use of checklists, which describe precisely the occurrence of particular behaviours and global rating scales which describe the quality of a performance, allowing for more interpretation by the examiner (Regehr et al., 1999;Hodges et al., 1999;Hodges et al., 2002).Checklists are designed and incorporated into OSCE to increase the objectivity and reliability of marking by different examiners.However some researchers have criticised the validity of checklists due to their tendency to become objectified and trivial in the evaluation of clinical competence (Van der Vleuten et al., 1991;Cohen et al., 1997;Cunnington et al., 1997;Cushing, 2002).These authors have demonstrated the reliability and validity of global rating scales, thereby providing evidence that subjectivity may not be inherently unreliable.Global ratings have also been reported to better evaluate the performance of advanced students as well as negate some of the nuances associated with checklists ( Van der Vleuten et al., 1991;Regehr et al., 1998;Hodges et al., 1999).Some studies have compared the psychometric properties of checklists and global rating scales on OSCEs and concluded that global rating scales scored by experts showed higher inter-station reliability, better construct validity and better concurrent validity than did checklists (Hodges et al., 1997;Regehr et al., 1998).
Intensive examiner training improves inter-rater reliability as it ensures that all raters interprete item descriptions similarly and apply similar standards on students' performance (Williams et al., 2003;Spencer & Silverman, 2004).Although earlier studies have indicated that examiner training varied in effectiveness as a function of medical experience (Newble et al., 1980;Van der Vleuten et al., 1989), more recent studies have demonstrated the high impact of examiner training on the consistency of scoring (Humphrey-Murto et al., 2005;Chesser et al., 2009) However, establishing excellent examiner training sessions still remains a major problem for medical schools with increasing number of students, difficulty finding sufficient number of experienced examiners for multi-site exams and the challenges of getting time-poor clinicians away from their other activities to attend examiner-training sessions.Innovative and feasible approaches to tackling these tasks are necessary.The primary purpose of this study was to compare the inter-rater reliabilities of checklist and global rating scores of examiners who were exposed to an online training program (to standardise scoring techniques) across two medical schools.The study also examined examiners' perceptions of the feasibility and usability of the e-scoring program.

Study Context
In November 2010, two Australian medical schools (A and B) participated in a collaborative inter-school study of clinical competence in which three OSCE stations were developed and embedded in the (3rd and 4th years respectively) end-of-year clinical examinations.School A runs a five-year undergraduate medical programme, while School B runs a six-year undergraduate programme.Both schools have similar horizontally and vertically integrated outcomes-based curricula.The selected year groups were chosen because of their comparable levels of intended learning outcomes.

The Shared OSCE Stations
The three OSCEs (chest pain, diabetic foot and gallstones) comprised of eight-minute stations and were administered to a total of 119 third year medical students at School A and 94 fourth year medical students at School B. The three OSCE stations covered a range of core clinical competencies with which examiners at both schools were familiar.Between five to nine task-specific checklist items were developed for each case.The behaviourally anchored 4 -7-point rating scales assessed degree of coherence, empathy, verbal and non-verbal expressions.

Examination Procedure
The examination at School A was conducted over a two-day period to two different cohorts of students, while at School B it was a one day event with the three shared OSCEs embedded in a 12-OSCE station examination.Two concurrent sessions of each station were conducted at School A and four were conducted at School B, each with one SP and one examiner.Clearance was obtained from the relevant ethics committee for this study.

Examiners
Three examiners were independently selected from each school to serve as external examiners, one on each of the shared stations, and double mark with the internal examiners at the other school.Each external examiner independently double marked a total of 20 student observations.Each examiner rated student performance by first scoring the task-specific checklist and then completing a global rating.The two components were then summed to generate an overall performance score.

Examiner Training
To aid examiner training and standardise marking across the two examination sites, an OSCE e-scoring tool was developed and set up in a secure intranet site, in the on-line Blackboard Learning System Vista environment.The three shared OSCE scenarios were videotaped and used for the on-line examiner training; PGY1 residents (interns) were recruited to role play as medical students and SPs were recruited from the SP pool.Informed consent and confidentiality agreement were obtained from all the video participants.
A total of 24 examiners were involved in the on-line OSCE training program.All the internal (on only the shared OSCEs) and external examiners were invited via email, given login access and instructions on how to use the program; the video clips were made accessible to the examiners one week prior to the examination.The examiners were able to view the recordings in their own time and assess the interns' performances.
Each examiner was asked to watch two unlabelled scenarios (poor and good performance) of the OSCE case which they had been assigned to examine.After watching each scenario, they were required to assess the performance using the marking sheet that was provided in another window.The station information and criteria for marking were also made available.After completing and submitting their marking/scoring sheet, the examiners were then able to view and compare the scores they had given for the checklist task and the global rating with others already submitted.This enabled examiners at both sites to achieve consensus regarding what constituted unsatisfactory, borderline or satisfactory performance.The SPs on the shared OSCE stations were allowed to view the video clips and they discussed face-to-face with the internal examiners about expected performance.

Quantitative Data
Descriptive statistics of the on-line training scores, comparative analysis for checklist scores and global ratings in both schools were calculated using SAS.The difference between internal and external examiners' scores was tested using 2-sample t-test.Generalisability analysis was used to test for inter-rater reliability across sites.Multilevel mixed-effects linear regression in STATA was used to calculate the variance components and to evaluate the magnitude of the different sources of variation affecting the measurement.Different pairs of raters assessed examinees at each of the three stations and the examination at school A was conducted over two days with a different cohort on each day.Due to the disconnected design, variance components for each station within each site were estimated separately and the estimates were pooled across sites to eliminate confounding of the proficiency of examinee groups and the stringency of examiner groups across sites.For both checklist scores and global ratings, a single facet, random, raters/examiners (R) by persons/examinees (P) design [PxR] and the interaction effect of person by rater with residual effect (PxR,e) was used to assess inter-rater reliability.D-study was used to measure reliability estimates.

Qualitative Data
To capture their perceptions of the on-line training/e-scoring program, examiners were prompted to provide anonymous responses to four open-ended on-line survey questions which were administered to them immediately after completing their scoring of the OSCE scenarios.The examiners were asked to 1) comment on aspects they liked most about the e-scoring program; 2) comment on aspects they didn't like; 3) proffer suggestions on improvement of the program and 4) provide their views on the effect of the program on future assessments.The survey data were collated and emerging themes independently coded and confirmed by two researchers.Illustrative quotes are reported verbatim in Appendix 1.

Results
Table 1 portrays the mean checklist scores and global ratings ± the standard deviation (SD) given by co-examiners during the actual examination.There were no statistical differences in the mean scores given by the internal and external examiners in both schools.
The estimated variance components from generalisability analyses for checklist scores and global ratings are presented in Table 2. Pooled score variance attributed to student ability was higher on global ratings in comparison to checklist scores Note: a G-coefficients for this study with 2 raters; * Variance component estimates for persons (P); raters (R); and residual (PR,e), reflecting variance due to person-by-rater nteraction (PR) and unidentified sources of error.i (90.2% vs 68.3%).Rater effect accounted for 1.4% and 0% of total variance in checklist score and global rating respectively.Score variance due to interaction and residual error was larger for checklist scores (30.3% vs 9.7%) than for global ratings.G coefficients for checklist scores and global ratings are also presented in Table 2. G coefficients varied from each case, with the lowest values been obtained on the diabetic foot station across the two schools.In addition, reliability estimates for the global ratings were higher than for the checklists.
Survey results showed that examiners valued the process because it gave them an opportunity to see a "dry run" of the station and allowed them to set the "expected standard" for the station prior to the actual exam (Appendix 1).They also indicated that this sort of tool should be used more widely in OS-CEs.However, they pointed out that scoring borderline performance, rather than good or poor performance would make the e-scoring process more useful.

Discussion
The observed low variance in rater effect in our study indicates high inter-rater reliability, meaning each rater's scores are consistent across different students.The results also indicate that there are no significant differences in average scores across raters; hence the assessment clearly reveals the competence of each examinee.Our results show higher inter-rater agreement for global ratings in comparison to checklist scores.A growing body of literature has reported that global ratings have higher reliability than checklist scores and are better able to discriminate between examinees (Hodges et al., 1999;Govaerts et al., 2002;Hodges et al., 2003;Wilkinson et al., 2003).The higher examinee and lower residual variance estimates observed in the global ratings in this study in comparison to the checklist scores echoes these findings.McManus et al. (2006) reported that thorough selection, monitoring and training did not eliminate examiner stringency/ leniency effect.However, our study indicates otherwise, with the observed lower variance due to examiner difference.This might be as a result of the online training, which allowed examiners to agree on the "expected standard" for each station prior to the actual examination.The use of two examiners to reduce examiner bias has been proposed (Norcini, 2002;Wilkinson et al., 2003), but our findings clearly demonstrate that using on-line examiner training, higher reliabilities of 0.7 and above for high stakes examinations can be achieved even with the use of one examiner per station, indicating that there is little or no benefit in using examiners to double mark.Interestingly, our study showed that external examiners gave lower scores than internal examiners; this may indicate the effect of examiner familiarity with candidates as a potential source of bias (Stroud et al., 2011).
Researchers have suggested that variability in performance across cases is not simply related to content variation, but to other factors, such as pattern recognition based on irrelevant contextual features of the case (Govaerts et al., 2002).The observed varying magnitudes of estimated variance components across stations (cases) may indicate that the relative ordering of cases and the specificity of case content have a large effect on the variance.There is therefore the need to explore the magnitude of variance attributable to case, content and/or context specificity.
The survey results showed that the e-scoring program offered training for both quality assurance and appraisal purposes.The examiners valued the process as it allowed them to reach consensus about their scoring techniques and resulted in similar trends of scoring in both schools.Furthermore, given the busy schedule of clinicians and the challenges of getting away from their other activities to attend examiner-training sessions, the e-scoring package allowed examiners to use it in their own time.
Most of them found it easy to navigate through the program, but a few expressed difficulties in understanding the technology as well as the statistics generated for comparison of scores.
The examiners also suggested that scoring of borderline performances would be more useful, indicating that it was easier for the examiners to identify and agree on their ratings, particularly for good performance.This is a valid point, given the fact that borderline students are the ones medical educators are most concerned about.It is important for examiners to be able to make accurate pass/fail decisions so that only competent students are allowed to progress academically.On the whole, the examiners concurred on the efficacy and possibility of wider use of the e-scoring program.
The major limitation of this study is the small number of stations used.In addition, the rating of the global scales after the checklists could have affected examiner scoring of student performance.Due to the design of the study, inter-case reliability and the comparison between trained and non-trained examiners could not be determined.Further studies should explore these areas.

Conclusion
The results of this study suggest that global rating scales are a more appropriate summative measure than checklists in assessing examinees on performance based tests, providing further support for the reliability of subjective examiner judgments.This study also indicates possible elimination of examiner variance measurement error with the use of on-line examiner training program.The tool holds great promise for high stakes performance-based assessments conducted across multiple sites and will afford time-poor geographically separated clinicians the opportunity to better engage in the assessment process.

Table 1 .
Descriptive statistics for checklist scores and global ratings at both sites (mean scores ± standard deviation).

Table 2 .
Variance component estimates and G coefficients for checklist scores and global ratings.