Investigating the Reliability and Validity of Self and Peer Assessment to Measure Medical Students ’ Professional Competencies

The use of peer assessment through a multisource feedback process has gained recognition as a reliable and valid method to assess the characteristics of professionals and trainees. A total of 168 first-year medical students completed a 15-item questionnaire to self-assess their professional work habits and interpersonal abilities. Each student was expected to identify 8 first-year classmates to complete a corresponding 15-item peer assessment. Although the self and peer assessment questionnaires had strong reliability (Cronbach’s α = 0.85 and 0.91, respectively), an exploratory factor analysis resulted in a 3and 2factor solution, respectively. The third factor was associated with items related to students’ personal attributes. Significantly lower mean score differences for the self-report assessment were found for all 15 items (Cohen’s d = 0.27 to 1.39, p < 0.001). A decision study analysis found that 7 peer assessors were needed to achieve a generalizability coefficient of 0.70. The findings suggest some inconsistencies in regards to the construct validity and stability of measures between self and peer assessment measures. The need for self-awareness of students’ strengths and limitations, however, is recommended as part of their development in a profession that emphasizes self-regulation.


Introduction
Professionalism is a cornerstone of medical practice, reflecting the qualities we demand of our practitioners and the expectations we have for the medical students we accept into the profession (Papadakis et al., 2005).Although there are various approaches that can be used to assess professionalism, the use of direct observation compiled through a 360 degree evaluation or multi-source feedback (MSF) format has been recognized as one of the most effective methods for the assessment of professionalism (Arnold, 2002;Bandiera et al., 2006;ACGME, 2013).Although professionalism is a multi-faceted construct that includes a range of measures from the personal (e.g., ability to self-reflect and regulate), attitudinal (e.g., altruistic, honesty, integrity) and behavioral (e.g., dutifulness, collegiality), the use of MSF has the potential to be used as a formative or summative feedback process for evaluating specific components of professionalism.For example, within the context of medical practice MSF has been shown to be successful in evaluating physicians' professional attitudes and behaviors from a range of stakeholder perspectives that include other physicians, coworkers and the patients' themselves (Violato & Lockyer, 2006;Allerup et al., 2007;Brinkman et al., 2007;Lockyer & Clyman, 2008).
The introduction of MSF at the medical school level to assess professional attitudes and behaviours reflects a progressive move towards acknowledging the importance the profession places on this role or core set of competencies (Epstein et al., 2004;ACGME, 2013).There are constrains, however, to introducing a MSF process at medical schools in that as students the assessment is restricted to other peers as they are the only other persons that will have observed and know each other well enough.In addition, the peer assessment is constrained by what other students can realistically observe and assess about each other as it relates to the professional attributes displayed through education activities shared mutually.In particular, the peer assessment protocol has shown promise as a reliable and valid method to assess the professional competence of medical students (Epstein et al., 2004;Dannefer et al., 2005).Used in part to assess the domains of competence such as interpersonal, humanistic, and teamwork skills, the 15 item peer assessment form was shown to distinguish between two domains of students' professional competence: consistency in their work related habits and interpersonal habits.
The subsequent research on the use of the peer assessment protocol with undergraduate medical students has expanded to explore the effects of rater selection (Lurie et al., 2006a), changes in self-perceived abilities among man and women (Lurie et al., 2007a), the relationship between peer assessments and other measures (e.g, Dean's letter rankings and ratings by internship directors) (Lurie et al., 2007b), and temporal and group-related trends (Lurie et al., 2006b).In each case, the structure of the peer assessment form is premised on the measure of two identified dimensions referred to as work habits and interpersonal abilities.In a qualitative study of the impact that the peer assessment protocol had at the University of Rochester School of Medicine and Dentistry, medical students reported that they found the process transformative and a useful source of feedback that contributed to their own professional development (Nofziger et al., 2010).
In this study, we describe the implementation and results of a formative self and peer assessment protocol as a measure of medical students' professional competence in their first and second years at a medical school in Canada.The responses were analyzed to determine the reliability (i.e., internal consistency, test-retest, and generalizability coefficients) and construct validity (i.e., factor analysis) of the self and peer questionnaires and to explore relationships between self and peer assessment ratings.

Method Participants
As a component of the student feedback process in the undergraduate medical education program, a total of 168, first year medical students from the University of Calgary completed an online 15 item, self assessment form during the middle of their first year and, then again, at the middle of their second year.All of the participants were asked to identified 8 classmates to complete a peer assessment version of the questionnaire in their first year using the same 15 items.This study was approved by the Conjoint Health Research Ethics Board of the University of Calgary and signed consent was obtained by all participants.

Questionnaire
The 15 item Likert form was developed through initial research by Dannefer et al. (2005) on a peer assessment protocol to measure medical students' professional competencies as a function of observable behaviors (Appendix).In particular, their findings supported a two factor or subscale measure with high internal reliability (Cronbach's alpha greater than 0.80 for each subscale) that assesses students' professional work habits (WH) and interpersonal abilities (IA).The 15 th item is not considered to be connected to either factor or subscale and is treated as an overall assessment of the individual's potential for professional competency; reflecting on whether or not the assessor is concerned about this person's future patients.The scoring of each item is based on a 5-point scale anchored by descriptors at each end and the option to circle UA for "unable to assess".For example a score of 1 or 2 for item number 13 states that the person's "behavior is frequently inappropriate" to options 4 or 5 for "behavior is always appropriate".

Statistical Analysis
The construct validity of the self and peer assessment questionnaires were investigated with exploratory factor analysis using principal components and varimax rotated solutions.The internal reliability (Cronbach's alpha), and generalizability coefficients were calculated in a decision study analysis to determine the optimal number of peer assessors required to obtain a generalizability coefficient of greater than 0.70 (Brennan, 2001).
A comparison of mean differences between self and peer assessment ratings on each of the 15 items was conducted using independent samples t-tests with an effect size difference (Cohen's d) analysis.The interpretation of the magnitude of the effect size for mean differences is based on Cohen's (1988) suggestions of d of 0.30 as "small", d of 0.50 as "medium", and d of 0.80 as "large".In addition, for the independent variable sex (men vs. women) mean differences in the self and peer assessments were also investigated using independent samples t-tests and effect size difference analyses.
As there have been found to be discrepancies between self and peer assessment on other measures of professional competencies using multisource feedback, medical students' self assessments were classified into quartile categories (i.e., < 25 th percentile, 26 th to 50 th percentile, 51 st to 75 th percentile, and > 76 th percentile) based on their mean total scores across all 15 items on the questionnaire and compared with their corresponding peer assessments.

Results
The Cronbach's alpha coefficients for the self and peer assessments questionnaires were α = 0.85 and 0.91, respectively.An average of 7.5 peer assessment questionnaires were completed for each medical student self assessment.On all 15 of the items shown in Table 1, the mean ratings of the medical students on the self assessment questionnaire were significantly lower than those on the students' corresponding peer assessment ratings (p < 0.001).Overall, the medical students consistently score themselves lower than their peers with a mean effect size difference of d = 0.69 across all 15 items (range from d = 0.27 to 1.39).A large effect size difference (d = 1.17, p < 0.001) was found across a total score comparison between the self and peer assessment groups.
As shown in Figure 1, when students' self assessment scores were compared with their corresponding peer assessments by quartile groupings, medical students' in the lowest three quartiles score themselves significantly l wer that their peer assessors o tend to cross load between factors (items 6, 7, 8, 9, and 10).The internal reliability coefficients for the peer assessment questionnaire were α = 0.87 for the WH and α = 0.86 for the IA subscales.For the three factor solution derived for the self assessment questionnaire the internal reliability coefficients were lower at α = 0.60 (WH), α = 0.80 (IA), and α = 0.77 (PA); reflecting in part the reduced number of items associated with each factor or subscale.The total percentage of the variance accounted for by the self and peer assessment questionnaires were found to be 53.6% and 57.3%, respectively.The generalizability (G) coefficient for the peer assessment with 8 raters for the 15 item checklist was Ep 2 = 0.73.A decision study analysis was used for this single-facet nested design (i.e., peer assessors nested within individual medical students) to derive mean G coefficients of 0.58 to 0.80 for 4 to 12 peer assessors, respectively (Figure 2).The proportion of variance accounted for by persons (medical students) in the analysis was 25% and the remaining 75% of the variance was contributed through the peer assessors nested within student interaction effect.(p < 0.001).In particular, those in the lower self assessment quartile underestimated their performance competency in comparison with their peer assessors by 12.3% of the total score.There was, however, no significant differences found between self and peer assessment total means scores in the top 4 th quartile.Regardless of the discrepancies found in the self assessment percentile rankings, peer assessments were found to be not significantly difference and consistent across all quartiles (67.7% to 69.7%).
A subsequent administration of the 15 item questionnaire was completed approximately 12 months later in the middle of the students' second year.A test-retest reliability analysis resulted in a correlation coefficient of r = 0.44 and an overall significant increase in students' total scores from year one (M = 62.5, SD = 5.47) to year two (M = 65.6,SD = 4.54, p < 0.001; d = 0.62).A subsequent exploratory factor analysis on the completed self assessment questionnaire for year two also confirmed a three factor solution (accounting for 52% of the variance).In paired sample t-tests between the two administrations, students were found to rate themselves significantly higher on each of the three subscales identified: WH (from M =19.9, SD = 1.94 to M = 20.8,SD = 1.75; p < 0.001; d = 0.48), IA (from M = 17.9, SD = 1.87 to M = 18.4,SD = 1.63; p < 0.01; d = 0.28), and PA (from M = 20.6,SD = 2.30 to M = 21.8,SD = 1.90; p < 0.001; d = 0.57).In addition, students' rated themselves significantly higher in year two on question 15 regarding the effectiveness of their healthcare practice with patients in the future (from M = 4.25, SD = 0.60 to M = 4.53, SD = 0.50; p < 0.001; d 0.51).
As shown in Table 2, the exploratory factor analyses resulted in 3 and 2 factor solutions for the self and peer assessment instruments, respectively.In both cases two of the factors were identified based on a previous factor analysis study as WH and IA (Dannefer et al., 2005).The third self assessment factor derived was labeled personal attributes (PA) as they reflect items that describe individual's attributes in asking for and implementing feedback (item loading = 0.751), admitting mistakes or being truthful (0.634), and collaborative through the sharing of information or resources (0.626).Although there are common items between the self and peer assessment questionnaires that are associated with only the WH (items 1, 2, 3 and 14) or IA (items 4, 5, 12 and 13) subscales, the remaining items = Note: a Self and peer Assessment total variance is equal to 53.6% and 57.3%, respectively.Note that only item loadings greater than 0.32 are reported (i.e., accounting for greater than 10% of the variance).

Discussion
The major findings of the present study are that: 1) self assessment of professional competence was significantly lower in comparison with peer assessors on all questionnaire items, 2) distinct from the 2 factor solution for the peer assessment questionnaire, the 3 factor solution for the self assessment includes an additional subscale associated with personal attributes, 3) women medical students were rated significantly higher either by themselves or their peers on more items than were the men, and 4) on a one year test-retest of the self assessment questionnaire, students' self-reported rating were significantly higher from year one to two on total and subscale scores.
The discrepancies found between self-and peer-reported assessment across each of the items rated reflect a concern as to why individual students tend to perceive that their professional competencies are lower than the mean ratings provided by a group of peers they interact with on a regular, if not daily, basis.
When investigated by quartile groups based on self assessment totals, there appears to be a majority of medical students that significant underestimate themselves in comparison with their peers.This miscalibration effect found between the self and peer assessment ratings is typical of physicians' self and peer assessment as well (Violato & Lockyer, 2006).It would seem that the miscalibration of professional competencies that was shown to be evident in experienced physicians begins early in one's medical career, as we see in the present results, and likely reflects a general discrepancy found in human self assessment.
Although methods for the assessing of competencies such as work habits, interpersonal abilities and personal attributes are less well developed or tested, the peer assessment protocol and 15 item self and peer assessment forms provide a potentially reliable and valid method to introduce sources of feedback that can help medical students to reflect on and enhance their own professional development.The findings in this study, however, demonstrates the difficulty in developing MSF tools or questionnaires that can consistently measure similar constructs between different types of raters (e.g., self, peer, co-workers, patients) as a function of observable behaviors (Lockyer & Clyman, 2008).Therefore, one of the main limitations of the use of the results is that there is some uncertainty as to the specific factors being measured as the self assessment questionnaire appears to be more multidimensional in that an additional third factor was identified (i.e., personal attributes).Another limitation is that overall students are left with the impression from their peer groups that they are actually rated much higher (or at the same level) on each of the 15 items than the selfreported ratings provided on the corresponding self assessment questionnaire (Colthart et al., 2008).
There is an expectation that medical schools need to provide students with feedback related to their clinical and professional performance.The current assessment formats for medical students focus primarily on testing clinical knowledge and skills, without any means of formative feedback for professional development.The use of a self and peer assessment MSF process in the initial years of medical school provides an opportunity for students to become engaged in understanding how they are performing on other aspects of their non-cognitive skills development (i.e., ability to collaborate with others, communication effectiveness, managing their time and resources).Nevertheless, efforts to measure relevant constructs that have application to their roles and responsibilities as future physicians are still not well defined.With the move towards competencybased frameworks in residency programs, MSF assessments will need to better reflect measures associated with the key competencies identified of practicing physicians (ACGME, 2013;Frank, 2005).

Figure 1 .
Figure 1.Mean percentile on four quartile groups comparing self to peer assessment total scores.

Table 1 .
Means (standard deviations)and effect size differences between items on the self and peer assessment instruments.
a In independent samples t-tests between Self and Peer Assessment items, p < 0.001.

Table 2 .
Two and three factor solutions for self and peer assessment instruments, respectively.