Monitoring recovery by physical therapists using the FIM scale during rehabilitation programs : An inter-rater and intra-rater reproducibility study *

Our aim was to evaluate the reproducibility of the Functional Independence Measure (FIM) scale when assessed by physical therapists in the routine setting of a Rehabilitation Hospital. We included a consecutive series of patients with spinal cord or cerebral lesions. Each of the 50 selected patients was evaluated by two of the 5 experienced physical therapists participating in the study. The degree of inter-rater and intrarater agreement was measured by a weighted k statistic, k for perfect agreement, and k for the agreement with tolerance. The weighted k index for inter-rater agreement on the FIM score was in the almost perfect range (k 0.87; 95% CI = 0.79 0.95), but a 20-point tolerance was necessary to reach a k value of 0.81 (95% CI = 0.66 0.95). Agreement was substantial or almost perfect for most subscales, but the k index with 1-point tolerance reached the almost perfect rating for comprehension only. For intra-rater agreement, weighted k index was in the almost perfect range for the FIM score and for all subscales; kappa index reached the almost perfect range with a 4-point tolerance for FIM score and with 1point of tolerance for all subscales except interpersonal relations. FIM is useful to monitor patient improvement during rehabilitation treatment, mostly when assessed by the same physical therapist.


INTRODUCTION
Rehabilitation services require instruments that are suitable for the following patients over time customizing rehabilitation protocols for measuring disability.Moreover, disability rating scales that may be assessed by physical therapists would be of greatest interest favouring a closer monitoring of clinical course and lower costs [1,2].The FIM is considered a useful support for clinical practice in each rehabilitation area, although this scale seems less useful for spinal cord injury [1][2][3].It is an easy-to-use, standardised and robust general measure of functional disability.Previous studies showed high intra-rater and inter-rater reliability of the FIM, indicating the internal consistency of the scale although it was not sensitive enough to assess changes in patients with tetraplegia [1,[4][5][6][7][8][9][10][11].However, these studies failed to clarify what was the minimum change that may reflect a real change of patients' status rather than random variability.Actually, to be helpful in monitoring recovery during the treatment of a scale should be suitable for physical therapists and should prove reliable enough to show even minimal changes in the patient's disability.
Reproducibility studies estimate the probability that the same score is attributed when the patient is retested and then the likelihood that an improved score reflects a true clinical improvement.Besides, reproducibility also reflects the reliability of the scale and training of raters.Scales showing a high reproducibility index when administrated by physical therapists may be adopted for monitoring the clinical course of patients during rehabilitation programs.Therefore, we examined the repro-ducibility of the FIM scale when assessed by physical therapists in a sample of patients with mild to severe disability in the routine setting of a Spinal Unit and a Department of Physical Medicine of a Rehabilitation Hospital.

METHODS
In the present study we included a consecutive case series of 50 patients admitted to the Spinal Unit and the department of Physical medicine of the Rehabilitation hospital "Casa di Cura San Raffaele" of Sulmona, L'Aquila, Italy, between March and August 2009, because of the occurrence of neurological deficits caused by spinal cord injury, neurodegenerative, vascular or inflammatory diseases.All patients provided in person informed consent, according to national and international regulations.Comatose patients were excluded.FIM is an 18-subscale ordinal scale which rates the level of assistance required to perform various activities of daily living using a seven-level scoring system, with scores ranging between 126 (normal status) and 18 (totally dependent) [12].Five experienced physical therapists, that routinely used the FIM scale, were arranged in 25 combinations in a balanced design in which each physical therapist was in turn once the first and once the second rater.Each couple of physical therapists evaluated 2 patients, assigned the score to each of the 18 subscales and computed the total score.Each patient was randomly assigned to one of the 25 couples of raters and was evaluated twice, within an interval of 24 hours (+/−5 hours).The raters were taught to independently evaluate each patient and not to communicate the scores to each other or to the patient in order to keep independency of the assessments.In ten instances the first and the second raters corresponded to the same physical therapist.These patients were used to evaluate intra-rater reliability, while the remaining 40 patients were used for inter-rater reliability.
We performed a graphical descriptive analysis of the FIM scores in order to evaluate the distribution of discrepancies over the scale range.The degree of inter-rater and intra-rater agreement was measured with weighted k statistic, accounting for severity of disagreement.Morever, k for perfect agreement, and k for the agreement with tolerance were also computed in order to evaluate the minimum variation reflecting a true change in patient status [13].The indexes were computed for each couple of raters and then an overall k index was calculated according to the method proposed by Fleiss et al. [14] The values of the k statistic were interpreted according o the criteria of Landis and Koch [15].For a k index < 0.00 agreement was termed as poor; for a k index between 0.00 and 0.20, as slight; for a k index between 0.21 and 0.40, as fair; for a k index between 0.41 and 0.60, as moderate; for a k index between 0.61 and 0.80, as substantial; for a k index between 0.81 and 1.00, as almost perfect [15].

RESULTS
Our study population included 50 patients (28 men and 22 women) with spinal cord (50%) or cerebral (50%) lesions referred to the Rehabilitation Center of "Casa di Cura San Raffaele" of Sulmona, Italy: 48 were first choices while 2 were replacement choices due to death or to unexpected discharge before completion of the protocol.The mean age was 59.5 +/− 22.58 years.Etiology and distribution of neurological deficits for the included patients were reported in Table 1.Graphical analysis showed a uniform distribution of the sample over the range of the FIM scale (Figures 1(a) and (b)) and of each subscale with the exceptions of personal care, feeding oneself, sphincter al control, communication, relational/ cognitive capacity in which more values occurred in the higher range of the scale and of locomotion, in which most values occurred in the lower range.The maximum disagreement on the FIM scale produced a 40-point difference between raters.Outlier values of disagreement were observed for a few patients in several subscales, with differences of more than 3 points in 12 of 18 sub-  scales (tyding oneself, washing oneself, dressing from the waist up, dressing from the waist down, perineal hygiene, bladder control, bowel control, water closet, walking/wheelchair, stairs, interpersonal relations, and problem solving).Kappa indexes and 95% CI for inter-rater agreement of FIM overall score and for subitems are reported in Table 2. Based on weighted k index, the agreement on the overall FIM score was almost perfect (k 0.87; 95% CI = 0.79 -0.95).The agreement was substantial or almost perfect in all scales but that relating to walking/wheelchair in which agreement was moderate.
The k index of perfect agreement for the FIM overall score was slight (k 0.18; 95% CI = 0.006 -0.30).A 20point tolerance was necessary to reach a k value rated as almost perfect (k 0.81; 95% CI = 0.66 -0.95).The kappa index of perfect agreement was fair for all subscales of FIM with the exception of bowel control, stairs, and problem solving for which it was moderate.The agreement with 1-point of tolerance reached the almost perfect rating for comprehension only (k 0.82; 95% CI = 0.64 -1.00) and was substantial for the majority of the remaining subscales but tyding oneself, washing oneself, dressing from the waist up, dressing from the waist down, walking/wheelchair, and interpersonal relations in which it was moderate.Kappa indexes and 95% CI of intrarater agreement for FIM overall score and for subscales (Table 3) were always higher than the corresponding values of inter-rater agreement.The weighted k index was in the almost perfect range for the overall FIM score and for all subscales.The kappa index of perfect agreement, for the FIM overall score was substantial (k 0.77; 95% CI = 0.48 -1.00) and became almost perfect with a 4-point tolerance.The analysis of intra-rater perfect agreement showed k values rated as almost perfect or substantial for all subscales but dressing from the waist down in which it was moderate.The agreement with 1point of tolerance reached the almost perfect level for all subscales.

DISCUSSION
The results of this study including a sample of patients with a mild to severe level of disability indicate that inter-rater and intra-rater reproducibility of FIM are high when evaluated by the weighted k index accounting for severity of disagreement.However, the analysis of perfect agreement and of agreement with tolerance indicated that a 20-point tolerance was necessary to reach a substantial inter-rater agreement while with a 4-point tolerance intra-rater agreement was almost perfect.Kappa index with 1-point tolerance showed a substantial or almost perfect inter-rater agreement in most subscales and almost perfect intra-rater agreement in all subscales.Reproducibility of scales may be influenced by the distribution of values across the scale range, being higher when values fall in a tighter range.In the present study, the overall FIM and subscales scores are spread across the whole range in most of the scales.Therefore, the sample is fairly representative of cases seen in common clinical practice and allowed an unbiased estimation of reproducibility.The time interval of 24 hours between the first and the second observations may have favoured intra-rater agreement that was almost perfect in almost all subscales.However, had we adopted a wider time interval among assessments, changes in the functional status of patients might have produced a spurious disagreement.We think that a 24-hour interval was an acceptable compromise.Moreover, the good agreement might have also depended on the inclusion of patients in the post-acute phase in a well-defined setting, with examiners trained in the identification of relevant clinical features, actively collaborating and exchanging views on patients course.On the other hand, routine administration of FIM might have reduced the agreement in the long run, due to the tendency to over-interpret some subscales.Previous studies indicate that FIM provides good inter-and intra-rater reliability across a wide variety of raters with different professional backgrounds and levels of training, but few studies addressed the reproducibility of FIM in patients with spinal cord injury since it was considered not specifically designed for those subjects [1,4,5,16,17].We confirmed the high reproducibility of the scale in a sample including 50% of subjects with spinal cord lesions.An important lesson from our study and literature reports is that the level of reproducibility of FIM is high even when the scale is assessed by physical therapists, although one must consider as a source of variability the evel of professional skill, ex erience with the FIM and l p acquaintance with the patient [1,2,5].According to our results the FIM scale may be adequately assessed by treating physical therapists when adequately trained to assess the scale in the routine clinical practice, with uniform reproducibility across all subscales.A sound difference between the levels of inter-and intra-rater agreement is evident from our study.A possible explanation of this result may consist in the misinterpretation of scale coding by some raters and might indicate the need of further training or loss of adherence to coding rules with time.Therefore, periodic retraining of raters should be planned in order to keep high the reproducibility.Several subscales did not reach a kappa index in the almost perfect or substantial range when inter-rater agreement was assessed with 1-point of tolerance, while all subscales reached the almost perfect rating of intra-rater agreement with the same analysis.So if the evaluations are carried out by the same rater, changes as small as 1-point in each subscale should be considered clinically significant.Otherwise, when evaluations are performed by different raters, variations of less than 2 points in each sub-scale might be a consequence of variability in the assessments.
Our results thus suggest that the repeated administration of FIM by the treating physical therapist may record even small variations during the rehabilitation program.This practice may produce a reinforcing effect in terms of engagement of the patient with therapy.Moreover, precise monitoring of functional status allows a continuous adaptation of the rehabilitation protocol favouring the achievement of the best improvement in the patient's physical performance and the least rate of complications.However, monitoring of patients with the FIM scale should be performed by the same physical therapist to inimize random variability of assessments.m

Figure 1 .
Figure 1.Distribution of FIM overall scores of intra-rater agreement analysis.

Table 1 .
Main characteristics of the study population.