Effect of Fine-grained Lexical Rating on L2 Learners' Lexical Learning Gain

In assessing L2 lexical learning, especially initial learning, researchers always face the problem of whether partial word learning should be counted. Existing studies have either counted partial word learning (i.e. counted both partial and complete word learning) or have only counted complete word learning. However, it is not clear whether counting partial word learning makes a difference in capturing task-based and intra-learner lexical learning gain. Few studies have investigated this potential difference and even fewer if both productive and receptive lexical learning are considered. The present study employed differently fine-grained word rating methods to assess three Chinese EFL learner groups' performances on four vocabulary posttests after receiving three treatment tasks: a written output task, an oral output task, and a reading task. Data analyses revealed that the use of differently fine-grained scoring methods did not necessarily affect learners' cross-task lexical learning effects significantly, but it did make a significant difference in measuring individual learners' lexical learning gain. The findings are discussed with reference to whether and how a less or more fine-grained scoring method should be adopted in rating lexical learning.


Introduction
Second language (L2) vocabulary assessment mainly serves two purposes: one is measuring learners' vocabulary size; the other is gauging learners' vocabulary achievement over a period of time (Bruton, 2009;Read, 2000).For either purpose, learners may manifest partial word learning as lexical learning is incremental (Barcroft & Rott, 2010;Schmitt, 2010).Particularly, in receptive vocabulary tests, learners may provide close but inaccurate word meanings for target word forms; likewise, in taking a productive vocabulary test, learners may generate inaccurate orthographic (or phonological) word forms.Partial learning is especially common in assessing learners' initial word learning gain in experimental studies (Barcroft & Rott, 2010).An examination of existing studies on lexical learning revealed that some took learners' partial knowledge gain into account (e.g.Barcroft, 2002Barcroft, , 2007;;Webb & Chang, 2012), whereas others did not (e.g.Hulstijn & Laufer, 2001;Min, 2008).Bruton (2007) argued that in experimental studies the assessment of incidental vocabulary receptive learning from L2 reading should take into account partial learning since the reading context may not provide learners with adequate lexical learning affordances.Similarly, because of its incremental nature, measures of productive lexical learning should count partial word knowledge as well.The critical point here is that counting partial word knowledge gain might affect cross-task lexical learning impacts and intra-learner lexical knowledge gain while examining the effectiveness of lexical learning tasks.However, few existing studies have specifically addressed this issue.This paper examines whether differently fine-grained lexical learning scoring methods make a difference in determining task-engendered lexical learning as well as individual learners' lexical learning gain through an experimental study.

L2 Lexical Knowledge Assessment
Lexical learning assessment pertains to both receptive and productive measures, and these have been defined in different ways.Read (2000) classified receptive and productive vocabulary measures as either recognition (understanding the meaning of an isolated target word) and recall (eliciting the form of a target word based on some stimulus) or comprehension (understanding the meaning of a target word in listening or reading) and use (a target word occurring in speech or writing).As the measures for comprehension and use may be contaminated by contextual factors, the tendency in vocabulary assessment is testing recognition and recall, for example, through L2-to-L1 translations and L1-to-L2 translations respectively (Read, 2000), which actually measure the formmeaning link of an L2 word (Schmitt, 2010).Laufer and Goldstein (2004), based on form-meaning relationships, formulated four degrees of form-meaning knowledge: active recall (supplying L2 word forms), passive recall (supplying L1 equivalents), active recognition (selecting L2 word forms), and passive recognition (selecting L1 equivalents), which Schmitt (2010) named as form recall, meaning recall, form recognition, and meaning recognition for easy understanding.Since form recognition and meaning recognition do not occur in real communications, lexical learning is basically about meaning recall and form recall, the initial steps leading to receptive and productive mastery of words (Schmitt, 2010).
The present study focuses on both receptive and productive lexical learning, which are defined as meaning recall and form recall respectively following the distinctions made in Schmitt (2010).Particularly, receptive lexical learning is operationalized as learners being able to retrieve word meaning when seeing the orthographical form of a target word, and productive lexical learning is operationalized as learners being able to produce the orthographical form of a target word when being provided with the meaning.The methods for assessing receptive and productive lexical learning in previous studies are reviewed below in terms of both testing format and scoring method.

Assessing Receptive Lexical Learning
Receptive lexical learning defined as meaning recall can be assessed by providing L1 translations, L2 paraphrases, or equivalent pictures for target words depending on the participants' backgrounds.One example is Newton (1995), which required participants with different native language backgrounds to show their understanding of target words in any of the above-mentioned ways.Hulstijn and Laufer (2001) allowed their participants, intact classes of EFL learners, to provide either L1 equivalents or English explanations for target words.Keating (2008), a replication of Hulstijn and Laufer (2001), utilized Spanish-English translation to assess learners' receptive learning of Spanish words.
Such receptive tests focusing on form-meaning links do not incorporate partial word learning in terms of test format, but partial word knowledge can be incorporated in scoring.For example, instead of using a correct/ incorrect scoring criterion, Hulstijn and Laufer (2001) and Keating (2008) both considered partially correct meanings by scoring each target word on a three-point scale: 0, 0.5, and 1.
One assessment method emphasizing partial word knowledge in test format is the Vocabulary Knowledge Scale (VKS) (Bruton, 2009;Stewart, Batty, & Bovee, 2012;Wesche & Paribakht, 1996).The VKS is a generic instrument for measuring the depth or quality of vocabulary knowledge gain.As Wesche and Paribakht (1996: p. 33) claimed, "Its purpose is not to estimate general vocabulary knowledge, but rather to track the early development of specific words in an instructional or experimental situation".The VKS consists of a word knowledge elicitation scale and a scoring scale.The elicitation scale is composed of five categories (Bruton, 2009;Stewart et al., 2012;Wesche & Paribakht, 1996), representing five degrees of word knowledge, going from (I) being completely unfamiliar with a word, (II) having seen a word but not knowing its meaning, (III) being able to guess its meaning (supply synonyms or translations), (IV) knowing its meaning exactly (supply synonyms or translations), to (V) being able to use it (write a sentence).The scoring scale allows for five possible scores: 1, 2, 3, 4 and 5, with 1 and 2 being awarded to test-takers' knowledge degrees I and II respectively, 3 awarded to knowledge degrees III and IV, i.e. when a correct synonym or translation of a word being supplied despite testtakers' certainty, and 4 and 5 awarded to knowledge degree V depending on test-takers using the word in a sentence with semantic appropriateness only or with both semantic appropriateness and grammatical accuracy.
The VKS has been used in a number of studies in its original or modified form.Both Hashemi & Gowdasiaei (2005) and Kim (2008) used the original VKS.The former examined the effects of lexical-set and semanticallyunrelated vocabulary instruction on Iranian EFL learners' lexical learning, while the latter, a partial replication of Hulstijn and Laufer (2001), assessed learners' lexical gain from performing reading comprehension, reading plus blank fill-in, and composition writing.Since the VKS has been criticized to be multidimensional, to exclude multiple meanings and approximate word knowledge, and not to really measure productive lexical learning (Bruton, 2009;Stewart et al., 2012), researchers (e.g.Atay & Kurt, 2006;Joe, 1998;Min, 2008;Rott, Williams, & Cameron, 2002;Webb & Chang, 2012) modified the VKS self-report categories for their own purposes.Particularly, Atay & Kurt (2006), instead of using the original five-category VKS report scale, adopted a two-category scale, providing test-takers with an unknown/known option, and required test-takers to supply word meaning and make a sentence with the word if they take the known option.Joe (1998) used the VKS in interviews rather than as a written procedure, to allow for more probing of what the learners know about each word.Joe (1998) also modified the elicitation scale by introducing an extra category, which goes as "I haven't seen this word before, but I think…" so as to allow learners to make inferences about a word through recognizing the prefix, stem, or suffix.Min (2008) condensed the original five VKS categories into four and subsumed them under the basic unknown/known dichotomy, with the unknown dichotomy consisting of Categories I (unknown words) and II (partial word knowledge), and the known dichotomy containing Categories III (supply word meaning) and IV (write a sentence).Rott et al. (2002) only incorporated four VKS categories in order to measure their learners' productive vocabulary learning though the VKS is intended for measuring receptive lexical learning as stated earlier.Webb & Chang (2012) employed three report categories: 0, 1, and 2, respectively for having never seen the word, having seen the word but not knowing its meaning, and supplying word meaning.
Studies may use the VKS test format but not count partial word knowledge at the scoring stage (e.g. de la Fuente, 2002;Min, 2008).De la Fuente (2002) employed an oral receptive version and an oral productive version of the VKS, each containing four self-report categories.However, the author used a correct/incorrect binary scoring procedure, hence losing much of the information that comes from using the VKS.Min (2008) employed four self-report categories but assigned zero scores to Categories I and II, and marked Categories III and IV independently and respectively following a correct/incorrect criterion.
To summarize, the above account of supplying L1 equivalents and the VKS suggests that assessing partial word learning does not so much lie in the test format as in the scoring procedure.The format of supplying L1 equivalents does not entail partial word knowledge, but since learners may not be able to provide exact equivalents, partial word gain can be considered in scoring (Hulstijn & Laufer, 2001;Keating, 2008).The elicitation scale of the VKS provides the possibility for garnering learners' different degrees of word knowledge, which, however, may be ignored in scoring (de la Fuente, 2002;Min, 2008).The point is whether counting partial word knowledge really matters.Particularly, does including partial word knowledge or not really make a difference in assessing cross-task lexical learning effects and intra-learner word knowledge gain?This issue constitutes one focus of this study.

Assessing Productive Lexical Learning
Productive word learning defined as word form recall can be measured in two test formats.One is isolated L1-to-L2 word translation (Barcroft & Rott, 2010;Hulstijn & Laufer, 2001) or picture labeling (Barcroft, 2004;de la Fuente, 2002;Ellis & He, 1999;Smith, 2004).The L1 word in L1-to-L2 translation and the picture in picture labeling both serve as meaning prompts for test-takers to recall and produce the target word form so that L1-to-L2 translation and picture labeling can be categorized as the same test format.The L1-to-L2 translation is applicable when test-takers share their L1.However, its problem lies in the possibility of being unable to elicit the right target word.For instance, Barcroft and Rott (2010) conducted immediate posttests in order to elicit target German and Spanish word forms from their participants.The problem with picture labeling is that it is confined to testing some concrete nouns, as seen in previous studies (e.g.Barcroft, 2004;de la Fuente, 2002;Ellis & He, 1999;Smith, 2004).
The other commonly used test format is the productive vocabulary levels test designed by Laufer and Nation (1999).Each test item in this format contains a sentence, which incorporates the target word but leaves its position as a blank.Test-takers are required to fill in the sentential blank so that the target word can be produced.In order to push test-takers to produce the right target word, the initial letter(s) are provided as prompts, as in "The differences were so sl_____ that they went unnoticed" (Schmitt, 2010: p. 203).However, Schmitt (2010) doubts the validity of this test format in that the supplied initial letter(s) and the sentential context might affect the difficulty of the test item.He also argues that such a test format is not communicative enough and cannot measure learners' comprehensive productive word knowledge.Nevertheless, this sentence blank-filling format does involve word form recall, whereas measuring productive vocabulary knowledge via writing or speaking would involve word use instead of word form recall. Furthermore, the blank-filling test tries to overcome the constraints of L1-to-L2 translation and picture labeling as discussed above.However, the use of prompt letters and the sentence context do require validation (Schmitt, 2010).Although this test has been designed to measure productive vocabulary size, the present author considers it equally suitable for measuring learners' productive word gain from task performances, as used in this study.
In spite of their different formats, L1-to-L2 translation (or picture labeling) and the letter-prompted sentence blank-filling both involve word form recall, and may elicit inaccurate word forms, which test formats cannot predict.Yet, inaccurate word forms may be counted in test scoring in order to gain a more accurate picture of learners' lexical gain However, existing studies focusing on productive lexical learning show discrepancies regarding how partial word knowledge has been counted.Some have followed a correct/incorrect (i.e.1/0) binary scoring method, which ignores partial word knowledge (de la Fuente, 2002Fuente, , 2003;;Min, 2008).Other studies have adopted a three-point scoring system (i.e. 1, 0.5 and 0) with partial word learning being awarded 0.5 points (e.g.Barcroft, 2009;Barcroft & Sommers, 2005).In addition, Barcroft (2002) has developed a 5-point scoring scale, called the lexical production scoring protocol (LPSP), to measure learners' productive lexical learning as accurately as possible.The LPSP awarded 0.00, 0.25, 0.50, 0.75, or 1.00 to a target word depending on the percentage of "letters correct" (target letters placed in the correct positions of a word) and the percentage of "letters present" (misplaced target letters in a word) in the learner-produced target word.The LPSP has been applied in Barcroft's other studies (Barcroft, 2003(Barcroft, , 2007) ) as well as by other researchers (Keating, 2008;Niu, 2014;Niu & Helms-Park, 2014;Smith, 2004).
Since different scoring approaches count partial lexical knowledge to different degrees, some studies have adopted multiple approaches to capture a fine-grained picture of learners' productive word gain (Barcroft, 2002(Barcroft, , 2004)).Barcroft (2002), examining the effects of semantic learning, structural learning, versus no elaboration on picture-cued word form recall by L1 English learners of Spanish, applied letter-based scoring, word-based scoring, and syllable-based scoring to measure participants' lexical gain in order to discern the potential scoring moderating effect.Both letter-based scoring and syllable-based scoring counted partial word learning, while word-based scoring did not.However, the study did not observe any significant moderating effect of different scoring approaches on different learning conditions.Barcroft (2004) employed whole word scoring and syllable scoring in measuring the effect of sentence writing and word-meaning repetition on English-speaking Spanish learners' word recall, revealing no significant moderating effect, either.In actuality, neither of the studies intended to particularly examine the different effects of different scoring methods; instead, they validated their research results with different scoring methods.While the two studies revealed that differently fine-grained productive scoring methods did not affect cross-task lexical learning differences, neither examined whether different scoring methods would influence intra-learner lexical learning gain.Therefore, as with receptive lexical learning, more studies are needed to examine whether differently fine-grained productive word scoring methods make a difference in gauging both cross-task lexical learning effects and individual learners' lexical learning gain.

Research Design and Research Questions
Predicated on the above-reviewed literature, the study examined whether counting partial word knowledge would affect cross-task lexical learning differences and intra-learner lexical learning gain.Both receptive and productive lexical learning were included and defined as word meaning recall and orthographic word form recall respectively.The data of the study came from Chinese EFL learners' vocabulary tests taken after performing three input-based tasks: collaborative written output, collaborative oral output, and reading comprehension, sequentially named as Written Output, Oral Output, and Reading in the study.The three tasks were chosen in order to better observe cross-task lexical learning differences because Written Output and Oral Output have been found to be similar but both tend to be significantly better than Reading in bringing about lexical learning (Niu & Helms-Park, 2014).
Counting partial word knowledge was realized via using differently fine-grained scoring methods.Specifically, learners' receptive lexical learning was measured by an adapted VKS, which was scored with two methods: a correct/incorrect binary scoring method called correct meaning scoring and an adapted 5-point scoring method called graded scoring.Graded scoring took learners' partial receptive word knowledge into account while correct meaning scoring did not.It was expected that graded scoring would provide a more accurate picture of learners' receptive lexical learning and hence would reveal cross-task lexical learning effects and individual learners' lexical learning gain more exactly than correct meaning scoring.In order to empirically substantiate such a prediction, two research questions (RQs) were raised in relation to receptive lexical learning: 1

) To what extent do differently fine-grained receptive lexical scoring methods lead to different cross-task lexical learning effects?
2) To what extent do differently fine-grained receptive lexical scoring methods lead to different intra-learner lexical learning gain?
Learners' productive lexical learning was measured by a letter-cued sentence blank-filling following Laufer and Nation (1999), which was scored in three ways: letter-based scoring, syllable-based scoring, and word-based scoring, in light of the methods used in Barcroft (2002Barcroft ( , 2004)).The three methods counted correct letters, correct syllables, and correct words sequentially.Letter-based scoring is more fined-grained than syllable-based scoring, and syllable-based scoring more fine-grained than word-based scoring.It is predicted that letter-based scoring would reveal a more accurate picture of learners' productive lexical learning and hence would reveal cross-task lexical learning effects and individual learners' lexical learning gain more exactly than syllable-based scoring, and syllable-based scoring would be more exact than word-based scoring in revealing learners' productive lexical learning.While Barcroft (2002Barcroft ( , 2004) ) observed no significant difference among the three scoring methods in revealing cross-task lexical learning effects, no previous studies have investigated whether differently finegrained scoring methods are related with intra-learner lexical learning gain.Hence, with reference to productive lexical learning, the following two RQs were addressed: 3

) To what extent do differently fine-grained productive lexical scoring methods lead to different cross-task lexical learning effects?
4) To what extent do differently fine-grained productive lexical scoring methods lead to different intra-learner lexical learning gain?

Participants
240 Chinese year-one English majors from a key university in Guangdong Province, China participated in the study.They were aged 17 -21, and had studied English for 7 -11 years.They entered university through passing China's national entrance examinations.In their year-one university study, the participants mainly attended courses relating to the four language skills.They came from 10 intact classes and were divided into three groups: 1) the Written Output group, 2) the Oral Output group, and 3) the Reading group, with 98, 96, and 46 students respectively in each group.The three groups were not significantly different in their overall English proficiency as measured by their term-final English core course scores, F (2, 237) = 0.206, p = 0.814.

Reading Input and Target Words
The input passage The Land of Disney was selected and adapted from BBC English.The passage was expository and hence favorable for learners to remember and recall its content.In order to enhance learners' comprehension, culturally loaded details and non-essential embellished sentences were removed.Although the adapted passage was not difficult for the participants, it contained some relatively difficult words so that the lexical learning purpose could be achieved.The length of the passage (485 words) ensured that the reading task and the postreading output tasks could be completed within an 80-minute class period.Based on the passage length, the time on task, and the recommended practice in previous studies (e.g.Hulstijn & Laufer, 2001;Laufer, 2003), 10 words plus 6 distracters were selected as target words based on pilot tests on peers of the participants.The target words were all content words, including two nouns (epoch and acuity), three adjectives (perilous, idyllic and apprehensive) and five verbs (encapsulate, instigate, espouse, depict and heed).The known word coverage of the passage was approximately 97.94% of the word tokens, which should have ensured the participants' instant comprehension of the passage (Hirsh & Nation, 1992).The target words and distracters were glossed with Chinese meanings on the margin of the passage since word meaning guessing is often unreliable (Laufer, 1997), and learners must first know word meanings if they are to put words to use and store them in memory (e.g.Rott et al., 2002).The marginal Chinese glossing is also coherent with the participants' habit of memorizing English words through associating Chinese equivalents.

Pretest
A word pretest was administered in order to ensure participants' zero baseline knowledge of the target words.The pretest contained the 10 target words and 10 distracters, all key words from the reading input passage.Participants were required to provide Chinese meanings for them.A pretest is believed to be able to gauge learners' word knowledge more accurately than a post-task recall.The pretest effect, if there was any, would apply to all participants, and hence would not affect the research results.

Treatment Tasks
The study employed three treatment tasks, and their detailed requirements are outlined in Table 1.The three tasks have been designed to be as comparable as possible in terms of task cycle, time on task, and requirements.Written Output and Oral Output were accompanied by a separate work sheet, on one side of which were the cued words arranged in the order in which they appeared in the passage, and on the other side of which were the cued words ordered alphabetically and followed by their phonetic transcriptions, parts of speech, and Chinese meanings.Cued words were provided in order to remind participants of the propositions of the passage and to optimize their chances of using the target words in reconstruction.The instructions for all three tasks were phrased in both English and Chinese in order to optimize participants' understanding.

Posttests
In light of the productive vocabulary levels test (Laufer & Nation, 1999), a letter-cued sentence blank-filling was adopted to measure participants' productive word learning.It required test-takers to complete a set of blank-embedded sentences, as illustrated below.Both cued letters and Chinese meanings were provided in order Table 1.Requirements of the three treatment tasks.

Reading
Judge whether 16 statements are "true", "false", or "of no evidence" without referring to the passage (25 mins) Evaluate the correctness of their judgment against the original passage (10 mins) to prompt learners to generate the target words.The choice of cued letters was determined based on a pilot study.The sentences were chosen and adapted from the British National Corpus in order to optimize their authenticity.The reality is not as i c (美好的，愉快的) as he has expected.By drawing on the VKS (Wesche & Paribakht, 1996), an adapted version was utilized to measure participants' receptive lexical gain.It adopted the first four VKS self-report categories, and Categories III and IV required participants to provide Chinese meanings for the tested words instead of providing either translations or synonyms because the input passage was glossed with Chinese.Category V of the original VKS was excluded because it tested word use rather than word recall.
Four posttests, both productive and receptive, were administered in the study.In order to reduce test effects, the four posttests were varied in terms of the number of items and the order in which the items were arranged.Posttest 1 contained 14 items, including 10 target items and 4 distracters, and the 14 items were arranged in the order that they appeared in the input passage.On posttest 2, only the 10 target items remained, and were ordered randomly.Posttests 3 and 4 were the same as posttests 1 and 2 correspondingly.Pilot study analysis indicated that both productive posttests (Alpha = 0.7442; Guttman split-half = 0.6292) and receptive posttests (Alpha = 0.7780; Guttman split-half = 0.8009) were reliable.

Data Collection Procedure
The data were collected in the following procedure: 1) In Week 1, all three groups completed the vocabulary pre-test within 10 minutes.Then the output group members were paired according to the closeness of their term-final English core course scores and their personal preference.Afterwards, they received a task practice session.The tasks adopted for the practice session were the same as the treatment tasks in terms of both task requirements and task instructions except that a shorter different passage was used and shorter time was allowed for the practice session.
2) In Week 2, all three groups performed their respective treatment task strictly in line with the task instructions.The task performances of the two output groups were audio-taped by the task administrator, who was their listening class teacher, with participants' permission in case that their performances would be useful for interpreting the findings of the study.Recording is a common practice for students in listening classes and should not have affected the participants' performances.Despite being allocated 45 minutes for task completion, Written Output, Oral Output, and Reading averagely took 45, 40 and 30 minutes respectively.This time difference, being regarded as a task-inherent factor, was not considered in data analysis.
3) Immediately upon task completion, all three groups took vocabulary posttests 1, with the productive test preceding the receptive one in order to avoid test effect.4) In Weeks 3, 5 and 6, vocabulary posttests 2, 3 and 4 were administered.

Data Analysis
The results of the vocabulary pre-test excluded the target word apprehensive from data analysis because some participants happened to be taught this word shortly before the experiment.This word was dropped in order to maintain participants' zero baseline knowledge of target words and maintain an adequate sample size.After the pre-test screening and the exclusion of those absent from any data collection session, 175 participants remained: 69 for Written Output, 72 for Oral Output, and 34 for Reading.One-way ANOVA results of the three groups' term-final English core course scores indicated that they were not significantly different in English proficiency, F (2, 172) = 1.245, p = .29.Data analysis of the study mainly involved scoring productive and receptive posttests and analyzing the scores statistically in response to the research questions.

Productive Vocabulary Scoring
The three productive vocabulary scoring methods are detailed below.1) Letter-based scoring Letter-based scoring calculates the percentage of correct letters in a target word that learners produced and hence takes partially correct word forms into account.Correct word forms are understood as 100% of their letters being correctly produced.The scoring rules were formulated by adapting Barcroft's (2002) LPSP.Applying LPSP to the present study showed that LPSP was not sensitive enough to the variance of incorrect forms.Thus, instead of using a 5-point scale (i.e.0, 0.25, 0.5, 0.75 and 1), a letter-based scoring system weighed correct let-ters only, and directly transformed the percentage of correct letters into the score.Three levels were used for letter-based scoring: 1) 0 points were awarded when nothing was written or none of the supplied letters were correct; 2) 1 point was awarded when all supplied letters were correct; 3) a score between 1 and 0 was awarded when a portion of the supplied letters was correct.This was computed through dividing the number of correct letters by the number of target letters or by the number of supplied letters if more letters were written than the target letters.All partially correct words were collected and scored repeatedly.The intrarater scoring reliability based on the author's last two markings was 97.97%.A trained interrater marked all partially correct words and the interrater reliability was 90.36%.Based on letter-based scoring, the score for each target word ranged from 1 to 0. For each participant, the maximum score was 9 and the minimum was 0.
2) Syllable-based scoring Syllable-based scoring only considers correctly produced syllables.It is less fine-grained than letter-based scoring because scoring partially correct syllables as zero might miss counting some correctly produced letters.Since phonetic syllable divisions and orthographic syllable divisions are not always the same (Wells, 1990), the target words in this study were syllabified on their orthographic forms according to phonetic syllabification principles.The reasons are that the productive posttests required participants to produce orthographic forms of the target words, phonological representations were also involved in participants' task performance, and the orthographic forms of English words are usually connected with their pronunciations (Graddol, 2007;Horobin, 2007).The 9 target words contain totally 24 syllables, 18 of which were incorporated into scoring because the productive posttests were letter-cued and some syllables had been supplied as cued letters.Each correct syllable was awarded 1 point.The total score for all marked syllables ranged from 0 to 18, which were converted into scores ranging from 0 to 9 for statistical analysis in order to be comparable with results from the other two scoring methods.
3) Word-based scoring Word-based scoring only counts correctly produced words.It is the lease fine-grained method among the three because excluding incorrect word forms means excluding correct letters or correct syllables.Word-based scoring stipulates that one correctly produced word is awarded 1 point; otherwise, 0 points would be assigned.Unanswered items were assigned 0 points because the common practice for productive posttests (sentence blank filling in this study) is that learners leave a blank empty if they do not know the answer.Word-based scoring was conducted by screening the scores resulting from letter-based scoring.In other words, letter-based "1" scores were the "1" scores for word-based scoring, and letter-based "0" scores and partial scores were the "0" scores for word-based scoring.

Receptive Vocabulary Scoring
The two receptive vocabulary scoring methods are stated as follows.
1) Graded scoring In light of the VKS scoring scale (Wesche & Paribakht, 1996), graded scoring measured participants' receptive word gain on a 5-point scale: 0, 0.25, 0.5, 0.75, and 1.The VKS employed in this study contained four self-report categories.If Category I or Category II was chosen, 0 points or 0.25 points would be awarded respectively.However, three levels of word knowledge were identified for both Category III and Category IV according to the accuracy of the Chinese meaning provided, and each level was assigned a different score.A wrong meaning being supplied equals choosing Category II, and 0.25 points would be awarded.If close meanings were supplied, 0.5 points were awarded.Close meanings refer to meanings partially sharing the original meaning of the target word.For example, the exact Chinese meaning of acuity is minrui (敏锐), but jimin (机敏) would be regarded as the close meaning.To enhance reliability, the close meanings were collected and shown to another Chinese EFL teacher, and agreement was reached between her and the author.If correct meanings were supplied under Category III, the score would be 0.75; if under Category IV, the score would be 1, in order to show participants' degree of certainty in retrieving word meaning.Based on graded scoring, the score for each participant ranged from 9 to 0.
2) Correct meaning scoring Correct meaning scoring measures receptive vocabulary acquisition solely based on the correctness of the Chinese meanings that participants provided for the target words.This approach is categorical in that either 1 point or 0 points were accorded.One correct Chinese meaning was assigned 1 point; if no meanings, wrong meanings, or close meanings were supplied, 0 points were awarded.In practice, correct meaning scoring was conducted through counting the items rated at 0.75 points and 1 point in the graded scoring.

Statistical Analysis
SPSS version 16 was used to analyze the data statistically.In order to answer RQs 1 and 3, one-way ANOVA was applied to compare the differences among the three task groups with reference to letter-based scoring, syllable-based scoring, and whole word scoring respectively for productive lexical learning and with reference to graded scoring and correct meaning scoring respectively for receptive lexical learning.In order to answer RQs 2 and 4, all participants' scores resulting from the three productive word scoring methods were correlated and compared for significant differences with reference to each productive posttest; meanwhile, their scores resulting from the two receptive word scoring methods were correlated and compared for significant differences with reference to each receptive posttest.

Receptive Multiple Scoring and Cross-Task Lexical Learning Effects
As shown in Table 2, graded scoring produced higher scores than correct meaning scoring did on all 4 posttests and across all three tasks.This confirms that graded scoring is more fine-grained than correct meaning scoring as assumed in the study.However, results of one-way ANOVA, as displayed in Table 3, revealed scoring-engendered cross-task lexical learning statistical differences only on posttests 1 and 2. That is, Written Output and Reading led to significantly different receptive lexical acquisition on posttest 1 according to graded scoring but not based on correct meaning scoring, whereas the two tasks brought about significantly different receptive lexical retention on posttest 2 according to correct meaning scoring but not based on graded scoring.This finding indicates that differently fine-grained receptive scoring methods, that is, graded scoring and correct meaning scoring in this study, may affect cross-task lexical learning differences significantly.Yet, this effect is variable in that a more fine-grained scoring method is not more likely to produce cross-task differences than a less fine-grained scoring method does, especially when the differences between two tasks are small, like that between Written Output and Reading in this study.However, when two tasks, like Oral Output and Reading used in the study, are sufficiently distant, differently fine-grained scoring methods tend not to significantly affect the degree of their differences in affecting lexical learning.Specifically, as shown in Table 3, Oral Output and Reading led to significantly different receptive lexical learning on all 4 posttests based on both scoring methods, hence showing no scoring method effect.

Receptive Multiple Scoring and Intra-Learner Lexical Learning Gain
As presented in Table 4, graded scoring produced higher scores than correct meaning scoring did for all participants (N = 175) on all 4 receptive posttests.This again substantiates that graded scoring can capture fine-grained lexical learning better than correct meaning scoring.
Furthermore, the paired t-test results in Table 5 reveal that the score differences between graded scoring and correct meaning scoring for all participants reached statistical significance on all 4 posttests.This suggests that the two differently fine-grained receptive scoring methods did produce statistically different intra-learner lexical learning gain.That is, the participants' lexical learning as derived from graded scoring appeared significantly better than that generated from correct meaning scoring.Yet, the score differences resulting from the two  scoring methods did not change the general trend of participants' lexical learning gain, as shown by the strong correlations between the scores generated from the two scoring methods, as outlined in Table 6.That is, those participants who obtained higher scores based on graded scoring tended to gain higher scores, too, according to correct meaning scoring, and vice versa.

Productive Multiple Scoring and Cross-Task Lexical Learning Effects
As displayed in Table 7, the scores obtained from letter-based scoring were higher than those derived from syllable-based scoring, and the latter were higher than those generated from word-based scoring for all three tasks and on all four productive posttests.This indicates that letter-based scoring is the most fine-grained scoring method followed by syllable-based scoring while word-based scoring is the least fine-grained among the three, as assumed in this study.
However, results of one-way ANOVA, as shown in Table 8, reveal that the three scoring methods did not bring about significantly different cross-task lexical learning on posttests 1 or 3.That is, the lexical learning differences among Written Output, Oral Output, and Reading were consistent in terms of their statistical significance with reference to all three scoring methods.Yet, on posttests 2 and 4, different cross-task lexical learning effects with reference to the three scoring methods were observed.Particularly, on posttest 2 the score differences between Written Output and Reading derived from letter-based scoring and syllable-based scoring were statistically significant, while their score difference generated from word-based scoring did not reach statistical significance.This supports the better fine-grain of letter-based scoring and syllable-based scoring than wordbased scoring, which, nevertheless, was not substantiated on posttest 4. Rather, on posttest 4, the score difference between Written Output and Reading reached statistical significance according to syllable-based scoring, but not according to letter-based scoring or word-based scoring.This inconsistency again indicates that differently fine-grained scoring methods may give rise to different cross-task lexical learning effects, but such an effect is variable in that a more fine-grained scoring method does not necessarily more lead to statistically different cross-task lexical learning effects than a less fine-grained scoring method, and vice versa.

Productive Multiple Scoring and Intra-Learner Lexical Learning Gain
As shown in Table 9, the participants as a whole (N = 175) obtained the highest average score on all 4  productive posttests based on letter-based scoring, the lowest average score based on word-based scoring, and the medium average score according to syllable-based scoring.This also substantiates the assumption of the study that letter-based scoring is the most fine-grained method, word-based scoring the least fine-grained, and syllable-based scoring in the middle.
Results of paired-samples t-test, as displayed in 10, reveal that the score differences generated from the three scoring methods for all participants were all statistically significant on all 4 productive posttest.This suggests that differently fine-grained scoring methods were able to lead to learners' significantly different lexical learning gain.That is, the participants' productive lexical learning as measured by letter-based scoring appeared significantly better than that produced by syllable-based scoring, and their lexical learning as measured by syllable-based scoring was significantly better than that generated by word-based scoring.However, the score differences resulting from different scoring methods did not change the general trend of the participants' lexical learning gain as shown by the correlations in Table 11.That is, the participants who garnered higher scores based on letter-based scoring tended to obtain higher scores according to syllable-based scoring and word-based scoring, too, and vice versa.

Discussion
The study found that for both receptive and productive lexical learning, differently fine-grained scoring methods might not necessarily lead to significantly different cross-task effects (in response to RQs 1 and 3), whereas they did bring about significantly different intra-learner word gain (in response to RQs 2 and 4).The finding about the association between differently fine-grained scoring methods and cross-task productive lexical learning effects substantiates that of Barcroft (2002Barcroft ( , 2004)).As no previous studies had investigated whether and how differently fine-grained scoring methods would affect cross-task receptive lexical learning effects or intra-learner lexical learning gain, the present study bridges the gaps and makes contributions in these respects.The findings are discussed in terms of the association between multiple rating methods and cross-task effects as well as intra-learner lexical learning gain respectively.

Multiple Ratings and Cross-Task Lexical Learning Effects
The finding that differently fine-grained scoring methods did not consistently lead to significantly different cross-task receptive or productive lexical learning effects indicates that the score gap resulting from different scoring methods might not be large enough to make significant differences.This should be attributed to the fact that a scoring method, if applied, had been applied consistently to rate the lexical learning of all three task groups.Regarding productive lexical learning, because of their increasing fine-grain, word-based scoring, syllable-based scoring, and letter-based scoring produced increasingly higher scores for all three groups.In other words, the Oral Output group obtained higher scores according to letter-based scoring than they did according to syllable-based scoring or word-based scoring, which was also true with the Written Output group and the Reading group.In this case, it is natural that the cross-task lexical learning effects derived from the three scoring methods might not be significantly different, especially when the lexical learning effects of different tasks were adequately close.Similarly, in measuring receptive lexical learning, graded scoring produced higher scores for all three groups than correct meaning scoring did since the former is more fine-grained than the latter.Thus, it is also natural that the cross-task receptive lexical learning effects generated from the two scoring methods might be so close as to have no significant differences.The finding about the association between multiple ratings and cross-task effects implies that in task-based experimental studies, it may not matter much whether a more or less fine-grained scoring method is used in rating lexical learning as long as a scoring method is used consistently.With regard to productive lexical learning, one evidence comes from Barcroft (2002Barcroft ( , 2004)), which applied differently fine-grained lexical learning scoring methods but did not find significant cross-task moderating effects.The finding of the present study further confirms that of Barcroft (2002Barcroft ( , 2004)).The literature provides various productive lexical rating options such as binary scoring (e.g. de la Fuente, 2003), 3-point scoring (e.g.Barcroft, 2009;Barcroft & Sommers, 2005), or 5-point scoring (e.g.Barcroft, 2003;Barcroft, 2007;Keating, 2008;Smith, 2004).The first one did not consider partial word learning while the latter two did.The finding of the study indicates that researchers can choose any of the above methods in examining task-based lexical learning differences, since these discrepant scoring methods do not usually significantly affect cross-task effects.Likewise, with respect to receptive lexical learning, it is also not essential whether partial word meaning recall is considered (e.g.Hulstijn & Laufer, 2001;Keating, 2008) or not (e.g. de la Fuente, 2002;Min, 2008).The VKS aims to count learners' partial word learning (Wesche & Paribakht, 1996).The finding of the present study indicates that it is worth examining the necessity of using the VKS in measuring task-related receptive lexical learning.
In actuality, the use of a more or less fine-grained scoring method in rating lexical learning depends on how word learning is defined because differently fine-grained scoring methods target different components that learners recall.In the present study, letter-based scoring, syllable-based scoring, and word-based scoring sequentially take the correct letter, the correct syllable, and the correct word as the scoring unit.A score of 5 based on letter-based scoring or syllable-based scoring does not necessarily mean that the learner has recalled 5 target words, while a score of 5 based on word-based scoring means that the learner has recalled 5 words correctly.Thus, letter-based scoring and syllable-based scoring might be misleading if productive word learning is defined as correctly producing a word.In like manner, a score of 1 generated from graded scoring does not necessarily mean that the learner recalled the meaning of a word correctly as correct meaning scoring did.Instead, it might be that the learner selected Category II of the VKS for four different words.Hence, graded scoring is misleading, too, if receptive lexical learning is defined as recalling the meaning of a word correctly.Therefore, researchers need to consider the conceptualizations behind differently fine-grained scoring methods when considering rating partial word learning.

Multiple Ratings and Intra-Learner Lexical Learning Gain
The study found that for both receptive and productive lexical learning, differently fine-grained scoring methods brought about significantly different intra-learner word learning gain.This finding implies that whether partial word learning is counted or not will matter a lot in judging the effectiveness of a task through measuring learners' lexical learning, which can be extended to measuring learners' vocabulary size through administering vocabulary tests.Specifically, it is very likely that counting partial word learning or not (for instance, employing letter-based or syllable-based scoring rather than word-based scoring in rating productive lexical learning, and adopting graded scoring instead of correct meaning scoring in rating receptive lexical learning) will reveal significantly different degrees of task-related lexical gain (and significantly discrepant vocabulary sizes for individual learners).Therefore, researchers should be very cautious in choosing rating methods for measuring individual learners' lexical gain (or vocabulary size).Particularly, if they aim to capture learners' lexical gain over a period of time (or vocabulary size) as accurately as possible, such partial-learning-incorporated scoring systems as the VKS scoring scheme (Wesche & Paribakht, 1996) and the LPSP (Barcroft, 2002) are indispensible.
However, the problem of defining lexical learning also applies here since the use of differently fine-grained scoring methods may mean different lexical units being measured.Considering the connotations behind the differently fine-grained scoring methods as discussed above, researchers should be clear about the essential differences that using different rating methods may make in measuring individual learners' lexical learning gain (or vocabulary size).On the one hand, choosing differently fine-grained scoring methods means that different degrees of lexical gain (or different vocabulary sizes) may be garnered for individual learners; on the other hand, differently fine-grained scoring methods may mean different word components being weighed, as argued above.Thus, researchers should carefully ensure that their definition of lexical learning and what they measure are consistent.

Conclusion
The study interestingly revealed that the use of differently fine-grained scoring methods might not necessarily affect cross-task lexical learning effects significantly, but it did make a difference in measuring individual learners' lexical learning.The study not only bridges a research gap about the association between differently fine-grained scoring methods and learners' lexical gain, but its findings provide both empirical evidence and implications for whether and how a more or less fine-grained scoring method should be adopted in rating lexical learning.Like many studies, the study also has its limitations.Specifically, it only used a small word sample, and did not control the length of the words in terms of the number of letters or syllables that they contained, which might affect the results of the study.Hence, future research may examine a larger word sample and use words of the same length to validate the findings of the study.
in pairs with 16 cued words orally (15 mins), and then report the reconstruction by one member (10 mins) Compare oral reconstruction with the original passage in pairs (10 mins)

Table 3 .
Cross-task lexical learning effects of 2 scoring methods on 4 receptive posttests.
Note: * Significant at the p ≤ 0.05 level; ** Significant at the p ≤ 0.01 level.

Table 4 .
Descriptive data of all learners resulting from 2 scoring methods on 4 receptive posttests.

Table 5 .
Mean differences of scores derived from 2 scoring methods on 4 receptive posttests.

Table 8 .
Cross-task lexical learning effects of three scoring methods on 4 productive posttests.Significant at the p ≤ 0.05 level; ** Significant at the p ≤ 0.01 level.

Table 9 .
Descriptive data of all learners resulting from 3 scoring methods on 4 productive posttests.

Table 10 .
Mean differences of scores derived from 3 scoring methods on 4 productive posttests.

Table 11 .
Correlations of scores derived from 3 scoring methods on 4 productive posttests.