The Phonetics of Multiple Vowel Lengthening in Japanese

Many languages exploit a short vs. long lexical contrast in vowels. In most, if not all of these languages, the contrast is binary. In Japanese, however, speakers can lengthen vowels to express emphasis, and multiple degrees of lengthening can be used to express different degrees of emphasis. This paper offers the first experimental documentation of this emphatic vowel lengthening phenomenon. The current results demonstrate that, among the seven speakers recorded, at least a few speakers show six-levels of distinction in duration, and all but one speaker showed a steady linear correlation between duration and level of emphasis. We conclude that Japanese speakers have articulatory control that allows them to make very fine-grained durational distinctions, which go beyond mere binary short vs. long distinctions.


Introduction
Many languages distinguish short vowels from long vowels to make lexical contrasts, but these duration-based length contrasts are usually binary; e.g.[hato] "dove" vs. [haato] "heart" and [obasaN] "aunt" vs. [obaasaN] "grandmother" in Japanese.While there is the rare typological exception such as Estonian, in which this contrast can be ternary (Prince, 1980), the distribution of superlong vowels is constrained by various prosodic and morphological factors (see Ladefoged & Maddieson, 1996;Lehiste, 1970;Prince, 1980 for discussion).Ladefoged and Maddieson (1996: p. 320) state that Mixe (Hoogshagen, 1959) is the only language that they know of that has a purely lexical duration-based three-way contrast (cf.Jany, 2006Jany, , 2007)), although they also mention Yavapai (Thomas & Shaterian, 1990) as another possible candidate.At any rate, three-way vowel length contrasts are rare at best cross-linguistically, and in the languages where they do exist, the ternary contrast is prosodically and/or morphologically restricted.As far as we know, there are no convincing cases of languages that make use of a purely lexical four-way (or greater) duration-based length contrast in vowels 1 .
In Japanese, however, speakers can use vowel lengthening to express emphasis.This process is commonly found in colloquial Japanese; a quick Google search (http://www.google.co.jp) with examples like [suɡoo-i] (すごーい) "great" and [çidoo-i] (ひどーい) "awful" with lengthened stem-final vowels yields many hits.In addition, this pattern can manifest as multiple levels of emphasis (and therefore lengthening), extending beyond the familiar short/long binary distinction 2 .
This study offers the first experimental documentation of the vowel lengthening pattern3 .One theoretical contribution of this paper is to investigate exactly how many levels of durational distinction Japanese speakers can make in expressing different degrees of emphasis-especially given that lexical vowel length contrasts are usually limited to a binary distinction in many languages, including Japanese.
Durational properties of Japanese short vowels and long vowels have been studied rather extensively in the previous literature both in terms of their production and perception (Behne et al., 1999;Braver & Kawahara, 2012;Han, 1962;Hirata, 2004;Hirata & Lambacher, 2004;Hirata & Tsukuda, 2009;Hoequist, 1982;Kinoshita et al., 2002;Moreton & Amano, 1999;Port et al., 1987).These studies have shown that duration is the major acoustic and perceptual correlate of short vs. long contrasts in Japanese, although there may be slight differences in formant characteristics as well, in such a way that long vowels are more dispersed in F1 and F2 dimensions than short 2 Japanese speakers can also lengthen consonants to express emphasis (Aizawa, 1985;Kawahara, 2001, to appear;Nasu, 1999).For a phonetic study testing different degrees of lengthening of Japanese consonants, see Kawahara (2012b).For a previous phonetic study investigating various acoustic properties of "paralinguistic focus", which may be similar to what the current project examines, see Maekawa (1998).
We also note, as we will discuss in the General Discussion section, that English has a similar process, as in Thank you sooooo much and she's so cuuuuuuute.See a post on Language Log by Mark Liberman (http://languagelog.ldc.upenn.edu/nll/?p=2006) for related observations.It is beyond the scope of the current study to conduct a cross-linguistic comparison, but a cross-linguistic study of this sort of lengthening phenomena is certainly hoped for.vowels (Hirata & Tsukuda, 2009).
Although the phonetics of Japanese short and long vowels has been well studied in the past, to the best of our knowledge, there has not been experimental documentation of the emphatic lengthening pattern, which makes use of multiple levels of durational distinctions.One relevant study is Kakehi and Hirose (1997) which tested the production of (heteromorphemic) sequences of the same vowels across morphemes in Japanese (e.g.Matsue e ejiten-wo okutta "(I) sent a picture dictionary to Matsue"), and showed that Japanese speakers do make a distinction among 2 consecutive [e]s, 3 consecutive [e]s, 4 consecutive [e]s, and 6 consecutive [e]s in their production.Drawing on this study, our study below investigates vowel lengthening patterns with multiple levels, and shows that Japanese speakers can make similar fine-detailed durational distinctions even within single morphemes, and that this fine distinction can hold across a wider range of vowels in Japanese.

Method Stimuli
This study used emphasis of stem-final vowels in adjectives which are commonly observed in Japanese casual speech.The stimuli were grouped according to their final vowels, [a, o, u], which commonly appear stem-finally in Japanese adjectives4 .For each vowel, two adjectives were chosen.The adjectives used in this experiment are listed in Table 1, where [-i] is an adjectival ending (present/non-past tense).All the stimuli were disyllabic and had a lexical pitch accent on the second syllable (i.e. the second syllable had an HL falling pitch contour).A subject noun was added to each adjective to make a complete sentence: e.g.[çiza-ɡa ita-i] "(I have) a knee pain"5 .
In Japanese orthography, vowel length can be expressed using "ー" following the target vowel6 .In this experiment, in addition to the non-lengthened rendition, five different degrees of emphasis were included as stimuli, as illustrated in Table 2.
There were a total of 36 stimuli (3 vowels * 2 adjectives * 6 emphasis levels).A random number was assigned to each stimulus item so that transcribers could later track which item had been produced.

Participants
The participants were seven native speakers of Japanese (anonymously coded as Speakers TF, TN, TX, TW, TT, SX, TV).They were all undergraduate students at International Table 1.The list of stimuli.
An illustration of one stimulus set in Japanese orthography.

Japanese orthography Transcription Condition
Christian University (Tokyo, Japan).They were paid 500 Japanese yen for their time.They all signed a consent form before participating in the experiment.

Procedure
The recording sessions took place in a sound-attenuated room at International Christian University.The stimuli and all instructions were presented in Japanese orthography using Superlab ver.4.0 (Cedrus Corporation, 2010).In the instructions, speakers were told that the experiment was about multiple levels of emphasis in Japanese, and that they were going to read sentences with vowels of differing length.They were instructed to read the whole frame sentence, not just the target words, for each stimulus.
Each block contained one token of every stimulus item.The speakers were allowed to take a short break after each block.The order of the stimuli within each block was randomized by Superlab.The speakers went through ten blocks, which resulted in a total of 360 tokens (36 stimuli * 10 repetitions).Each speaker was assigned 30 minutes for the experiment.
Before the main session, as practice, each speaker read all the stimuli once to familiarize themselves with the stimuli and the task.After the practice phase, the experimenter (the first author) clarified any questions that they had.Speakers were recorded directly via a portable recorder (TASCAM DR-40) with a 44.1 kHz sampling rate and a 16 bit quantization level.The first author sat with each speaker throughout the experiment to monitor the progress of the recording.

Acoustic Analysis
The duration of each stem-final vowel plus the adjectival suffix [i] was measured.We did not attempt to put a boundary between the stem-final vowels and the suffixal [i], because the transitions from the stem vowels into the suffixal [i] were blurry (a vowel-to-vowel transition is generally blurry and hard to unambiguously locate in an acoustic analysis: Turk et al., 2006).However, since only the stem-final vowels were empha-sized, and not the suffixal vowel (see Table 2), the duration of [i] should be more or less constant across all conditions.Vowel onset and offset were determined by inspecting both waveforms and spectrograms, and the boundaries were placed where F2 and F3 (dis-)appear.Sample spectrograms are shown in Figure 1.
After the segmental boundaries were placed, the durations of the target intervals were automatically extracted.Acoustic measurements were done using Praat (Boersma & Weenink, 1999-2013).

Statistics
Since there are many comparisons (6 levels of emphasis * 3 types of vowels * 7 speakers), no pair-wise comparisons at each emphasis level were conducted, in order to avoid Type I error (i.e. to avoid finding some significant effects by chance).However, error bars, which represent 95% confidence intervals, are provided in the result figures.They were calculated over 20 repetitions of each vowel (2 adjectives * 10 repetitions), except when speakers mispronounced some relevant token.A post-hoc inspection of the data showed that a linear regression analysis would be useful, so they are reported in the results section.All statistical analyses were performed using R (R Development Core Team, 1993Team, -2013)).R was also used to generate result figures.

Results
Since different speakers showed different patterns, we report the results of individual speakers separately, and present a summary in the next section after reporting the results of individual speakers.We start first by discussing those speakers who showed the clearest distinctions among the different emphasis levels.First, as shown in Figure 2, Speaker TF seems to make a perfect six-way distinction; i.e., the vowel durations for each level of emphasis are different from those of every other level of emphasis for this speaker, and error bars do not overlap.
There are large jumps in duration from the non-emphatic level to the first level of emphasis; with each additional degree of emphasis, there is a shorter, but steady, increase in duration.
To assess the correlation between emphasis level and duration, a linear regression analysis was run with vowel duration as the dependent variable, and emphasis level as the independent variable.Since the increase from non-emphatic vowels to the first level of emphasis is non-linear, they were excluded from this regression analysis.The coefficient estimate of the regression analysis is 120 ms (t(247) = 30.8,p < .001).This correlation estimate represents an average durational increase per emphasis level for this speaker.In other words, it estimates that for each level of emphasis, vowel duration should increase by 120 ms.The correlation between duration and emphasis level is very high (r = .89),showing that the linear relationship between durational increase and emphasis level is very strong.
As shown in Figure 3, like Speaker TF, Speaker TX shows a six-level distinction among emphatic vowels.The average duration for each condition differs, and error bars barely overlap.In the regression analysis, the coefficient estimate is 105 ms (t(245) = 20.2,p < .001),and the correlation estimate r is .79.As with Speaker TF, there are large durational jumps from non-emphatic to emphatic vowels.The emphatic vowels show steady, linear increases in duration, except for exceptionally   large differences between emphasis level 4 and emphasis level 5.These large, non-linear jumps may be responsible for the lower r-value compared to that of Speaker TF.Presumably, for this speaker, the most emphatic vowel has a special status, so it receives extra lengthening.
Speaker TN, as shown in Figure 4, showed the next clearest increase in duration as the emphasis levels go up.Although the speaker generally does not show a clear difference between level 1 and level 2 emphasis for any of the three vowels, the speaker nevertheless seems to make a difference between the other emphasis levels.This speaker also makes an exceptionally large increase from level 4 to level 5 for [a].In the regression analysis, the coefficient estimate is 78 ms (t(230) = 20.7,p < .001),and the correlation coefficient r is .81.
Speaker TW did not show differences between level 1 and level 2 (or level 3 for [u]), as illustrated in Figure 5.It is as though this speaker was treating these levels of emphasis as one category of emphasis.However, the speaker did make a distinction between other levels of emphasis.The correlation between emphasis level and duration is therefore still high (r = .76).In the regression analysis, the coefficient estimate is 51 ms (t(240) = 17.9, p < .001).The smaller estimate is also reflected in this speaker's duration range; in Figure 5, the duration range is about 600 ms, whereas for the previous speakers, the duration ranges are between approximately 800 ms and 1000 ms (Fig- ures 2-4).
As shown in Figure 6, Speaker TT shows the next highest correlation between emphasis level and duration (r = .66).This speaker shows large variability in several conditions (as represented in the size of the error bars for these conditions); e.g.emphasis level 5 for [a], and at all emphasis levels for [u].This speaker also does not show a difference between level 1 and level 2 for [u].These behaviors may be responsible for the lower r-value of this speaker compared to those discussed above.In the regression analysis, the coefficient estimate is 75 ms (t(242) = 13.8, p < .001).
As shown in Figure 7, Speaker SX does not show differences between several of the conditions: between level 2 and level 3 for [a], between level 4 and level 5 for [o] and between level 3 and level 4 for [u].The lack of differences in these conditions resulted in an r-value that is lower than previous speakers (r = .61);however, this linear correlation is still high.In the regression analysis, the coefficient estimate is 27 ms (t(248) = 12.1, p < .001).
Finally, as shown in Figure 8, Speaker TV shows a more or less binary distinction-i.e.non-emphatic vs. emphatic-although we do observe a slight increase in duration as emphasis levels go higher (r = .41).Indeed the regression analysis reveals that the coefficient estimate is as low as 12 ms, although it did reach statistical significance (t(245) = 7.2, p < .001).

Summary
Table 3 provides a summary of each speaker's data.It provides a regression function, an r value as a measure of the strength of the linear correlation between emphasis levels and duration, and maximum duration (token-wise) as a measure of their duration range-the range each speaker is willing to use for the emphatic vowels.
In spite of some inter-speaker variability, all speakers showed a positive, steady correlation between level of emphasis and vowel duration.Speakers TF and TX showed a perfect sixway durational distinction, without much overlap in error bars.While other speakers did not show all these distinctions quite as clearly, they showed a (mostly) steady linear increase in duration as emphasis levels increased.Furthermore, all speakers except Speaker TV made an at least 5-way distinction: they either had all levels distinguished, o did not show a difference r    between two (but not more than two) adjacent levels (with the potential exception of [u] for Speaker TW).On the other hand, Speaker TV appeared to make an (almost) binary distinction between emphasized and non-emphasized vowels.Overall, there were no evident significant reversals, where higher em-phasis levels would have shown shorter durations (perhaps except for Speaker TV's [a], level 4 and level 5).
In Table 3, we can observe that there is some association between the strength of correlation (r) and the maximum duration a speaker used; for example, Speaker TF, who showed the  highest correlation, used a large duration range, whereas Speaker TV, who showed the lowest correlation, used the smallest duration range.The correlation is not perfect, however, since Speaker TT showed the second-largest duration range, yet this speaker has the third-lowest r-value.

Summary
The current study, to the best of our knowledge, has provided the first experimental description of the emphatic vowel lengthening pattern in Japanese.Although there is some interspeaker variability, several speakers were able to make durational distinctions as fine-grained as six-ways.Other speakers showed a positive correlation between durations and emphasis levels, all to a statistically significant degree.These patterns are in line with the conclusions drawn from a companion study on emphatic consonant lengthening in Japanese (Kawhara, 2012b), which used a method that is similar to the current experiment to measure the duration of Japanese consonants with multiple degrees of emphasis (The current speakers and those who participated in Kawahara (2012b) do not overlap).
Taken together, one general implication of our current study, beyond providing an experimental description of the Japanese vowel lengthening pattern, is that the current results show that Japanese speakers have articulatory controls which enable them to potentially make six-way durational distinctions.

Further Questions
One question that arises, given that speakers can make such fine-grained durational distinctions, is why natural languages generally deploy only a two-way distinction for lexical contrasts (as discussed in the introduction).One possible answer to this question is that a three way durational contrast may be difficult to unambiguously perceive in real communicative situations-in other words, perceptual distinctiveness restricts a range of possible contrasts that the grammar can deploy (see e.g.Boersma, 1998;Diehl et al., 2004;Flemming, 1995Flemming, , 2004;;Liljencrants & Lindblom, 1972;Lindblom, 1986;Padgett, 2002;Schwartz et al., 1997aSchwartz et al., , 1997b; see especially Engstrand & Krull, 1994;Podesva, 2000;and Kawahara, 2012a for the grammatical imperatives on perceptual dispersion in durational contrasts).For a discussion of an alternative, more formally-based explanation, see the companion paper (Kawahara, 2012b).
The current study also raises many questions which should be addressed in future studies.For example, would Japanese listeners be able to track these different degrees of emphasis?The current experiment used only up to 5 levels of emphasis, but given how well some speakers performed, what would the real limit for Japanese speakers be?Although the current paper focused on vowels only, it is possible to lengthen consonants (Kawahara, 2012b), and it is also possible to lengthen both vowels and consonants: e.g.[suɡɡoo-i] (すっごーい).How vowel lengthening and consonant lengthening interact is an interesting question.Also, it is possible to lengthen stem-initial vowels [suuɡo-i] (すーごい) instead of stem-final vowels [suɡoo-i] (すごーい).Whether position of emphasis affects durational manifestations of vowels is question worth pursuing.Additionally, the differences, if any, between lengthened vowels and sequences of (heteromorphemic) vowel sequences (Kakehi & Hirose, 1997), merits investigation.Toshio Matsuura (p.c.) offers an example paradigm in Table 4 to address this last question (although this paradigm does not control for accent).
Comparing a paradigm like the one in Table 4, which contains heteromorphemic strings of up to six [o]s, with our current results, which show six-way contrasts within a single adjective, may reveal interesting effects of morphological boundaries on phonetics (see e.g.Bird, 2004;Cho, 2001;Frazier, 2006;Pluymaekers et al., 2010).
Moving beyond Japanese, would we expect speakers of other languages to be able to produce similar durational differences (and would they make as many levels of distinction)?Would other languages draw the boundaries between each durational level at the same place?Would there be a difference between languages that exploit duration-based contrasts (as in Japanese) and those that do not?In English, for example, we observe examples like: Thank you sooooooo much, I loooooooove you and She's so cuuuuuuute.Given these stimuli, would English speakers make distinctions similar to those of the Japanese speakers tested in this experiment?Further, as an anonymous reviewer points out, semantic focus can be realized in acoustic dimensions other than duration; e.g.stronger intensity and pitch range expansion (see Ishihara, 2003;Liu & Xu, 2005;Taheri Ardali & Xu, 2012;Xu, 2005 among many others).It remains to be investigated how Japanese speakers (and speakers of other languages, for that matter) make use of these acoustic dimensions to express the sort of emphasis investigated in this paper.
Finally, Hirata and Tsukuda (2009) show that long vowels are more dispersed in their F1 and F2 dimensions than short vowels in Japanese.Thus, the effects of emphatic vowel lengthening on formant displacement should be explored in future studies.All of these are interesting questions, which are, however, beyond the scope of the current study.

A Final Remark
We would like to close with a remark about the distinction between non-emphatic vowels and emphatic vowels.Recall that all the speakers produced the emphatic vowels as longer than the non-emphatic vowels, despite the fact that not all speakers realized differences among all different levels of emphasis.Moreover, as observed in all the figures, all speakers showed a very large increase in duration from non-emphatic vowels to emphatic vowels, and this increase is larger than the observed differences between the various levels of emphatic vowels.We therefore suggest that Japanese speakers overall make a binary distinction between emphatic and non-emphatic durations, and within the emphatic durations, speakers differ in how to acoustically realize the degrees of emphasis.This conclusion may imply that, semantically speaking, the difference between nonemphatic and emphatic is more important than different degrees of emphasis.Further, Japanese speakers attempt to reflect this difference in semantic importance in their production of emphatic and non-emphatic vowels.Again, we find the same patterning in the companion study on consonant lengthening (Kawahara, 2012b), which reinforces this conclusion.

Figure 2 .
Figure 2. The average durations of each emphasis level with 95% confidence intervals: Speaker TF.

Figure 3 .
Figure 3.The average durations of each emphasis level with 95% confidence intervals: Speaker TX.

Figure 4 .
Figure 4.The average durations of each emphasis level with 95% confidence intervals: Speaker TN.

Figure 5 .
Figure 5.The average durations of each emphasis level with 95% confidence intervals: Speaker TW.

Figure 6 .
Figure 6.The average durations of each emphasis level with 95% confidence intervals: Speaker TT.

Figure 7 .
Figure 7.The average durations of each emphasis level with 95% confidence intervals: Speaker SX.

Figure 8 .
Figure 8.The average durations of each emphasis level with 95% confidence intervals: Speaker TV.

Table 3 .
Summary of each speaker's behavior.

Table 4 .
An illustration of one stimulus set in Japanese.