Is This a Paraphrase ? What Kind ? Paraphrase Boundaries and Typology

A precise and commonly accepted definition of paraphrasing does not exist. This is one of the reasons that have prevented computational linguistics from a real success when dealing with this phenomenon in its systems and applications. With the aim of helping to overcome this difficulty, in this article, new insights on paraphrase characterization are provided. We first overview what has been said on paraphrasing from linguistics and the new lights shed on the phenomenon from computational linguistics. Under the light of the shortcomings observed, the paraphrase phenomenon is studied from two different perspectives. On the one hand, insights on paraphrase boundaries are set out analyzing paraphrase borderline cases and the interaction of paraphrasing with related linguistic phenomena. On the other hand, a new paraphrase typology is presented. It goes beyond a simple list of types and is embedded in a linguistically-based hierarchical structure. This typology has been empirically validated through corpus annotation and its application in the plagiarism-detection domain.


Introduction
Although the computational linguistics 1 community has been working on paraphrasing over the last decades, it continues to be a challenging and unresolved issue.One of the main reasons is found in the multifaceted and boundless nature of the phenomenon, which makes its automatic treatment complicated.
Computational linguists have looked for precise and computationally treatable knowledge on paraphrasing in the linguistics field without reaching a definitive solution.This has led researchers to rely on vague definitions of paraphrasing, such as "expressing one thing in other words" (Shinyama, Sekine, & Sudo, 2002), "alternative ways to convey the same information" (Barzilay, 2003), or "sentences or phrases that convey approximately the same meaning using different surface words" (Bhagat, 2009), and to develop techniques based on workable paraphrase notions that are partial and adhoc.
In this scenario, our aim is to go a step forward in paraphrase linguistic characterization in order to provide Natural Language Processing (NLP) with more solid grounds for the development of methods and systems dealing with paraphrasing.We adhere to Wintner (2009), who calls for the return of linguistics to computational linguistics: "what makes our systems special is the fact that they manipulate natural languages, and the only scientific field that can inform our work is linguistics." In concrete, we overview what has been said about paraphrasing in linguistics, how computational linguistics has used this knowledge as a base of its systems, and what are the new insights to paraphrasing derived from them.In light of the shortcomings observed, our proposal on paraphrase characterization is set out.It aims to help in answering two questions that reflect two different approaches to the phenomenon: "is this a paraphrase?", which puts on the table where paraphrase boundaries should be drawn, and "what kind?", aiming to describe what are the paraphrase linguistic manifestations, made concrete in a typology.
Our work is not tight to any concrete theoretical framework.Moreover, it has been empirically validated through annotation with our typology of more than 5700 paraphrase pairs from three paraphrase corpora, which are different in nature and in two languages: the PAN-PC-10 corpus (Potthast et al., 2010), the Microsoft Research Paraphrase corpus-MSRP (Dolan & Brockett, 2005), and the Wikipedia-based Relational Paraphrase Acquisition corpus-WRPA (Vila, Rodríguez, & Martí, 2013).The annotated subsets of these corpora are called, respectively, P4P, MSRP-A, and WRPA-A.P4P and MSRP-A are in English and WRPA-A is in Spanish (Vila et al., Submitted) 2 .
In Section 2, the state of the art on paraphrasing from linguistics and computational linguistics is set out.Section 3 presents the researchers' proposals on paraphrase boundaries and typology.Finally, conclusions and future work are presented in Section 43 .

What Has Been Said about Paraphrasing?
Paraphrasing has been conceived and apprehended from different angles in linguistics and computational linguistics.The variety of visions of paraphrasing is even larger if we consider fields like discourse analysis or psycholinguistics, which have also addressed the phenomenon.This variety is again enlarged if we adopt a diachronic view, including disciplines such as rhetoric or biblical exegesis.As can be seen, paraphrase broad and multifaceted nature is shown in the varied literature on the topic.
In what follows, we focus on how paraphrasing has been understood in linguistics (Section 2.1) and computational linguistics (Section 2.2)4 .

In Linguistics
In the field of linguistics, paraphrasing is at the core of two theories that set out language models focusing on production: Meaning-Text Theory (MTT) and Systemic-Functional Grammar (SFG).Their proposals are substantially different in essence, but their approaches to paraphrasing, similar: both see language production as a system of choices or alternatives, which can give rise to paraphrases.
MTT gives rise to Meaning-Text Models (MTMs).Such models incorporate a grammar organized in seven levels of representation-with semantics and phonetics at the wings-comprising six components, which contain the rules that allow for going from one level of representation to the other.The second constituent in MTMs is the Explanatory Combinational Dictionary (ECD), which governs the whole process.Lexical Functions (LF), which identify recurrent patterns of semantic-syntactic correspondence, are a fundamental part of the ECD.
Within this framework, two paraphrase mechanisms can be identified.First, paraphrases can be produced in the transition between levels of representation: representations in one level can be projected in two or more representations in the next one.Second, paraphrases can be established through equivalence rules between representations at the same level.Paraphrasing at the deep syntax level was first described by Žolkovskij & Mel'čuk (1965), who built a paraphrasing system comprising lexical and syntactic paraphrasing rules5 ; paraphrasing at the semantic level was more recently described (Milićević, 2007a;Milićević, 2007b).The axiomatic foundations and formal complexity of MTT prevent its straightforward exploitation outside the MTT framework and lead to a costly computational implementation.Nevertheless, ECD and LF in particular are useful in themselves as they encode most of the paraphrase potential in the model.
Although in a less explicit way, paraphrasing is also at the base of SFG: "the systemic theory is a theory of meaning as a choice, by which a language, or any other semiotic system, is interpreted as networks of interlocking options" (Halliday, 1994).In this framework, paraphrases are the result of making alternative choices.Obviously, not all alternants are meaning preserving and, therefore, not all of them give rise to paraphrases.
Other linguistic proposals include elements that can be used in paraphrasing.Transformations, which are at the core of Harris (1957)'s proposal and Chomsky (1965)'s Generative Grammar, have been used as a way to represent and enumerate formal relations between sentences.Some of these transformations are paraphrastic as they preserve the meaning of sentences.Transformations take place between surface structures in Harris's approach; in Chomsky's, in contrast, they take place from deep to surface syntax structures.In the latter case, different surface representations derived from the same deep structure can be understood as paraphrases.Following Hiż (1964)'s ideas, Smaby (1971) describes a paraphrase transformational grammar that maps equivalent structures.The main interest of this work is the effort to formalize paraphrasing; nevertheless, it only deals with those paraphrases that can be formally apprehended.
With the emergence of generative semantics (Lakoff, 1971), there was a movement to a semantically-based framework.Since, in this case, the deep structure is purely semantic, generative semantics appears to be a suitable means for describing paraphrasing6 .Diathesis alternations, which stand for those alternate structures that are admitted by the same predicate, can also be viewed as paraphrases.Levin (1993) provides diathesis alternations for English, some of them, such active/passive or causative/inchoative alternations, are of general application while others are specific for English language.
There also exist works that analyse and discuss the linguistic nature of paraphrasing.Martin (1976) defines linguistic paraphrasing as logical equivalence.He also describes two mechanisms of linguistic paraphrasing: first, "semic content" identity and "actantial pattern" correspondence, which roughly corresponds to structural reorganizations, and, second, "actantial pattern" identity and "semic content" correspondence, which mainly corresponds to synonymy substitutions.Fuchs (1994), in turn, describes paraphrasing in discourse and in language from a diachronic perspective.Moreover, she argues for the enunciative dimension of paraphrasing: it cannot be reduced to closed equivalence, instead it consists of a dynamic and approximate relationship.Milićević (2007a), in line with proposals within the MTT framework, analyses paraphrasing as a multifaceted and variable phenomenon focusing on the different paraphrase dimensions.Some concrete aspects discussed by these authors are taken up in subsequent sections of this article.Some of the works mentioned above include lists of paraphrase types.Mel'čuk (1992) enumerates 54 lexical and 29 syntactic paraphrasing rules within the MTT.Milićević (2007a) defines a set of MTT semantic-paraphrase rules and also classifies paraphrases from five different perspectives, such as accuracy of the paraphrase link (exact and approximate) or paraphrase relationship depth (semantic, lexico-syntactic, syntactic, and morphological paraphrases).Lists of transformations (Harris, 1957) or diathesis alternations (Levin, 1993) can also be seen as typologies of potential paraphrases.The latter sets out around 60 diatheses organized in 8 main classes.Martin (1976), in turn, sets out varied paraphrase mechanisms, focusing on paraphrasing by connotative variation, double-negation or double-inversion paraphrasing, and paraphrasing by synonymy substitution.

In Computational Linguistics
We analyse the paraphrase characterization in computational linguistics from two different perspectives.In Section 2.2.1, we analyse the notions of paraphrase that underlie NLP paraphrase techniques.In Section 2.2.2, we overview paraphrase typologies built in this field.

Paraphrase Notions Underlying NLP Methods
While linguistic analysis approaches paraphrasing with the aim of exploring, explaining, and formalizing it, NLP researchers focus on developing methods and techniques to deal with the phenomenon in their systems and applications7 .Each method applied subsumes a way of understanding paraphrasing and paraphrases addressed with such a technique are of a particular nature.Sometimes these methods have their roots in linguistics; on other occasions, they were born within NLP.
A number of authors have applied MTT proposals.Boyer & Lapalme (1985) developed a paraphrase generation system based on the ECD and the lexical transformations of the model.Lareau (2002), in turn, presents an automatic text synthesis prototype system, Sentence Garden, which aimed to produce not only one sentence, but all possible sentences that express a given meaning (although the prototype only implemented the semantics-deep syntax interface).
The idea of transformation between surface structures has also been used in NLP.McKeown (1983), for example, sets out a paraphrase component for a question-answering system, where a transformational grammar is used to generate paraphrases.Romano et al. (2006) use transformation rules in their paraphrase-based approach to relation extraction.
Harris (1954)'s distributional hypothesis, which states that words occurring in the same contexts tend to have similar meanings, has been widely applied, directly or indirectly, more or less strictly, and under different forms: "sentences which appear in similar contexts are paraphrases" (Barzilay & McKeown, 2001), "if two paths [in a dependency tree] tend to occur in similar contexts, the meanings of the paths tend to be similar" (Lin & Pantel, 2001) 8 , "named entities are preserved across paraphrases" (Shinyama, Sekine, & Sudo, 2002), "the meaning of the text around the source and target entities [in a concrete relation] will be similar throughout their different occurrences" (Vila, Rodríguez, & Martí, 2013), etc.
Other authors establish the paraphrase link through a third vertex.In Rinaldi et al. (2003)'s question-answering system, paraphrases are those linguistic units mapping to the same logical representation.Bannard & Callison-Burch (2005), in turn, start out from the assumption of similar meaning when multiple phrases map onto a single foreign language phrase.The third vertex is a logical meaning representation in the first case and a sentence in another language in the second.
Similarity measures have also been used to address paraphrasing in NLP.In this framework, paraphrases are those text snippets with a high level of overlapping or a low distance.Similarity can be calculated at word level using, for example, string edit distance or ngram overlapping (Dolan & Brockett, 2005); at syntax level, applying tree edit distance (Kouylekov & Magnini, 2005); and, at semantic level, taking advantage of semantic roles in PropBank or FrameNet frames, using a semantic space such as WordNet or Wikipedia, or using distributed representations of co-occurrences, usually vector-based (Baroni & Lenci, 2010) 9 .The latter approach is currently a very active research area.Semantic similarity has also been addressed in the Semantic Textual Similarity task in Semeval 2012, where paraphrases are ranked according to their similarity level10 .
To conclude, each NLP technique applied addresses a concrete paraphrase facet, which is generally partial and ad-hoc.In this regard, a major distinction can be made.In methods relying on the formal mapping of the paraphrase members (transformations and formal similarity measures), paraphrases addressed must be similar in form.This is not the case of those methods where no formal mapping is necessarily assumed (MTT, distributional hypothesis, semantic similarity measures, and third vertex).

Paraphrase Typologies
Many NLP researchers have found in typology building a way to apprehend paraphrasing.Early works on paraphrase typologies are Culicover (1968) and Honeck (1971).They set out similar typologies in the sense that both divide their paraphrase types into formalizable and non-formalizable ones, leaving the latter group outside the scope of their work.This has been a general tendency in NLP and paraphrases where no formal mapping can be established have hardly been addressed.In concrete, Culicover (1968) presents a paraphrase typology of five types: transformational, attenuated, lexical, derivational, and real-world, and carries out a formalization attempt through the definition of some structural and semantic conditions to be fulfilled by each of the paraphrase types.He makes a division between computationally "accessible" and "inaccessible paraphrase relationships" and focuses on the accessible ones, leaving those inaccessible (most real-world paraphrases) under-explained.Honeck (1971), in the psychology field, offers a taxonomy of three types of paraphrases, including transformational, lexical and formalexic (combination of the two); however, he isolates two extra types of paraphrases that are outside the scope of his study: parasyntactic (unavailable for formal treatment) and syndetic (combination between the other types), where no formal correspondences can be established.
Sometimes paraphrasing is classified in a very generic way setting out only a few types, such as in Shimohata (2004: pp. 15-18) or Barreiro (2008: pp. 29-33).These types generally stand for the type of linguistic units or the level of language where paraphrases take place.There also exist typologies that focus on concrete paraphrase cases, such as paraphrases involving support-verb constructions (Barreiro, 2008: pp. 73-81), and typologies that come from paraphrase related fields, such as text reuse (Clough, 2003: p. 100) or editing (Faigley & Witte, 1981).
There also exist exhaustive paraphrase typologies focusing on concrete paraphrase facets, such as syntactic (Dras, 1999) or lexical mechanisms (Bhagat, 2009), or covering paraphrasing in a more comprehensive way (Fujita, 2005).More specifically, Dras (1999) sets out 54 types expressed in terms of syntactic transformations and groups them into five classes standing for paraphrase effects: change of perspective, change of emphasis, change of relation, deletion, and clause movement.Bhagat (2009), in turn, classifies paraphrases according to the lexical changes involved (e.g.actor/action substitution or noun/adjective conversion) and links each of these types to the structural modifications accompanying them (substitution, addition/deletion, and/or permutation).Finally, Fujita (2005) presents a general classification of lexical and structural paraphrases11 setting out 24 paraphrase types grouped into six classes including paraphrases of single content words, function-expressional paraphrases, paraphrases of compound expressions, clause-structural paraphrases, multi-clausal paraphrases, and paraphrases of idiosyncratic expressions.
Approaches to paraphrase characterization from NLP are generally partial and ad-hoc, but have opened new windows onto the paraphrase phenomenon understanding.In Section 2.2, we have shown how computational linguistics can "shed[s] new light on phenomena that traditional approaches fail to account for [and] bring refreshing insights and new points of view to all branches of linguistics" (Wintner, 2009).

Paraphrase Characterization
As shown in Section 2, a precise and commonly accepted definition of paraphrasing does not exist.From the perspective of linguistics and computational linguistics, the definition of "approximate sameness of meaning" is generally assumed, but it is vague (to what extent can it be "approximate"?) and actually shifts the problem to another location (what is "meaning"?).
In this article, we adopt a different approach to paraphrase characterization.Instead of focusing on the definition of paraphrasing itself, we address the questions of where to draw the boundaries between paraphrases and non-paraphrases (Section 3.1) and what phenomena fall under paraphrasing (Section 3.2).Although we are aware that paraphrase fuzziness is also present in both boundary drawing and typology building, and that they are simply another approach to the same problem, they allow us to be more precise without abandoning a general perspective on paraphrasing.

Paraphrase Boundaries
Meaning preservation has been discussed at length in the literature.In lexical semantics, Cruse (1986) defines absolute synonymy as an unexpected and merely transitory relationship.Sameness of meaning is also negated in paraphrase literature; Fuchs (1988) rejects the idea of paraphrasing as pure and simple identity: "the synonymy-identity myth has only given rise to sterile arguments."Therefore, paraphrasing must be situated in the field of the approximation, opening the path to different semantic similarity degrees or degrees of paraphrasability.Paraphrasing takes place in a continuum that goes from absolute identity to the absence of semantic similarity.In this scenario, a question arises: where to draw the boundaries between paraphrases and non-paraphrases.
We consider that fixed and precise paraphrase boundaries do not exist, instead they depend on the task and objectives: sometimes a wide understanding of paraphrasing will be required, on other occasions, a more restrictive view will be necessary.Fuchs (1994) points out that a linguistic unit is a paraphrase of another one if the latter can be considered within the bounds of acceptable deformability or "distortion threshold" with respect to the former.This threshold is variable as "it depends on different parameters constituting the discursive activity: tolerance to deformation is greater or lesser depending on the subjects and situations." In this section, we set out three cases of borderline paraphrases that are derived from our analysis of the state of the art of paraphrasing and related areas, and our experience in paraphrase-type annotation: loss of content, pragmatic knowledge, and changes in some grammatical features.These borderline paraphrases are placed in the continuum between paraphrases and non-paraphrases, in which authors can position their own paraphrase border according to their objectives.Moreover, for each of these cases, we mention the approach we adopted, which is reflected in our typology (Section 3.2).The section is closed with a comparison between paraphrasing and two related phenomena, namely coreference and textual entailment, which often lead to confusion in NLP.
Content Loss.Many paraphrase boundary cases are due to some kind of content loss.Content loss may be due to deletion [my favorite in (1)] or generalization [from pilot to commander in (2)].
(1) a. Yesterday I went to the beach b.Yesterday I went to my favorite beach (2) a.The pilot was having breakfast b.The commander was having breakfast Depending on the quantity and relevance of the missing content, different degrees of paraphrasability are possible.In this sense, the level of paraphrasability of the sentences in (3) is lower than those in (1).
(3) a. Yesterday I went to the beach b.Yesterday I went to the beach which used to be my favorite when I was a child Moreover, the missing content can sometimes be recovered by means of implicit lexical knowledge in the context.The Generative Lexicon (Pustejovsky, 1995), though not addressing paraphrasing directly, offers useful insights in this regard.Setting out from the idea that the meaning of words reflects the deeper conceptual structures in the cognitive system, the qualia structure specifies four aspects of word meanings: formal (distinction within a larger domain), constitutive (relation between an object and its constituent parts), telic (purpose and function), and agentive (factors involved in its origin).In (4), the information contained in the qualia's telic of book allows for the recoverability of the deleted content (reading).In contrast, in (1), we have no way to recover the missing content.Therefore, the level of paraphrasability is higher in (4).Moreover, the pair in (5) shows a higher degree of paraphrasability than the pair in (2), as the context of taking off in the former clarifies that this commander is, actually, a pilot.In (2), we only rely on the hypernym relationship between pilot and commander.
(4) a. John began reading a book b.John began a book (5) a.The pilot was ready to take off b.The commander was ready to take off Depending on the task and objectives it is necessary to consider the above examples to be paraphrases or not.Many paraphrase types in our typology involve different degrees of semantic loss 12 .The ADDITION/DELETION type (types in our typology appear in Tables 1 and 2) is a clear example of this.Although the missing content cannot always be recovered in our types, this is sometimes possible: in "light/generic element addition/deletion" within the SYNTHETIC/ANALYTIC SUBSTITUTION type (Table 3), the content of the deleted element is embedded in the one Table 1.Paraphrase typology (1).Classes appear in the first column, subclasses in the second, and types in the third.Most of the examples come from the P4P corpus and also appear in Barrón-Cedeño et al. (2013).Spelling, punctuation, format, and paraphrase extremes are extracted from the MSRP-A corpus.Martin (1976) contrasts "linguistic" to "pragmatic paraphrases", the latter standing for pairs that, in a given situation, refer to the same intention (6) or refer to the same facts and events (7) 13 .Milićević (2007a), in turn, contrasts "language" to "cognitive paraphrases", the latter comprising paraphrases exploiting pragmatic data, such as ( 6), (8), and (9), and paraphrases exploiting encyclopedic knowledge, such as (7)14 .Fujita (2005) talks  about "pragmatic paraphrases" (6) and "referential paraphrases" (9).Dorr et al. (2004) mention "viewpoint variation paraphrases" (10), also cited by Hirst (2003).Finally, Fuchs (1994) considers cases like the one in ( 7) to be outside the boundaries of paraphrasing.The way to present and conceptualize all these examples varies according to the author, but all of them put forward the idea that paraphrasing may rely on something beyond pure semantic similarity.We distinguish between two main types of knowledge that can give rise to pragmatic paraphrases, namely encyclopedic knowledge [( 7) and ( 10)] and situational knowledge (the remaining examples).These two types of knowledge are usually called common-sense knowledge in NLP.As Milićević (2007a) points out, we can also draw a continuum here: "between those clear and unambiguous cases, there is a gray area populated by paraphrases that can be called quasilinguistic." If we stick to the paraphrase definition of sameness of meaning, these examples are outside paraphrase limits.However, under certain circumstances, it may be necessary to consider these cases as a special type of paraphrase linked to the situational context.Because our typology relies on semantic content, those cases fall outside our proposal.
Grammatical Features.With the generic concept of "grammatical features", we refer to changes in person, number, and time.They generally lead to deep changes in meaning, though, on occasions, they may give rise to paraphrases.
The example in ( 11) is clearly nearer paraphrasing than (12), as, in (11), the first person plural includes the first person singular.In (13), the change in number is not relevant: street does not refer to a concrete one, but to the general sense of "outdoors"; in ( 14), the change in number gains relevance as we move from the idea of Table 3. Prototypes for SYNTHETIC/ANALYTIC SUBSTITUTIONS.These examples also appear in the annotation guidelines (see footnote 2) and, as all the examples there, are extracted/adapted from state of the art paraphrase typologies (see the annex of the guidelines) and the annotated corpora, or are our own."liking a concrete cake" to "liking cakes in general".In (15), both tenses overlap to a high degree, which is not the case of ( 16), standing for different moments in time.Only examples (11), (13), and ( 15) are considered to be paraphrases in our approach.They are included in the INFLECTIONAL CHANGE type in our typology.Contrary to content loss and pragmatic knowledge, which are language independent, this group includes phenomena that are closely related to how languages encode morpho-semantic content.In English, this is reflected in the inflection.
Paraphrase, Coreference, and Textual Entailment.Paraphrasing overlaps with both coreference and textual entailment, leading to recurrent confusions.In what follows, the main differences and similarities between these two phenomena and paraphrasing are presented.
Paraphrasing and coreference overlap considerably, but they differ in essence: paraphrasing is concerned with meaning, whereas coreference is about discourse referents (Recasens & Vila, 2010).In example (17), a paraph-rase relationship exists between shop assistant and sales person; but the former acts as a nominal predicate, which is not referential and cannot be part of coreference relationships.In contrast, in (18), we can establish a coreference relationship between the noun phrases in italics, but they do not hold the same meaning and, therefore, are not paraphrases.Finally, in (19), paraphrase and coreference overlap in the coast/the seashore.
(17) She is a shop assistant in that store, but the sales person that assisted me was not her.(18) -Are you a family member of the patient in room 235? -Yes, my cousin is in that room.(19) Yesterday I was walking along the coast.The seashore is what I really love in this area.
Paraphrases can also be seen as bidirectional entailment relations: "text A is a paraphrase of text B if and only if A entails B and B entails A" (Rus et al., 2009).Limiting paraphrasing to bidirectional entailment reduces it to very few cases; therefore, some unidirectional-entailment cases are generally considered to be paraphrases.Dorr et al. (2004), for example, present "inference" as a paraphrase type.Kotlerman et al. (2010), in turn, introduce the concept of "directional similarity".Once again, we situate paraphrasing in a continuum with strict bidirectional entailment at one extreme and strict unidirectional entailment at the other.Where to put the boundaries between paraphrases and non-paraphrases depends again on the task and objectives.
The relationship between textual entailment and paraphrasing is intimately linked to the question of content loss mentioned above, as all paraphrases exhibiting content loss are cases of unidirectional entailment.In our typology, this is illustrated by ADDITION/DELETION.Moreover, our typology includes types categorized as "paraphrase extremes" including IDENTICAL and NON-PARAPHRASE, which are clear paraphrase limits, and ENTAILMENT, that is, those cases of non-paraphrase that are closer to the paraphrase domain.In the annotation task, it is worthwhile isolating these cases of entailment for researchers interested in broadening the scope of their work (Vila et al., Submitted).

Paraphrase Typology
In this section, we focus on the characterization of paraphrasing through the description of its possible linguistic manifestations or types.Our typology is not a proposal started from scratch, but has been built on the basis of state-of-the-art typologies, which have provided ours with insights on structure and types.Actually, our typology aims to cover all the phenomena described in these typologies 15 .
A set of characteristics make our typology a step forward with respect to the state of the art.First, it consists of a comprehensive typology of paraphrasing that focuses on general paraphrase phenomena, leaving finegrained linguistic mechanisms in a second term.Second, it goes beyond a simple list of types: it has a hierarchical structure, which is linguistically based and uniform throughout, and it is accompanied by a linguistic reflection describing and justifying its nature.Finally, as previously mentioned, it has been empirically validated on paraphrase corpora.
The typology is displayed in Tables 1 and 2. It consists of a three level typology of 24 paraphrase types (third column) grouped in 5 classes (first column), two of them having two sub-classes each (second column) 16 .In what follows, an overview of our typology is set out.In concrete, we describe its scope, the type of units it classifies, its structure, and its types.
Scope of the typology.It is a general typology of paraphrasing in the sense that it comprehends the paraphrase phenomenon as a whole and covers all its possible manifestations, from elementary modifications like the INFLECTIONAL CHANGE type in Table 1 to deep reorganizations like SEMANTICS-BASED CHANGES in Table 2. Also, it covers paraphrases from the word to the discourse level.It should be noted that, since our typology relies on semantic content, pragmatic paraphrase fall outside our proposal (Section 3.1).
Unit of classification.The units classified according to our typology are what we call atomic paraphrase phenomena (paraphrase phenomena onwards), that is, autonomous paraphrase reorganizations consisting of a set of dependent linguistic mechanisms.The DERIVATIONAL CHANGE in Table 1, for example, comprises a change from a verb to an adjective form, as well as an involved structural modification.Among the dependent linguistic mechanisms, one of them is the trigger.In the previous example, it is the change of category or derivational change.As can be seen, paraphrase-type names stand for the linguistic mechanism triggering the paraphrase phenomenon.
Paraphrase phenomena can take place isolated or combined, giving rise to complex paraphrase pairs.In the pair containing a DERIVATIONAL CHANGE mentioned above, other paraphrase phenomena can be observed, such as a SAME-POLARITY SUBSTITUTION (or synonymy substitution) between things and accounts.
Typology structure: classes, subclasses, and types.Types are grouped in classes according to the nature of the trigger linguistic mechanism: (i) The morpholexicon-based change class comprises those types in which the paraphrase phenomenon is triggered at the morpholexicon level; (ii) the structure-based change class comprises those types that are the result of a different structural organization; and (iii) the semantic-based change class contains those types arising at the semantic level.An example of (i) are DERIVATIONAL CHANGES, where the trigger consists of the change of category, which implies structural reorganizations.Regarding (ii), a DIATHESIS ALTERNATION like the one in Table 1 involves a change of voice of the verb among others changes, but the trigger is syntactic.Finally, paraphrases in the semantics class (iii) are based on a different distribution of semantic content across the lexical units involving multiple and varied formal changes (Table 2).
There are two more classes in our typology: miscellaneous changes and paraphrase extremes (Table 2).The former comprises types not directly related to one single language level.The latter comprises those phenomena that are at the limits or outside the limits of paraphrasing (Section 3.1).Finally, the sub-classes follow the classical organization in formal linguistic levels from morphology to discourse and simply establish an intermediate grouping between some classes and their types.
Two main kinds of paraphrase structural reorganizations can be inferred from the previous explanation: those that are triggered by a lexical substitution (morpholexicon-based changes), and those that are not (structurebased changes).The idea of lexical trigger has its basis in the lexical projection rules put forward by Chomsky (1986) and their further reformulations.
This organization in classes and the idea of trigger determined the methodology applied to annotate the scope in Vila et al. (Submitted).
The types17 .Types in our typology correspond to general and contrastive categories: they stand for coarsegrained categories of paraphrase phenomena that are substantially different from each other, e.g., SAME-POLARITY SUBSTITUTION vs. PUNCTUATION CHANGE.Even types closer in nature clearly contrast.For example, the linguistic mechanisms involved in OPPOSITE POLARITY and CONVERSE SUBSTITUTIONS are similar (both can involve a change in the order of the arguments); however, the linguistic mechanism triggering the paraphrase phenomenon (the opposite-polarity or converse lexical unit) makes them different.
An important consideration regarding the nomenclature used for the types has to be pointed out.Some paraphrase-type names refer to paraphrase relationships by default, e.g., all DERIVATIONAL CHANGES give rise to paraphrase relationships as changes of category do not affect the core meaning of the sentence.Other paraphrase-type names refer to linguistic mechanisms that do not necessarily give rise to paraphrases, e.g., INFLECTIONAL CHANGES may change the core meaning of the sentences.Therefore, cases like the INFLECTIONAL CHANGE type have to be understood as meaning-preserving changes in inflection, and not as changes in inflection as a whole (Section 3.1).
Each type is realized by a set of more fine-grained prototypes, that is, those patterns that characterize the linguistic mechanisms underlying the paraphrase.Defining a complete list of prototypes for each type is not the objective of this work.Nevertheless, while not aiming to be exhaustive, we exemplify prototypes taking SYNTHETIC/ANALYTIC SUBSTITUTIONS as an example 18 .In this case, we identified the five prototypes shown in Table 3: (i) compounding/decomposition, (ii) alternations affecting genitives and possessives, (iii) synthetic/analytic-superlative alternation, (iv) light/generic element addition/deletion, and (v) specifier addition/deletion.Martin (1976) analyses in detail what he calls "double-negation" and "double inversion paraphrasing", which correspond roughly to our OPPOSITE POLARITY and CONVERSE SUBSTITUTIONS.The equivalence rules he defines for French can be seen as a list of prototypes for these types.Barreiro (2008: pp. 73-81)'s typology involving support-verb constructions and, at a smaller scale, Peñas & Ovchinnikova (2012: pp. 399-400)'s noun-compound and genitive paraphrases can also be seen as potential lists of prototypes for the SYNTHETIC/ANALYTIC SUBSTITUTION type.
Types and prototypes differ in that types are stable and prototypes are an open class.Types represent general paraphrase phenomena covering paraphrasing as a whole.Their comprehensiveness has been tested through corpus annotation in two languages (English and Spanish).Prototypes, in contrast, are concrete linguistic mechanisms or patterns of realization for which a complete list is not necessarily provided in this work.They are more language dependent than types.

Conclusions and Future Work
This article has offered an overview on what has been said about paraphrasing in linguistics, how computational linguistics has used this knowledge as a base for its systems, and new insights on paraphrase characterization derived from computational linguistics methods.This analysis has shown that, given the vague and multifaceted nature of paraphrasing, a precise and commonly accepted definition of the phenomenon does not exist.This has complicated paraphrase tasks in NLP on many occasions: "the difficulty when working with paraphrases lies on its own definition" (Herrera, Peñas, & Verdejo, 2007).
The aim of this article is to move forward in paraphrase characterization in order to provide NLP with more rigorous paraphrase knowledge.We addressed this problem from two directions.First, based on the idea that paraphrase boundaries are not fixed and depend on the task and objectives, we have presented three areas where boundary-paraphrases are placed.Second, paraphrase characterization has been addressed through the construction of a new paraphrase typology.Types in our typology are comprehensive, general, and stable.The prototypes they contain, in contrast, constitute an open and flexible group where new linguistic mechanisms can be described.This typology has been empirically validated through the annotation of more than 5700 paraphrase pairs from three corpora that are different in nature and in two languages (Vila et al., Submitted).Moreover, our typology proposal has already been tested in the automatic plagiarism detection field with promising results (Barrón-Cedeño et al., 2013).
Finally, this article opens a number of lines for future research, such as (i) further analyzing paraphrase boundaries with the aim of defining unseen borderline areas, (ii) the in-depth study of the idea of prototype and prototype definition, and (iii) seeing whether the most coarse-grained types in our typology (SYNTAX & DISCOURSE STRUCTURE and SEMANTICS-BASED CHANGES) accept a more fine-grained classification.
was with difficulty that the course of streets could be followed (b) You couldn't even follow the path of the street Modal-verb changes (a) I [...] was still lost in conjectures who they might be (b) I was pondering who they could be Derivational changes (a) I have heard many accounts of him [...] all differing from each other (b) I have heard many different things about him Lexicon-based Spelling changes (a) The foodservice pie business doesn't fit the company's long-term growth strategy (b) The foodservice pie business does not fit our long-Leicester [...] failed in both enterprises (b) He did not succeed in either case Converse substitutions (a) The Geological Society of London in 1855 awarded to him the Wollaston medal (b) Resulted in him receiving the Wollaston medal from the Geological Society in London in 1855 Structure-based changes Syntax-based Diathesis alternations (a) The guide drew our attention to a gloomy little dungeon (b) Ou[r] attention was drawn by our guide to a little dungeon Negation switching (a) In order to move us, it needs no reference to any recognized original (b) One does not need to recognize a tangible object to be moved by its artistic representation Ellipsis (a) In the scenes with Iago he equaled Salvini, yet did not in any one point surpass him (b) He equaled Salvini, in the scenes with Iago, but he did not in any point surpass him or imitate him Coordination changes (a) It is estimated that he spent nearly $10,000 on these works.In addition he published a large number of separate papers (b) Altogether these works cost him almost $10,000 and he wrote a lot of small papers as well that remains, as the latter is a hyponym of the former.As shown in Vila et al. (Submitted), ADDITION/DELETION is one of the most frequent types in the annotated corpora, demonstrating its accessibility when paraphrasing.Pragmatic Knowledge.Examples like the ones in (6) to (10) are treated by several authors, both in linguistics and computational linguistics, as special types of paraphrases that go beyond pure semantic similarity to fall within the field of pragmatics.(6) a. Close the door please b.There is air flow (7) a. Penelope was waiting for Ulysses return b.The Ithaca queen was waiting for Ulysses return (8) a.Here, life is good b.In Paris, life is good (9) a.They got married last year b.They got married in 2004 (10) a.The US-led invasion of Iraq b.The US-led liberation of Iraq The Russian law, which limits the percentage of Jewish pupils in any school, barred his admission (b) The Russian law had limits for Jewish students so they barred his admission Discourse-based Punctuation changes (a) Swartz repaid it in full, with interest, according to his lawyer, Charles Stillman (b) Swartz fully repaid it with interest, according to his lawyer, Charles Stillman Direct/indirect-style alternations (a) "She is mine," said the Great Spirit (b) The Great Spirit said that she is her [s] Sentence-modality changes (a) The real question is, will it pay?Will it please Theophilus P. Polk or vex Harriman Q. Kunz?(b) He do it just for earning money or to please Theophilus P. Polk or vex Hariman Q. Kunz Syntax/discourse-structure changes (a) How he would stare!(b) He would surely stare!Semantics-based changes (a) The scenery was altogether more tropical (b) which added to the tropical appearance Miscellaneous changes Change of format (a) Fell 1.5% (b) Fell 1.5 percent Change of order (a) First we came to the tall palm trees (b) We got to some rather biggish palm trees first Addition/deletion (a) One day she took a hot flat-iron, removed my clothes, and held it on my naked back until I howled with pain (b) As a proof of bed treatment, she took a hot flat-iron and put it on my back after removing my clothes Paraphrase extremes Identical (a) But he added group performance would improve in the second half of the year and beyond (b) De Sole said in the results statement that group performance would improve in the second half of the year and beyond Entailment (a) [...] It was acquiring the "intellectual property and technology assets" of GeCAD (b) [...] It intends to acquire the intellectual property and technology assets of Romanian antivirus firm GeCAD Software Srl Non-paraphrase (a) The report was found Oct. 23, tucked inside an old three-ring binder not related to the investigation (b) The report was found last week tucked inside a training manual that belonged to Hicks

(
11) a.We love flowers b.I love flowers (12) a.She is my collaborator b.He is my collaborator (13) a.I got lost in the street b.I got lost in the streets (14) a.I like the cake b.I like cakes (15) a.The plane takes off at 6:30 b.The plane is taking off at 6:30 (16) a.She lives in Barcelona b.She had lived in Barcelona