A Corpus-Based Study on Phrasal Complexity in Computer Science Abstracts of Novice and Advanced Writers

The correlation between phrasal complexity and L2 writers’ writing proficiency has been confirmed by numerous studies, among which the empirical studies on measuring of L2 writing have attracted much attention in English for academic purposes. Based on the hypothesized developmental indices of noun phrase modifications, this study examines, from the corpus-driven approach, the linguistic realization of phrasal complexity on abstracts written by international advanced academic writers and Chinese graduate L2 writers of computer science. The results show that while both groups illustrated the developmental stages of academic writing, many complexity measures of noun phrase modifiers significantly distinguish L2 learners from expert writers, especially in post-modifiers. The findings thus support previous studies on writing developmental stages, presenting a thorough understanding of graduate L2 writers’ and advanced writers’ of EAP writing through analysis of pre-modifiers and post-modifiers.


Introduction
Conventional wisdom on abstracts demonstrates that an abstract contains a crucially important content of an already existing text (Kilborn, 1998). As an indispensable screening and indexing tool (Huckin, 2001), free access to complete full-text of research article abstracts blurs or breaks the boundaries of interactive exchange between academic communities (Koltay, 2010). Concerning abstract composing, it not only lies in the writing strategies and linguistic devices that urge writers to compose a remarkably compact text with relatively few words, but also brief, concise, and objective previews that assist readers in gaining convenient access to information. For instance, Ruan (2018) has pointed out that constructing such highly condensed texts requires authors to wield a sophisticated repertoire of linguistic and non-linguistic knowledge.
Additionally, evidence has shown that there is a notable historical surge in the use of "nouns and phrasal noun-modifying structures" in academic writing (Biber & Gray, 2016). In particular, research article abstracts as a unique genre in academic prose, contain a large number of lengthy noun phrases and non-clausal phrases (e.g., Biber & Gray, 2010;Biber et al., 2011;Ansarifar et al., 2018;Ruan, 2018;Song & Wang, 2019), which present a highly compact writing style. Given these circumstances, it is natural to wonder what features are presented in use; an exploration of noun phrases in research article abstracts will thus contribute to understanding the developmental stages of complexity features in academic writing.
Although it is commonly thought that an attractive abstract is crucial to a wide range of academic writers, several issues involved are still open to question. First, studies have shown that there is indeed a poor competence of structural clarity and expressive accuracy for second language (L2) academic writers, especially abstracts of hard sciences (Fu et al., 2021), which affects journal visibility, citation frequency (Wang, 2020), as well as author's academic identity construction. Second, owing to lack of detailed guidance and targeted feedback, these novices writers, more specifically in non-native bilingual or multilingual writers, are often capable of constructing their abstracts with "maximum efficiency, clarity, and economy" (Swales & Feak, 2009). Consequently, as for abstract constructing instruction, it should be an essential and integral component of teaching objectives to effectively identify novice L2 writers' developmental stages and provide targeted feedback and guidance according to the differences in phrasal complexity.
Moreover, despite many scholars indicate that abstract is sensitive enough to distinguish research article abstracts across various disciplines (e.g., Hyland & Tse, 2004;Pho, 2008;Omidian et al., 2018), what remains largely unknown is what rhetorical strategies are selected, what fundamental distinctions are presented, and how developmental stages are obeyed, especially within a concrete discipline. That is to say, an intermediate investigation into academic sub-registers still needs to attach due significance to recognize the distinctive patterns of employed linguistic recourses (Biber & Gray, 2016: p. 250). However, there are few studies so far probing into the grammatical devices concerning the realization of structural compression in scientific research writing. If the answers to these questions are pretty ambiguous and fail to response undoubtedly, then targeted instruction in constructing abstracts might become no more than an exhilarating proposal.
This study aims to contribute instructions on language-focused EAP (English for Academic Purposes) classrooms and teaching reform in advanced academic writing to serve effective academic communication by comparing novice Chinese writers and more advanced international writers. Rather than simply present speculative interpretations of the overall writing development, this study uses hypothesis of Biber et al.'s (2011) to empirically investigate specific stages of writing development on computer science that is traditionally regarded as hard sciences. The concrete group of L2 writers studying here is Chinese graduates who have published their dissertations in CNKI (China National Knowledge Infrastructure) but lack extensive publication experience in international academic journals. We compare the phrasal complexity of research articles published in CNKI by novice writers and those published in IEEE INFOCOM (IEEE International Conference on Computer Communication), a leading conference in the same field, by advanced writers with many publications in preeminent journals. It is assumed that abstracts written by top-tier researchers are regarded as the model of composing influential condensed texts, and features of advanced writers' writing would then have practical implications for novice and less experienced L2 academic writers.
In light of these dynamics, this study mainly aims to answer the following questions: 1) Whether novice Chinese writers and advanced international writers follow the development stages for noun phrase modification features hypothesized by Biber et al. (2011)? 2) What are the similarities and distinctions of the linguistic features of noun phrases between novice L2 writers and advanced international writers in computer science?
3) What are the reasons behind the differences, and what implications for targeted language-focused instruction for Chinese graduate computer science writers?
This paper first presents an overview of syntactic complexity in writing quality and development, followed by the more recent studies focusing on phrasal complexity indices. It then summarizes the developmental stages for phrasal complexity features from developmental stages for complexity features (Biber & Gray, 2010). Finally, the paper focuses on phrasal complexity indices to predict the writing quality and developmental stages of Chinese L2 graduates, hoping to have a thorough understanding of the fundamental distinctions between novice writers and advanced writers.

Measuring Linguistic Complexity in Second Language Writing
As a sub-dimension of L2 English writing (Wolfe-Quintero et al., 1998;Ortega, 2003), linguistic complexity studies (also called syntactic complexity studies) has long sought ways to explore what language features found within text could effectively assess both writing development and language proficiency of L1 and L2 writers (Pallotti, 2015;Lu, 2017 Jiang, 2020). An early study on complexity measurement is performed through a great number of "global measures" (Ortega, 2003), including the amount of embedding, the sophistication of structure, the range of structural types, as well as the length of production unit.
More recently, however, the conventional or "large-grained" length and clausal-based metrics (Kyle & Crossley, 2018) have been greeted with some skepticism (Biber et al., 2011;Kyle & Crossley, 2018;Biber et al., 2020;Crossley, 2020). One of the two major concerns is that excessive attention at the syntactic level is deemed unnecessary to a greater or lesser extent (Kyle & Crossley, 2018;Biber et al., 2020). On the one hand, findings from the historical change in spoken and written discourse demonstrate that complex noun phrases rather than clausal subordination can explicitly delineate the boundaries between written academic prose (especially academic research writing) (Biber & Gray, 2010;Biber et al., 2011). On the other hand, a large number of studies primarily take syntactical units (e.g., mean length of T-unit) as complexity indices; however, no in-depth attempt has been made to operationalize linguistic complexity at the phrase level. The underlying notion is that the length of a T-unit may increase systematically, both from the extensive use of phrasal dependents (such as attributive adjectives) to the extensive use of dependent clauses (such as finite complement clauses); however, large-grained indices fail to capture the structural devices deployed within a T-unit (Kyle & Crossley, 2018). The other inexplicable question on large-grained measures is that they do not afford accurate, adequate, and satisfactory explanations for "additional layer of meaning" (Paquot, 2019) manifested in specific lexical patterns (Kyle & Crossley, 2018;Biber et al., 2020), as they show weakness in drawing distinctions between phrasal and clausal complexity (Biber et al., 2020). Consequently, Biber et al., 2011, Biber & Gray, 2016 have proposed that L2 writing assessment should pay more attention to the complexity indices at the phrasal level in academic writing, thus constructing a hypothesis of five writing developmental stages from preference for clauses to a dense use of noun phrases.
Therefore, the author's contention here is that the large-grained indices on syntactical complexity are somewhat controversial concerning its interpretive power in accounting for phrasal complexity features, thus requiring more finegrained measures of phrasal complexity in a specific disciplinary field (computer science) than it has hitherto received.

Theoretical Framework
It is widely recognized that syntactic complexity is associated with greater wiring development. Traditional indices of complexity shed lights on clause-level structure, ranging from T-unit counts to sentence length (e.g., Ortega, 2003;Lu, 2011). Supported by a comparative study on spoken and written discourse, Biber, Gray and Poonpon (2011) have challenged this stereotype on complexity in English academic prose by redefining it as more elaboration (clausal complexity) acquired in early developmental stages and more compression (phrasal complexity) in later stages. Specifically, Biber et al. (2011) divides writing development into five stages. The stages progress from "finite dependent clauses functioning as constituents in other clauses", toward intermediate stages of "nonfinite dependent clauses and phrases functioning as constituents in other clauses", and finally to the last stage requiring "condensed use of phrasal (non-clausal) dependent structures that function as constituents in noun phrases" (Biber et al., 2011).
More recent studies are further pushing writing development measurement from clausal complexity and moving toward phrasal complexity under the presumption that high-quality academic writing is closely associated with a higher degree of phrasal complexity (Biber et al., 2011;Lu, 2011;Taguchi et al., 2013;Parkinson & Musgrave, 2014), while at the same time posing a comparative absence of clause or sentence devices. Biber (1988: p. 104) has demonstrated that the high frequency of nouns as the fundamental carriers of referential meaning is linked to a higher information density. As the drift towards a structurally compressed discourse style in academic written registers (Biber & Gray, 2016), it turns out that abstracts present an extremely dense use of phrasal structures in limited word count where information is conveyed through phrasal devices instead of clausal elaboration (Biber & Gray, 2010;Ansarifar et al., 2018;Yin et al., 2021).
Furthermore, multiple studies have confirmed Biber's claims about the significance of phrase-level structure as more convincing measures of complexities of English academic writing (e.g., Lu, 2011;Kyle & Crossley, 2018;Ruan, 2018). Corresponding studies have explored L2 writing development by comparing writing in different genres such as argumentative writing, course essay, and critiques (e.g., Lu, 2011;Taguchi et al., 2013;Atak & Saricaoglu, 2021) or by comparing L2 writers at different levels such as secondary, undergraduate, and graduate level (e.g., Casal & Lee, 2019;Bi & Jiang, 2020;Yin et al., 2021). The following results from such research confirm the statement that phrase-level indices perform better than clause-level indices in L2 writing development. It thus provides a strong starting point for discussing considerations into phrase-level complexity indices in general.
Therefore, the current study builds on the developmental stages for complexity features hypothesized by Biber et al. (2011), which jointly utilize the pre-and post-modifiers embedded to measure L2 writing quality and development. Table  1 presents the developmental stages for noun phrase features, which have been continuously improved and enriched according to the research object of this study to make it more in line with the present research purpose. For instance, in the original framework, simple prepositional phrases with concrete and abstract meaning were in the third and fourth developmental stages respectively. However, considering the discipline characteristics of computer science, there is no significant difference between advanced and novice writers, as both of them frequently use simple prepositional phrases with abstract meaning (e.g., electronic communication signals, 5G networks, Internet of Things). Additionally, there are no clear criteria to classify abstract or concrete sense in these novel fields. Therefore, simple prepositional phrases with concrete and abstract meaning were separated in the study.
As shown in Table 1, the indices related to noun phrases are not presented in stage 1, and start directly from stage 2 through simple phrasal embedding in the noun phrase and prepositional phrases with relative clauses; further developments include modifying nouns with -ed and -ing clauses and modifying nouns with prepositional phrases; the final stage reflected in Table 1 includes phrasal structures as noun modifiers, and complementary clauses as noun modifiers.

Corpus
To carry out a contrastive study of abstracts written by advanced academic writers and novice graduate writers of computer science, 400 relevant abstracts are collected from two sources covering the period between 2019 and 2020 (Table 2). One is Chinese graduate writers' abstracts from CNKI (https://www.cnki.net/), and the other is advanced academic writers' abstracts from IEEE INFOCOM (https://infocom2020.ieee-infocom.org/). During data collection, each text is stored in plain TXT files and checked manually to ensure that the primary topic is closely relevant to computer science. Additionally, the original data of the research is stored in a separate Excel file that contains the information on authors, journals, word counts, web addresses, publishing and collection dates so that the original files and information can be retrieved.

Analytical Procedures
The two corpora were tagged using CLAWS part-of-speech tagger (Garside & Smith, 1997), and manual coding fixing was conducted for noun phrase structures that couldn't be tagged automatically. For instance, each instance with Of phrases was checked manually to further differentiate whether it was a preposition with nonfinite complement clause in stage 5 (e.g., a prototype of using low cost …) or a noun post-modification in stage 3 (e.g., localization of illegal transmitters). Additionally, AntConc (Anthony, 2019) was applied to identify the frequency of noun phrase structures in the two corpora by retrieving the corresponding tag of the indices presented in Table 1, aiming to explore the similarities and differences in the use of noun phrases. Moreover, the independent-samples T-test is employed as a method to explore significant differences between advanced international writers' and novice Chinese writers' abstracts.
Finally, this study shed lights on the reasons behind the current stages and hopes to offer some suggestions for EAP writing.

Predicting and Assessing L2 Writing Development
In order to investigate whether abstracts written by novice L2 writers and advanced writers follow the developmental writing stages hypothesized by Biber et al. (2011), in-depth analyses on five developmental stages were applied. A large number of studies have confirmed the importance of complex nouns as a complexity measure to distinguish learners from expert writers (e.g., Ruan, 2018;Larsson & Kaatari, 2020;Bychkovska, 2021); thus, 14 indices closely related to complex nominals presented in Table 1 are utilized to predict writing quality and development of novice and advanced writers. Figure 1 below displays the proportion of noun phrase structure deployed in four developmental stages. As can be seen, noun phrase features from stage 3 are the most frequent linguistic resources for constructing abstracts for both novice and advanced writers, at 45.64% and 45.07% respectively. Additionally, an interesting trend is that the proportion of two corpora in other stages also exposes remarkable similarity (e.g., 27.85% and 29.92 at stage 4, 21.63% and 23.68% at stage 2), demonstrating that Chinese novice writers possess analogous preferences with advanced writers in writing strategies and language resources of composing abstracts. These similarities may result from writers' perspective of academic identity construction and academic writing exploration wherein for novice L2 writers they favor writing strategies that manifested conform to expected norms and authorities.
However, although noun phrase structures from both corpora presented least at stage 5, a more significant distinction is displayed between novice and advanced writers at 4.88% and 1.33% respectively. Moreover, compared with advanced international writers, novice Chinese writers have a relatively higher writing complexity at stage 2, with a proportion of 23.68%. These findings are broadly in line with the previous studies, that is, higher quality and more advanced writing involves more syntactic complexity (Kyle & Crossley, 2018;Ansarifar et al., 2018;Crossley, 2020;Atak & Saricaoglu, 2021). Furthermore, the proportion of various stages in Table 3 may indicate that IEEE writers, as an advanced level of English learners or English natives, utilize greater linguistic recourses characterizing more sophisticated developmental stages, while Chinese graduates, as Chinese English learners at a relatively high level, still show quite reliance on features at the early stages of L2 development, and pose weakness in an advanced combination of various linguistic skills.
To conclude, without sufficient empirical proof for the claim that the topranked research articles are salient enough to enlighten academic writing of novice L2 writers, it is hard to envisage such expected assumptions; therefore, a further investigation is carried out into complex nominals of the novice and advanced academic writing.

Novice vs. Advanced Academic Writing in Computer Science
Normalized frequency analysis is conducted in an attempt to explore various types of modifiers included in complex nominals. It does so by examining whether there are apparent or observable differences in the frequencies of preand post-modifications between the two corpora, followed by independentsamples T-test of their statistical significances through IBM SPSS Statistics (v.25). Open Journal of Modern Linguistics   Figure 2 graphically illustrates the distribution of phrasal features. Considering the total words in IWA are more than twice those in CWA, raw frequency is standardized per 100 words to ensure the controllable comparison between the two corpora. As shown in Figure 2, the most common types of noun post-modifying noun are structure with prepositional phrases, while attributive adjectives and nouns were the most common form of pre-modifying nouns. It provides supporting evidence for widespread usage of attributive adjectives and prepositional phrases in written academic registers (Biber et al., 2020) while showing an inconsistent claim that science research writing contains fewer attributive adjectives (Biber & Gray, 2016). Furthermore, the finding demonstrates that computer science as a hard science discipline also shows somewhat similarity to applied linguistics (Ansarifar et al., 2018), traditionally regarded as a humanistic discipline in phrasal structures. Table 4 represents raw frequencies, normalized frequencies, and percentages of each measure in the two corpora. It demonstrates that all the factors concerning post-modifying nouns are differed between the two groups of writers, from usages of prepositions at an earlier stage to multiple prepositional phrases with levels of embedding at a more advanced stage. For instance, advanced writers show a preference for prepositional phrases other than of, while Chinese writers tend to use more NP-of phrases. Ansarifar et al.'s (2018) partially confirmed this finding that MA writers are prone to underuse prepositional noun modifiers compared to expert writers. Additionally, a further investigation into the NP-of phrases manifests that Chinese writers treat NP-of phrases as part of possessive case governed by the study object of the thesis (e.g., of paper, of work, of system, of an algorithm). On the other hand, the advanced writers more frequently use prepositional phrases in a cause-effect relationship, thus increasing the complexity of the internal logical structure of abstracts through the compression of additional information by post-modification.   Table 5 below displays the comparisons of the normalized occurrences of each type of modifiers in the two sub-corpora. Mean occurrences of all phrasal complexity measures were compared by employing independent-samples T-test for statistical significance (p < 0.05). Overall, it is observed that all significant differences between the two corpora appear in post-modifying nouns. Of the seventeen measures, more considerable variation was observed in stages 3 and 5 than in other stages. In stage 2, no significant between attributive adjectives was found for either Chinese graduate writers or advanced international writers; one possible reason may be that attributive adjectives are linguistic devices acquired at an earlier stage of L2 acquisition. In stage 3, that relative clauses and prepositional phrases other than of were found to have statistically significant differences, accounting for a p-value of 0.001 and 0.000, respectively. Furthermore, although scholars have hypothesized that participle adjectives are acquired later than other pre-modifiers (Parkinson & Musgrave, 2014), no significant differences between advanced and novice writer corpora were found in this study. A review of stage 5 also shows that more advanced writers tend to use more post-modifying nouns through the higher frequent embedding of to-clauses in their academic writing, with statistically significant differences found between the IWA and the CWA groups.

Discussion
This study conduct a corpus-driven comparative studies on abstracts written by international advanced writers and Chinese novice L2 writers of computer science and found that both group of scientific research writers follow the development stages for noun phrase modification features hypothesized by Biber et al. (2011). There results support findings from previous studies that the increasing complexity development and the raising-quality academic writing present greater reliance on noun phrases or phrasal complexity (e.g., Biber et al., 2011;Kyle & Crossley, 2018;Ansarifar et al., 2018;Biber et al., 2020;Crossley, 2020;Larsson & Kaatari, 2020;Atak & Saricaoglu, 2021;Bychkovska, 2021).
To conclude, there are several implications of this study for L2 phrasal complexity. First, as for Chinese graduate writers, the disproportionate focus on pre-modifying nouns and relative insufficiency of post-modifying nouns might deserve more attention. On the one hand, these issues are closely related to academic writing norms, writer identity construction, and international academic discourse status, all of which influence the students' academic writing practice.
On the other hand, these academic writing skills are expected to be acquired by students at the graduate level when dealing with academic studies. Still, many of them fail to deploy linguistic strategies to construct compact writing tactically.
The underlying notion is that academic writing for Chinese writers is supposed to move beyond blindly or mechanically embedding multiple modifiers, but attempt to raise awareness on shift to an integrated combination of noun phrase features concerning more advanced stages. Specifically, as advanced writers demonstrate movement toward more complex phrasal structure, Chinese graduate writers are supposed to push the inherent notion of academic writing away from the overuse of simple noun phrases at early stages (e.g., attributive adjectives, nouns, and NP-of phrases), and moving more toward producing complex non-clausal phrases and lengthy post-modifying nouns at later stages (e.g., prepositional phrases other than of, and multiple prepositional phrases with levels of embedding). Furthermore, an exploration into abstracts in computer science also sheds new light on disciplinary variations of the compact discourse style in academic prose, affording explicit guidance for L2 writers on what phrasal features are able to use and what features are supposed to focus in academic writing of computer science. For instance, given the preciseness and accuracy of research articles in computer science, it should be noted that science research writing contains fewer attributive adjectives (Biber & Gray, 2016). Besides, although computer science (hard science discipline) demonstrates a somewhat similar phrasal structure to applied linguistics (humanistic discipline) (Ansarifar et al., 2018), academic writing of computer science is more characterized by a structurally compressed discourse style (high proportion of complex noun phrases) in academic written registers.
Second, regarding language-focused EAP classrooms and teaching reform in advanced academic writing, the findings have shown that higher-level writers frequently use modifiers at higher developmental stages to produce condensed discourse style, and linguistic elements of phrasal complexity can be used to predict and access writing development of L2 writers. It seems reasonable to assume that corresponding instructional guidance could positively impact L2 writers' writing development. Therefore, teachers may provide extensive practice in decoding complex noun phrases in actual academic writing teaching. Furthermore, it is worth setting the teaching objective of academic prose more precisely than the large-grained instruction that misplaces the focus on a syntactic or clausal level, with more explicit guidance and feedback on phrasal complexity under which complex nouns should be realized, grasped, and utilized by students. More studies should be carried on to discuss the impact of shorter-term or longer-term interventions on student's writing later in the following study. Additionally, this study provides a basis for constructing complex noun phrases or compressed discourse styles in other disciplines. It also offers a strong starting point for setting up following-up studies into further corpus expansion, instructional textbooks, and formative teaching in general.

Conclusion
In conclusion, these lines of studies may advance our understanding of the writing quality and development of incorporating explicit noun phrases instruction into L2 research article writing. However, it poses a somewhat limitation that the finding of use patterns and the subsequent instructional guidance are not practically validated in real classrooms; thus it would be a productive area for future study to conduct classroom-based studies to assess the specific instructional method for phrasal complexity.