Towards Kikamba Computational Grammar

The under-resourced Kikamba language has few language technology tools since the more efficient and popular data driven approaches for developing them suffer from data sparseness due to lack of digitized corpora. To address this challenge, we have developed a computational grammar for the Kikamba language within the multilingual Grammatical Framework (GF) toolkit. GF uses the Interlingua rule-based translation approach. To develop the grammar, we used the morphology driven strategy. Therefore, we first developed regular expressions for morphology inflection and thereafter developed the syntax rules. Evaluation of the grammar was done using one hundred sentences in both English and Kikamba languages. The results were an encouraging four n-gram BLEU score of 83.05% and the Position independent error rate (PER) of 10.96%. Finally, we have made a contribution to the language technology resources for Kikamba including multilingual machine translation, a morphology analyzer, a computational grammar which provides a platform for development of multilingual applications and the ability to generate a variety of bilingual corpora for Kikamba for all languages currently defined in GF, making it easier to experiment with data driven approaches.


Introduction
The commonly used data driven approaches for developing natural language processing (NLP) tools are currently unusable with under-resourced languages due to data sparsity and this problem might not be resolved in the near future.
There is a high demand for these NLP tools due to the exponential growth of the Internet, which has availed a wealth of information available to people and million speakers. Its grammar is agglutinative, tonal, inflectional and has a noun class system or a class gender (noun prefix and Concord for the noun modifiers) [2] [3] [4]. In addition, its orthography consists of seven vowels and fifteen consonants [5]. In terms of descriptive grammar for Kikamba language, some work is already done, though most of them are not published such as derivational verb morphology [6] [7], noun modification [3], morphosyntax for Kikamba [2] and tonal perspective [8] [9]. Some gaps still exist on these works; for example, the subject marker and negation in verb morphology is only done for class gender which deals with humans only. The concord for possessive pronouns, morph phonological changes in adjectives and verbs, the morphology of compound Nouns and adjectives are yet to be done. With respect to language resource tools, there are only two language tools for this language to the best of our knowledge-these are a Part of Speech tagger and a named entity recognizer [10] [11].
GF has also been used to model language resources for Bantu Languages. Kiswahili language has a partial morphology analyzer [12] while the Tswana Language from South Africa has a mini resource grammar [13]. Hence, no wide-coverage grammar for a Bantu Language has been made in GF so far. Thus, development of the Kikamba Computational Grammar is a significant milestone towards the creation of standard Basic Language Resource Kit (BLARK) [14] since it will result in a Morphological analyzer and multilingual translation using the capability of Grammatical Framework. Secondly, it will be a catalyst to the provision of information and communication technology (ICT) in Kikamba language, thus bridging the digital divide. It will provide a platform for the generation of parallel corpora and treebanks, which are crucial for building NLP tools using data driven approaches. Finally, it is an electronic preservation effort for the Kikamba language so that the Kamba people are not disenfranchised in the global information space.

Morphology
Kikamba language way of forming words from the morphemes is through prefixing and suffixing (agglutination) with the direct influence of noun class system, noun concord and morph phonological transformation. Only a few borrowed words or irregular words deviate from the noun class system prefixing.

B. Kituku et al. Journal of Data Analysis and Information Processing
Regarding the noun class system, arguments have been advanced whether it should be referred to as gender or noun class. Some consider a pair of singular and plural noun class as gender [15] [16]. This thought is reinforced by Demuth [17] by proposing a noun class as a subset of gender. However, Ibrahim [18] argues that gender or noun class can hold ground since Bantu genders are not inspired by natural sex gender semantics as the case with Indo-European languages. For the purpose of this paper, we shall adopt two pairs of noun classes (singular and plural) forming class gender. Table 1 lists all noun classes for Kikamba language [2] [3] [4]. The morpheme before the underscore represents the singular noun class while the one after represents the plural noun class and both form the class gender encoded in the third column for use in the GF grammar modeling. We shall discuss the inflection of open and thereafter closed categories.

Noun
The structure of noun morphology consists of obligatory prefix and root plus an optional suffix. The prefix determines the noun class number and we exemplify its usage by Example 1 where the notation "c" means class and the number means noun class number based on Table 1 (for example c1 means noun class number one), while the root is the radical of the lexical word. The suffix "ni" is used to form a locative noun, which is a case (grammar feature). In the real sense, it is a preposition and a noun combined, for example "at the shop" becomes "dukani" and "on the table" become "mesani". The words "shop" and "table" in Kikamba are "duka" and "mesa" therefore, the preposition is actualized by adding the suffix "ni". The morpheme "ka" marks future tense also referred to as indefinite future tense. The tense morpheme is in-between the subject marker and the root as exemplified in Example 3. Kikamba language has a remote future tense, constructed by concatenating prefix "ni" to the future tense, e.g., using the case of Example 3 we will have "niakakoma", "Gloss", "he will sleep".

Syntax
The main topology for the Kikamba language sentence is subject-verb-object (SVO) [2] [7] whereby the subject is a noun phrase, followed verb phrase. The verb phrase is a combination of the verb phrase and object complement which can be a verb phrase, noun phrase, etc. The presence of the object is influenced by the verb valence (univalent, divalent and trivalent). For example, for the univalent verb, the topology becomes SV because the one place verb does not require arguments. The syntactic agreement is via concord agreement within the lexical items mainly influenced by the class gender of the noun [2] [3].
Noun phrases are made of a noun and its modifiers which include an adjective (Adj), determiner (Det), both possessive (poss) and demonstrative (dem) and finally numbers (Num). Rugemarila [20] has worked extensively on the structure of noun phrases in Bantu languages and has concluded the structure to be as illustrated below which concurs with one presented by Mbuvi [3] for Kikamba language. [

dem] [Noun] [Det <poss> <dem>] [Num] [Adj]
The structure of a verb phrase is the same as a verb and carries all parameters that are integral to verbs.

Translation Approaches
The three main approaches to machine translation are: data driven, rule based and hybrid strategies [21]. The data driven approach, such as neural network models, statistical models, etc. makes use of parallel aligned corpus to make the machine translation possible. It is divided into statistical and example based translations. The rule based approach uses syntax, lexical rules and a lexicon to form a computational grammar based on Chomsky theories [21]. Word-based, transfer and interlingua are the three subcategories of rule based approaches. A grammar formalism determines the architecture of the grammar. The hybrid approach involves using the above approaches together with either a rule based guided hybrid or data driven hybrid translation. In section one, we mentioned Kikamba language being an under resourced language. Thus, very few digital corpora are available, that is why we used the Interlingua rule based translation approach. The Grammatical Framework was chosen because first, its multilingual capability enables the creation of the technology in the different languages already defined in GF. Secondly, separate tecto-grammatical (abstract syntax) and pheno-grammatical (concrete structure) [22] enable faster development since one concentrates on only the concrete syntax of the language been devel-Journal of Data Analysis and Information Processing oped. Finally, it provides a platform where application grammars can develop controlled natural languages on top of the resource grammars without the application programmer knowing the mechanics of the resource grammars.
Grammatical Framework (GF henceforth) is a toolkit used for rapid development of multilingual grammar resources and applications based on the functional programming paradigm, the logic framework of abstract syntax plus concrete syntax. GF is also a grammar formalism grounded on categorical formalism [23] [24]. GF has one abstract syntax which defines categories of trees and the functions to implement them and many concrete syntaxes, one for each specific language grammar which provides the linearization of the categories and function of trees embodied in abstract grammar [22]. These parallel grammars of concrete syntaxes equivalent to parallel multiple context-free grammars reside

Implementing the Kikamba Grammar in GF
Dictionaries, linguistic postgraduate theses and informants (who speak the language and/or are linguists) formed the data source for the lexicon and descriptive grammar. Linguists were used in cleaning, authenticating the data and through elicitation, they generated morphology and syntax of the categories that were missing in the Descriptive grammar from corpora. The elicitation was performed either through language analysis of the corpus through linguist judgment or by translation from English to the specific Bantu language as proposed B. Kituku et al. Journal of Data Analysis and Information Processing by Chelliah [28]. Snowball sampling techniques [29] 1 , which is a non-probabilistic sampling technique was used to gather the sparse corpora and to identify the few linguists available in the language. The evolutionary prototype model [30] 2 approach was applied since for every function or module developed in GF there was a need to demonstrate its working by testing and refining the function until it produces the correct output. Interlingua rule based approach was used to develop the computational grammar in a morphology driven strategy, which is a bottom-up method. It involves first defining the lexicon, then categories, their smart paradigms based on the regular expression and finally working on the syntax rules [25]. Therefore, we will first discuss the morphology of the part of Speech tags and thereafter syntax rules.

Adjective
Adjectives were implemented using parameter AForm, which had positive (AAdj), comparative (AComp) forms plus Adverbs9Advv) formed using adjectives and utilizing variable features: class gender and number. The comparative adjective form was implemented by adding the infix "ang" to positive adjective form just before the final vowel of the adjective. Table 4

Personal Pronouns and Possessives
The personal pronoun is a string but requires concord agreement of class gend-

Common Noun (CN)
In Indo-Europeans languages, the CN is combined with an adjective to form NP or another CN and later a determiner can be added as a pre-modifier or post-modifier. However, in Kikamba language, the determiner is added between the adjective and the noun. Thus, the design of CN using two strings as exemplified below was to enable string one "s" to hold the CN while string two "s2" to hold the adjective. Hence it would be easier to add a determiner between string one and two. The class gender was retained from the noun since it will be used in agreement (concord). Below is the rule for forming CN from an adjective and a noun. All noun modifiers come after it with the exception of some quantifiers. CN has pre and postmodifiers such as an adjective, relative clause, adverbs, sentence and noun phrase and based on them, ten syntax rules were constructed. Below is an example of combining an adjective and a common noun.

Determiner Phrase (Det)
Det Phrases can either be possessive or demonstrated which were implemented using quantifiers, numbers and possessive pronouns. Three rules were implemented for Det Phrase and below is an example of one of the rules which form Det by taking a quantifier and a number. DetQuant quant num = {s = \\Cgender =>quant.s ! num.n!Cgender ++ num.s !Cgender; n = num.n; isPre = True};

Adjective Phrase
The adjective phrase was modeled via positive adjective, comparative adjective, post modifier of an adjective such as adverbs and also attaching it to a sentence. In total, eleven rules were used to implement adjective phrases and the comparative adjective phrase. The next rule exemplifies the implementation. The agreement consists of number and class gender and the Boolean value allows us to place the adjective phrase after the noun. ComparA a np = {s = \\g,n => a.s !AAdj g n ++ "kuvita" ++ np.s ! npNom; isPre = False}; Journal of Data Analysis and Information Processing

Noun Phrase (NP)
NP was implemented from the common noun, proper names, determiners, pronouns and also recursion of NP with adverbs, pre-determiners and determiners.
NP implementation used two parameters: case and agreement (concord). On the case, we introduce extra case NPoss to cater for NP formed from personal and possessive pronouns. Eight rules were implemented to form NP. Below is an example of how to form NP by combining a determiner and a common noun in the Kikamba language. The Boolean function associated with the determiner allows pre and post determiners of CN to be placed in the right position.

Verb Phrase (VP)
In VP the prefixes (focus, negation, subject marker, tense) morphemes were concatenated to verbs as mentioned in section 3.3 to make a complete verb.
Since a whole verb can act as a sentence, then the parameters of sentences: polarity, tense and anterior in addition to agreements were used in the design as exemplified by the operation oper verb phrase. Five record strings were used: s for normal verb, progV for progressive verbs, compl for object of the verb, imp for imperative verbs and inf for infinitive verbs. The subcategorization of verbs was taken care of through compl (one place, two place and three place verb) and in total 20 rules were implemented based on the regular verb phrase function regVP.

Other Syntax Categories
A clause was formed by combing a noun phrase and a verb phrase and implemented the topology SVO where the O was the second string of verb phrase which implemented the compliment of the verb. In the next section, we illustrate one of the rules for forming clauses. The clauses formed a sentence with the same parameters. However, the difference in GF is that the polarity and tense in clauses are undetermined [25]. Finally, the sentence and interrogative forms utterance (utt), which were the starting category for this computational grammar and was modeled based on definition 2. Seven clause rules, eight utterance rules and seven sentence rules were implemented

Results
The Kikamba grammar was subjected to test suites for purposes of testing and evaluation. The testing aimed to improve grammar quality (reduce over the generation and ensure coverage) during development while the evaluation objective was to check coverage and quality of the grammar after development. The linguistic phenomena covered for this grammar are shown in Figure A1 and are the ones that were tested and evaluated. There are three ways used to create test suites for testing computational grammars [31] [32].
• Grammar writer or expert writes the test suite data or uses already existing test suites. • Using natural existing corpus or treebanks.
• Use of the comments created for each grammar rule that shows what the rule parses in the grammar. We based our evaluation and testing on the aspect of the grammar already developed as per Table 5. Thus, we used method one for evaluating and method three for testing.
To create the test suite for testing, the comment(s) for each function/rule in the abstract syntax was used. The comments are/is an example(s) of what the rule can parse in the English language in addition to extra phrases generated by the grammar writer for each rule in the English language. The test suite for each rule was translated into Kikamba language phrases or lexicon (gold standard). The rule was implemented in such a way that its linearization output to match the gold standard, else the function was refined and the regression testing re-run until a match was obtained and also in case of changes of the module, re-runs were made to ensure no new noise was introduced. The above is the standard testing procedure for GF grammar [25] and also illustrated in Figure 2.  Figure 3 represents the sentence "these bad men will cut many trees" in Kikamba languages. The verb "cut" is a two-place verb hence has an object existing in future tense with positive polarity and simultaneous anteriority. The Sentence S is created from the clause Cl, which consists of NP and VP. Also, the VP is made of VPslash and NP. Therefore, the sentence is indirectly made of NP VPslash NP, which represents the SVO structures respectively. Table 6 shows the morphology of individual categories. The morphology is discussed in Table 7. All tense, polarity and anteriority implemented in this grammar have been exemplified in Table A1 at the appendix using the verb "sleep".     "on" that is fused with the noun "table" to become "mesani" and preposition "of" which is translated "kya" based on class gender G4 of the pen. The gloss of the utterance used is "the pen of John was on the table". Figure 5(b) shows word alignment between English and Kikamba languages for the same utterance.
In Kikamba language, the tone is used to mark a question; hence, there are no rearrangements of the declarative sentence constituents. Figure 6(a) demonstrates the coverage of Wh-question "which trees did the wind push?" while the The Kikamba grammar is part and initial stage of creating a shared grammar for Kenyan Bantu languages through bootstrapping strategies, mainly grammar sharing and grammar porting. In order to maintain a standard regression testing of any new Bantu language that will be added via bootstrap, we parsed the hundred English sentences in order to create a treebank test suite. Table 5 represents the categories covered in the treebanks. Below is an example of a tree which will linearize into "andu aume miongo ili athuku vyu nimananyw'ie nzovi" in the Kikamba language with a gloss of "the twenty very bad men drank beer" in the English language. The tree starts at the phrase level (PhrUtt) with no conjugation and vocative, taking sentence utterance (Utts  n2))))))) (AdjCN (AdAP very_AdA (PositA bad_A)) (UseN man_N))) (Com-plSlash (SlashV2a drink_V2) (MassNP (UseN beer_N)))))) NoVoc The treebanks created had 2854 functions in total. With the largest tree having 62 functions while the shortest had 11 functions. The largest tree was made of two sentences which had complex verb phrases and noun phrases.

Discussion
The statistical machine translation (SMT) Dholuo-English and Swahili-Dholuo [37] work gave a low BLEU score of 0.29 and 0.15, which the author attributed to lack of bilingual corpora. Given that the corpus was divided into ten portions; nine portions used for training and one portion used for testing, then the expectation was a high BLEU score. This is a clear indication that the use of a rule-based system will produce a high performance for under resourced languages. The SAWA corpus English to Swahili statistical machine translation [38] resulted in a BLEU score of 35, which is still low. Weku [39] reports a BLEU score of 32.6 on English-Swahili SMT based on Bayesian inference. We could not find a rule based system evaluation using the above metrics so as to compare B. Kituku et al. with and especially not a system for a Bantu language. Therefore, this work is a clear indication of how using the rule based system will help to produce highly accurate systems for these under resourced languages.
The error analysis was done sentence by sentence and Figure 8 summaries the issue which contributed to the noise. In Kikamba language, pronouns were dropped (prop drop) since they were represented in the subject marker of the verb and in some cases, they were not dropped. Secondly, some prepositions were fused in the noun but also had strings. Verbs contributed the most significant percentage of the errors due to morphophonological issues as a result of nasal deletion and insertion, which is present in the Kikamba language [40].  When a sentence had two adjectives, their order was changed in the translation and was heavily penalized by WER and BLEU hence the use of PER which allows words re-order and the error reduced to 10.96% from 12.82% of WER.

Conclusions
Through this paper, we have formalized the grammar for Kikamba language through the high precision rule-based approach in interlingua GF environment. The metrics results after evaluation which are encouraging are 4-gram BLEU of 83.05%, WER of 12.82% and PER of 10.96%. Therefore our contribution would be: firstly, we have provided NLP tools; morphological analyzer and machine translator for under-resourced Kikamba languages by extending the GF library, which is a step towards BLARK. Secondly, the wide coverage of the Kikamba computational grammar provides a platform for building multilingual technological applications and also to generate the scarce bilingual corpus pairing with other languages present in GF for experimenting using data driven methods. Finally, we have also created a treebank that can be used to evaluate Bantu languages. Future work would be working on the morphophonological rules of verbs, extending the lexicon so as to handle text and finally including questions as part of the grammar. Table A1. Examples of tense, negation and anteriority.