The under-resourced Kikamba language has few language technology tools since the more efficient and popular data driven approaches for developing them suffer from data sparseness due to lack of digitized corpora. To address this challenge, we have developed a computational grammar for the Kikamba language within the multilingual Grammatical Framework (GF) toolkit. GF uses the Interlingua rule-based translation approach. To develop the grammar, we used the morphology driven strategy. Therefore, we first developed regular expressions for morphology inflection and thereafter developed the syntax rules. Evaluation of the grammar was done using one hundred sentences in both English and Kikamba languages. The results were an encouraging four n-gram BLEU score of 83.05% and the Position independent error rate (PER) of 10.96%. Finally, we have made a contribution to the language technology resources for Kikamba including multilingual machine translation, a morphology analyzer, a computational grammar which provides a platform for development of multilingual applications and the ability to generate a variety of bilingual corpora for Kikamba for all languages currently defined in GF, making it easier to experiment with data driven approaches.
The commonly used data driven approaches for developing natural language processing (NLP) tools are currently unusable with under-resourced languages due to data sparsity and this problem might not be resolved in the near future. There is a high demand for these NLP tools due to the exponential growth of the Internet, which has availed a wealth of information available to people and coupled with the high penetration rate of connected mobile devices. There is, therefore, an urgent need to devise strategies that can accelerate the development of language technology tools and applications for under-resourced languages so as to enable their speakers to maintain the use of their languages within a digital environment. This paper describes the development of a computational grammar for Kikamba Language, an under-resourced language, using the multilingual Grammatical Framework toolkit.
Guthrie [
Kikamba language way of forming words from the morphemes is through prefixing and suffixing (agglutination) with the direct influence of noun class system, noun concord and morph phonological transformation. Only a few borrowed words or irregular words deviate from the noun class system prefixing. Regarding the noun class system, arguments have been advanced whether it should be referred to as gender or noun class. Some consider a pair of singular and plural noun class as gender [
The structure of noun morphology consists of obligatory prefix and root plus an optional suffix. The prefix determines the noun class number and we exemplify its usage by Example 1 where the notation “c” means class and the number means noun class number based on
Classes (c) | Class number | GF coding |
---|---|---|
mu_a | 1, 2 | G1 |
mu_mi | 3, 4 | G2 |
i_ma | 5, 6 | G3 |
ki_i | 7, 8 | G4 |
ka_tu | 12, 13 | G5 |
va_ku | 14, 15 | G6 |
n_n | 9, 10 | G7 |
u_ma | 11, 6 | G8 |
u_n | 11, 10 | G9 |
ku_ma | 15, 6 | G10 |
The adjective describes and modifies a noun and its inflection consists of a prefix (concord) which agrees with the class gender of the noun being modified. In addition, to form the adjective, concatenation of the prefix with the adjective root is done [
Kikamba language is no exception to the complexity of verb morphology in Bantu languages. Its declension involves several morphemes (several prefixes, root, extensional suffix and final vowel which represent mood) plus some grammar features such as person, number, class gender, tense, polarity, etc.
Architecture | Morpheme | Kikamba |
---|---|---|
Prefixes | Focus | “ni” |
Negation | as per class | |
Subject marker | as per class and person | |
Tense/Aspect | As per tense | |
Object marker | as per class and person | |
Infinitive | “Ku” | |
Root | Root | |
Extension | Applicative | “i’ |
Suffix | Causative | “ithy” |
Passive | “w” | |
Reversive | “u” | |
Reciprocal | “an” | |
Final vowel | “a/e” |
Tense
Reichenbach [
The morpheme “ka” marks future tense also referred to as indefinite future tense. The tense morpheme is in-between the subject marker and the root as exemplified in Example 3. Kikamba language has a remote future tense, constructed by concatenating prefix “ni” to the future tense, e.g., using the case of Example 3 we will have “niakakoma”, “Gloss”, “he will sleep”.
Past tense is marked by final vowels morpheme “ie” which mark tense though affected by the phonological rule and uses infix “na” to mark aspect [
Present tense, in some cases referred to as present indefinite tense or habitual tense depending on usage, is marked by aspect vowel “a” [
Finally, the Perfect tense on positive polarity is not marked by any morpheme though, in the negative, it is marked by morpheme “na’’ as illustrated in Example 6.
The demonstrative, a noun modifier which shows how far the object(s) is/are from the speaker and unlike Indo-Europeans languages which have demonstrative strings for near and distant. Kikamba language has an extra string for the aforementioned [
Personal pronouns in Kikamba language stand for absent nouns and in GF they are modeled as noun phrases and therefore have a string and enforce agreement (person, class gender and number). The possessive pronoun, a noun modifier depicting ownership and its architecture consist of a prefix dependent on class, gender and number [
For the preposition, through elicitation, it was noted that the strings for some prepositions have variable features of the class, gender and number for example “of”, while most of them do not inflect. In addition, some prepositions are fused into the noun as demonstrated in Example 8, resulting in the locative noun. Cardinal and ordinal numerals can be expressed in words or digits. The cardinal numerals, when expressed in words for the cases of one to five behave like adjectives and take a concord agreement while the rest are independent of the class gender [
The main topology for the Kikamba language sentence is subject-verb-object (SVO) [
Noun phrases are made of a noun and its modifiers which include an adjective (Adj), determiner (Det), both possessive (poss) and demonstrative (dem) and finally numbers (Num). Rugemarila [
[dem] [Noun] [Det ] [Num] [Adj]
The structure of a verb phrase is the same as a verb and carries all parameters that are integral to verbs.
The three main approaches to machine translation are: data driven, rule based and hybrid strategies [
Grammatical Framework (GF henceforth) is a toolkit used for rapid development of multilingual grammar resources and applications based on the functional programming paradigm, the logic framework of abstract syntax plus concrete syntax. GF is also a grammar formalism grounded on categorical formalism [
Grammar features are defined using parameters which are objects of some type and use the keyword param. Below is an illustration of parameter number
param
Number = Singular/Plural
GF makes a distinction between inherent and variable features of grammar. To gather all features of a specific category together, a record is used. For example, the noun category in Kikamba languages has inherent feature class gender and variable features number and case and therefore its linearization type record gathering all features would be defined as below
N = {s: Number => Case => Str; g: Cgender};
Finally, GF uses the operator “+” for concatenation and keyword oper to define operation or function for regular expression of all categories in the morphology modules.
Dictionaries, linguistic postgraduate theses and informants (who speak the language and/or are linguists) formed the data source for the lexicon and descriptive grammar. Linguists were used in cleaning, authenticating the data and through elicitation, they generated morphology and syntax of the categories that were missing in the Descriptive grammar from corpora. The elicitation was performed either through language analysis of the corpus through linguist judgment or by translation from English to the specific Bantu language as proposed by Chelliah [
To model the noun inflection class gender, number and case, grammar features were used. Ten class genders were identified as shown in
param
Number = Sg | Pl;
Case = Nom | Loc;
Cgender = G1|G2 | G3 | G4 | G5 | G6 | G7| G8 | G9 | G10;
lincat
N = {s: Number => Case => Str; g: Cgender};
Kikamba language has a simple noun (single string) and compound noun (two strings) with inflection happening by changing the prefix (see
mkN = overload {
mkN: Str -> Cgender -> N = \n, g -> lin N (regN n g);
mkN: (man, men: N)-> Cgender -> N = compoundN;
mkN: (man, men: Str) -> Cgender -> N = \s,p,g -> lin N (iregN s p g);};
The function PrefixPlNom provided the inflection prefix while each smart paradigm retained class gender for future concord agreement with the noun modifiers at the syntax stage.
Noun inflection |
---|
regN: Str ->Cgender -> Noun = \w, g -> let wpl = case g of { G1=>case w of {“mwa” + _ => Predef.drop 2 w; “mwi” + _ => “e” + Predef.drop 3 w; _ => PrefixPlNom G1 + Predef.drop 2 w }; G2=>case w of {“mw” + _ => “my” + Predef.drop 2 w; _ => PrefixPlNom G2 + Predef.drop 2 w }; ……………………………………………….. _ => PrefixPlNom g + Predef.drop 2 w}; in iregN w wpl g; compoundN: N -> N ->Cgender-> N = \mundu,muume,g -> { s = \\n,c => mundu.s! n! c ++ muume.s!n! c; g = g; lock_N = <> }; iregN: Str-> Str ->Cgender -> Noun= \man,men,g -> { s = table{Sg => table{Nom => man; Loc=> man + “ni” | men + “ni” }; Pl => table{Nom => men; Loc=> “”}}; g = g; }; |
Adjectives were implemented using parameter AForm, which had positive (AAdj), comparative (AComp) forms plus Adverbs9Advv) formed using adjectives and utilizing variable features: class gender and number. The comparative adjective form was implemented by adding the infix “ang” to positive adjective form just before the final vowel of the adjective.
AForm = AAdj Cgender Number | AComp Cgender Number | Advv;
The GRL provided a grid of (4*2*2) four tense (present, past, future and conditional), two polarities (positive and negative) and two anteriorities (anterior and simultaneous) which were used to implement verbs. The above grid expanded because of morphemes in Kikamba verbs, which depend on ten class gender and number grammar features such as subject marker and object marker hence (10*2*4*2*2). To improve time and space complexity, we implemented the verb suffixes in
Various verb forms needed for implementation of the verb and verb phrase were identified as present progressive, infinitives, past tense form, present definite form and neutral form and the parameter VForm was used to assemble them as shown below. The parameter VForm Extension provided the derivational morphology based on the extension suffixes presented in
Adjective inflection |
---|
regA:Str -> {s: AForm => Str} = \seo -> {s = table { AAdj G1 Sg=>case Predef.take 1 seo of { “a”|”e”|”i”|”o” => “mw” + seo; “u” => “m” + seo; _ => ConsonantAdjprefix G1 Sg + seo }; ………………………………………………………….. AComp g Sg=>let af: Str = case Predef.take 1 seo of { “i” => “mw” + seo; “a” => “my” + seo; “u” => “m” + seo; _ => ConsonantAdjprefix g Sg + seo }; in init af + “ang” + last af } }; |
smart paradigms regV and iregV were the functions for regular and irregular verbs.
param
VExte = EPassive | EApplicative | EReciprocal | ECausative | EDistributive;
VForm = VPreProg | VInf |VPast | VPreDef | VGen | VExtension VExte;
oper
regV: Str -> Verb =\vika -> let root = init vika
in {s = table{
VPreProg => case Predef.dp 1 root of {
“b” |”v”|”m” => root + “ete”;
_ => root + “ite”};
VInf => “ku”+ vika;
VPast => root + “ie”;
VPreDef => root + “aa”;
VExtension type => init vika + extension type + last vika;
VNeuter => vika}};
iregV: Str -> Verb =\vika -> {s=\\_=> vika};
Cardinal and ordinal numerals were implemented both in words from 0 up to 999,999. Two parameters were used to model numerals. First, DForm with four forms unit represents ranges of 0 - 9 numerals, tens representing a range of 10 - 99 and hund 100 - 999 range. The CardOrd represents ordinal (Nord) and cardinal (Ncard) numerals. The smart paradigm regular number (regNum) was used to implement the numerals. Ordinal numerals were formed from cardinal numerals by adding class gender morpheme supplied by function Ordprefix.
param
DForm = unit | teen | ten | hund;
CardOrd = NCard | Nord;
oper
regNum: Str -> {s: DForm => CardOrd => Cgender => Str} =
\six -> {s = table {
unit => table {NCard =>\\g => six;
NOrd => \\g => Ordprefix g ++ six};
teen => table {NCard =>\\g => “ikumi na” ++ six;
NOrd => \\g => Ordprefix g ++ “ikumi na” ++ six};
ten => table {NCard =>\\g => “miongo” ++ six;
NOrd => \\g => Ordprefix g ++ “miongo” ++ six};
hund => table {NCard =>\\g => “maana” ++ six;
NOrd => \\g => Ordprefix g ++ “maana” ++ six} } };
The personal pronoun is a string but requires concord agreement of class gender, number and person since GF treats it as a noun phrase while the possessive inflect by class gender and number. The PronForm parameter was used to represent the above two scenarios as depicted below with the function make pronoun mkPron generating both lexemes by taking two string, class gender, number and person as arguments being supplied by the linearization lin of the pronoun as shown by the example he_Pron below. Finally, the function ProunSgprefix and ProunSgprefix provided the class gender-specific prefix for concatenation with possessive form stem as shown in
param
Agr = Ag Cgender Number Person;
PronForm = Pers | Poss Number Cgender;
lin
he_Pron = mkPron “we” “ake” G1 Sg P3;
The demonstrative, quantifier and preposition configured as a string dependent on class gender and number parameters. Adverbs do not inflect hence are independent strings. The linearization type of preposition was configured with a Boolean operator to distinguish between the ones being fused with nouns and those not. Below are the linearization category and the smart paradigm mkprep.
lin
above_Prep = mkPrep “iulu” False;
oper
Prepp = {s: Number => Cgender => Str; isFused: Bool};
mkPrep = overload {
mkPrep: Str ->Bool-> Prep = \str,bool -> lin Prep {s = \\n,g => str; isFused = bool};
mkPrep: (Number => Cgender => Str) ->Bool-> Prep = \t,bool -> lin Prep {s = t; isFused = bool}; };
In Indo-Europeans languages, the CN is combined with an adjective to form NP or another CN and later a determiner can be added as a pre-modifier or post-modifier. However, in Kikamba language, the determiner is added between the adjective and the noun. Thus, the design of CN using two strings as exemplified below was to enable string one “s” to hold the CN while string two “s2” to hold the adjective. Hence it would be easier to add a determiner between string one and two. The class gender was retained from the noun since it will be used in agreement (concord). Below is the rule for forming CN from an adjective and a noun. All noun modifiers come after it with the exception of some quantifiers. Kikamba language does not have articles.
lincat
CN = CNoun;
oper
CNoun: Type = {s: Number => Case => Str; g: Cgender; s2: Number => Str};
CN has pre and postmodifiers such as an adjective, relative clause, adverbs, sentence and noun phrase and based on them, ten syntax rules were constructed. Below is an example of combining an adjective and a common noun.
AdjCN ap cn = {s = cn.s; g = cn.g; s2 = \\n => cn.s2! n ++ ap.s ! cn.g ! n};
Det Phrases can either be possessive or demonstrated which were implemented using quantifiers, numbers and possessive pronouns. Three rules were implemented for Det Phrase and below is an example of one of the rules which form Det by taking a quantifier and a number.
DetQuant quant num = {s = \\Cgender =>quant.s ! num.n!Cgender ++ num.s !Cgender;
n = num.n; isPre = True};
The adjective phrase was modeled via positive adjective, comparative adjective, post modifier of an adjective such as adverbs and also attaching it to a sentence. In total, eleven rules were used to implement adjective phrases and the comparative adjective phrase. The next rule exemplifies the implementation. The agreement consists of number and class gender and the Boolean value allows us to place the adjective phrase after the noun.
ComparA a np = {s = \\g,n => a.s !AAdj g n ++ “kuvita” ++ np.s ! npNom; isPre = False};
NP was implemented from the common noun, proper names, determiners, pronouns and also recursion of NP with adverbs, pre-determiners and determiners. NP implementation used two parameters: case and agreement (concord). On the case, we introduce extra case NPoss to cater for NP formed from personal and possessive pronouns. Eight rules were implemented to form NP. Below is an example of how to form NP by combining a determiner and a common noun in the Kikamba language. The Boolean function associated with the determiner allows pre and post determiners of CN to be placed in the right position.
DetCN det cn = {s =\\c=> case det.isPre of {
False => det.s!cn.g ++ cn.s ! det.n !npcase2case c ++ cn.s2!det.n;
True => cn.s ! det.n !npcase2case c ++ det.s!cn.g ++ cn.s2!det.n};
a =Ag cn.g det.n P3;};
In VP the prefixes (focus, negation, subject marker, tense) morphemes were concatenated to verbs as mentioned in section 3.3 to make a complete verb. Since a whole verb can act as a sentence, then the parameters of sentences: polarity, tense and anterior in addition to agreements were used in the design as exemplified by the operation oper verb phrase. Five record strings were used: s for normal verb, progV for progressive verbs, compl for object of the verb, imp for imperative verbs and inf for infinitive verbs. The subcategorization of verbs was taken care of through compl (one place, two place and three place verb) and in total 20 rules were implemented based on the regular verb phrase function regVP.
oper
VerbPhrase: Type = {
s: Agr => Polarity => Tense => Anteriority => Str;
compl: Agr => Str;
progV: Str;
imp: Polarity => ImpForm => Str
inf: Str};
A clause was formed by combing a noun phrase and a verb phrase and implemented the topology SVO where the O was the second string of verb phrase which implemented the compliment of the verb. In the next section, we illustrate one of the rules for forming clauses. The clauses formed a sentence with the same parameters. However, the difference in GF is that the polarity and tense in clauses are undetermined [
PredVP np vp = let agr = verbAgr np.a in{s=\\pol,tense,anter => let
verb: Str = vp.s!Ag agr.g agr.n agr.p !pol!tense!anter;
obj: Str = vp.compl !Ag agr.g agr.n agr.p; in
np.s !npNom ++ verb ++ obj};
The Kikamba grammar was subjected to test suites for purposes of testing and evaluation. The testing aimed to improve grammar quality (reduce over the generation and ensure coverage) during development while the evaluation objective was to check coverage and quality of the grammar after development. The linguistic phenomena covered for this grammar are shown in
· Grammar writer or expert writes the test suite data or uses already existing test suites.
· Using natural existing corpus or treebanks.
· Use of the comments created for each grammar rule that shows what the rule parses in the grammar.
We based our evaluation and testing on the aspect of the grammar already developed as per
To create the test suite for testing, the comment(s) for each function/rule in the abstract syntax was used. The comments are/is an example(s) of what the rule can parse in the English language in addition to extra phrases generated by the grammar writer for each rule in the English language. The test suite for each rule was translated into Kikamba language phrases or lexicon (gold standard). The rule was implemented in such a way that its linearization output to match the gold standard, else the function was refined and the regression testing re-run until a match was obtained and also in case of changes of the module, re-runs were made to ensure no new noise was introduced. The above is the standard testing procedure for GF grammar [
Coverage | |
---|---|
Sentence | Declarative, Questions |
Tense | Present, Future, Past and Conditional |
Verb | One-Place, Two-Place, Verb Phrase |
Determiners | Quantifiers, Numbers and Possessive Pronoun |
Noun | One Place Two-Place, Three Place Complex Noun |
Adjective | Positive, Comparative and Complex |
Noun Phrase | Personal Pronoun and NP Phrase |
Adverb | Modifying Verbs, Numbers and Adjective |
Others | Prepositional and Conjugation |
In our evaluation a 100 sentences test suite was developed from three sources: a linguist who was provided with the 500 different categories lexicons in GF so as to generate sentences, GF online treebanks3 and Khegai [
We shall demonstrate how coverage of morphology and syntax using the dominate topology was accomplished in four levels. The Graphviz5 software will be used to provide the Kikamba parse tree and words alignment after parsing the equivalent in English.
· Normal sentence with simple SVO topology
· A sentence with a complex Noun Phrase
· Prepositional usage
· Normal questions and Wh-questions
Word | Category | Explanation |
---|---|---|
Andu aume | Compound Noun | a class gender G1 number Pl prefix ndu-root a class gender G1 number Pl prefix uume-root |
Aa | Quantifiers | class gender G1 dependent string |
Athuku | Adjectives | a G1 concord prefix thuku Adj root |
Ma | VP | Subject marker for class gender G1 and person 3 |
Ka | VP | Future tense morpheme in simultaneous |
Tema | V2 | Two place verb (with argument) |
Miti | N | mi class gender G2 number Pl prefix ti-root |
Miingi | Determiner | mi G1 concord prefix ingi Det root |
Word | Category | Explanation |
---|---|---|
Ana Inya | N2 | Prefix a class gender G1 number Pl root ndu String to the noun |
menyu | Possessive Det | class gender G1 dependent string |
Anene | Adjective | a G1 concord prefix and the adjective root is nene |
Onthe | Determiner | class gender G1 dependent string |
Ma | VP | Subject marker for class gender G1 and person 3 |
Ti | VP | past tense morpheme in simultaneous |
nee | Infix | |
koma | V | mi class gender G2 number Pl prefix ti-root |
“on” that is fused with the noun “table” to become “mesani” and preposition “of” which is translated “kya” based on class gender G4 of the pen. The gloss of the utterance used is “the pen of John was on the table”.
In Kikamba language, the tone is used to mark a question; hence, there are no rearrangements of the declarative sentence constituents.
The Kikamba grammar is part and initial stage of creating a shared grammar for Kenyan Bantu languages through bootstrapping strategies, mainly grammar sharing and grammar porting. In order to maintain a standard regression testing of any new Bantu language that will be added via bootstrap, we parsed the hundred English sentences in order to create a treebank test suite.
PhrUtt NoPConj (UttS (UseCl (TTAnt TPast ASimul) PPos (PredVP (DetCN (DetQuant DefArt (NumCard (NumNumeral (num (pot2as3 (pot1as2 (pot1 n2))))))) (AdjCN (AdAP very_AdA (PositA bad_A)) (UseN man_N))) (Com-plSlash (SlashV2a drink_V2) (MassNP (UseN beer_N)))))) NoVoc
The treebanks created had 2854 functions in total. With the largest tree having 62 functions while the shortest had 11 functions. The largest tree was made of two sentences which had complex verb phrases and noun phrases.
The statistical machine translation (SMT) Dholuo-English and Swahili-Dholuo [
with and especially not a system for a Bantu language. Therefore, this work is a clear indication of how using the rule based system will help to produce highly accurate systems for these under resourced languages.
The error analysis was done sentence by sentence and
When a sentence had two adjectives, their order was changed in the translation and was heavily penalized by WER and BLEU hence the use of PER which allows words re-order and the error reduced to 10.96% from 12.82% of WER.
Through this paper, we have formalized the grammar for Kikamba language through the high precision rule-based approach in interlingua GF environment. The metrics results after evaluation which are encouraging are 4-gram BLEU of 83.05%, WER of 12.82% and PER of 10.96%. Therefore our contribution would be: firstly, we have provided NLP tools; morphological analyzer and machine translator for under-resourced Kikamba languages by extending the GF library, which is a step towards BLARK. Secondly, the wide coverage of the Kikamba computational grammar provides a platform for building multilingual technological applications and also to generate the scarce bilingual corpus pairing with other languages present in GF for experimenting using data driven methods. Finally, we have also created a treebank that can be used to evaluate Bantu languages.
Future work would be working on the morphophonological rules of verbs, extending the lexicon so as to handle text and finally including questions as part of the grammar.
We would like to acknowledge the contribution made by the following people in terms of Kikamba translation, Kikamba grammar structure, GF expertise. Prof. kyalo Wamitila, Prof. Angelina Kioko, Dr. Hans Leiß, Dr. Otiso Wambua, Obed Mutiso, Joe Kyalo, Christopher Kithuka, immaculate Wanza and Rama Munara.
The authors do not have any conflict of interest.
Kituku, B., Nganga, W. and Muchemi, L. (2019) Towards Kikamba Computational Grammar. Journal of Data Analysis and Information Processing, 7, 250-275. https://doi.org/10.4236/jdaip.2019.74015
Form | Swahili | English |
---|---|---|
TPresASimulPPos TPresASimulPNeg TPastASimulPPos TPastASimulPNeg TFutASimulPPos TFutASimulPNeg TCondASimulPPos TCondASimulPNeg TPresAAnterPPos TPresAAnterPNeg TPastAAnterPPos TPastAAnterPNeg TFutAAnterPPos TFutAAnterPNeg TCondAAnterPPos TCondAAnterPNeg | Nimakomaa we ndakomaa nimanakomie inyui mutineekoma ithyit ukakoma we ndukakoma makeethiwa makomie maikeethiwa makoma ithyi nitwakoma ithyi tuinakoma we niwakomete we ndwakomete nyie ngeethiwa ninakoma makeethiwa matanakoma we niwesaa kukoma we ndesaa kukoma | they sleeps he doesn’t sleep they slept you didn’t sleep we will sleep you won’t sleep they would sleep they wouldn’t sleep we have slept we haven’t slept he had slept you hadn’t slept they will have slept they won’t have slept she would have slept she wouldn’t have slept |