TITLE:
Development of Machine Learning Models for Kiswahili Text Classification
AUTHORS:
Godfrey Wandwi, Peter Mtesigwa
KEYWORDS:
Kiswahili Text Classification, Morphological Segmentation, Convolutional Neural Network, Long Short-Term Memory, Low-Resource Languages
JOURNAL NAME:
Open Journal of Applied Sciences,
Vol.15 No.11,
November
19,
2025
ABSTRACT: Text classification plays a critical role in numerous natural language processing applications, yet limited work has addressed the unique linguistic structure of African languages such as Kiswahili. Most existing models treat words as atomic units, relying on standard embedding techniques and overlooking morphological complexity inherent in agglutinative languages. In this paper, we develop machine learning models specifically tailored for Kiswahili text classification by integrating sub-word level features derived from morphological segmentation. Our approach combines convolutional neural networks to extract local patterns from morpheme sequences and employs long short-term memory networks to capture contextual dependencies across entire sentences. The models were trained and evaluated on Kiswahili corpora collected from various domains. We evaluate our models on multiple Kiswahili corpora covering news, social media, and e-commerce reviews to ensure robustness across domains. Experimental results demonstrate that incorporating morphological awareness significantly improves classification accuracy compared to baseline models using whole-word embeddings. An ablation study revealed that removing morphological features reduced F1-score while excluding Bi-LSTM decreased sequence modeling capability, highlighting the contribution of each component. Furthermore, the proposed architecture shows robust performance across multiple Kiswahili text genres, highlighting its adaptability. These findings support the development of language-specific modeling strategies for low-resource languages and advance the field of African language processing.