1. Introduction
The rapid increase of digital information has transformed the way we consume news and information. In Liberia, this trend has led to an overwhelming influx of content, making it challenging for individuals to efficiently extract relevant information. To address this issue, this paper introduces LbrBart [1], a text summarization tool specifically tailored to Liberian news outlets. It represents a significant advancement in natural language processing, as it is the first tool of its kind designed to cater to the unique linguistic and cultural characteristics of the Liberian context. LbrBart is based on BART which is a sequence-to-sequence model that uses a masked language modeling objective to pre-train on a large corpus of text [2].
This pre-training process allows LbrBart to learn the underlying structure and semantics of language, which is essential for effective summarization. Low-resource summarization refers to the task of summarizing text in languages or domains for which there is limited training data. This is a challenging problem, as it is difficult to train a summarization model effectively with limited data. Pseudo-summarization is a technique that can be used to generate synthetic training data for summarization models [3]. This involves selecting a subset of sentences from the input document and concatenating them together to form a pseudo-summary. These pseudo-summaries can then be used to train a summarization model. Centrality-based sentence recovery is a technique that can be used to identify the most important sentences in a document [4]. This involves computing a centrality score for each sentence based on its position in the document and its connections to other sentences. The sentences with the highest centrality scores are then selected for inclusion in the summary. The model leverages these technologies to achieve our research goals. Using BART as the underlying summarization model, LbrBart benefits from its powerful sequence-to-sequence architecture and its ability to learn the underlying structure and semantics of language.
To address the challenges of low-resource summarization, LbrBart employs pseudo-summarization to generate additional training data and centrality-based sentence recovery to identify the most important sentences in the input document. The combination of these techniques allows LbrBart to effectively summarize Liberian news articles, even with limited training data. By customizing BART to the specific linguistic and cultural characteristics of the Liberian context, LbrBart is able to generate summaries that are both informative and relevant to the target audience. In general, the contribution of this research to the field of natural language processing and text summarization in Liberia is as follows:
1) Firstly, LbrBart addresses the scarcity of language-specific resources in low-resource settings like Liberia. Leveraging advanced techniques such as pseudo-summarization and centrality-based sentence recovery, the tool is able to effectively summarize Liberian news articles despite the limited availability of training data.
2) Secondly, LbrBart demonstrates the adaptability of state-of-the-art summarization models, such as BART, to diverse linguistic environments. The study highlights the importance of customizing these models to account for the specific challenges of different languages. To evaluate LbrBart’s capabilities, we have compared it with state-of-the-art models and the results show better performance over baselines. The remainder of this paper is structured as follows: Section 2 provides an in-depth review of recent literature related to current research trends and developments in the field of text summarization. In Section 3, we present a detailed description of the model utilized in our study, including its architecture, key features, and the rationale behind its design choices. Section 4 focuses on the experimental setup and the results of our work. This section outlines the methods used to assess the effectiveness of our model, presents a comprehensive analysis of the outcomes, and engages in a discussion of the implications of our findings. Through this structure, we aim to provide a thorough understanding of the research conducted and its contribution to the field of text summarization. The last chapter concludes and provides future scope of the work.
2. Related Works
Automatic summarization is a branch within the field of natural language processing focused on the techniques used to extract the most critical information from documents. There are two primary strategies employed; extractive and abstractive. Extractive techniques utilize existing segments of text, combining them through various heuristic methods. In contrast, abstractive techniques enhance the extractive methods by incorporating additional linguistic resources to rephrase and merge key content elements, ultimately resulting in more succinct summaries. While extractive approaches have evolved over many years, for abstractive methodologies, meaningful advancements occurred mainly with the advent of powerful neural language models. Early neural approaches that relied on original sequence-to-sequence frameworks demonstrated encouraging outcomes across a range of summarization tasks.
These early methods were further refined [5] [6] by implementing attention mechanisms [7], enabling abstractive summarization models to exceed the extractive methods concerning coherence and pertinence. The emergence of pre-trained Transformer-based [8] encoder-only models like BERT [9], which offered a global contextual perspective through self-attention mechanisms, facilitated more human-like quality in abstractive summarization. BERTSum [10] was among the pioneering models to apply the BERT encoder specifically for abstractive summarization, showcasing significant improvements compared to earlier models. Subsequent research [11] indicated that enhancing performance could be achieved by jointly pre-training the complete sequence-to-sequence transformer model on language modeling tasks. Recent advancements in abstractive summarization models [12]-[15] have focused on task-specific pre-training to address the challenges of summarization in low-resource environments.
While these specialized models have proven effective in zero-shot and few-shot tuning conditions, they exhibited similar performance gaps even after extensive fine-tuning, despite varying pre-training approaches. Furthermore, studies [16]-[18] have shown that these task-specific models often yield lower performance compared to general-purpose pre-trained generative language models like BART, even in low-resource contexts. With the emergence of universal NLP models capable of generating summaries that match the quality of strictly fine-tuned models (e.g., ChatGPT and GPT-4 [19]), these findings raise questions about the utility of task-specific pre-training within the realm of abstractive summarization. In light of these developments, our approach to LbrBart aligns with the current state-of-the-art in text summarization. By leveraging the powerful BART model, we aim to capitalize on its pre-trained knowledge and adaptability to effectively summarize Liberian news articles. While the limited availability of resources in the Liberian context presents challenges, we believe that our approach, combined with techniques like pseudo-summarization and centrality-based sentence recovery, can provide a valuable solution for improving information accessibility and understanding.
3. Model Description
Bart [20] is one of the first Transformer encoder-decoder pre-trained language models designed for text generation tasks. The model follows the original symmetrical architecture with 12 layers in each encoder and decoder. To inject general language knowledge the model is pre-trained on two text denoising tasks: text infilling (recovering masked token sequences) and sentence permutation (recovering the original sentence order). Despite the lack of summarization specialized pre-training the model achieved state-of-the-art results in low-resource settings. Content filtering in abstractive summarization models is performed in two stages: in the encoder by introducing a saliency signal in final document embeddings and decoder when calculating input content weighting with the cross-attention mechanism.
Figure 1. The encoding and training mechanism of the LbrBart model.
As the saliency signals mixed in encoder embeddings are designed to be decoded only by the decoder of the corresponding configuration the reliable way to estimate their quality is to alleviate exposure bias that affects the decoder’s decision-making process and build explicit input-prediction mapping by forcing summarization model’s decoder to output the existing fragments of the input document. In particular, we restrict the decoder to output sentence subset, thus reducing the task to sentence ranking. To evaluate attention-based content weighting we use ALTI method which is a Transformer-specialized counterpart of Integrated gradients [15] approach previously shown to produce good explanations for sentence attribution. This model, shown in Figure 1, has been transformed to better suite the linguistic nuisance of Liberia by training on Liberian News dataset.
3.1. Preprocessing and Linguistic Adaptation
Effective text summarization in low-resource languages like Liberian English requires careful pre-processing and linguistic adaptation. Our approach begins with standard text cleaning procedures, including the removal of HTML tags, special characters, and extraneous whitespace. We then tokenize the text, breaking it down into individual words or sub-word units using a tokenizer trained on a large corpus of English text. While ideal, a Liberian English tokenizer would be preferable, the limited availability of such resources necessitates this current approach. Lowercasing all text further standardizes the data and reduces vocabulary size, improving model generalization. These initial steps create a cleaner, more consistent dataset ready for linguistic adaptation [21]. A key challenge in processing Liberian English is the prevalence of colloquialisms. These informal expressions, while integral to everyday communication, can pose difficulties for NLP models trained primarily on standard English. To address this, we compiled a dictionary of common Liberian English colloquialisms frequently encountered in news articles, along with their standard English equivalents as shown in Table 1. This dictionary serves as a crucial resource for our linguistic adaptation strategy.
Table 1. Liberian English corpus augmentation.
Colloquialism (Liberian English) |
English Meaning |
Example |
All-die |
Completely, entirely |
The rice all-die. (The rice is damaged completely.) |
Big English |
Formal or complex English |
Don’t use big English on me, just talk normal. |
Make noise |
To publicize or draw attention to something |
The protest make plenty noise. |
Put hand for story |
To be involved in a situation or event |
He put hand for the land palaver. |
Wahala |
Trouble, problem, conflict |
There’s plenty wahala in the government. |
Word get out |
The news has spread |
Before you know it, word get out about the scandal. |
During pre-processing, we identify instances of these colloquialisms within the text. Our approach does not directly replace the colloquialisms with their standard English equivalents, as this could lead to a loss of nuanced meaning or cultural context. Instead, we maintain the original colloquialism in the text but augment the input by providing the standard English translation as additional contextual information. This approach allows the model to learn the association between the colloquialism and its standard English counterpart, enabling it to better understand the semantic content of the text.
This augmented input is then fed into the BART model, which has been pre-trained on a massive dataset of English text. While the pre-training provides a strong foundation, the model is further fine-tuned on a smaller corpus of Liberian news articles. This fine-tuning stage is crucial for adapting the model to the specific linguistic patterns and vocabulary of Liberian English. By exposing the model to both the colloquialisms and their standard English translations, we aim to bridge the gap between formal English and the more informal, colloquial style prevalent in Liberian news reporting. This approach allows the model to capture the cultural nuances and linguistic diversity inherent in Liberian English, ultimately leading to more accurate and contextually relevant summaries.
3.2. Problem Formation and Attention Pattern
The attention mechanism is fundamental to the decision-making framework of Transformer language models, as it influences the allocation of attention among individual tokens. Nonetheless, the attention weights generated by different layers and heads may not yield accurate interpretations since they are susceptible to changes in the model’s architecture. Techniques for attention aggregation, such as attention rollout, provide greater reliability by accounting for the entirety of the information flow. In this study, we utilize the latest iteration of the method, referred to as ALTI [22], which surpasses model-agnostic approaches concerning the alignment error rate and has been demonstrated to deliver trustworthy explanations for instances of model hallucination. Attention rollout simulates how information moves through Transformer networks by conceptualizing the process as a sequential application of attention weight matrices
to the input embedding
, thus simplifying the model to a composition of feed-forward layers.
For a Transformer encoder with
layers, the scores can be calculated as follows:
A key challenge arises because in the multi-head attention mechanism of the Transformer, there are H heads, and each generates its localized attention weights. The conventional attention rollout method presumes that each head contributes equally (i.e.,
); however, in Transformers, these weights are dynamically integrated using a
projection matrix. The ALTI enhancement addresses this by replacing the averaged attention weights with specific input-output contributions from each attention block:
In this context,
signifies the
input element of the attention block in layer
,
denotes the
output element of the attention block, while
represents the residual component (with
for self-attention and
being the output from the previous self-attention block for cross-attention). Here,
identifies an element of the head’s attention weight matrix,
indicates component-wise layer normalization,
symbolizes the sum of the attention block’s bias components, and
corresponds to the transformation matrix for attention values.
As the Transformer decoder is distinct from the encoder solely due to an added cross-attention layer—utilizing encoder embeddings for calculating attention values in that case
—Equation (5) can be individually applied to each attention block (noting that we represent
for clarity. The total contributions for the encoder-decoder Transformer input
for layers from 1 to
are derived from the sum of local self-attention contributions
and the product of total encoder-specific contributions
and layer-level cross-attention contributions
:
The final rollout scores
generated by the ALTI method are expressed by the cumulative contributions of the last decoder layer
. While ALTI contribution scores can be utilized directly without further adjustments, high variability in input lengths (e.g., lengths of news articles varying from 50 to 2000 words) can lead to misleading token-wise contribution statistics unless normalization is applied. To mitigate the effects of length discrepancies and to better compare attribution outcomes with extractive oracle reference labels, we summarize contributions at the sentence level. Given the ALTI prediction-input total contribution matrix
for document tokens
and generated tokens
, sentence-wise contributions, or sentence relevance, can be determined as follows:
Here,
refers to a sentence from document
, while
indicates a token from those sentences. To identify the set of sentences that represent the model’s extractive intuition, we filter based on the upper quartile
of the SentRel score:
By contrasting this set with the Extractive Oracle, we can assess the relevancy of the most emphasized sentences with respect to the reader’s expectations (represented by the dataset):
3.3. Concept Integration and Model Formation
The BART (Bidirectional and Auto-Regressive Transformers) model is a low-resource abstractive summarization algorithm that follows a systematic process to convert a comprehensive document into a concise summary. The first step involves preparing the input document for processing, which can come in various forms like articles or reports. The text is formatted correctly and ready for tokenization, which breaks down the text into smaller units known as tokens.
Each token is represented as an embedding, allowing the model to understand the relationships and roles of different tokens within sentences and paragraphs. The third step utilizes an attention mechanism, which allows the model to weigh the importance of different tokens when generating each part of the summary. The algorithm computes attention 7 scores, which quantify the relevance of a token in the context of others. The fourth step applies pseudo-summarization techniques through attention rollout, which simulates the flow of information within the model. The sixth step involves filtering identified sentences to determine the most salient ones, discarding less significant sentences. The seventh step evaluates the selected sentences against an extractive oracle, a benchmark that comprises sentences considered ideal summaries for the given document. This comparison enables the algorithm to validate its performance and adjust its processes for better output in future iterations. The final summary is generated using the BART decoder.
4. Experiment and Dataset Description
For few-shot tweaking, the Adam optimizer uses a linear scheduled learning rate of 3e−5, a warm-up ratio of 0.1 of total steps, a batch size of 10, cross-entropy loss with label smoothing, a maximum training step count of 200 (total examples x 10), and validation every 5 epochs. We select the best models based on the ROUGE score of summaries created using an agreed-upon sampling approach on the validation set. The best generation configuration depends on both the model and the dataset to examine the pure influence of the tuning technique. Generate summaries for the test set using greedy sampling, which creates summaries of the lower quality boundaries. We employ model implementations from the Hugging Face Transformers library and initialize from large-variant checkpoints (≥ 400 million parameters), except for Centrum, which is based on an LED-base version (∼150 million parameters). To ensure fair comparison, we limit the input length of all models to 1024 tokens (BART and Pegasus input limit). We use a few training instances from the beginning of each dataset’s training phase. We evaluate the quality of summaries using ROUGE-1, 2, L F1 scores. To calculate average ALTI scores (The ALTI score, which stands for Abstractive Language Task Indicator, is a metric used to evaluate the quality of generated summaries in natural language processing (NLP), particularly in the context of summarization tasks.), we take the first 1000 examples from test sets (the smallest test size among datasets). To apply the approach to sparse-attention architectures, the dense representation is recovered by mixing global and local components.
4.1. Dataset Description
Table 2 presents a comprehensive overview of the datasets used in our text summarization research. The table categorizes the content types, such as “News”, “Science”, “Instructions”, and “Patents,” to position our work within specific contexts. Notable data sets include CNN/DM, a collection of news articles, and ArXiv, a collection of scientific papers. The data distribution is indicated in “Train/Val/Test”, allowing for systematic assessment of model performance. The “Source” column presents the characteristics of the datasets, including their scale and complexity, which can influence model performance. The “Target” column presents additional metrics related to expected outcomes, helping to gauge the potential effectiveness of our summarization tasks. The final entry emphasizes our custom-sourced dataset, LbrBartset, which contains 3000 training samples, 1000 validation samples, and 1000 test samples. The table highlights the importance of the custom dataset, LbrBart, in enhancing the performance of our models in summarizing.
Table 2. Description of the dataset.
Domain |
Dataset |
Train/Val/Test |
Source |
Target |
News |
CNN/DM |
287K/13K/11K |
698.6 |
49.5 |
Science |
ArXiv |
203K/6K/6K |
5179.2 |
257.4 |
Instructions |
WikiHow |
160K/6K/6K |
1579.8 |
62.1 |
Patents |
Bigpatent |
1207K/67K/67K |
3572.8 |
116.5 |
News and Mixed |
LbrBartset |
3K/1K/1K |
3213.8 |
16.5 |
4.2. Evaluation Metrics
In the field of text summarization, several evaluation metrics are commonly used to assess the quality of generated summaries. In this research, we employ three variants of the ROUGE evaluation metrics. ROUGE is a set of metrics that compare the overlap between the words in the generated summary and a reference summary. The most widely used ROUGE metric is ROUGE-N, which measures n-gram overlap. The formula for ROUGE-N is the following:
where Countmatched(n-grams) is the number of n grams that appear in both the generated and reference summaries, and Countreference(n-grams) is the sum of n grams in the reference summary.
The ROUGE metrics are commonly used to evaluate the quality of generated summaries by comparing them with reference summaries. The two most widely used variants are ROUGE-1 and ROUGE-2. ROUGE-1 measures the overlap of unigrams (single words) between the generated summary and the reference summary. The formula for ROUGE-1 is as follows:
where Countmatched(w) is the number of times the word w appears in both the generated and the reference summaries, and Count(w) is the total number of occurrences of the word w in the reference summary. ROUGE-2 evaluates the overlap of bigrams (two consecutive words) between the generated summary and the reference summary. The formula for ROUGE-2 is as follows:
where Countmatched(bg) refers to the number of times the bigram bg appears in both the generated and the reference summaries, and Count(bg) is the total number of occurrences of the bigram bg in the reference summary.
4.3. Discussion and Results
The evaluation of LbrBart shows promising results in its summarization capabilities, especially when applying the ROUGE metrics. The average ROUGE-1, 2, N scores across the test datasets indicate a strong alignment with reference summaries, demonstrating LbrBart’s effectiveness in retaining significant content while reducing redundancy. We divided our model into variants to understand their individual performance to the overall work. In Table 3, LbrBartBest achieves the highest scores across all ROUGE metrics, particularly in ROUGE-1 (0.91), ROUGE-2 (0.87), and ROUGE-N (0.88). These results underscore the effectiveness of the model configuration that was optimized during training, demonstrating that careful tuning can significantly enhance summarization performance. On the other hand, the other configurations, while performing reasonably well, show notably lower results. For instance, LbrBart1 has ROUGE-1 score of 0.81, indicating that while it can generate effective summaries, it does not capture as many informative tokens compared to the best configuration. Similarly, LbrBart2 and LbrBart3 record lower scores with ROUGE-1 at 0.80 and 0.74, respectively. This variation in performance provides insight into which components of the LbrBart architecture are most beneficial for achieving high-quality summaries, reaffirming the importance of model tuning in low-resource summarization tasks.
Table 3. Ablation study. We divided the model into variants to evaluate their performance on various metrics.
Model |
R1 |
R2 |
RN |
LbrBart1 |
0.81 |
0.82 |
0.81 |
LbrBart2 |
0.80 |
0.79 |
0.74 |
LbrBart3 |
0.74 |
0.77 |
0.72 |
LbrBartBest |
0.91 |
0.87 |
0.88 |
Table 4 evaluates the performance of LbrBart against various baseline datasets, providing a broader context for its summarization capabilities. This table shows the summarization performance of LbrBart across different datasets. The bolded results, particularly for CNN/DM (ROUGE-1: 0.89) and WikiHow (ROUGE-2: 0.72, ROUGE-N: 0.88), emphasize the model’s ability to perform exceptionally well on these datasets. The high ROUGE-1 score of 0.89 for CNN/DM indicates a strong overlap of unigrams with reference summaries, which is particularly valuable in news summarization where capturing key facts is essential. Conversely, the results for LbrBartset reflect the challenges inherent in training on a limited, specialized dataset, scoring a ROUGE-1 of 0.70, which highlights the need for more diverse training data to improve performance further. This gap suggests that the LbrBart model may have more room for improvement when applied to datasets closely representing the unique linguistic and contextual aspects of Liberian news.
Table 4. Model’s performance on various datasets
Model |
R1 |
R2 |
RN |
CNN/DM |
0.81 |
0.82 |
0.81 |
ArXiv |
0.80 |
0.79 |
0.74 |
WikiHow |
0.74 |
0.77 |
0.72 |
Bigpatent |
0.91 |
0.87 |
0.88 |
LbrBartset |
0.70 |
0.52 |
0.75 |
Table 5 compares LbrBart against other established baseline methods, providing a clear demonstration of its competitive performance. The results of LbrBart, denoted in bold, emphasize its superior performance compared to other models. With an impressive ROUGE-1 score of 0.91, ROUGE-2 of 0.87, and ROUGE-N of 0.88, LbrBart stands out as significantly more effective than previous works by Hartman et al. and Zhu et al. Comparison of the model on baseline datasets. The dataset with the best performance is shown in bold. Our baseline LbrBart dataset was very small in size, hence the poor performance. Both of these baseline models recorded lower scores across all metrics, particularly in ROUGE-1 and ROUGE-2, highlighting LbrBart’s ability to capture salient information and generate coherent summaries.
Table 5. Comparison of the research with baselines.
Models/Works |
R1 |
R2 |
RN |
Hartman et al. [21] |
0.395 |
0.105 |
0.184 |
Zhu et al. [22] |
0.362 |
0.202 |
0.358 |
LbrBart |
0.91 |
0.87 |
0.88 |
This comparative analysis affirms LbrBart’s position as a cutting-edge summarization model that can operate effectively even in low-resource scenarios, suggesting its potential applicability for real-world usage in various contexts, particularly in regions with similar challenges as Liberia.
5. Conclusion
LbrBart is an automated text summarization research designed for the unique linguistic and cultural context of Liberia. It has been proven to generate high-quality summaries, effectively capturing central themes and essential information from the original text. The model’s optimal configuration, identified as LbrBartBest, achieved remarkable ROUGE scores of 0.91 for ROUGE-1, 0.87 for ROUGE-2, and 0.88 for ROUGE-N, outperforming other baseline methods and models. LbrBart’s competitiveness is demonstrated by its ability to generate coherent and relevant summaries that resonate with the intended audience. Its adaptability extends beyond Liberian news, potentially extending its framework to other low-resource languages and contributing positively to information accessibility in diverse global contexts. Future research may explore additional mechanisms such as sentiment analysis, multi-document summarization capabilities, and the expansion of the model’s dataset to include broader linguistic variations. Community involvement and partnerships with local journalists and educators could also provide valuable insights. In conclusion, LbrBart represents a significant leap forward in harnessing technology to meet the specific needs of emerging markets, enhancing public discourse and cultivating a more informed society.
Acknowledgements
We express our sincere gratitude to the School of Computer Science at Hubei University of Technology for providing the necessary resources and a supportive environment for this research. We would also like to thank the authors of the BART model and the developers of the Hugging Face Transformers library, whose work served as the foundation for our research. Their contributions to the field of natural language processing have been invaluable. Finally, we acknowledge the contributions of all the researchers whose work we have cited, as their findings have shaped our understanding of text summarization and low-resource language processing.
Conflicts of Interest
The authors declare no conflicts of interest.