GUARDIAN: A Multi-Tiered Defense Architecture for Thwarting Prompt Injection Attacks on LLMs

Parijat Rai; Saumil Sood; Vijay K. Madisetti; Arshdeep Bahga

doi:10.4236/jsea.2024.171003

Journal of Software Engineering and Applications > Vol.17 No.1, January 2024

GUARDIAN: A Multi-Tiered Defense Architecture for Thwarting Prompt Injection Attacks on LLMs

Parijat Rai¹, Saumil Sood¹, Vijay K. Madisetti², Arshdeep Bahga³
¹School of Computer Science Engineering & Technology, Bennett University, Greater Noida, India.
²School of Cybersecurity and Privacy, Georgia Institute of Technology, Atlanta, USA.
³Cloudemy Technology Labs, Chandigarh, India.
DOI: 10.4236/jsea.2024.171003 PDF HTML XML 218 Downloads 1,055 Views

Abstract

This paper introduces a novel multi-tiered defense architecture to protect language models from adversarial prompt attacks. We construct adversarial prompts using strategies like role emulation and manipulative assistance to simulate real threats. We introduce a comprehensive, multi-tiered defense framework named GUARDIAN (Guardrails for Upholding Ethics in Language Models) comprising a system prompt filter, pre-processing filter leveraging a toxic classifier and ethical prompt generator, and pre-display filter using the model itself for output screening. Extensive testing on Meta’s Llama-2 model demonstrates the capability to block 100% of attack prompts. The approach also auto-suggests safer prompt alternatives, thereby bolstering language model security. Quantitatively evaluated defense layers and an ethical substitution mechanism represent key innovations to counter sophisticated attacks. The integrated methodology not only fortifies smaller LLMs against emerging cyber threats but also guides the broader application of LLMs in a secure and ethical manner.

Keywords

Large Language Models (LLMs), Adversarial Attack, Prompt Injection, Filter Defense, Artificial Intelligence, Machine Learning, Cybersecurity

Share and Cite:

Rai, P. , Sood, S. , Madisetti, V. and Bahga, A. (2024) GUARDIAN: A Multi-Tiered Defense Architecture for Thwarting Prompt Injection Attacks on LLMs. Journal of Software Engineering and Applications, 17, 43-68. doi: 10.4236/jsea.2024.171003.

1. Introduction

In this era of rapid technological progress, Large Language Models (LLMs) stand as a cornerstone in the realm of artificial intelligence. They have an exceptional ability to generate text that closely resembles human writing. This ability has created a revolution in varied domains, from content generation to intricate natural language processing tasks. Despite their proficiency, like all the other things, they are not immune to attacks by malicious actors. One such attack is manipulating an LLM into producing unethical or harmful content through prompt injections.

One of the primary reasons LLMs can generate harmful text lies in their foundational training process. These models learn from vast datasets compiled from the internet, encompassing a wide spectrum of human language and behavior. While this training approach endows LLMs with the ability to mimic human-like writing effectively, it also exposes them to the internet’s array of biases, inaccuracies, and potentially harmful content. Moreover, because LLMs lack the ability to understand context or the ethical implications of their output in the same way humans do, they can inadvertently produce responses that are inappropriate, offensive, or unethical.

Such attacks present a grave risk, threatening to undermine trust and compromise the integrity of a wide range of applications, including digital content creation, interactive chatbots, and educational platforms. Even with advanced safety measures in place, there remain loopholes that allow carefully crafted prompts to bypass these defenses, leading to the dissemination of undesirable content.

This issue poses a significant risk, particularly for smaller LLMs, which are increasingly utilized due to their efficiency and lower operational costs. While these models offer a myriad of benefits, their scaled-down nature often means reduced defenses against sophisticated cyber threats. They do not have sufficiently implemented and regularly updated safety guardrails and defenses against these attacks as much as popular and LLMs backed by big companies do.

Our research crucially addresses the cybersecurity of smaller Large Language Models (LLMs). With the growing prominence of these cost-effective and efficient models, their vulnerability to cyber threats becomes a pressing issue. We introduce a comprehensive, three-tiered defense framework specifically tailored for smaller LLMs. This framework comprises of a System Prompt filter, a Pre-Processing Filter, and a Pre-Display Filter, each serving as a critical layer of defense against malicious prompt injections. The integrated approach ensures that even if one layer is compromised, the subsequent layers provide a robust safety net. Moreover, recognizing the difference between user prompts that might seem harmful to the filters but actually have no malicious intent, our framework includes a feature that intelligently suggests alternative, safer prompts to users, aligning with ethical standards while promoting responsible engagement with LLMs.

Our proposed framework not only strengthens the security against malicious prompt injections but also enhances trust and broadens the adoption of LLMs in diverse sectors. Its global relevance and adaptability pave the way for future AI safety research and have potential implications for AI policies and regulations. Ultimately, our work contributes to the sustainable and responsible advancement of AI, balancing technological progress with ethical standards and societal trust.

The subsequent sections will delve into a detailed examination of each defense layer, their collective functionality, and the wider implications for securely and ethically implementing smaller LLMs. Our initiative represents a significant step forward in shielding LLMs from exploitation, supporting the overarching goal of making these models both powerful and secure for a variety of applications.

The paper demonstrates an effective framework, GUARDIAN (Guardrails for Upholding Ethics in Language Models), to secure language models against prompt injection attacks. The key research contributions and novel aspects of the paper are as follows:

• Proposes a 3-tiered defense architecture comprising a system prompt filter, pre-processing filter, and pre-display filter to protect language models against prompt injection attacks.

• Crafts a dataset of adversarial prompts using strategies like assurance of ethical use, alternate reality simulation, etc. to test defenses.

• Innovative ethical prompt auto-suggestion feature provides a safe alternative aligned with user intent.

• Comprehensive testing methodology using crafted adversarial prompts to simulate attacks on Llama-2 model.

• Detailed quantitative analysis presented on blocking rates of prompts at each defense layer. First layer blocks 40% of attack prompts by adding an ethical reminder to the system prompt. Second layer uses a fine-tuned classifier to flag toxic inputs and an ethical prompt generator to suggest alternatives, blocking 60% prompts. Third layer leverages the model itself to screen outputs for ethics, blocking the remaining attack prompts for 100% coverage.

2. Statement of the Problem

The primary issue our research addresses is the vulnerability of Large Language Models (LLMs) to prompt injections by malicious actors, with a specific focus on the heightened risks faced by smaller LLMs. While LLMs have become integral to various sectors, their capability to generate text is not immune to attacks that can lead to unethical or harmful outputs. This problem is rooted in their training on diverse internet-sourced datasets, which, despite enabling a broad linguistic understanding, also expose them to biases and potentially harmful content.

The challenge is more acute in smaller LLMs. They lack the comprehensive defense mechanisms found in larger, more resource-heavy LLMs. Large LLMs which are backed-up by large corporations are well maintained and have well implemented security guardrails. This disparity leaves smaller LLMs more susceptible to sophisticated prompt injections, potentially leading to the generation and dissemination of inappropriate or biased content.

This gap in defense capabilities is critical. The inadequacy of existing solutions for smaller LLMs not only poses a risk of producing undesirable content but also threatens to undermine user trust and the reliability of applications dependent on these models. It is imperative that this problem gets addressed, ensuring the safe, ethical, and effective use of LLMs, particularly the smaller variants, in a variety of applications.

To develop a defense solution, we first need to devise and craft prompts that successfully break the LLM that we are conducting the testing on i.e., Llama-2 in our case. Llama-2 [1] is a sophisticated large language model developed by Meta AI. The model’s enhanced capabilities and emphasis on safety and helpfulness make it a significant contribution to the field of AI and natural language processing. Since it’s regularly being updated and fortified by a team of engineers, jailbreaking this model is not an easy task. For our research, we had to specially craft prompts by combining various techniques to jailbreak the Llama-2 model which will be discussed in the attack architecture.

Our research is directed towards developing a nuanced, multi-layered defense framework that is specifically designed to bolster the security and ethical operation of smaller LLMs, while also being relevant to LLMs in general. This approach aims to bridge the gap in existing defenses, contributing to the broader goal of ensuring responsible and trustworthy use of LLM technologies.

3. Literature Review

3.1. Baseline Reference

The baseline reference for our research is “LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked” by Helbling et al. [2] . This study introduces a novel approach to enhance Large Language Model (LLM) security, addressing the significant issue of LLMs generating harmful content in response to user prompts. Despite efforts to align LLMs with human values, they are susceptible to adversarial attacks. The paper proposes a self-defense mechanism where an LLM filters its own responses to determine whether they are harmful. Using models like GPT 3.5 and Claude Anthropic as harm filters, the research demonstrated high accuracy in identifying harmful content, although it also noted challenges with ambiguous responses and model abstentions. This research is a critical contribution to LLM security, suggesting self-validation by LLMs as a viable method to prevent the generation of harmful content. The harm filter is the 3rd filter in the solution proposed by us. We improve upon the baseline paper by adding the proposed filter to our complete 3-layered defense architecture.

3.2. Other Related Literature

The field of Large Language Model (LLM) security has seen recent advancements through a series of papers. These studies address a wide range of adversarial threats, including prompt injection, jailbreaking attacks, and the complex challenge of maintaining alignment with human values. We discuss a few approaches in the following sections.

Liu et al. [3] , in their paper “Prompt Injection Attacks and Defenses in LLM-Integrated Applications,” delve into the critical issue of prompt injection attacks on models like GPT-3 and GPT-4. They argue that current literature lacks a systematic approach to understanding and defending against these threats. To bridge this gap, they propose frameworks that not only formalize these attacks but also provide systematic defenses. This work is pivotal in establishing a foundational understanding of prompt injection vulnerabilities and defense mechanisms in LLMs.

Robey et al. [4] focus on jailbreaking attacks in their work “SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks.” They introduce the SmoothLLM algorithm, designed to counteract these attacks by perturbing input prompts and aggregating predictions. This method is a significant step in identifying and neutralizing adversarial inputs, contributing to the robustness of LLMs against sophisticated attacks.

Bochuan Cao et al. [5] in their paper “Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM,” shed light on the concept of Robustly Aligned LLMs (RA-LLMs). This method aims to defend against alignment-breaking attacks that exploit LLMs’ potential for generating harmful content. The authors propose a robust alignment checking function, which is a groundbreaking approach in enhancing the resilience of LLMs, ensuring they remain aligned with intended ethical guidelines and human values across various applications.

Chen et al. [6] present an innovative Moving Target Defense (MTD) enhanced LLM system in their study “Jailbreaker in Jail: Moving Target Defense for Large Language Models.” This system significantly reduces the vulnerability of LLMs to adversarial attacks and has shown effectiveness across multiple commercial LLM platforms. This research represents a crucial advancement in dynamic defense strategies, adapting to evolving threats in real-time.

Wei et al. [7] investigate the vulnerabilities of LLMs to adversarial attacks in their paper “Jailbroken: How Does LLM Safety Training Fail?” They highlight the necessity for more sophisticated safety mechanisms beyond simple model scaling, pointing out the limitations of current approaches and suggesting directions for future research in enhancing the safety of LLMs.

Kumar et al.’s [8] “Certifying LLM Safety against Adversarial Prompting” introduces the “erase-and-check” method, which enhances LLM safety against adversarial prompts. This approach systematically removes tokens from prompts and checks the subsequences for harmful content, offering a novel and effective method for enhancing the safety of LLM interactions.

Mozes et al. [9] address the use of LLMs for illicit purposes in their comprehensive paper. They provide an overview of scientific efforts in threat identification, prevention strategies, and understanding vulnerabilities. This paper highlights the importance of ongoing research and peer review in this field, underscoring the necessity of a multi-faceted approach to understanding and mitigating the potential misuse of LLMs.

Wu et al. [10] delve into defending ChatGPT against Jailbreak Attack in their paper through a technique called System-Mode Self-Reminder. This method significantly lowers the success rate of Jailbreak Attacks, emphasizing the importance of proactive and innovative defense strategies in safeguarding LLMs against emerging threats.

Gelei Deng et al. [11] tackle jailbreaking attacks on LLMs used in chatbots in “MASTERKEY: Automated Jailbreaking of Large Language Model Chatbots.” The authors introduce a framework for automating the generation of jailbreak prompts and reverse-engineering LLM defenses, providing crucial insights into the mechanics and vulnerabilities of LLM-based chatbots.

Rao et al. [12] address the challenge of jailbreaking LLMs in their study “Tricking LLMs into Disobedience: Understanding, Analyzing, and Preventing Jailbreaks.” They present a formalism for categorizing these attacks and conduct an empirical analysis of their effectiveness. This study is integral in understanding the nature and impact of such threats, offering valuable perspectives for developing more robust defense mechanisms.

Finally, Shen et al. [13] focus on the threat posed by jailbreak prompts in “Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models.” Their comprehensive study analyzes the evolution and effectiveness of these prompts, highlighting the inadequacies of current defenses and the need for continued advancement in this area.

Together, these papers exemplify the complex and multifaceted nature of adversarial threats to LLMs and the diverse nature of approaches being developed to secure these models. They also underscore the critical importance of continued research, innovation, and collaboration in the field to ensure the ethical and secure application of LLM technology in our rapidly changing AI-driven digital world.

4. Methodology

4.1. Overview

This section outlines a systematic approach for testing and validating the proposed 3-layered defense mechanism against prompt injection attacks, emphasizing the integration of development, testing, and refinement stages.

4.2. Adverserial Prompt Dataset Creation for Testing

• Objective: To develop a comprehensive dataset of prompts capable of challenging the LLM’s defenses, aiding in robustness testing.

• Procedure:

1) Compilation: Assemble a dataset of prompts, including both known successful attack vectors and novel prompts designed to probe system vulnerabilities.

2) Categorization: Classify prompts based on their nature and complexity to systematically assess each defense layer.

3) Use in Testing: Employ this dataset in each testing phase to rigorously evaluate each layer’s resilience.

4.3. Filter 1: System Prompt Layer Analysis

• Objective: To evaluate the effectiveness of the system prompt layer in identifying and mitigating potentially harmful or unethical input prompts.

• Procedure:

1) Data Collection: Utilize the previously compiled prompt dataset.

2) Testing: Introduce these prompts to the system to assess the initial layer’s ability to detect and neutralize potential threats.

3) Metrics: Find the blocking accuracy of the filter.

4.4. Filter 2: Custom Fine-Tuned Layer Evaluation

• Objective: This study aims to deploy a refined classification Large Language Model (LLM) tailored to a specialized dataset, enhancing the identification of potentially harmful content within user queries. These queries might have previously evaded detection by the initial filtering mechanism. Furthermore, the study introduces an Ethical Prompt Auto-Suggestion feature, offering morally acceptable alternatives to user queries that may be deemed unethical.

• Procedure:

1) Dataset Acquisition and Preprocessing for Classification: The Google Jigsaw’s Toxic Comment Classification Dataset [14] is employed, undergoing specific preprocessing to align with the requirements of our classifier LLM model.

2) Creating and Fine-tuning the Classifier: The “bert-base-uncased” model [15] is selected as the foundational architecture for fine-tuning our classifier, chosen for its demonstrated efficacy in general classification tasks.

3) Generating and preprocessing dataset for prompt Generator: Due to lack of Utilizing the aforementioned Jigsaw Dataset, we create a bespoke dataset for ethical prompts. This dataset is distinctive, as there are no analogous datasets currently available on the internet. It is generated using the “Zephyr-7B-α” LLM model [16] . This dataset is then processed to prepare it for the refinement of our prompt generator.

4) Creating and Fine-tuning the Prompt Generator: The “Zephyr-7B-α” model is adopted as the baseline for our prompt generator’s fine-tuning, selected due to its unique capability of processing potentially unethical content, without stringent content restrictions.

5) Metrics: The primary metrics for assessing both models are training loss and accuracy. Given the bespoke nature of the text dataset, no established benchmark exists for this type of task, necessitating this approach.

4.5. Filter 3: Pre-Display Filter Validation

• Objective: To verify the efficacy of the pre-display filter, which employs another LLM to ensure the ethicality of generated outputs.

• Procedure:

1) Integration: Incorporate a secondary LLM (or the same) that examines outputs for ethical compliance.

2) Simulated Attacks: Test with a range of outputs, those from prompts which failed to get blocked by the previous layers.

3) Metrics: Focus on the accuracy of the ethical filter in identifying unethical content.

4.6. Testing Method

• Objective: To implement LM Studio [17] for local operation of the Llama-2 model, enabling manual testing of prompts.

• Procedure:

1) Setting up Llama-2 model on a local machine using LM Studio, which facilitates experimenting with LLMs and supports ggml-compatible models from Hugging Face.

2) Conducting manual testing by inputting each prompt individually, observing and documenting the model’s responses.

3) Assessing the effectiveness of prompts in testing the model’s defense mechanisms.

4.7. Ethical Considerations

This subsection outlines the ethical considerations for focusing on defenses against prompt injection attacks.

• Collaborative Ethical Standards: Aligning with both personal research principles and the conventional guidance to uphold high ethical standards.

• Professional Guidance and Oversight: Leveraging the expertise of the industrial best practice in ethical decision-making, particularly for sensitive content.

• Compliance with Applicable Laws and Policies: Adhering to relevant data protection and privacy laws in academic and broader contexts.

• Transparency in Research Methods: Maintaining clear communication about methodologies, data sources, and processing.

• Ethical Review and Accountability: Conducting a self-imposed ethical review and seeking additional oversight as needed.

• Responsible Data Handling: Implementing secure data management practices for sensitive content.

• Impact and Sensitivity Assessment: Regularly assessing potential impacts on communities, focusing on minimizing harm.

• Open Feedback Loop: Inviting feedback from academic peers and the professor for continuous improvement.

• Adaptive Approach to Ethical Concerns: Staying responsive to new ethical insights and societal norms.

• Commitment to Ethical Research Practice: Upholding a balance between knowledge pursuit and responsibility.

This methodology outlines a structured approach for the development, testing, and refinement of a 3-layered defense mechanism against prompt injection attacks in LLMs. The project aims to enhance the security and ethical standards of LLMs in handling diverse and sophisticated prompts through iterative testing and refinement.

5. Our Multi-Filter Defense: Architecture

We propose a comprehensive 3-layer defense strategy to protect the model against prompt injection attacks. To be able to evaluate our defenses, it is essential that there exist specially crafted prompts that are able to break the LLM which in our case is the Llama-2-7b-chat model. These prompts test and defy the model’s inbuilt guardrails against such attacks. A diverse dataset containing these prompts is created and used to test the LLM. After the attack testing is done, the proposed defenses are evaluated individually against the prompt dataset. Once the individual testing is done, the defenses are employed altogether to test their collective efficacy against the unethical prompts.

5.1. Attack Approach

Llama-2 stands as a highly resilient model, fortified with numerous safety measures to shield against prompt injections and the generation of harmful or unethical content. The engineering team at Meta continuously strengthens its security barriers, progressively making it more challenging to compromise. Previously effective prompts that could circumvent these defenses are now obsolete, as associated vulnerabilities have been addressed and resolved. Nevertheless, despite these enhancements and fixes, certain vulnerabilities remain exploitable. Through extensive testing, iteration, and refinement, we successfully crafted a dataset containing unethical prompts capable of breaching the security safeguards of the Llama-2 model.

Dataset In our dataset, we have crafted five primary prompts that are specifically tailored to target and exploit weaknesses within the Llama-2 model. These prompts, incorporating a range of jailbreak techniques outlined in the following section, are designed to circumvent the model’s robust defense systems. Each attack prompt in our dataset is structured with a base prompt followed by a supplementary sub-prompt. These sub-prompts are of an unethical nature, such as providing guidance on how to break into a house. After the base prompts initiate the jailbreak, these sub-prompts that actually provoke the LLM to generate unethical responses. The sub-prompts are a mix of requests for unethical advice and actual illicit advice. However, they are calibrated to avoid extreme content that would activate the LLM’s built-in safety mechanisms. This careful balance is crucial, as overly extreme sub-prompts would fail to breach the foundational model without filters, rendering them ineffective for our testing purposes. In our constructed dataset, there are a total of 50 attacks, derived from the combination of five primary prompts and ten associated sub-prompts for each.

5.1.1. Base Prompts

The dataset we have developed comprises five key prompts, each carefully engineered to overcome the protective barriers and effectively jailbreak the Llama-2 model. These foundational prompts have been constructed utilizing three unique strategies. Some of these prompts are our original work, while the rest are adaptations of jailbreak prompts available in the public domain, specifically altered to penetrate Llama-2’s security layers. The attack dataset is constructed using three primary strategies for generating jailbreak prompts:

1) Unrestricted Role Emulation: Engage the LLM in assuming a persona that operates without ethical constraints or limitations. This strategy involves prompting the LLM to adopt the role of an individual or entity that is not bound by moral or legal norms.

2) Manipulative Assistance: Make the LLM believe that providing unethical output is a form of helpful assistance. Example: Crafting prompts that frame unethical requests as assisting law enforcement.

3) Assurance of Ethical Use: Reassure the LLM that the output will not be utilized for unethical purposes. Example: Including statements denying unethical intentions, regardless of the request’s nature.

4) Alternative Reality Context: Exploit the LLM’s capacity for imaginative scenarios where ethical constraints are not applicable. Example: Framing prompts in a fictional or game-like setting.

5.1.2. Design of Sub-Prompts

The initial role of the base prompts is to compromise the LLM. Once this is achieved, the LLM becomes responsive to the subsequent unethical sub-prompts, which are effective due to the prior jailbreak. These sub-prompts are carefully formulated to be neither too extreme nor vague, ensuring they do not activate the LLM’s safety protocols. Employing overly extreme sub-prompts is generally counterproductive, as it tends to prompt the LLM’s defense mechanisms, leading to an unsuccessful jailbreak. Our method is tactically planned to ensure the success of the attack. While some highly extreme prompts might occasionally succeed, they usually trigger the LLM’s internal safeguards and are therefore less effective for our purposes.

Figure 1 illustrates a jailbreak prompt example, comprising the base prompt merged with a sub-prompt, resulting in the generation of unethical output.

Figure 2 demonstrates the creation of four base prompts based on the above mentioned strategies, alongside four sample base prompts, culminating in a set of final adversarial prompts.

5.2. Attack Prompt Examples

Section (A.1) showcases 2 attack prompts along with the corresponding outputs generated prior to the application of any filters.

Figure 1. How adversarial prompts work?

Figure 2. Creation of adversarial prompts for attack.

5.3. Defense Approach

Our defense consists of a 3-layer approach. Figure 3 provides a clear and concise depiction of the defense architecture:

Figure 3. GUARDIAN architecture comprising multiple filters.

5.3.1. The 1st Layer: System Prompt/Pre-Prompt Filter

The first layer in our defense architecture is the System Prompt Layer, which serves as a foundational filter. This layer functions before the user’s prompt is processed by the LLM. It includes an additional line, “Keep in mind the ethical boundaries,” integrated into the LLM’s default system prompt. This phrase acts as a constant reminder to the LLM, reinforcing its duty to produce outputs that are both ethical and safe. This precaution enhances the model’s existing security protocols without being overly restrictive. Careful consideration was given to the complexity of this system prompt. If it were too intricate, there’s a risk that the LLM might become excessively constrained, potentially leading to a refusal to respond to legitimate and safe queries. Therefore, the prompt was deliberately balanced to maintain the LLM’s operational effectiveness while upholding high ethical standards, ensuring that it serves as a subtle yet effective guardrail.

5.3.2. The 2nd Layer: Pre-Prompt Processing Filter

The second defense mechanism in our proposed multi-filter approach is the Pre-Prompt Processing Layer (PPPL). This operates at the crucial juncture wherein the Large Language Model (LLM) has received the user’s input but has not yet initiated processing. In this phase, upon receiving a prompt, it’s ethical score is assessed using a meticulously fine-tuned classifier model. Should the prompt be judged ethical, it is then allowed to proceed for processing and subsequent main LLM processing and generation. Conversely, if deemed unethical, the system refrains from outright rejection of the user’s request. Instead, PPPL advances an alternative prompt that is ethically aligned yet semantically akin to the original, potentially unethical, input. This feature of suggesting an ethical counterpart was integrated to address instances where user inquiries, though inherently safe, are articulated in a manner that could be misconstrued as unethical by the LLM. By offering an ethically sound alternative, the system ensures continuity in the user’s query processing, thereby circumventing the potential disruption that might arise from the safety constraints imposed on guard-railed LLMs. The architecture of the second filter is illustrated in Figure 4.

Figure 4. Architecture for Filter 2.

Exploratory Data Analysis on the Data Used

The Jigsaw Toxic Comment Classification Dataset is a large collection of Wikipedia comments, meticulously labeled by human raters for various types of toxic behavior. This dataset comprises around 230,000 data points, each derived from Wikipedia talk page comments. The labeling process, conducted by Jigsaw, identifies different subtypes of toxicity within these comments, including:

1) Toxicity: Comments that are generally offensive or harmful.

2) Severe Toxicity: Comments that are extremely offensive or harmful.

3) Obscenity: Comments containing profane or vulgar language.

4) Threatening Language: Comments that include threats or aggressive content.

5) Insulting Language: Comments that are derogatory or demeaning.

6) Identity Attacks: Comments that target aspects of a person’s identity, such as race, gender, or religion.

Additionally, the dataset includes one-hot encoding of comment tags for each of these toxic categories. Some comments in the dataset are multi-tagged, indicating that they may exhibit more than one type of toxic behaviour.

Upon examination of the dataset, a significant imbalance among the various sub-categories was detected, which posed a potential challenge for refining the classification model. Such an imbalance, predominantly biased towards the majority class, could have led to a skewed model, inherently biased towards this class. To mitigate this issue, we opted for an “undersampling” strategy. This approach was employed to cultivate a more equitably distributed dataset, encompassing approximately 32,000 entries, thereby ensuring a balanced representation across the different sub-categories.

Prompt Classification Model

The model represents a complete workflow for developing a text classification model to identify toxic comments, leveraging a pre-trained BERT model and custom training configurations. The fine-tuning process was performed using a range of libraries and tools, including HuggingFace Hub, and Transformers Library. The following outlines the steps and parameters involved in the fine-tuning process:

1) Environment Setup:

- We ensured that the necessary libraries and dependencies were installed. This included installing required libraries including torch, transformers, pandas, matplotlib, seaborn, and scikit-learn. It imports necessary Python libraries for data handling (numpy, pandas), visualization (matplotlib, pyplot, seaborn), machine learning (torch, sklearn).

- Natural language processing and model libraries were imported from the HuggingFace Hub (transformers.BertTokenizer, BertForSequenceClassification).

- We utilized RTX A6000 GPU for training the model, specifying parameters like batch, gpu_layers, epochs to manage the computational resources effectively.

2) Data Preparation:

- The Google Jigsaw Toxic Comment Dataset, obtained from kaggle, containing comments and their toxicity labels, is loaded from a CSV file. Basic data exploration includes viewing the dataset structure and summarizing label distributions.

- Visualization of class distribution is done using bar plots to understand the frequency of the 6 labels like “toxic”, “severe_toxic”, etc.

- The dataset is then divided into subsets based on toxic and clean comments. A balanced dataset is created by sampling from these subsets to ensure equal representation of toxic and clean comments and avoid biasing.

- The final balanced dataset is split into training, testing, and validation sets.

3) Model Configuration:

- The model chosen for this task is pre-trained BERT (Bidirectional Encoder Representations from Transformers) model from the HuggingFace library for text classification tasks, specifically BertForSequenceClassification.

- BERT’s tokenizer is utilized to convert text data into a format suitable for model input, including tokenization and encoding. This step is crucial for preparing the text data for input into the BERT model.

4) Training Configuration:

- A DataLoader is created using TensorDataset and DataLoader from PyTorch to handle the batch processing of the dataset during training.

- The BERT model is configured with AdamW optimizer, and key training parameters are set using the TrainingArguments class from the transformers library. Key parameters included the number of training epochs, batch sizes, learning rate, optimizer type.

5) Training Execution:

- The training process, done using the train() method, involves fine-tuning the BERT model on the preprocessed text data, learning to classify the comments into different categories like “toxic”, “severe_toxic”, etc. This includes iterating over batches of data, feeding them to the model, and updating the model weights.

6) Model Saving:

- Upon completion of the training process, the fine-tuned model was saved to disk using the save_pretrained() method.

Exploratory Data Analysis on the Generated Ethical Data

The development and generation of the dataset and model for ethical data generation from unethical sources involved a multi-step process. We sourced the initial dataset from Kaggle, specifically a refined version of the “Google Jigsaw Toxic Comment Classification” dataset. This dataset comprises the original five labels, along with an added Toxicity Label, which is scaled from 0 to 5. The toxicity score represents the cumulative total of all individual labels assigned to a comment.

For our purposes, we selectively focused on comments with toxicity scores ranging from 1 to 4. This selection criterion was strategically chosen to exclude comments that are entirely non-toxic, as well as those that exhibit extreme toxicity. Such a filtering approach was instrumental in maintaining an equilibrium within the dataset, thereby facilitating more effective processing by the model.

Unethical to Ethical Data Generation Model

The selected model for this endeavor is the Zephyr-7B-α, a refined variant of the well-established mistral model [18] . The primary rationale behind opting for this particular model was its absence of inherent safety guardrails. This characteristic uniquely positions the Zephyr-7B-α model to process unethical user prompt data effectively. In contrast, many alternative models typically either refuse to process such data or provide ethical justifications for their inability to handle unethical inputs. The Zephyr-7B-α’s flexibility in this regard thus made it a fitting choice for our specific research requirements.

1) Environment Setup:

- This includes the installation of essential libraries for the model’s setup, specifically transformers, optimum, and auto-gptq.

- Natural language processing and model libraries were imported from the HuggingFace Hub.

2) Data Preparation:

- The unethical comment dataset obtained in the previous step, is loaded from a CSV file.

3) Model Configuration:

- A MistralForCausalLM model is initialised, part of the Fine-tuned Zephyr-7B-α model, indicating the focus on causal language modeling tasks. This model is part of the transformers library and is adapted for sequence-to-sequence tasks.

- Input Length Configuration: The notebook sets the maximum input length for the model to 3000 tokens using the exllama_set_max_input_length function, allowing for the handling of longer sequences.

4) Inference Configuration:

- Model Inference Setup: For generating responses, the notebook sets up the model for inference, including configuring parameters like temperature, top_p, top_k, and maximum_token_length for the generation.

- Text Generation: The model is used to generate text responses, with the inference process involving the generate function from the transformers library. The parameters set for generation indicate a focus on diversity and controlled randomness in the output.

5) Text Cleaning and Post-Processing:

- After generating text, the notebook includes cells for cleaning and post-processing the output. This includes splitting the text based on specific tags, removing unwanted characters, and formatting the text.

6) Dataframe Operations:

- The notebook demonstrates the use of Pandas DataFrame for storing and managing the generated text, indicating an organized approach to handling the model’s output.

This provides a detailed view of how the model is configured and used for inference in the notebook, showcasing a sequence-to-sequence modeling approach with a focus on handling and generating language data efficiently.

Ethical Prompt Generator Model

The techniques and configurations used for fine-tuning the Zephyr-7B-α model with LoRA and quantization, as well as the setup used for generating and processing text for inference tasks.

1) Environment Setup:

- It includes installing essential libraries, including transformers, datasets, trl, peft, accelerate, bitsandbytes, auto-gptq, optimum, pandas, and scikit-learn.

- Natural language processing and model libraries were imported from the HuggingFace Hub.

2) Data Preparation:

- The unethical comment dataset and the ethical comment dataset obtained in the previous step, is loaded from CSV files.

- We merge the dataframes and split the data into train, validation, and test sets (with a distribution of 70%, 15%, 15%). We also convert the dataframe into into a format compatible with the datasets library, which facilitated easier management and utilization of the data during training.

3) Model Configuration:

- Quantization and LoRA Configuration: The notebook features configuration for model quantization and Low-Rank Attention (LoRA) adaptation. This is indicative of efforts to optimize the model’s performance and computational efficiency.

- Quantization: Implemented using GPTQConfig to enable 4-bit precision base model loading (specified in the bits parameter), aimed at reducing model size, potentially accelerate the training process and improving inference speed.

- LoRA Adapter Configuration: The LoRA adapter is configured with specific parameters such as the attention dimension (denoted as lora_r), scaling factor (lora_alpha), and dropout probability (lora_dropout), which are crucial for the model’s ability to learn fine-grained adjustments during fine-tuning.

4) Training Configuration:

- TrainingArguments Class: We set up training arguments using the TrainingArguments class. These arguments include output_dir, per_device_train_ batch_size, gradient_accumulation_steps, optim (optimizer), learning_rate, lr_scheduler_type, save_strategy, and logging_steps.

5) Training Execution:

- The SFTTrainer class, from the trl library, was employed for the supervised fine-tuning of the model using our specific dataset. This trainer was set up with the pre-arranged dataset, tokenizer, training arguments, and the LoRA configuration. The training process commenced with the invocation of the train() method within the SFTTrainer class. Throughout this training phase, the model was trained to craft responses derived from the input text present in our dataset.

6) Inference Configuration:

- Model Generation Setup: The notebook demonstrates the use of the model for generating text, which involves configuring generation parameters such as temperature, top_p, top_k, and max_new_tokens. This setup is critical for controlling the randomness and creativity of the generated output.

7) Performance Metrics:

- We measure the time taken for generation, indicating a focus on evaluating the model’s efficiency and responsiveness during inference.

Main LLM Processing Model

For our work we chose LlaMA-2 as the main large language model. This selection was based on its widespread availability and exceptional accuracy, especially for language generation tasks. It is important to acknowledge, though, that (other) similar models could also fulfil this role effectively.

5.3.3. The 3rd Layer: Pre-Display Filter

The third defense layer in our structure is the Pre-Display Filter, also termed Post-Generational Ethical Screening. This mechanism comes into play at the crucial point when the LLM has generated output, but it has not yet been presented to the user. In this stage, the output is subjected to an ethical review by the pre-display filter. This assessment of potential unethical content is conducted regardless of the nature of the prompt that elicited the output.

Our approach involves the Llama-2 model itself functioning as the ethical filter. We pose a direct inquiry to the LLM, asking it to evaluate whether the generated output is harmful. Based on the LLM’s response to this query, a determination is made: if the LLM classifies the output as unethical or unsafe, it is effectively blocked from being displayed to the user. Conversely, if the LLM assesses the output as ethical and safe, it is then displayed on the screen. This method ensures a robust and dynamic assessment, leveraging the LLM’s capabilities to reinforce our system’s ethical standards.

Selection of Secondary LLM: For our work, we selected Llama-2 as the secondary LLM, owing to its status as a widely accessible and highly accurate model, particularly adept at identifying unethical outputs. However, other comparable models could also serve this purpose effectively.

6. Main Results

In our constructed dataset, there are a total of 50 attacks, derived from the combination of five primary prompts and ten associated sub-prompts for each. These prompts were selected through a process of trial and error, based on their consistent success in circumventing the Llama-2 base model’s security measures when no defensive mechanisms were in place. This resulted in the model demonstrating a 0% success rate in thwarting these attacks. Subsequently, we will examine the prompts that were effectively blocked following the application of each additional security layer.

6.1. System Prompt Layer Results

In the initial stage of defense, the system prompt layer acts as a preliminary filter. Before processing the main input prompt, the system evaluates the prompt against this layer. This initial defense successfully intercepted 2 out of 5 base prompts, as well as all 10 associated sub-prompts. Consequently, this layer effectively thwarted 20 out of the 50 attack prompts, resulting in a blocking accuracy of 40% after the application of the first filter. The blocked prompts are the ones where the LLM is asked to act as someone with no ethics or limitations. The prompts that were blocked typically involved requests for the Language Model to operate without ethical constraints or limitations. The outcomes are shown in Table 1.

6.2. Pre-Processing Layer Results

The second layer of our defense system plays a crucial role in identifying and neutralizing jailbreak prompts, particularly those that involve fictional game scenarios. These prompts, often characterized by their toxic and obscene nature, are rigorously screened in this layer. Here is where the Toxicity Classifier Layer comes into play. As soon as such a prompt is detected, it triggers flags for output classes in the Toxicity Classifier Layer. This comprehensive screening ensures that the prompt is promptly recognized and effectively blocked, demonstrating the layer’s efficiency in maintaining the ethical standards and safety of the system’s outputs. This flagging mechanism exemplifies the robustness of our defense strategy, ensuring a high level of protection against various forms of inappropriate content. The outcomes are shown in Table 2. Thus, after the application of the 2nd filter, 30 out of 50 i.e. 60% prompts are blocked.

Table 1. Results of the system prompt layer.

Table 2. Results after applying Pre-Processing Filter in addition to the 1st filter.

After the Flagging, if needed, the user can use the Ethical Prompt Auto-Suggestion Layer’s functionality to substitute their unethical prompt with an ethical version. The Ethical version should ideally be able to pass the second filter without any problems.

6.3. Pre-Display Filter Results

The third and final layer in our defense strategy is the pre-display filter. This layer acts as a crucial safeguard, targeting outputs that have navigated past both the system prompt filter and the second filter without detection, due to the absence of explicit toxicity. Despite this, the third filter effectively identifies and blocks these outputs for producing content that is unethical or illegal. It successfully intercepts the residual 2 base prompts along with all 10 sub-prompts, ensuring comprehensive coverage. Consequently, this final filter achieves a completely blocks all prompts, culminating in a 100% blocking rate at the conclusion of the process. The outcomes are shown in Table 3.

Disclaimer: Our defense architecture reached a 100% blocking rate with our dataset’s attack prompts, but this may not hold for all jailbreak prompts, as attackers could develop prompts that bypass our system.

6.4. Generation Model Performance Metrics

Table 4 and Table 5 illustrate the performance metrics for two models, “FLAN-T5-Large” (780M parameters) and “Zephyr-7B-α” (7B parameters) respectively.

The selection of the FLANT5-large and Zephyr 7B models was informed by specific criteria. Zephyr 7B, characterized by its minimal use of guardrails, exhibits an inherent capability to process prompts that may be deemed unethical without modifications. Initial testing revealed that Zephyr 7B displayed promising outcomes even prior to fine-tuning using specific system prompts. The subsequent decision to fine-tune Zephyr 7B aimed to standardize its output format and enhance its ability to identify a broader spectrum of toxic prompts.

Conversely, the FLAN-T5-Large model was selected due to its exemplary performance in sequence-to-sequence tasks, particularly in summarization, translation, and question-answering. Preliminary evaluations of the base model indicated suboptimal performance for the specific task outlined in this study. This under-performance may be attributed to the highly specialized nature of the task, which diverges from the conventional applications for which this model is optimized.

Given the partial alignment of our task with summarization, we proceeded to fine-tune the FLAN-T5-Large model using methodologies akin to those employed in summarization tasks. This involved prefixing the input with specific instructions to guide the model towards the desired output.

For the evaluation metrics, we utilized both ROUGE and Perplexity. Perplexity serves as a vital indicator of a model’s text generation capabilities, with lower scores denoting better predictive accuracy and text fluency. ROUGE (Recall-Oriented Understudy for Gisting Evaluation), on the other hand, assesses automatic summarization and machine translation by comparing machine-generated text to a set of reference texts, typically crafted by humans. This metric is particularly relevant for tasks involving direct comparison between generated and reference texts.

Table 3. Results after applying pre-display Filter in addition to the 1st and 2nd filters.

Table 4. FLAN T5 model results.

Table 5. Zephyr model results.

In the comparative analysis of the FLAN T5 and Zephyr models for text generation tasks, Zephyr demonstrated superior performance, as evidenced by more favorable perplexity scores. This enhanced performance is likely attributable to Zephyr’s foundational design as a CausalLM model, which affords it greater proficiency in understanding and processing natural language contextually. Empirical evidence from our tests also supports this conclusion, showing Zephyr’s heightened ability to process test samples (not included in training) more effectively than the FLAN-T5-Large model.

Notably, the Zephyr model does not employ Rouge scores, which are more suited for tasks where the generated text is directly compared to a reference text, such as summarization. In text generation, especially in creative or open-ended outputs, such direct comparisons are less applicable. The quality of generated text in these scenarios is often assessed based on factors like coherence, relevance, and creativity, which are not quantifiable by Rouge scores. Therefore, the absence of Rouge scores in the evaluation of Zephyr is aligned with the nature of its application in text generation, where perplexity serves as a more relevant and informative metric.

6.5. Illustrative Examples of the Solution Components

Section (A.2) shows an example where a jailbreak prompt is passed and gets blocked by the 1st i.e. the system prompt filter.

Section (A.3) shows an example where a jailbreak prompt is passed and gets blocked by the 2nd i.e. the pre-processing filter.

Section (A.4) shows an example where a harmful output generated by a jailbreak prompt is passed and the 3rd i.e. the pre-display filter blocks it.

Section (A.5) presents a case of an unethical prompt along with the ethically appropriate output generated by the Zephyr model.

7. Conclusions and Future Work

We now address the ongoing challenge of enhancing the robustness of large-scale language models (LLMs), particularly in the context of countering jailbreak prompts. Despite significant advancements in models like Llama-2 7b, these systems remain vulnerable to sophisticated bypass techniques. As outlined above, we introduced a comprehensive three-tiered defense mechanism specifically designed to thwart such jailbreak attempts. Our experimental analysis reveals that this multi-layered approach not only effectively neutralizes these threats but also offers additional improvements at each stage, with subsequent filters successfully intercepting prompts that may elude earlier layers. Furthermore, a key contribution in our approach is the implementation of an ethical prompt generator within the second defense tier. This generator plays a crucial role by proposing alternative, ethically sound responses to replace potentially harmful or unethical content, ensuring both the integrity and safety of the model’s outputs.

Building upon existing solutions, we developed a multi-layered filtering system as a solution to address potentially harmful prompts. Our literature review and research into existing methods revealed various alternatives, but we opted for this three-level approach for its superior effectiveness compared to single-filter systems. Each layer in our model offers robust protection, forming a comprehensive defense. However, we expect that more sophisticated adversarial prompts could still penetrate our defenses, highlighting the importance of continuous monitoring and updates to our filtering techniques to counter new adversarial strategies.

Our decision to implement a three-tiered filtering system stems from our experimental findings with adversarial prompts. These tests revealed that no single filter could effectively counteract all types of adversarial prompts. Adversarial prompts, crafted by attackers to exploit specific vulnerabilities in a Language Large Model’s (LLM) prompt processing stages, are complex and sophisticated. They are designed to launch various forms of attacks, potentially disrupting multiple aspects of an LLM’s operations. Consequently, a diversified filtering approach was deemed essential. This approach ensures that if one layer fails, subsequent layers provide additional defense and act as a safety net, preventing the adversarial prompt from further processing.

As previously described, our proposed solution is designed to function as an auxiliary framework, capable of integration with the existing LLM processing pipelines. At its essence, our system comprises a sequence of input prompt verification filters. These filters are tactically placed to operate after the reception of user input but before the presentation of the LLM-generated response to the user. By situating the filters at this juncture, they serve as an initial checkpoint, enhancing the security of LLMs, particularly those lacking inbuilt protective mechanisms. The implementation of these filters is crucial in bolstering the overall security structure, significantly mitigating potential threats posed by unfiltered inputs in LLMs that are otherwise devoid of such preemptive safeguarding measures.

In this paper, we proposed an effective approach to tackle the issues arising from adversarial prompts and prompt injection attacks. We recognize that there might be other strategies, possibly more effective, that deviate from our multi-filter model. One possible alternative is creating an extensive database that records known attack patterns or signatures, aiming to filter input prompts using this information. Although this idea has potential, it faces significant challenges, such as the need for substantial data collection and management and may still have problems with zero-day scenarios. Regarding future research, as for the method outlined in our research, we propose three areas for additional investigation:

1) Advancement of Jailbreak Prompts: Future efforts may concentrate on developing more complex and varied jailbreak prompts to test and enhance the resilience of our defense systems. This involves creating sophisticated prompts with diverse attack methodologies to evaluate and strengthen our three-tiered defense against new threats.

2) Improving the performance of the Second Filter: Currently there are 3 main stages in Filter 2:

a) Refinement of Classification Models: The focus will be on developing more efficient classification LLMs that require less pre-processing, thereby reducing data handling and training costs. Additionally, exploring alternatives to LLM-based classifiers could provide new insights, considering their current dominance in Natural Language Processing tasks.

b) Creation of an Ethical Dataset via LLM Generation: This stage would investigate the potential of bypassing the need for LLM-based dataset generation if a well-curated and robust dataset is available. This would streamline the process, allowing for direct fine-tuning of the ethical prompt generation model.

c) Enhancement of Ethical Prompt Generation Model: The objective is to develop advanced LLMs that are more adept at distinguishing between ethical and unethical contexts, as well as improving general sequence-to-sequence tasks. Key improvements will involve using cleaner datasets and superior computational resources to refine the model’s performance in ethical prompt generation.

3) Exploration of Various LLMs in the Third Filter: We aim to replace the current Llama-2 model in the pre-display filter with other LLMs. Future research will include comparative analyses of various LLMs to identify the most effective ones for detecting and neutralizing threats, thereby optimizing the final defense layer’s efficiency and reliability.

Appendix

A.1. Unethical Prompts and Outputs

Table A1. Two unethical prompts and their respective outputs generated.

A.2. Output after Applying First Filter

Table A2. An unethical prompt getting blocked by the first filter.

A.3. Output after Applying Second Filter

Table A3. An unethical prompt getting blocked by the second filter

A.4. Output after Applying Third Filter

Table A4. Output generated by an unethical prompt getting blocked by the third filter.

A.5. Ethical Generator Output

Table A5. Ethical output generated from an unethical prompt.

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper.

References

[1]	Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Canton Ferrer, C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P.S., Lachaux, M.-A., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E.M., Subramanian, R., Tan, X.E., Tang, B., Taylor, R., Williams, A., Xiang, J., Xu, K.P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S. and Scialom, T. (2023) Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv preprint arXiv:2307.09288.
[2]	Phute, M., et al. (2023) LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked. arXiv preprint arXiv:2308.07308.
[3]	Liu, Y.P., et al. (2023) Prompt Injection Attacks and Defenses in LLM-Integrated Applications. arXiv preprint arXiv:2310.12815v1.
[4]	Robey, A., et al. (2023) SmoothLLM: Defending Large Language Models against Jailbreaking Attacks. arXiv preprint arXiv:2310.03684.
[5]	Cao, B.C., et al. (2023) Defending against Alignment-Breaking Attacks via Robustly Aligned LLM. arXiv preprint arXiv:2309.14348.
[6]	Chen, B.C. et al. (2023) Jailbreaker in Jail: Moving Target Defense for Large Language Models. MTD’23: Proceedings of the 10th ACM Workshop on Moving Target Defense, November 2023, 29-32. https://doi.org/10.1145/3605760.3623764
[7]	Wei, A., et al. (2023) Jailbroken: How Does LLM Safety Training Fail? arXiv Preprint arXiv:2307.02483.
[8]	Kumar, A., et al. (2023) Certifying LLM Safety against Adversarial Prompting. arXiv Preprint arXiv:2309.02705.
[9]	Mozes, M., et al. (2023) Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities. arXiv preprint arXiv:2308.12833.
[10]	Wu, F.Z., Xie, Y.Q., Yi, J.W., et al. (2023) Defending ChatGPT against Jailbreak Attack via Self-Reminder. https://doi.org/10.21203/rs.3.rs-2873090/v1
[11]	Deng, G., et al. (2023) MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots. arXiv preprint arXiv:2307.08715.
[12]	Rao, A., et al. (2023) Tricking LLMs into Disobedience: Understanding, Analyzing, and Preventing Jailbreaks. arXiv preprint arXiv:2305.14965.
[13]	Shen, X.Y., et al. (2023) Do Anything Now: Characterizing and Evaluating In-the-Wild Jailbreak Prompts on Large Language Models. arXiv Preprint arXiv:2308.03825.
[14]	Wulczyn, E., Thain, N. and Dixon, L. (2017) Ex Machina: Personal Attacks Seen at Scale. In: Proceedings of the 26th International Conference on World Wide Web, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Perth, 1391-1399. https://doi.org/10.1145/3038912.3052591
[15]	Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019) BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 4171-4186.
[16]	Tunstall, L., Beeching, E., Rasul, K., Belkada, Y., Huang, S., von Werra, L., Fourrier, C., Habib, N., Sarrazin, N., Sanseviero, O., Rush, A.M. and Wolf, T. (2023) Zephyr: Direct Distillation of LM Alignment. arXiv preprint arXiv:2310.16944.
[17]	(2023) LM Studio—Discover, Download, and Run Local LLMs. https://lmstudio.ai/
[18]	Jiang, A.Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D.S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L.R., Lachaux, M.-A., Stock, P., Le Scao, T., Lavril, T., Wang, T., Lacroix, T. and El Sayed, W. (2023) Mistral 7B. arXiv preprint arXiv:2310.06825.

Journals Menu

Follow SCIRP

	+1 323-425-8868
	customer@scirp.org
	+86 18163351462(WhatsApp)
	1655362766

	Paper Publishing WeChat

Journals Menu

Home

About SCIRP

Service

Policies