1. Introduction
The cybersecurity sector is no exception to the technological revolution. It is confronted with the rise of Artificial Intelligence (AI), which can analyse metadata in record time. This is enabling us to make a major leap forward in a number of areas, particularly cybersecurity [1]. Phishing is a social engineering technique designed to obtain confidential information by posing as a trusted entity [2]. Phishing attacks [3] have increased exponentially, affecting individuals and organisations alike, according to reports and studies on cyber security published by the Anti-Phishing Working Group (APWG). This APWG report for the fourth quarter of 2023, detailed in Figure 1, observed almost five million phishing attacks, or 4,987,809. These attacks constitute the worst phishing year on record and are still considered to be a record phishing year [4]. Piter Cassidy (published on 14 April 2024).
Figure 1. APWG report on trends in phishing activity in Q4 2023.
Traditional phishing detection methods, such as blacklists, are struggling to keep pace with the rapid evolution of phishing techniques. Other approaches have already been used to filter phishing sites, sometimes applicable at different stages of the attack flow. These include network protection, authentication, user education and classifiers. Artificial intelligence would be an indispensable aid in tackling new cyber-attacks in all their forms, using machine-learning algorithms [5]. Phishing is a common attack technique used by cybercriminals to trick users into divulging confidential information such as login credentials, passwords and financial information, in order to compromise their activity on the platform [6].
This study focuses on the growing adoption of artificial intelligence in the field of cybersecurity. It is conducted particularly in the context of phishing detection. It explores the various Machine Learning techniques used for phishing detection, such as supervised learning algorithms [7]. Artificial intelligence and Machine Learning (ML) techniques have shown significant potential for improving the detection and prevention of phishing attacks [8]. This capability enables proactive detection of attacks, even when phishing techniques are constantly evolving to bypass traditional defenses.
Several studies have focused on analyzing the contribution of artificial intelligence in phishing detection. Yang et al. [9] proposed a multidimensional phishing detection approach based on deep learning in multidimensional features by combining URL statistical features. On the other hand, authors of [10]-[12], in their studies, went with a single classifier while analysis in order to better exploit the potentiality offered by supervised machine learning with all its algorithms and models as is the case for our study. Authors of [5] [6] [9] [12]-[19], AI is playing an increasingly important role in cybersecurity, they discussed a way to identify the URL of a given website is a phishing site or not by going through two popular machine learning algorithms such as SVM and random forest. Authors of [20], [21] compared the results of several machine learning methods for predicting phishing websites and achieved very good performance by assembling the Random Forest and XGBoost classifiers, a process we considered time-consuming.
Furthermore, these aforementioned studies praise the applicability of AI in cybersecurity without particularly exploiting supervised machine learning algorithms combined with deep learning, which is our hobbyhorse. We are focusing on regression problems where the prediction involves a continuous value, and the models are trained on labelled data in order to accurately predict the results for new instances. The objective of this study is to analyse 3 classifiers used in machine learning and 1 in deep learning for the detection and prediction of phishing websites in AI such as Naive Bayes, Decision Tree, Gradient Boosting and Multi-Layer Perceptron. Compare these models after study and analysis in order to identify their limitations and strengths. Draw and find a more effective model to propose and guide future perspectives in this area of research.
After this introduction, our paper proceeds with a theoretical framework of phishing techniques, followed by phishing detection hardware and methods, description of the dataset, experimental results, limitations and future prospects, contribution and finally conclusion.
2. Theoretical Framework of Phishing Technics
2.1. Theoretical Framework
In this section, we will define and describe phishing and URL, phishing anatomy and structure, examine some known phishing techniques and even those used by criminals to deceive people. To date, numerous techniques have been introduced to eliminate phishing attacks and keep users safe online. Spoofed e-mails and false URLs remain difficult to detect and unavoidable [22].
2.2. What Is Phishing?
Phishing is a type of cybercrime in which a digital attacker impersonates a trusted entity to obtain sensitive information. Phishing can target individuals for banking credentials or credit card data, or could also target organizations through their employees. Deloitte estimates that a whopping 91% of cyber-attacks begin with a phishing email. Phishing is a major problem for organizations trying to stay safe, as employees are often targeted as a means of infiltrating the business. With so much at stake, it really pays to understand the biggest phishing attacks1. Phishing is based on social engineering. Its whole premise is to build up a degree of trust with the victim so that they feel comfortable providing information. This could involve the hacker posing as a colleague, manager or technology support. Once the victim is convinced that they are speaking with a known entity, the stage is set for the attack.
2.3. Anatomy of URL Phishing
A URL (Uniform Resource Locator) is a type of uniform resource identifier (URI) used to access information from remote computers, such as a web server or cloud storage space. It contains various elements, including the network communication protocol, a sub-domain, a domain name and its extension. URL phishing is a fraudulent activity that involves diverting people to a fake website where they download malware or expose their confidential information. Before understanding how attackers proceed when creating a phishing domain, we’ll introduce the structure of the URL. Figure 2 shows that a URL is made up of several distinct parts2.
Figure 2. The URL structure [23].
2.4. Phishing Awareness Training and Software
To combat the ever-changing nature of phishing attacks, organizations and individuals can benefit from phishing awareness training programs and software solutions. These tools inform users about the latest phishing techniques, provide simulated phishing scenarios to test their awareness, and offer real-time protection against malicious e-mails. Investing in such resources can significantly improve a person’s ability to identify and avoid falling victim to phishing scams.
2.5. Phishing Techniques
We take a look at some well-known phishing methods used by criminals to trick people: Fraudulent e-mails, Fraudulent websites, Fraudulent text and SMS messages, Phishing attacks on social networks, Fraudulent phone calls, Targeted phishing (spear phishing), Fake tech support fraud.
3. Phishing Detection Materials, Method and Techniques
This section highlights the materials and methods used in this study to achieve the assigned objectives, but also goes on to describe and present the data set used for the experiments and outlines the algorithms used in our study. Machine learning as a field of Artificial Intelligence relies on mathematical and statistical approaches to provide simplified and efficient methods for data analysis [24]. It can automate repetitive operations, analyze large data sets, facilitate more informed and accurate decision-making, personalize customer experiences and optimize operational processes to increase profitability [25]. The aim of machine learning is to find flexible patterns in data and for specific tasks such as predictions, phishing detection. ML algorithms can be broadly classified into three types as summarized in Figure 3, namely supervised for labeled observations, unsupervised for unlabeled observations and reinforcement learning for models that learn from errors to improve accuracy [26].
![]()
Figure 3. Flow chart illustrating the classification of different machine learning algorithms into supervised, unsupervised and reinforcement models [26].
3.1. Rationale for Model Selection
This study focuses on four algorithms and models, combining both machine and deep learning. The choice of these models is based on their respective capabilities: 1) Naïve Bayes is based on Bayes’ theorem, assumes conditional independence between features, simple and fast to train, effective for classifying textual data, such as spam e-mail detection, and can work with a small training dataset. 2) Decision trees widely used for classification and regression, easy to interpret, handle numerical and categorical data, visually understandable and help explain model decisions and adaptability to mixed data (numerical and categorical). 3) Powerful Gradient Boosting for regression and classification, combines several weak decision trees to form a robust model. With high accuracy by aggregating weak models, but also adaptable to complex data and non-linear interactions. 4) Multi-Layer Perceptron (MLP) Deep neural network, learns complex representations from data, capable of capturing non-linear patterns, models complex relationships between features. It is often used for classification, regression and other tasks.
This choice of models covers a wide spectrum of approaches, from the simplest and most interpretable (Naïve Bayes, decision trees) to the most complex and powerful (Gradient Boosting, MLP). This has enabled us to study the trade-offs between performance, interpretability and adaptation to new phishing threats.
3.2. Description and Representation on the Data Set
Most of the machine learning models examined here fall into the category of supervised machine learning. To update our dataset with new phishing sites, we also implemented code that extracts features of new phishing sites provided by the Kaggle website collected from the well-known public repository Kaggle.com3. In addition, each sample is tagged with a class identifier indicating whether it is categorized as a phishing website (1) or not (−1). An overview of the data set reveals a total of 11,054 samples, each composed of 32 features. Of these, 31 are independent and 1 dependent. All features are of integer type, avoiding the need for the Label Encoder transformation. No outliers are detected in the dataset [27]. There are no missing values in the dataset. The algorithm for the methodology used in this study comprises the following steps: Data acquisition and preparation: Loading of the dataset containing instances of suspected phishing websites. Performing exploratory data analysis (EDA) to better understand the structure and content of the dataset. Data visualization: Use of visualization techniques to illustrate patterns and trends within the dataset. Data pre-processing and splitting: Pre-processing of data to ensure compatibility with ML algorithms, including feature scaling and encoding of categorical variables. Partitioning of the dataset into training and test sets to facilitate model evaluation. Model training: Implementation of various supervised machine learning algorithms, and neural networks, to train predictive models. Model comparison: Evaluation of the performance of trained models using appropriate measures, such as accuracy, precision, recall and F1 score. Comparison of the effectiveness of different ML algorithms in accurately classifying phishing websites. Each website is marked as either legitimate or phishing.
3.3. The Algorithms Used
Supervised machine learning, a common technique, is used to predict outcomes from a given set of features, using example feature-label pairs. In regression problems such as ours, where prediction involves a continuous value, models are trained on labeled data to accurately predict outcomes for new instances. The regression models selected are:
3.3.1. Decision Trees
The decision tree is a non-parametric supervised learning method used for classification and regression tasks. It divides the feature space into segments to make predictions. They are among the most versatile supervised learning models, and have the important property of being easy to interpret [28]. A decision tree is, as shown in Figure 4, a binary tree-like data structure that is used to make a decision. Trees are a very intuitive way of displaying and analyzing data, and are commonly used even outside the field of machine learning. A decision tree helps individuals make better decisions through a tree graph or modeling of alternatives and their possible implications, such as likely outcomes, resource costs and utility [29]. With the ability to predict both categorical and real values, as well as the ability to integrate numerical and categorical data without any normalization or creation of indicator variables, it’s not hard to see why they are a popular choice for machine learning.
![]()
Figure 4. Presentation of the decision tree classifier4.
3.3.2. Gradient Boosting Classifier
The Gradient Boosting Classifier is a powerful ensemble learning algorithm that combines several weak learners to create a strong predictive model. It sequentially trains a series of decision trees, each correcting the errors of its predecessors, ultimately producing a highly accurate model. It belongs to the family of machine learning methods based on ensemble learning. It is a boosting algorithm, i.e. it builds a predictive model as a weighted combination of simpler models, usually decision trees [19]. Here are the main steps of the algorithm: Initialization: The model starts with a very simple decision tree, which makes poor predictions. Boosting iterations: At each iteration, a new decision tree is added to the model. This tree is trained to predict the residual errors (difference between true values and predictions) of the previous model. Model update: The model is updated by combining the new small decision tree with the previous trees, assigning greater weight to those trees that reduce the error the most [30]. Stop: The process stops when a maximum number of iterations is reached, or when a stop criterion is met (for example, when performance improvement is deemed sufficient).
3.3.3. Naive Bayes
The Naive Bayes algorithm is a supervised learning algorithm based on Bayes’ theorem and used to solve classification problems. It is mainly used for text classification involving a high-dimensional training data set. The Naïve Bayes classifier is one of the simplest and most effective classification algorithms, making it possible to build fast machine learning models capable of making rapid predictions [31]. This is a probabilistic classifier, which means it predicts based on the probability of an object. Some popular examples of the Naïve Bayes algorithm are spam filtering, sentimental analysis and article classification [32].
3.3.4. Multi-Layer Perceptron Classification
The MLP is a type of artificial neural network composed of several layers of nodes, capable of learning complex patterns in data. It is a type of artificial neural network used, as shown in Figure 5, for classification and regression5. It consists of several layers of interconnected neurons: 1) Input layer: receives the problem’s input variables (features), the number of neurons in this layer corresponds to the number of input variables. 2) Hidden layer(s): these intermediate layers apply non-linear transformations to the input data. The number of hidden layers and neurons per layer is chosen by the user during model design, and each neuron in a hidden layer calculates a weighted linear combination of its inputs, then applies a non-linear activation function (such as the sigmoid or ReLU function). 3) Output layer: produces the model’s final predictions; the number of neurons in this layer depends on the problem, and the activation function used in this layer also depends on the problem. The main advantages of the multilayer perceptron are6:
Its ability to approximate complex functions thanks to its hidden layers.
Flexibility to handle different types of problem (classification, regression).
Ts robustness to noise and incomplete data.
However, MLP requires careful tuning of its hyperparameters (number of layers, neurons, learning rate, etc.) to achieve good performance.
Figure 5. Multilayer perceptron classifier7.
3.4. User Behavior Analysis
Some systems combine analysis of e-mail content with tracking of user behavior (browsing history, past interactions, etc.) to improve phishing detection. The characteristics of our data set are as follows:
1) Index: Unique identifier for each line (sample) of data;
2) UsingIP: Indicates whether the IP address is used in the URL (1 for yes, −1 for no;
3) LongURL: Indicates whether the URL is long (1 for yes, −1 for no); a fundamental measure often used to distinguish legitimate from malicious URLs. Long, convoluted URLs can indicate attempts to obscure their true intent;
4) ShortURL: Indicates whether the URL is short (1 for yes, −1 for no);
5) Symbol@: Indicates whether the “@” symbol is present in the URL (1 for yes, −1 for no);
6) Redirecting//: Indicates whether the URL contains “//” redirects (1 for yes, −1 for no);
7) PrefixSuffix-: Indicates whether the URL contains prefixes or suffixes with hyphens “-” (1 for yes, −1 for no);
8) SubDomains: Indicates the number of subdomains in the URL (1 for many, −1 for few);
9) HTTPS: Indicates whether HTTPS protocol is used (1 for yes, −1 for no);
10) DomainRegLen: Length of domain record (1 for long, −1 for short);
11) Favicon: Indicates whether the favicon is loaded from the same domain (1 for yes, −1 for no);
12) NonStdPort: Indicates whether a non-standard port is used (1 for yes, −1 for no);
13) HTTPSDomainURL: Indicates whether the domain URL uses HTTPS (1 for yes, −1 for no);
14) RequestURL: Indicates whether the majority of URLs on the page point to the same domain (1 for yes, −1 for no);
15) AnchorURL: Indicates whether the URLs of anchor tags point to the same domain (1 for yes, −1 for no);
16) LinksInScriptTags: Indicates presence of links in script tags (1 for yes, −1 for no);
17) ServerFormHandler: Indicates whether the page form is managed by the server (1 for yes, −1 for no);
18) InfoEmail: Indicates whether an email address is present in the URL (1 for yes, −1 for no);
19) AbnormalURL: Indicates whether the URL is abnormal in relation to the domain (1 for yes, −1 for no);
20) WebsiteForwarding: Indicates whether the site frequently redirects to other pages (1 for yes, −1 for no);
21) StatusBarCust: Indicates whether the status bar is customized (1 for yes, −1 for no);
22) DisableRightClick: Indicates whether right-click is disabled on the page (1 for yes, −1 for no);
23) UsingPopupWindow: Indicates whether popup windows are used (1 for yes, −1 for no);
24) IframeRedirection: Indicates whether redirection iframes are used (1 for yes, −1 for no);
25) AgeofDomain: Indicates domain age (1 for old, −1 for young);
26) DNSRecording: Indicates whether DNS records are available (1 for yes, −1 for no);
27) WebsiteTraffic: Indicates website traffic level (1 for high, −1 for low);
28) PageRank: Indicates page PageRank (1 for high, −1 for low);
29) GoogleIndex: Indicates whether the page is indexed by Google (1 for yes, −1 for no);
30) LinksPointingToPage: Indicates the number of links pointing to the page (1 for many, −1 for few);
31) StatsReport: Indicates the presence of statistical reports (1 for yes, −1 for no);
32) class: The target variable, indicating the class to which each sample belongs (1 for one class, −1 for the other class, typically representing labels such as “phishing” or “legitimate”).
4. Results and Discussion
In this section, we present the results obtained from our analyses and experiments with data analysis and processing, dataset manipulation using different Machine Learning algorithms, discussion of the results, limitations and future prospects.
4.1. Presentation of Results
4.1.1. Description of the Dataset
Table 1. Dataset description.
Features |
Mean |
std |
UsingIP |
0.313914 |
0.949495 |
LongURL |
−0.633345 |
0.765973 |
ShortURL |
0.738737 |
0.674024 |
Symbol@ |
0.700561 |
0.713625 |
Redirecting// |
0.741632 |
0.670837 |
PrefixSuffix- |
−0.734938 |
0.678165 |
SubDomains |
0.064049 |
0.817492 |
HTTPS |
0.251040 |
0.911856 |
DomainRegLen |
−0.336711 |
0.941651 |
Favicon |
0.628551 |
0.777804 |
NonStdPort |
0.728243 |
0.685350 |
HTTPSDomainURL |
0.675231 |
0.737640 |
RequestURL |
0.186720 |
0.982458 |
AnchorURL |
−0.076443 |
0.715116 |
LinksInScriptTags |
−0.118238 |
0.763933 |
ServerFormHandler |
−0.595712 |
0.759168 |
InfoEmail |
0.635788 |
0.771899 |
AbnormalURL |
0.705446 |
0.708796 |
WebsiteForwarding |
0.115705 |
0.319885 |
StatusBarCust |
0.762077 |
0.647516 |
DisableRightClick |
0.913877 |
0.406009 |
UsingPopupWindow |
0.613353 |
0.789845 |
IframeRedirection |
0.816899 |
0.576807 |
AgeofDomain |
0.061335 |
0.998162 |
DNSRecording |
0.377239 |
0.926158 |
WebsiteTraffic |
0.287407 |
0.827680 |
PageRank |
−0.483626 |
0.875314 |
GoogleIndex |
0.721549 |
0.692395 |
LinksPointingToPage |
0.343948 |
0.569936 |
StatsReport |
0.719739 |
0.694276 |
class |
0.113986 |
0.993527 |
4.1.2. The Feature Correlation Matrix
Figure 6. Feature correlation matrix.
Figure 6 is a correlation matrix between different features of a dataset of phishing URLs from our study. Correlation measures the strength and direction of the linear relationship between two variables. Values range from −1 to 1, where: 1 indicates a perfect positive correlation (when one variable increases, the other also increases). −1 indicates a perfect negative correlation (when one variable increases, the other decreases). 0 indicates no linear correlation. These correlations were used in Figure 6 to identify and detect the most important features when detecting phishing URLs, and helped us to improve the classification models used to detect these threats.
4.1.3. Characteristic Histogram
Figure 7. Feature histograms.
This Figure 7 shows histograms of various characteristics extracted from cybersecurity-related data. Each histogram shows the distribution of values for two distinct classes, represented by the blue and red dots. These histograms in Figure 7 allow us to analyze the discriminability of these different features for classification between the two classes. Features with more separated distributions (such as WebsiteTraffic and HTTPS) appear to be more informative in distinguishing the two classes. The graphs show that features related to sub-domains, HTTPS usage and web traffic are the most relevant for differentiating these two classes, while prefixes/suffixes and URL anchors seem less discriminating.
4.1.4. Diagram Representing the Number of Phishing Cases with the
Values of Two Label Classes
With these source codes we have:
Figure 8. Diagram representing the values of two labeled classes.
This diagram is represented by Figure 8 with two parts representing the values of two different classes, i.e. each sample is labeled with a class identifier indicating whether it is categorized as a phishing website (1) or not (−1). Composed of 11,054 website samples. The blue part represents class 1 and occupies 55.7% of the diagram, making up 6157 phishing sites. The orange part represents class −1 and occupies 44.3% of the diagram, i.e. 4897 non-phishing or non-phishing sites. This diagram therefore shows that class 1 has a greater weight than class −1 in the data represented. In Figure 8, the percentage difference between the two classes is around 11.4 points (55.7% for class 1 vs. 44.3% for class −1). This diagram makes it easy to compare the proportions between the two classes of data.
4.1.5. Data Splitting
The data set is divided into training and test sets on a 70 - 30 basis, or the data has been split into a training set (70% of the data) and a test set (30% of the data). This prepares the data for training and evaluation of a machine learning model.
4.2. Algorithms Used
4.2.1. The Naïve Bayes Classifier (NBC) Algorithm
On the training data, the Naive Bayes model performed relatively well, with an Accuracy of 88.8%, an F1-score of 89.3% and an Accuracy of 94.8%. However, Recall is a little lower at 84.4%. On test data, performance remains fairly close to training data, with Accuracy at 87.6%, F1-score at 88.3% and Precision at 94.3%. Recall fell slightly to 82.9%. The difference between performance on training and test data is relatively small (around 1 to 2 percentage points), indicating that the Naive Bayes model does not appear to be overlearned. The MAE (Mean Absolute Error) is 0.224 on training and 0.248 on test, which is relatively high compared with the other metrics. This suggests that the model still has sizeable errors in its predictions. In summary, the Naive Bayes model performs well, with accuracy and F1-score around 87% - 88% on test data. However, the Recall is a little lower, indicating that the model still has difficulty in correctly identifying certain samples. The relatively high MAE also shows that the model is still making sizeable errors in its predictions.
![]()
Figure 9. Graphical representation of training and test accuracy for different var_smoothing values.
Figure 9 shows the evolution of the accuracy of the machine learning model as a function of the “var_smoothing” regularization parameter. With the two curves shown: the blue curve represents the accuracy on the training data and indicates how the model performs on the data used to train it. The orange curve represents accuracy on test data. This measure indicates the model’s overall performance on new data not seen during training. We can observe that as the “var_smoothing” parameter increases from 0.0 to 1.0, the accuracy on test data progressively decreases from around 0.57 to 0.55. This suggests that the model is suffering from over-training. This suggests that the model suffers from overlearning when “var_smoothing” is too low, and that this parameter needs to be carefully adjusted to find the right balance between learning on training data and generalization on new data.
4.2.2. Decision Tree Classifier
The performance results of the decision tree model show that: Accuracy on the training set: 1.000. This indicates a very good fit to the training data. Accuracy on the test set: 0.952. This indicates that the model generalized well and achieved an accuracy of 95.2% on the test data, which is very satisfactory. F1 score on training data: 1.000. A score of 1.0 shows that the model performs perfectly on training data. F1 score on test set: 0.957, this test score also indicates very good overall model performance. These results show that the decision tree model learned the training data very well, and was able to generalize its performance well to the test set. It’s a high-performance model.
Evolutionary graph of model accuracy as a function of maximum decision tree depth.
Figure 10. Graph showing the evolution of the accuracy of a machine learning model as a function of the maximum depth of the decision tree model.
Model accuracy on the training set increases monotonically with maximum depth, reaching a value close to 1 (i.e. 100% accuracy) for a depth greater than 25, as shown in Figure 10. Model accuracy on the test set follows a similar trend, but with a less steep curve. It reaches a plateau around 0.97 (i.e. 97% accuracy) for a depth greater than 20. The gap between training and test accuracy narrows as maximum depth increases, indicating that the model generalizes better to new data when model complexity (measured by depth) is higher. This suggests that there is a trade-off between the model's ability to fit training data well and its ability to generalize to new data.
4.2.3. Gradient Boosting Classifier
Gradient Boosting Classifier prove that the accuracy on the training data is 0.999 which means that the model has a very high, almost perfect (99.9%), accuracy on the training data. This indicates that the model fits the training data very well. The Precision on test data of 0.962 is also very high. This shows that the model generalizes well and performs well on data it didn’t see during training. The F1 Score on training data of 0.999 combines precision and recall and indicates excellent overall performance on training data. The F1 Score on test data: 0.966 confirms the good generalization of the model. So, these results indicate that the Gradient Boosting Classifier performs excellently on both training and test data. The model seems well adapted to the data and capable of making accurate predictions.
Evolutionary plot of the model as a function of the learning rate used Gradient Boosting.
Figure 11. Graph showing the evolution of the accuracy of a machine learning model as a function of the learning rate using Gradient Boosting.
This Figure 11 shows the evolution of the model as a function of the training rate using Gradient Boosting. It shows that model accuracy on the training set increases steadily with learning rate, reaching a value close to 0.98, or 98% accuracy at high learning rates. Model accuracy on the test set follows a similar trend, although the curve is a little less steep. It reaches a plateau around 0.965, i.e. 96.5% accuracy for a learning rate above 6. The gap between training and test accuracy narrows as the learning rate increases, indicating that the model generalizes better to new data with a higher learning rate. This suggests that there is a trade-off between the model’s ability to fit training data well and its ability to generalize to new data. Fine-tuning the learning rate is necessary to find the best balance between these two objectives. This Figure 11 shows the importance of the choice of learning rate in optimizing the performance of a machine learning model.
4.2.4. Multi-Layer Perceptron Classifier
The Multi-layer Perceptron model seems to perform exceptionally well in phishing detection. According to the brief summary of metrics we had: accuracy on training data is 98.8%, and on test data it is 96.4%. This means that the model makes few classification errors, with only a slight drop in performance on the test data. F1-score combines precision and recall. Our model achieves an F1-score of 98.9% on both training and test data, showing equivalent performance on both classes. Recall is high, at 98.9% on training data and 97.3% on test data. This indicates that the model correctly identifies the vast majority of samples in each class. MAE (Mean Absolute Error) our model has a MAE of 0.025 on training data and 0.071 on test data. A low MAE suggests that the model makes few absolute errors on predictions. In sum, our MLP model demonstrates excellent generalization capability, which is essential for phishing detection.
4.3. Model Comparison Table
Table 2. Model comparison.
|
ML Model |
Accuracy |
f1_score |
Recall |
Precision |
Mean Absolute
Error |
0 |
Naive Bayes Classifier |
0.876 |
0.883 |
0.844 |
0.948 |
0.224 |
1 |
Decision Tree |
0.950 |
0.955 |
1.000 |
1.000 |
0.000 |
2 |
Gradient Boosting
Classifier |
0.962 |
0.966 |
0.999 |
0.999 |
0.002 |
3 |
Multi-layer Perceptron |
0.903 |
0.918 |
0.970 |
0.875 |
0.186 |
4.4. Heat Map of Model Performance Measurements
Figure 12. Heat map of performance metrics for all models.
4.5. Stacked Bar Diagram
Figure 13. Stacked bar chart of performance measures for all models.
4.6. Checking the Importance of Features
Figure 14. Checking the importance of features by permutation on the complete model.
5. Discussion of Results
5.1. Discussion of Results in Relation to Model Comparison
Different machine learning models were used in this study, and the previous sections presented the results and effects of the machine learning model on the classification process for phishing and legitimate URLs Table 1. Comparative analyses of machine learning models are presented in this section. Table 2 and Figure 13 and Figure 14 presented the clear and significant effects of the machine learning models in this study. From the outset, the comparative analyses show that the machine learning models Naive Bayes Classifier and Multi-layer Perceptron do not perform well, Naive Bayes give very poor results and Multi-layer Perceptron satisfactory results. Gradient Boosting Classifier and Decision Tree showed very effective and significant results in classifying phishing URLs.
5.2. Comparison of Study Models
These results from Table 2 helped us to make a comparative assessment of the performance of the different machine learning models, facilitating the choice of the best model for the problem under consideration: the Naive Bayes Classifier model seems to be a good compromise in terms of simplicity and training speed, but performs slightly less well than the other models in terms of precision and recall. On the other side of Table 2, the Decision Tree performs perfectly, but as the present results show, this may indicate overtraining. The Gradient Boosting Classifier stands out for its excellent overall performance, with a precision of 96.1%, an F1-Score of 0.966 and a very low MAE of 0.002. It’s a solid choice for this dataset, as it manages to balance precision and recall without showing any obvious signs of overlearning.
5.3. Discussion of Results in Relation to Metric and Stacked-Bar
Performance
Figure 12 shows a heat map of performance metrics for different classification models. It is clear from this graph that Gradient Boosting Classifier, Multi-layer Perceptron and Decision Tree stand out for their high scores in terms of precision, recall, F1 score and mean absolute error. The Naïve Bayes Classifier, while having high precision, has lower recall, which affects its F1 score. In addition, its mean absolute error is higher than that of the other models. Figure 12 clearly shows that the Gradient Boosting Classifier, the Gradient Boosting Classifier, Multi-layer Perceptron and Decision Tree stand out for their high scores in terms of accuracy, F1-score, recall and precision, clearly outperforming the Naive Bayes Classifier on most metrics.
5.4. Discussion of Results versus Storing or Retaining the Best
Model
The Gradient Boosting Classifier combines several weak models to form a more powerful model. The main parameters in this case are: The Learning Rate parameter: with a value of 0.7 indicates that each new tree makes a moderate contribution to the final prediction. This lower learning rate makes the model more robust to noise, but can also slow convergence. The maximum tree depth parameter or max_depth: with a depth of 4 proves that each tree can have up to 4 decision levels. Shallow trees therefore limit the risk of overlearning, but may also reduce the model’s ability to capture complex interactions. With these parameters, the GradientBoostingClassifier iteratively builds a set of shallow decision trees (max_depth = 4), updating the tree weights at each iteration according to a moderate learning rate (learning_rate = 0.7). We found this configuration to be a good starting point for a classification problem.
5.5. Experimental Results for the Most Important Characteristics
In regression problems such as ours, where prediction involves a continuous value, models are trained on labeled data in order to accurately predict results for new instances. A prediction approach to illustrate the importance of different features has been adopted to improve the results and efficiency of machine learning models, illustrated in our study by Figure 14. The main predictive observations remain the most important features, which are “StatsReport”, followed by “LinksPointingToPage” and “GoogleIndex”. According to the results presented in Figure 14, these three features have a high relative importance, exceeding 0.5 on the scale. This approach highlights the most influential features on the prediction model, making it possible to identify the key factors to be taken into account to improve model performance. The most important characteristics are linked to aspects such as website reputation, traffic and search engine indexing. This approach provides valuable information on the relative importance of different features in the prediction model, helping us to better understand the determining factors and direct our efforts to improve our model.
5.6. Limits and Future Prospects
Although the results of our study are satisfactory, they are not without their limitations. Although the algorithms and models used are powerful, as shown in Table 2, these models may have difficulty adapting to the new phishing techniques that are constantly emerging, as shown in Figure 1. The Naive Bayes model is limited by its simplicity and sensitivity to missing data. Gradient Boosting, on the other hand, has considerable complexity and computation time, especially in the case of our research, where the dataset to be analyzed is voluminous. The Multi-Layer Perceptron, being deep, requires a large amount of training data to avoid overlearning, so if the data is limited, the model may not generalize correctly. These models process elements from a single Kaggle dataset. However, these limitations do not diminish the scientific value of this study, nor do they remain insurmountable; to this end, our results remain scientifically valid. Therefore, to address these limitations, we are planning a future study in the first instance, on AI techniques for phishing detection using supervised learning by extending algorithms such as random drill, neural networks but also making use of other Datasets such as APWG eCrime Dataset, Phishing Dataset. In the second place on unsupervised learning and deep learning using deep neural networks for text and image analysis.
5.7. Contributions
Figure 15. The NDG + M model proposed in this study.
This section presents the solution, method or model we have just implemented. Our study makes remarkable contributions to the field of artificial intelligence by combining machine learning and deep learning. It synthesizes 3 machine learning models and 1 deep learning model. This study proposes the detection of cyber-attacks based on phishing URLs in order to prevent crime and protect people’s privacy. On a global scale, the dataset includes more than 6157 phishing URL attributes or 55.7% represented by class 1 which allow phishing URLs to be classified according to these attributes. Machine learning models were applied, such as Naive Bayes (NB), Decision Tree (DT), Gradian Boosting and a deep learning model such as Multi-Layer Perceptron (MLP) giving the model (NDG + M), which accurately classified the phishing URL threats. Validation with a parameter based on the feature selection technique was used with the hybrid NDG + M model to improve prediction results. The proposed methodology was evaluated using evaluation parameters such as accuracy, precision, recall, specificity and F1 score. Figure 15 shows the proposed model to improve the efficiency and accuracy of phishing detection.
In summary, the proposed NDG + M model aims to improve the effectiveness and accuracy of phishing detection with the predictive approach. The study focuses on phishing prevention techniques, including URL filtering, user education and training [33], real-time detection by machine learning. It also provides a detailed explanation of how 4 machine learning models work. However, it does not explore the details of the dataset are omitted, which has an impact on generalizability.
6. Conclusions
With the development of the Internet, cybercrime is on the increase on a daily basis, particularly through the use of suspicious and malicious URLs. This has a significant impact on the quality of services provided by the Internet and industrial companies. The detection of phishing is a major challenge for online security. The aim of this study was to analyze the different AI approaches used for phishing detection, and to determine the most practical types of supervised machine learning algorithms. The results we obtained with the Gradient Boosting Classifier are impressive. Precision of 96.2%, F1-score of 96.6% and recall of 99.9% across all classes are solid performances. The fact that this model balances precision and recall well is crucial to minimizing false positives while effectively detecting true positives. However, there are still aspects to be explored in future work, and the creation of a new dataset based on phishing URLs is essential to improve the robustness of detection systems. The integration of unsupervised, hybrid machine learning algorithms and elements of user behavior analysis may also bring new perspectives.
Raising staff awareness of security and phishing techniques is also a crucial step. Vigilance, source verification and ongoing training are effective ways of protecting against these growing threats. In conclusion, this study has provided valuable insights into phishing detection and practical skills in template selection, feature analysis and performance optimization.
NOTES
1https://www.webopedia.com/definitions/phishing-meaning/ [consulted June 2024].
2https://designpowers.com/blog/url-best-practices [viewed in June 2024].
3https://www.kaggle.com/datasets/taruntiwarihp/phishing-site-urls [consulted on May 26, 2024 in Bukavu].
4(https://www.javatpoint.com/machine-learning-decision-tree-classification-algorithm
[consulted on June 12, 2024])
5https://fr.wikipedia.org/wiki/Perceptron_multicouche [consulté le 08 Juillet 2024]).
6Idem.
7https://rtavenar.github.io/deep_book/fr/content/fr/mlp.html [consulted in July 8, 2024].