Scientific African 16 (2022) e01165 

Contents lists available at ScienceDirect 

Scientific African 

journal homepage: www.elsevier.com/locate/sciaf 

Performance evaluation of machine learning tools for 

detection of phishing attacks on web pages 

T.O. Ojewumi a , G.O. Ogunleye 

b , ∗, B.O. Oguntunde 

a , O. Folorunsho 

b , 
S.G. Fashoto 

c , N. Ogbu 

a 

a Department of Computer Science, Redeemer’s University, Ede, Osun State, Nigeria 
b Department of Computer Science, Federal University, Oye-Ekiti, Ekiti State, Nigeria 
c Department of Computer Science, Faculty of Science and Engineering, University of Eswatini, Kwaluseni M201, Eswatini 

a r t i c l e i n f o 

Article history: 

Received 3 October 2020 

Revised 19 April 2021 

Accepted 19 March 2022 

Editor DR B Gyampoh 

Keywords: 

Phishing 

Attack 

KNN 

Random Forest 

SVM 

a b s t r a c t 

This paper analyses and implements a rule-based approach for phishing detection using 

the three machine learning models trained on a dataset consisting of fourteen (14) fea- 

tures. The machine learning algorithms are; k-Nearest Neighbor (KNN), Random Forest, and 

Support Vector Machine (SVM). Among the three algorithms used, it was discovered that 

Random Forest model proved to deliver the best performance. 

Rules were extracted from the Random Forest Model and embedded into a Google 

chrome browser extension called PhishNet. PhishNet is built during the course of this re- 

search using web technologies such as HTML, CSS, and Javascript. As a result, PhishNet 

facilitates highly efficient phishing detection for the web. 

© 2022 The Author(s). Published by Elsevier B.V. on behalf of African Institute of 

Mathematical Sciences / Next Einstein Initiative. 

This is an open access article under the CC BY license 

( http://creativecommons.org/licenses/by/4.0/ ) 

 
Introduction 

Over the years since its inception, the internet has continued to grow exponentially in use, as well as its various use-

cases. One of the popular use-cases of the internet is internet banking. 

Internet banking allows customers to engage in financial transactions with financial institutions or banks over the inter- 

net. Internet banking is widespread and on the rise all over the world. 

In Europe, about 51% of the adult population engage in internet banking [1] , and about 60% of the United States popula-

tion engaged in internet banking last year according to the Statista Global Consumer Survey [ 2 , 3 ]. The E-commerce industry

influenced retail sales to the tune of $2.84 trillion in 2018. Cybercriminals have since realized the wealth in the internet

banking industry and have invented several means of attack, one of which is phishing [ 3 ]. 

A phishing attack is based on deceiving a person to navigate to a fake original-look-alike website where his/her sensitive 

details can be stolen. Phishing attacks are on the rise because of their profitability. There was a 36% increase in phishing

attacks, and the number of phishing sites grew by 220% over the course of 2018 [4] . Some of these phishing attacks are

aimed at employees of certain companies. Big companies such as Google and Facebook were no exemption as they were 
∗ Corresponding author. 

E-mail address: gabriel.ogunleye@fuoye.edu.ng (G.O. Ogunleye) . 

https://doi.org/10.1016/j.sciaf.2022.e01165 

2468-2276/© 2022 The Author(s). Published by Elsevier B.V. on behalf of African Institute of Mathematical Sciences / Next Einstein Initiative. This is an 

open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ ) 

https://doi.org/10.1016/j.sciaf.2022.e01165
http://www.ScienceDirect.com
http://www.elsevier.com/locate/sciaf
http://crossmark.crossref.org/dialog/?doi=10.1016/j.sciaf.2022.e01165&domain=pdf
http://creativecommons.org/licenses/by/4.0/
mailto:gabriel.ogunleye@fuoye.edu.ng
https://doi.org/10.1016/j.sciaf.2022.e01165
http://creativecommons.org/licenses/by/4.0/


T.O. Ojewumi, G.O. Ogunleye, B.O. Oguntunde et al. Scientific African 16 (2022) e01165 

 
confirmed to have fallen victim and lost up to $100 million combined in 2017 [5] . Many phishing attacks are aimed at

customers/users of a particular service. 

A famous phishing attack was a Gmail phishing scam that targeted almost 1 billion Gmail users worldwide in 2017. 

Such an attack is usually aimed at account takeover. Phishing attacks aim to deceive users/clients of services into revealing

sensitive information by disguising trustworthy institutions over the internet. This disguise involves the creation of fake 

websites, which are look-alikes of original reputable sites. Users are then made to navigate to these fake websites, usually 

with emails. These emails are made to appear to have come from the reputable institution and target customers of that

institution. They usually contain a hyperlink with the correct URL as text, while pointing to the fake website. The hope of

the sender of the email is that the target will click such a link and fail to check or recognize the difference in the URL; and

also reveal sensitive information such as login or credit card details. These attacks can be in a personalized form known

as Spear phishing. Phishing attacks also affect businesses/brands. Phishers usually make use of a legitimate brand for their 

attack. 

Once the phishing campaign is discovered and revealed, these brands become casualties, as this kind of publicity can hurt 

the brand’s image. Customers of such a brand might even avoid the brand’s legitimate website for fear of landing on the

fake web page. Organizations might also lose confidence in that brand and drop it for a competitor. Phishing emails can be

received at any time. Sometimes, these types of email messages may be preceded by technology disasters or newsworthy 

events. This makes the email seem authentic and creates a sense of urgency in the recipient. It is essential to solve the

problem of phishing attacks as phishing attacks cause financial loss for victims and put their information at risk. It also

creates a loss for business parties and can be a death blow for businesses. 

The proposed solution to the issue of phishing will yield a safer environment for web users and reduce the risk of being

caught by a phishing attack by alerting users when they land on a phishing site. This paper compares and implements a

rule-based approach for phishing detection using the three machine learning models that are popular for phishing detection. 

The machine learning algorithms are; k-Nearest Neighbor (KNN), Random Forest, and Support Vector Machine (SVM). The 

models were trained on a dataset consisting of fourteen (14) features. Among the three algorithms used, it was discovered 

that Random Forest model proved to deliver the best performance. 

Rules were extracted from the Random Forest Model and embedded into a Google chrome browser extension called 

PhishNet. PhishNet is built during the course of this research using web technologies such as HTML, CSS, and JavaScript. As

a result, PhishNet facilitates highly efficient phishing detection for the web. 

In this paper, we try to answer the following research questions: 

1. Which of the three machine learning models that are being used for phishing detection would perform best? 

2. How good is the rule-based machine learning model in predicting phishing websites? 

Review of related literature 

Mohigimi and Varjani [6] proposed a standard based phishing detection strategy. They proposed two new capabilities, 

which included making use of approximate string-matching algorithms and used relevant features from related works. They 

utilized the Support Vector Machine algorithm for training their model and later extracted the rules using SVM_DT calcu- 

lation. Their proposed model had a True Positive Rate of 99.14% in the detection of web banking phishing site pages. The

extracted rules were then implemented as a chrome browser extension, which they named PhishDetector for convenience. 

Their solution is autonomous of third-party services and can identify zero-day phishing attacks. The mishap of their pro- 

posed model is that it is excessively subject to page content and can’t identify website page with non-HTML code [11] . Tan

et al. [12] proposed a phishing detection technique dependent on character examination between the real and target website 

page. Their methodology named PhishWHO was in three phases; 

• Keywords extraction: the extraction of identity keywords from the textual components of a website where a novel 

weighted URL tokens system based on the N-gram model was proposed. 

• Target domain selection: the use of a search engine to find target domains which are then selected based on identity-

relevant features. 

• Identity matching: a three-tier identity matching system was proposed to determine the legitimacy of the query web- 

page. 

PhishWHO accomplished a true positive rate of 99.68% and a true negative rate of 92.52%. PhishWHO doesn’t perform 

well against the visual cloning procedure utilized by some phishers. Varshney et al. [13] utilized a search engine based

anti phishing approach for phishing detection. They built up a lightweight phish detector as a Google Chrome extension 

augmented with Google as the search engine and called it Lightweight Phish Detector (LPD). They performed extensive 

testing and compared their solution to other web index-based antiphishing approaches in terms of performance. They used 

URLs from PhishTank and Alexa ranking for their examination and got a variable result of 92.4% to 100% for true negative

rate and 99.5% for true positive rate. LPD was seen as proficient as far as calculation and capacity cost as it was estimated

at 47 kb and utilized just 11mb worth of memory. It didn’t require any dedicated server resources and could run on the

client-side of the browser since it was lightweight. Unfortunately, LPD could not deal with languages other than English so 

accurately and yielded a couple of false positives [14] . 
2 


T.O. Ojewumi, G.O. Ogunleye, B.O. Oguntunde et al. Scientific African 16 (2022) e01165 

 
Ali [4] proposed a phishing detection strategy utilizing different machine learning classification algorithms combined 

with wrapper feature selection techniques. The machine learning classifiers used incorporate Support Vector Machine (SVM), 

k-Nearest Neighbor (k-NN), Random Forest (RF), and Naïve Bayes (NB), among others. He had the ability to increase the 

accuracy of the model’s wrapper feature selection and achieved a 97.3% True Positive Rate and a True Negative Rate of 97%

with the Random Forest classifier on the detection of phishing sites. Wrapper selection feature requires extra computational 

overhead, thus wasting additional time. 

Abutair et al. [10] proposed a Case-Based Reasoning phishing detection system (CBR-PDS). It is a system that recognizes 

phishing attacks by depending on past cases and uses a small dataset contrasted with machine learning approaches. Their 

solution utilized a two-phase feature selection and weighting procedure to choose informative features and assign them 

weights out of twenty-one initial features. Their proposed model delivered a 96% precision on their test dataset. The pro- 

posed model is versatile to new phishing strategies; in any case, it requires constant updating with new phishing URLs. 

Yi et al. [15] proposed a phishing detection strategy that utilizes Deep Learning Frameworks. They grouped their extracted 

features into two types; original features and interactive features and then introduced a detection model dependent on Deep 

Belief Networks (DBN). Their solutiondelivered roughly 90% True Positive Rate and 99.4% True Negative rate. In any case, 

their model can end up being computationally costly. 

Sahingoz et al. [ 8 , 9 ] proposed a model for detecting phishing pages using machine learning algorithms. They utilized

Artificial Neural Networks (ANNs) and Deep Neural Networks (DNNs) machine learning algorithms and trained their models 

with a dataset containing around 60,0 0 0 site pages with 27 extracted features. Their solution achieved an accuracy of 92%

for the ANN approach and 96% for the DNN approach. In any case, the execution time of their framework can, at present,

be enhanced. 

Muppavarapu et al. [7] proposed a novel strategy for phishing location utilizing Resource Description Framework (RDF) 

models and Random Forest classification algorithm. Their model included making decisions dependent on a search engine 

and using a Random Forest model trained with 21 features as a contingency for decision making should the main methodol-

ogy fizzle. They executed their proposed model mostly with Java programming language and utilized Google as their search 

engine. The proposed strategy was effective as they accomplished a True Positive Rate of 98.8% and a True Negative Rate of

98.5%. 

Mishra and Gupta [5] proposed a phishing detection system that included utilizing URI and CSS matching techniques 

with the assumption that phishers, by and large, utilize the same CSS style as the original website page. They executed their

solutionas a desktop application called CUMP with Java programming language. Their proposed solution had a True Positive 

Rate of 93.27% and a True Negative Rate of 100%. The significant quality of their solution was its False Positive Rate, which

was low to zero and furthermore it’s capacity to deal with a broad scope of sites. Be that as it may, their solution has a

high memory cost and fails when a CSS document has an enormous number of CSS rules. 

Sahingoz et al. [9] proposed a real-time antiphishing framework that utilizes seven machine learning classification al- 

gorithms, and Natural Language Processing (NLP) based highlights. Their framework is autonomous of third-party services 

and can deal with new phishing sites. The classification algorithms utilized are Naïve Bayes, k-NN, Adaboost, Decision Tree, 

Random Forest, SMO, and K-star. Their best result was given by the Random Forest algorithm, which delivered an accuracy 

of 97.98%. Their proposed framework can perform well with various scope of sites as the models were prepared with a big

dataset containing data that met multiple criteria. The framework does not perform as adequately when the URL contains 

short sub-domains and domains without any path. This is because of the use of NLP based highlights. 

This paper aims to implement a solution that will make decisions based on the best performing machine learning classi- 

fication model for phishing detection using a lightweight Google chrome extension, which will offer convenient and highly 

accurate phish detection for internet users. 

Methodology 

This section focuses on and analyzes the Extracted Rule-based method for phishing detection and how it can be imple-

mented. In this paper, a total of 14 features were extracted from websites. Phishing websites were gotten from PhishTank 

(PhishTank.com is a community-based phish verification system where suspicious sites are submitted to be voted on by 

other users as a means of phish detection and verification), legitimate Internet-banking websites are gotten from various 

webpage directory services. The following outline methods were used to achieve this paper. 

• Use of Supervised machine learning algorithms to train models: 

Support Vector Machine (SVM), Random Forest, and k-Nearest Neighbor (k-NN) classification algorithms were used to 

train three separate models. These algorithms were accessed from Python’s Scikit-learn library. 

• Assessment, evaluation, and comparison of the performance of trained models in the classification of test examples. 

• Extraction of rules from the best model, selection, and evaluation of rules: 

A synthetic dataset based on the best model’s correct predictions were used to train a Decision Tree model from which
if-else decision rules can then be extracted. 

3 


T.O. Ojewumi, G.O. Ogunleye, B.O. Oguntunde et al. Scientific African 16 (2022) e01165 

 
• Embedding the selected rules into a browser extension called PhishNet for the detection of phishing webpages: 

In this research paper, PhishNet was created as a Chrome browser extension in the form of content script using JavaScript,

HTML, and CSS. 

The steps taken to build the proposed system are as follows: 

• Data collection 

• Feature Extraction 

• Model building and training 

• Model assessment 

• Rule Extraction 

• Building of PhishNet 

Fig. 1 illustrates the system methodology using a Block Diagram. 

Data collection 

This is the first phase of the project which involved the collection of phishing and legitimate webpage URLs which are

required for feature extraction. A total of 10 0 0 phishing webpages were gotten from Phishtank ( www.phishtank.com ). Four

hundred legitimate webpages, which are internet banking and financial webpages were obtained from web directories like 

Jasmine directory, business, the Financial brand, Intechnic, and Similar web directories. 

This phase was carried out with the aid of web scraping tools that access the DOM of webpages and extract the needed

information. 

Feature extraction 

Collected data were represented in the form of features for training the machine learning models. A total of 14 features

were extracted from the gathered webpages, six (6) of which were gotten from the webpage URL while the remaining eight

from the DOM of the webpage. The features that were used are as follows: 

IP address 

Some phishing webpages mask their website name by representing the website in the form of IP address, e.g., " http://

125.94.3.135/site.html ". Another reason phishers use IP addresses is to avoid paying for a domain name and hosting. Feature 

1 can be represented as follows: 

f 1 = 

{
1 , IP ad d ress 

0 , not IP ad d ress 

SSL security 

SSL security is an important feature for identifying phishing webpages as phishers are more likely to use websites with- 

out SSL security in order to avoid extra costs. Use of SSL security reflects in the URL of websites as an extra ’s’ in the

HyperText Transfer Protocol (’https’). Feature 2 can be extracted using string manipulation and is as follows: 

f 2 = 

{
1 , SSL security 

0 , no SSL security 

Number of dots 

The number of dots in the webpage URL is also another essential feature that can be used for classifying webpages. This

is because some phishing webpages pretend to be legitimate by using popular legitimate domains as subdomains of their 

site. This, in turn, increases the number of dots in the URL webpage. Therefore, URLs containing more dots are more likely

to be phishing webpages. Feature 3 is a count feature and can be represented as follows: 

f 3 = Number of dots 
4 

http://www.phishtank.com
http://125.94.3.135/site.html


T.O. Ojewumi, G.O. Ogunleye, B.O. Oguntunde et al. Scientific African 16 (2022) e01165 

Fig. 1. System methodology. 

5


T.O. Ojewumi, G.O. Ogunleye, B.O. Oguntunde et al. Scientific African 16 (2022) e01165 

 
Length of URL 

Phishing webpages tend to have longer URLs as phishers sometimes try to mask suspicious URLs with very long URLs 

and legitimate-looking subdomains. For extraction purposes the URL will be broken down into three parts; the host, file 

path and query with three corresponding features for their lengths as follows: 

f 4 = Length of host 

f 5 = Length of query 

f 6 = Length of the f ile path 

Resource domain difference feature set 

Phishing webpages usually have resources pointing to or having their source from the legitimate website being mas- 

queraded. Therefore, there is typically a difference between the domain of their URL and the domain of the URL of their

resources. 

The resources that were considered are links, styles, images and scripts which were represented by the HTML tags 〈 a 〉 ,
〈 link 〉 , 〈 img 〉 and 〈 script 〉 , respectively. These resources were gotten from the DOM of webpages, and their referenced URLs

will be gotten from the adequate attributes. 

Levenshtein distance string algorithm was employed to compute the difference between the domains of the webpage 

URL and resource URL for each URL in each resource category. The data would be normalized to allow for easy comparison

of data from various webpages. The computed value would range from 0 to 1; Most phishing webpages would have values

closer to 1. The features would be as follows as proposed by (Mohigimi & Varjani, 2016): 

f7 = 

∑ n 
i =0 

LD ( webpage _ domain , lin k i _ domain ) 
max ( | webpage _ domain | , | lin k i _ domain | ) 

n 

(1) 

f8 = 

∑ n 
i =0 

LD ( webpage _ domain , styl e i _ domain ) 
max ( | webpage _ domain | , | styl e i _ domain | ) 

n 

(2) 

f9 = 

∑ n 
i =0 

LD ( webpage _ domain , scrip t i _ domain ) 
max ( | webpage _ domain | , | scrip t i _ domain | ) 

n 

(3) 

f10 = 

∑ n 
i =0 

LD ( webpage _ domain , imag e i _ domain ) 
max ( | webpage _ domain | , | imag e i _ domain | ) 

n 

(4) 

Resource access protocol feature set 

This feature set will check if each URL resource in each resource category has SSL security. Legitimate webpages, espe- 

cially internet banking and financial webpages, tend to have to access to their resources over secure protocols. 

Links, styles, images, and scripts resources were considered and extracted from the DOM of the webpage. The features 

are represented with the aid of feature two as follows as proposed by (Mohigimi & Varjani, 2016): 

f11 = 

∑ n 
i =0 f2 ( lin k i _ URL ) 

n 

(5) 

f12 = 

∑ n 
i =0 f2 ( styl e i _ URL ) 

n 

(6) 

f13 = 

∑ n 
i =0 f2 ( scrip t i _ URL ) 

n 

(7) 

f14 = 

∑ n 
i =0 f2 ( imag e i _ URL ) 

n 

(8) 

Model building and training 

A total of two machine learning supervised classification algorithms were used to build and train classification models. 

The algorithms that were used are k-Nearest Neighbors (k-NN) and Support Vector Machine (SVM). The models were trained 

with the dataset consisting of 14 features. 
6 


T.O. Ojewumi, G.O. Ogunleye, B.O. Oguntunde et al. Scientific African 16 (2022) e01165 

Fig. 2. Screenshot sample of collected URLs. 

Fig. 3. Screenshot sample of the extracted output. 

 
k-Nearest neighbors 

k-Nearest Neighbor is a simple classification algorithm that trains a model simply by storing the dataset. It makes new 

predictions for a unique data point by looking for the datapoints closest to that datapoint and assigning it to the average

class of those data points. 

In its simplest form, the k-Nearest Neighbors algorithm considers only one neighbor and simply assigns the class of that 

neighbor to the new observation. The k in the name ’k-Nearest Neighbors’ signifies an arbitrary number k of neighbors that

can be considered. 
7 


T.O. Ojewumi, G.O. Ogunleye, B.O. Oguntunde et al. Scientific African 16 (2022) e01165 

Fig. 4. Decision boundary for KNN model on two features. 

Fig. 5. KNN model confusion matrix. 

 
Support vector machine 

Support Vector Machine is a linear model classification machine learning algorithm, i.e., it forms decision boundaries in 

a spatial plane using straight line segments, planes, or hyperplanes. New observations are then classified based on which 

class division they fall into. For instance, in a binary classifier, predictions are made using the general formula: 

� 

y = w [ 0 ] ∗ x [ 0 ] + w [ 1 ] ∗ x [ 1 ] + ... + w [ p ] ∗ x [ p ] + b > z (9) 

where, 

w = regression coefficient 

x[i] = feature 

b = intercept 

ˆ y = regression prediction 

The greater > z at the end of the formula is the prediction aspect of the formula, which means values greater than z fall

into one class, and those less than or equal to z fall into the other class. 

Random Forest 

A Random Forest is a collection of Decision trees with each tree differing slightly from the other. A prediction is made by

averaging the result obtained from all individual Decision Trees. This helps to reduce the problem of overfitting, a problem 

peculiar to the Decision Tree Algorithm. 

The algorithm is called Random Forest because of the way randomness is injected into the tree-building process to ensure 

each tree differs. 
8 


T.O. Ojewumi, G.O. Ogunleye, B.O. Oguntunde et al. Scientific African 16 (2022) e01165 

Table 1 

KNN model classification report. 

Precision Recall F1-score Support 

Legitimate 0.88 0.71 0.79 21 

Phishing 0.94 0.98 0.96 100 

Micro avg 0.93 0.93 0.93 121 

Macro avg 0.91 0.85 0.88 121 

Weighted avg 0.93 0.93 0.93 121 

 
Model assessment 

In this phase, statistics such as True Positive, False Positive, True Negative, False Negative, Precision, F-score, Accuracy, and 

Recall would be computed for each model. These statistics were used as a standard for comparison, and the best performing

model will be picked for the next phase, which is rule extraction. 

Rule extraction 

The method of rule extraction involves training a Decision Tree model with a synthetic dataset derived from the correct 

predictions made by the selected model. The best set rules for decision making can then be extracted from nodes in the

Decision Tree. 

Browser extension implementation 

The selected ruleset will be embedded in a Google chrome web-browser extension called PhishNet. It would access the 

webpage DOM, analyze it, and extract the corresponding features required for classifying the webpage. The implementation 

of this tool are covered in more detail in the implementation. 

Visual Studio code, Google Chrome, and the Python Scikit-learn library were used as tools to drive the implementation 

of this solution. 

System implementation results 

Data collection 

This stage was executed with python scripts using a web scraping library called Beautiful soup. Phishing URLs were 

gotten from PhishTank ( http://www.phishtank.com ) while legitimate financial institution URLs were gotten from sites like 

Jasmine directory ( https://www.jasminedirectory.com ), Intechnic ( https://www.intechnic.com ), The Financial Brand ( https: 

//thefinancialbrand.com ) and Similar web ( https://www.similarweb.com ). The URLs from these sites were consolidated and 

stored in an output CSV file as shown in Fig. 2 . 

Feature extraction 

This stage was executed with a python script using a python web scraping library called Beautiful soup; each feature to

be extricated was represented by a method. The script traverses through the list of URLs and visits each site, where it then

accesses the Document Object Model (DOM) of each webpage in order to extract the required features and stores the output 

in CSV files. Fig. 3 is a snippet of output after feature extraction. 

Model building and training 

Three (3) Machine learning models were trained with the dataset with the aid of python’s scientific library ’Scikit-learn’. 

k-Nearest neighbor(KNN) 

The k-Nearest Neighbor model generated the best outcomes with seven neighbors and was trained that way as shown 

in Fig. 4 . Null-valued feature instances were assigned the mean value of that feature. The model had a high True positive

estimation of 98%; however, it was hindered by the True Negative estimation of 71.43% as shown in Tables 1 , 2 , Figs. 5 and

6 . 

Support vector machine (SVM) 

The SVM model was trained using a weight of 100. Null-valued feature instances were assigned with the mean estimation 

of that feature. The model reached a high True Positive estimation of 97% yet a low True Negative estimation of 66.67% as

shown in Figs. 7 - 9 , Tables 3 and 4 . 
9 

http://www.phishtank.com
https://www.jasminedirectory.com
https://www.intechnic.com
https://thefinancialbrand.com
https://www.similarweb.com


T.O. Ojewumi, G.O. Ogunleye, B.O. Oguntunde et al. Scientific African 16 (2022) e01165 

Table 2 

KNN summary of results. 

STATISTIC VALUE 

Accuracy 93.39% 

Error 6.61% 

True Positive 98% 

True Negative 71.43% 

False Positive 28.57% 

False Negative 2% 

Precision 0.93 

Recall 0.93 

F-score 0.93 

Fig. 6. Detection accuracy and detection error graph. 

Fig. 7. Decision boundary for SVN model on two features. 

10 


T.O. Ojewumi, G.O. Ogunleye, B.O. Oguntunde et al. Scientific African 16 (2022) e01165 

Fig. 8. SVM model confusion matrix. 

Fig. 9. Detection accuracy and detection error graph. 

Table 3 

SVM model classification report. 

Precision Recall F1-score Support 

Legitimate 0.82 0.67 0.74 21 

Phishing 0.93 0.97 0.95 100 

Micro avg 0.92 0.92 0.92 121 

Macro avg 0.88 0.82 0.84 121 

Weighted avg 0.91 0.92 0.91 121 

Table 4 

SVM summary of results. 

STATISTIC VALUE 

Accuracy 91.74% 

Error 8.26% 

True Positive 97% 

True Negative 66.67% 

False Positive 33.33% 

False Negative 3% 

Precision 0.92 

Recall 0.92 

F-score 0.92 

 
Random Forest 

The Random Forest model was trained using five trees. Null-valued feature occurrences were assigned the value - 1. This 

model was the best performing model with a True Positive of 100%, a True Negative of 90.48%, and an accuracy of 98.35%

on the test information as shown in Figs. 10 - 12 , Tables 5 and 6 . 

Summary of all models 

The Random Forest model performed best with an accuracy of 98.35% and a True Positive of 100%, followed by the k-

Nearest neighbor model, which had an accuracy of 93.39%, True positive of 98% and a True Negative of 71.43% with the

Support vector machine model coming in last as shown in Table 7 . 
11


T.O. Ojewumi, G.O. Ogunleye, B.O. Oguntunde et al. Scientific African 16 (2022) e01165 

Fig. 10. Decision boundary for Random Forest on two features. 

Fig. 11. Random Forest confusion matrix. 

Fig. 12. Detection accuracy and detection error graph. 

 
PhishNet implementation 

PhishNet was implemented using technologies such as HTML, CSS, and Javascript to create a Google chrome web exten- 

sion. The extension uses the rules extracted from the Random Forest model and prompts a user when the user lands on a

phishing site as shown in Figs. 13 and 14 . 
12 


T.O. Ojewumi, G.O. Ogunleye, B.O. Oguntunde et al. Scientific African 16 (2022) e01165 

Table 5 

Random Forest classification report. 

Precision Recall F1-score Support 

Legitimate 1.00 0.90 0.95 21 

Phishing 0.98 1.00 0.99 100 

Micro avg 0.98 0.98 0.98 121 

Macro avg 0.99 0.95 0.97 121 

Weighted avg 0.98 0.98 0.98 121 

Table 6 

Random Forest summary of 

results. 

STATISTIC VALUE 

Accuracy 98.35% 

Error 1.65% 

True Positive 100% 

True Negative 90.48% 

False Positive 9.52% 

False Negative 0% 

Precision 0.98 

Recall 0.98 

F-score 0.98 

Table 7 

Summary of results for all models. 

STATISTIC KNN SVM RF 

Accuracy 93.39% 91.74% 98.35% 

Error 6.61% 8.26% 1.65% 

True Positive 98% 97% 100% 

True Negative 71.43% 66.67% 90.48% 

False Positive 28.57% 33.33% 9.52% 

False Negative 2% 3% 0% 

Precision 0.93 0.92 0.98 

Recall 0.93 0.92 0.98 

F-score 0.93 0.92 0.98 

Fig. 13. PhishNet analysing a page. 

13 


T.O. Ojewumi, G.O. Ogunleye, B.O. Oguntunde et al. Scientific African 16 (2022) e01165 

Fig. 14. PhishNet when it detects a phishing site. 

 
Conclusion and future work 

This paper presents the performance of machine learning classification model for phishing detection using a lightweight 

Google chrome extension. PhishNet was developed to detect phishing sites on the web. PhishNet provides a convenient 

solution to reduce the risk of phishing which in return would effectively alleviating the fear users feel when submitting 

sensitive information on the web. 

The paper was able to evaluate the performance of three machine learning tools namely Random Forest, K-Nearest Neigh- 

bor and Support Vector Machine. The result given proved that Random Forest performed better than the other two machine 

learning tools. Consequently, our study has proffered a better solution to the issue of phishing attack on web pages. 

The major shortcoming of this approach is its over-dependence on webpage content. Further work is hereby recom- 

mended in this area for future study. 

Furthermore, the paper recommends more data to be gathered in the future to get more accurate results and the use of

other machine learning tools could be tested. 

Funding 

This is to certify that this research work did not receive any funding. 

Declaration of Competing Interest 

The authors wish to declare that they have no conflict of interest. 

References 

[1] Eurostat, 2018. Internet banking on the rise . [Online] Available at: https://ec.europa.eu/eurostat/web/products-eurostat-news/-/DDN-20180115-1 . [Ac- 
cessed 22 August, 2020] 

[2] Statista, 2018. Share of internet users who did online banking on a computer in the United States in 2018, by age . [Online] Available at: https://www.
statista.com/statistics/228067/people- in- households- with- online- banking- usa/ . [Accessed 20 June, 2020] 
14 

https://ec.europa.eu/eurostat/web/products-eurostat-news/-/DDN-20180115-1
https://www.statista.com/statistics/228067/people-in-households-with-online-banking-usa/


T.O. Ojewumi, G.O. Ogunleye, B.O. Oguntunde et al. Scientific African 16 (2022) e01165 

 
[3] Statista, 2018. Online industries most targeted by phishing attacks as of the 3rd quarter of 2018 . [Online] Available at: https://www.statista.com/
statistics/266161/websites- most- affected- by-phishing/ .[Accessed 10 July, 2020] 

[4] W. Ali, Phishing website detection based on supervised machine learning with wrapper features selection, Int. J. Adv. Comput. Sci. Appl. 8 (2017)
72–78 January . 

[5] A. Mishra, B.B. Gupta, Intelligent phishing detection system using similarity matching algorithms, Int. J. Inf. Commun. Technol. 12 (2018) 51–73 . 
[6] M. Mohigimi, A.Y. Varjani, New rule-based phishing detection method, Expert Syst. Appl. 53 (2016) 231–242 . 

[7] V. Muppavarapu, A. Rajendran, S. Vasudevan, Phishing detection using RDF and Random Forests, Int. Arab J. Inf. Technol. (2018) 817–824 September . 

[8] O.K. Sahingoz, S.I. Baykal, D. Bulut, Phishing detection from urls by using neural networks, Comput. Sci. Inf. Technol. 8 (17) (2018) 41–54 . 
[9] O.K. Sahingoz, E. Buber, Ö. Demir, B. Diri, Machine learning-based phishing detection from URLs, Expert Syst Appl 117 (2019) 345–357 1 March . 

[10] H. Abutair, A. Belghith, S. AlAhmadi, CBR-PDS: a case-based reasoning phishing detection system, J. Ambient Intell. Humaniz. Comput. 109 (2017)
281–288 . 

[11] Baraniuk, C., 2017. Google and Facebook duped in huge ’scam .’ [Online] Available at: https://www.bbc.com/news/technology-39744007 . [Accessed 10 
September, 2020] 

[12] C.L. Tan, K.L. Chiew, W. Koksheik, S.N. Sze, PhishWHO: phishing webpage detection via identity keywords extraction and target domain name finder,
Decis. Support Syst. 88 (2016) 18–27 . 

[13] G. Varshney, M. Misra, P.K. Atrey, A phish detector using lightweight search features, Comput. Secur. 62 (C) (2016) 213–228 . 

[14] Webroot, 2019. 2019 Webroot Threat Report . [Online] Available at: https://www.webroot.com/us/en/about/press-room/releases/2019-webroot-threat 
-report . [Accessed 20 March, 2020] 

[15] P. Yi, Web phishing detection using a deep learning framework, in: Wirel. Commun. Mob. Comput., 2018, Hindawi, 2018, pp. 1–9 . 
15 

https://www.statista.com/statistics/266161/websites-most-affected-by-phishing/
http://refhub.elsevier.com/S2468-2276(22)00074-6/sbref0004
http://refhub.elsevier.com/S2468-2276(22)00074-6/sbref0005
http://refhub.elsevier.com/S2468-2276(22)00074-6/sbref0006
http://refhub.elsevier.com/S2468-2276(22)00074-6/sbref0007
http://refhub.elsevier.com/S2468-2276(22)00074-6/sbref0008
http://refhub.elsevier.com/S2468-2276(22)00074-6/sbref0009
http://refhub.elsevier.com/S2468-2276(22)00074-6/sbref0010
https://www.bbc.com/news/technology-39744007
http://refhub.elsevier.com/S2468-2276(22)00074-6/sbref0012
http://refhub.elsevier.com/S2468-2276(22)00074-6/sbref0013
https://www.webroot.com/us/en/about/press-room/releases/2019-webroot-threat-report
http://refhub.elsevier.com/S2468-2276(22)00074-6/sbref0015

	Performance evaluation of machine learning tools for detection of phishing attacks on web pages
	Introduction
	Review of related literature
	Methodology
	Data collection
	Feature extraction
	IP address
	SSL security
	Number of dots
	Length of URL
	Resource domain difference feature set
	Resource access protocol feature set

	Model building and training
	k-Nearest neighbors
	Support vector machine
	Random Forest

	Model assessment
	Rule extraction
	Browser extension implementation

	System implementation results
	Data collection
	Feature extraction
	Model building and training
	k-Nearest neighbor(KNN)
	Support vector machine (SVM)
	Random Forest
	Summary of all models

	PhishNet implementation

	Conclusion and future work
	Funding
	Declaration of Competing Interest
	References