An Efficient Self Attention-Based 1D-CNN-LSTM Network for

IoT Attack Detection and Identification Using Network Traffic

by

Tinshu Sasi

Bachelor of Technology (Computer Science and Engineering), Manav Rachna
International University, 2014

A Thesis Submitted in Partial Fulfilment of the Requirements for the Degree of

Master of Computer Science

In the Graduate Academic Unit of Computer Science

Supervisor(s): Rongxing Lu, PhD, Faculty of Computer Science
Arash Habibi Lashkari, PhD, Faculty of Computer Science

Examining Board: Sajjad Dadkhah , PhD, Faculty of Computer Science
Saqib Hakak, PhD, Faculty of Computer Science
Hamed Asgari Moslehabadi, PhD, Department of
Mechanical Engineering

This thesis is accepted by the
Dean of Graduate Studies

THE UNIVERSITY OF NEW BRUNSWICK

June, 2024

© Tinshu Sasi, 2024


Abstract

In the last 10 years, the Internet of Things (IoT) has played a crucial role in the digital

transformation of society. However, it is also facing increased security vulnerabilities

because of the wide range of devices it encompasses. This research presents a novel

mechanism called the Self Attention-Based 1D-CNN-LSTM Network for detecting

IoT attacks. The proposed mechanism achieves an impressive accuracy of 99.96% and

efficiently differentiates between malicious and benign samples. By employing Shapley

Additive Explanations (SHAP), we were able to identify important predictive features

from the preprocessed data, which were retrieved using CICFlowmeter. This has

strengthened the dependability of the model. In addition, we enhanced the model by

training it on a smaller collection of features, resulting in shorter training time while

preserving accuracy. We have also generated novel IoT tabular datasets consisting of

nine widely accessible IoT datasets, as specified in Table 5.1, to evaluate the model’s

robustness and showcase its efficacy in IoT security.

ii


Dedications

I dedicate my thesis to my spouse, Sweety Sinha, for her unwavering support and

encouragement.

iii


Acknowledgments

I express my heartfelt gratitude to my professors, Dr. Rongxing Lu and Dr. Arash

Habibi Lashkari, for their unwavering assistance and direction. They have provided

me with motivation and support throughout the whole process.

I express my gratitude to the University of New Brunswick, specifically the Faculty

of Computer Science, for providing me with the chance and assistance during my

program. I am also thankful to all the professors who imparted their knowledge to

me throughout the course of the program.

I express my gratitude to the members of my examining committee for their valuable

input and suggestions.

iv


Table of Contents

Abstract ii

Dedication iii

Acknowledgments iv

Table of Contents v

List of Tables viii

List of Figures x

Abbreviations xii

1 Introduction 1

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 What is an IoT architecture? . . . . . . . . . . . . . . . . . . . . . . 2

1.3 What are IoT attacks? . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 What is the difference between IoT and IT attacks? . . . . . . . . . . 6

1.5 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.6 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . 8

1.7 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Background of IoT Attacks 10

2.1 Technical Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

v


2.2 Common IoT Vulnerabilities . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Summary of IoT Attacks in Training dataset . . . . . . . . . . . . . . 15

2.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Literature Review 24

3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3 Different techniques used to perform IoT attack detections . . . . . . 30

3.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4 Proposed Method 36

4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.2 CICFlowmeter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.3 Data Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.4 What are Convolutional Neural Networks? . . . . . . . . . . . . . . . 39

4.4.1 1D-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.4.2 2D-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.5 CNN Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.5.1 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.5.2 Activation Function . . . . . . . . . . . . . . . . . . . . . . . . 42

4.5.3 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.5.4 Fully-Connected Layers . . . . . . . . . . . . . . . . . . . . . . 44

4.5.5 Dropout Layer . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.5.6 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.5.7 Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.6 What are Residual blocks? . . . . . . . . . . . . . . . . . . . . . . . . 48

4.7 What are Longest Short Term Memory Network? . . . . . . . . . . . 49

4.8 What are Attention Layers? . . . . . . . . . . . . . . . . . . . . . . . 51

vi


4.8.1 Self-Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.9 What are SHAP values? . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.9.1 SHAP Feature Importance Scores . . . . . . . . . . . . . . . . 57

4.10 Proposed Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.10.1 Pre-processing - Stage I . . . . . . . . . . . . . . . . . . . . . 58

4.10.2 Proposed Architecture - Stage II . . . . . . . . . . . . . . . . 61

4.11 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5 Experiments & Results 65

5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.4 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.5 Finalizing the Proposed Model . . . . . . . . . . . . . . . . . . . . . . 84

5.5.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . 85

5.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6 Conclusion & Future Works 102

6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

Bibliography 105

Vita

vii


List of Tables

1 List of Acronyms and Abbreviations . . . . . . . . . . . . . . . . . . xii

1.1 IoT Attacks Vs IT Attacks . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1 Top 10 OWASP IoT Vulnerabilities . . . . . . . . . . . . . . . . . . . 11

3.1 Comparison of IoT Attack Surveys . . . . . . . . . . . . . . . . . . . 25

4.1 Activation Functions [10] . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2 Activation Functions Part 2 [10] . . . . . . . . . . . . . . . . . . . . . 43

4.3 Pooling Functions [36] . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.4 Loss Functions [6] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.5 Optimizers [28] [20] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.6 Hyperparameters List . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.1 Augmented Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.2 List Of All Malicious Activities Present In All Datasets in Table 5.1

[60] [77] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.3 CICFlowmeter Extracted Features List . . . . . . . . . . . . . . . . . 78

5.4 Baseline Models Evaluation Results . . . . . . . . . . . . . . . . . . 88

5.5 Our Model Accuracy List . . . . . . . . . . . . . . . . . . . . . . . . 92

5.6 Confusion Metrics For Initial Model Across All datasets . . . . . . . . 92

5.7 Top 7 Best Features For Retraining . . . . . . . . . . . . . . . . . . . 95

5.8 Reduced Set’s Features Accuracy List . . . . . . . . . . . . . . . . . . 97

viii


5.9 Confusion Metrics For 7 Feature Reduced Model Across All datasets . 100

ix


List of Figures

1.1 IoT Architecture Levels . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3.1 Overview of Available IoT Attack Detection and Identification Methods 31

4.1 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.2 CNN Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.3 Dropout Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.4 Residual Block Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.5 LSTM Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.6 LSTM Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.7 Self Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . 52

4.8 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.9 Residual Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.1 Comparison between Baseline Models and Our Model . . . . . . . . . 88

5.2 Our Model’s Training History Results . . . . . . . . . . . . . . . . . . 90

5.3 Our Model’s Confusion Matrix Results . . . . . . . . . . . . . . . . . 91

5.4 SHAP Summary Plot for Our Model (Trained on CIC-BCCC-NRC-

IoT-2023 dataset) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.5 Feature Importance Graph for CIC-BCCC-NRC-IoT-2023 dataset . . 94

5.6 Reduced Set Feature’s Training History Results . . . . . . . . . . . . 99

5.7 Reduced Set Feature’s Confusion Matrix Results . . . . . . . . . . . . 99

x


5.8 Performance Comparison between Initial Model vs Reduced Feature

Set Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

xi


Abbreviations

Table 1: List of Acronyms and Abbreviations

Acronym/Abbreviation Description

IoT Internet of Things

CAN Controller Area Network

IIoT Industrial Internet of Things

AGVs Automated Guided Vehicles

DTPs Data Transfer Protocols

MQTT Message Queue Telemetry Transport

CoAP Constrained Application Protocol

DDS Data Distribution Service

AMQP Advanced Message Queuing Protocol

IT Information Technology

TCP Transport Control Protocol

CNN Convolutional Neural Networks

LSTM Long Short Term Memory

SHAP Shapley Additive Explanations

OWASP Open Worldwide Application Security Project

ACL Access Control List

DDoS Distributed Denial-of-Service

Continued on next page

xii


Table 1 – Continued from previous page

Acronym/Abbreviation Description

ICMP Internet Control Message Protocol

MTU Maximum Transmission Unit

UDP User Datagram Protocol

DNS Domain Name System

C&C) Command & Control

GRE Generic Routing Encapsulation

ETH Ethernet

MitM Man-in-the-Middle

ARP Address Resolution Protocol

MAC Media Access Control Address

HTTPS Hypertext Transfer Protocol Secure

BC Blockchain

FC Fog Computing

EC Edge Computing

ML Machine Learning

SIEM Security Information and Event Management

DL Deep learning

RNN Recurrent Neural Network

KNN K-Nearest Neighbors

ReLU Rectified Linear Unit

ELU Exponential Linear Unit

BN Batch Normalization

MSE Mean Squared Error

MAE Mean Absolute Error

Continued on next page

xiii


Table 1 – Continued from previous page

Acronym/Abbreviation Description

SGD Stochastic Gradient Descent

Adam Adaptive Moment Estimation

RMSProp Root Mean Squared Propagation

AdaGrad Adaptive Gradient Algorithm

CIC Canadian Institute for Cybersecurity

xiv


Chapter 1

Introduction

1.1 Introduction

The digital revolution has significantly transformed our lives, with Internet of Things

(IoT) playing a pivotal role. However, the rapid development of IoT in most corners

of life leads to various emerging cybersecurity threats. Therefore, detecting and pre-

venting potential attacks in IoT networks have recently attracted paramount interest

from academia and industry. Among various attack detection approaches, machine

learning-based methods, especially deep learning, have demonstrated outstanding po-

tential thanks to their early detection capability [78].

Over the last several years, there has been a substantial increase in the number of

attacks specifically aimed against Internet of Things (IoT) devices. These encompass

various types of cyber attacks, such as infiltrating wireless webcams to gain unautho-

rized access to surveillance cameras and violate user privacy; targeting implantable

cardiac devices to potentially deplete battery life, disrupt pacing, and cause electric

shocks, thereby endangering patients’ lives; compromising children’s smartwatches

to expose their location data and personal information, pose multiple safety risks;

and manipulating the Controller Area Network (CAN) bus of vehicles to potentially

1


alter speed and direction, thereby posing threats to public safety. An examination

into smart house hacking found that for one week, fraudsters and unidentified groups

launched over 12,000 attacks against various smart home devices, including TVs,

thermostats, smart kettles, and security systems. [79] [80].

1.2 What is an IoT architecture?

The IoT architecture refers to the organization and setup of IoT devices to fulfill users’

particular requirements and requests. An IoT system is divided into three to seven

layers, depending on their complexity, and each layer has a specific function. The lack

of established protocols in the architecture of the Internet of Things (IoT) presents

further difficulties regarding interoperability, security, and several other issues. The

Internet of Things (IoT) architecture can include up to seven levels. [70]. Figure 1.1

provides a detailed illustration of the Internet of Things (IoT) and Industrial Internet

of Things (IIoT).

• Perception Layer : The perception layer, also known as the device layer, includes

a variety of sensors such as RFID scanners, security cameras, GPS modules,

and so on. These devices, such as conveyor systems, industrial robots, and

automated guided vehicles (AGVs), may be used with industrial gear. These

gadgets collect sensory data, monitor the production floor and surroundings,

transport raw materials, and so forth [64].

• Transport Layer/ Network Layer : The Transport/Network layer is responsible

for transferring data to the processing systems of the subsequent layer [64].

IoT gateways must first transform the incoming input from analogue to digital

format. Subsequently, the gateway can transmit the data to a local or cloud

data center via several data transfer protocols (DTPs). Some of the leading IoT

protocols are Bluetooth, Wi-Fi, Zigbee, Z-Wave, 6LoWPAN (IPv6 over Low-

2


`

Perception Edge Processing ApplicationNetwork Business

Control and Optimization

Data Collection

Security

Figure 1.1: IoT Architecture Levels

Power Wireless Personal Area Networks), MQTT (Message Queue Telemetry

Transport), CoAP (Constrained Application Protocol), DDS (Data Distribution

Service), AMQP (Advanced Message Queuing Protocol) [70].

• Edge Layer : The edge layer in an IoT configuration consists of the physical

hardware, embedded operating system, and device firmware. With the growing

number of interconnected devices, latency becomes a prominent issue in more

extensive IoT networks. Edge computing, aided by the edge layer, resolves

this problem by allowing data processing and analysis close to the data source.

Latency, the period of time between the detection of an event and the execution

of an action, is a critical concern for devices that are linked to a network.

Reducing latency may be accomplished by placing processing resources close

to the sensors or at the network edge, enabling prompt connection and data

3


exchange between devices [56].

• Processing Layer : The Processing layer, sometimes called the Middleware layer,

comprises servers and databases. Its primary functions include decision-making,

executing optimization algorithms, and storing large amounts of data [64]. This

layer consists of cloud computing platforms that can analyze and interpret data

from the physical environment. The system processes unprocessed sensor data

and transforms it into relevant insights using cloud services and comprehensive

data modules. Furthermore, the processing layer allows the system to promptly

respond to inputs and outputs. It can make assessments and carry out tasks

based on the information it receives. The data received in the perception step

is used in this layer to generate predictions and provide insights [69].

• Application Layer : The application layer of IoT infrastructure is responsible for

data analysis to solve business issues or achieve specific goals. The application

layer provides customized functionality to meet the unique requirements of end

users. The applications and services in this layer are constructed on top of

the processing layer. Software tools facilitate converting data from the process-

ing layer into meaningful information that humans can understand or use by

automated processes [4].

• Business Layer : The business layer acts as the central point where choices

and solutions are developed from the data analysis conducted in the applica-

tion layer. The application layer may consist of several instances inside this

layer. At the business layer, identifiable patterns from the application layer

are used to gain a deeper understanding of business insights, predict future

trends, and make operational decisions that improve productivity, security, cost-

effectiveness, customer satisfaction, and other important business factors. In ad-

dition, the business layer is accountable for overseeing commercial transactions

4


and models related to interconnected devices. It includes the administration of

business processes, data analysis, and rules implementation. It serves as the

basis for controlling business logic and setting up procedures to achieve all the

business goals of an IoT system [70] [69] [4].

• Security Layer : The security layer is present across all levels of the IoT archi-

tecture and is essential for the efficiency of an IoT solution. Considering that

Internet of Things (IoT) devices often deal with confidential information, it is

crucial to implement strong security measures [70] [4].

1.3 What are IoT attacks?

IoT attacks constitute cyberattacks leveraging IoT devices to access consumers’ sen-

sitive data. Typically, attackers deploy malware on these devices, causing damage

or infiltrating additional organizations’ data. Due to insufficiently designed security

mechanisms, IoT devices emerge as prominent vulnerabilities within organizational

infrastructures, posing substantial security risks. Basic IoT devices often lack robust

built-in security measures to counter cyber threats. Given their limited functionalities

and purposes, security considerations for such devices are frequently overlooked, ren-

dering them susceptible to cyberattacks. Hackers and organizations can use common

flaws and ”zero-day exploits” to attack IoT devices in various ways [60].

5


1.4 What is the difference between IoT and IT

attacks?

Table 1.1: IoT Attacks Vs IT Attacks

Category IoT Attacks IT Attacks

Attack Sur-

face

Limited resources, higher

vulnerability.

Robust security, lower vul-

nerability.

Diversity of

Devices

Varied types, complex secu-

rity.

Standardized, simplified se-

curity.

Impact Harmful physical conse-

quences.

Data theft, service disrup-

tion.

Legacy De-

vices

Older devices, no updates,

higher risk.

Regular updates, lower risk.

Unlike conventional Information Technology (IT) attacks, Internet of Things (IoT)

attacks provide distinct issues that necessitate specific security solutions to minimize

the associated dangers fully. The distinction between IT and IoT attacks is outlined

in Table 1.1. Some of the different ways are:

• Attack surface: IoT devices often possess limited processing capabilities and

resources, resulting in potential deficiencies in security features compared to

traditional IT systems. Consequently, IoT devices are more susceptible to at-

tacks due to reduced defences. [60].

• Diversity of devices : The wide array of IoT device types, varying in form fac-

tor, operating systems, and network connectivity, complicates establishing stan-

dardized security measures. This diversity renders certain devices more prone

to vulnerabilities and targeted attacks. [60].

6


• Physical impact : Many IoT devices are integral to critical infrastructure or

life-sustaining systems, such as medical equipment, thus exposing them to cy-

berattacks with severe physical ramifications. In contrast, typical IT attacks

aim to compromise data integrity or disrupt services. [60].

• Legacy devices : IoT devices often have extended lifespans, resulting in a prolif-

eration of older, unsupported devices. The inability of legacy devices to receive

software updates or security patches renders them particularly vulnerable to

exploitation or compromise. [60].

1.5 Problem Statement

As discussed in the previous sections, the proliferation of IoT devices across various

industries underscores the importance of safeguarding these devices against potential

cyberattacks. Identifying and discerning malicious and benign instances by analyzing

time-related features extracted from IoT TCP flow network traffic data presents a

formidable challenge within the network security domain. Determining the optimal

features for detecting malicious and benign instances among the extracted features

and establishing their efficacy and reliability constitutes a critical inquiry. Given the

escalating complexity and volume of IoT network data, it is imperative to devise ro-

bust detection methodologies to distinguish between malicious and benign activities

accurately. Addressing this imperative is crucial for fortifying IoT systems and net-

works against potential security vulnerabilities, thereby upholding the integrity and

reliability of interconnected devices and services.

7


1.6 Summary of Contributions

This research endeavour mainly focuses on safeguarding IoT devices by identifying

IoT attacks and analyzing time-related features derived from IoT network traffic TCP

flow data. This will be achieved using deep learning algorithms to distinguish between

malicious and benign samples. To summarize, this thesis has made the following

contributions:

• Our proposal uses an Self Attention-based 1D-CNN and LSTM network to

identify IoT attacks by analyzing time-related features collected from TCP flow

data.

• To enhance the model’s credibility, the model will be evaluated on eight publicly

accessible external datasets that have been enhanced and pre-processed using

CICFlowmeter.

• Determining the optimal hyperparameter combination for the model with the

highest performance metrics.

• Employing Shapley Additive Explanations (SHAP) to compute feature impor-

tance scores and utilize them to identify the most optimal features from the

extracted feature list.

• Reducing the model size by retraining it on the most optimal features while

maintaining comparable performance to the original model.

• Releasing CIC-BCCC-NRC TabularIoTAttack-2024 dataset which includes over

80 extracted features from eight IoT datasets.

1.7 Thesis Organization

The structure of the remainder of this thesis is outlined as follows:

8


1. Chapter 2: Background of IoT Attacks provides a fundamental understanding

of the topic, including a discussion of key technical terms associated with the

subject matter, mainly focusing on IoT attacks. It also summarizes the IoT

attacks mentioned in the training datasets.

2. Chapter 3: Literature Review reviews prior research on IoT Attacks, examining

the detection methodologies using machine learning and deep learning tech-

niques. This chapter also offers a comprehensive classification of IoT attack

types.

3. Chapter 4: Proposed Method outlines the motivation behind the proposed ap-

proach, detailing the preprocessing steps and describing the model architecture,

including a thorough explanation of each principal component within the archi-

tecture.

4. Chapter 5: Experiments & Results presents the outcomes of the implemented

research. This chapter covers the dataset, experimental setup, features, metrics,

and the results obtained from the proposed method.

5. Chapter 6: Conclusion & Future Works concludes the thesis by summarizing

the contributions, discussing encountered challenges, and providing insights into

potential future research directions.

9


Chapter 2

Background of IoT Attacks

2.1 Technical Terms

Before delving deeper into the intricacies of IoT attacks, acquiring a fundamental

understanding of the subject is imperative. It is essential to thoroughly comprehend

the various technical terminologies associated with this study area [60].

• Vulnerability : It denotes the intrinsic weaknesses of a system or its design,

which permit unauthorized entities to execute commands, access data without

appropriate authorization, and possibly initiate denial-of-service attacks. Such

vulnerabilities can be detected across various domains of IoT systems. They

may appear in the system’s hardware or software, the protocols and procedures

implemented within these systems, and even in the behaviours and actions of

the users interacting with the system [60].

• Exposure: It refers to a problem or mistake within the system configuration

that allows an unauthorized person to undertake actions to obtain information

[60].

• Threats : It refers to an intentional or unintentional action that exploits vulner-

abilities within a system[60].

10


• Attacks : It is defined as deliberate actions to damage a system or disrupt its nor-

mal operations by exploiting vulnerabilities using various strategies and tools.

Attackers engage in these hostile acts to achieve specific goals, which may be

motivated by personal gratification or financial gain [60].

2.2 Common IoT Vulnerabilities

Table 2.1: Top 10 OWASP IoT Vulnerabilities

Rank Vulnerability Description

1 Weak, Guessable, or Hard-

coded Passwords

Inadequate credentials susceptible

to brute force attacks or publicly

accessible, including backdoors in

firmware or client software.

2 Insecure Network Services Presence of unnecessary or vulner-

able network services, particularly

online access, threaten information

confidentiality, integrity, and avail-

ability.

3 Insecure Ecosystem Inter-

faces

Presence of insecure online, back-

end API, cloud, or mobile interfaces

outside the device ecosystem, po-

tentially compromising device secu-

rity.

11


Rank Vulnerability Description

4 Lack of Secure Update

Mechanism

Inability of devices to undergo

secure updates, lacking firmware

validation, secure delivery, anti-

rollback procedures, or alerts on se-

curity changes.

5 Use of Insecure or Outdated

Components

Utilization of outdated or vulner-

able software or hardware compo-

nents, potentially exposing devices

to unauthorized access.

6 Insufficient Privacy Protec-

tion

Insecure storage or access of user’s

personal information within the

ecosystem.

7 Insecure Data Transfer and

Storage

Absence of encryption or access

control measures for sensitive data

throughout the ecosystem.

8 Lack of Device Management Absence of adequate security sup-

port for production devices, leading

to deficiencies in asset and update

management.

9 Insecure Default Settings Distribution of devices with unse-

cured default configurations or lim-

ited user control over settings.

12


Rank Vulnerability Description

10 Lack of Physical Hardening Absence of Physical Security Mea-

sures, enabling attackers to access

critical information or assume local

control.

Table 2.1 showcases the ten most significant vulnerabilities in IoT according to OWASP

that make IoT devices susceptible to IoT attacks [7].

• Weak/Default Passwords : Absence of a strong password recovery system; Weak

or default password; Non-implementation of stricter password rules; Inability

to change the username and password associated with the account.

• Insecure Network Services : Adversaries exploit vulnerabilities in IoT devices’

communication protocols and services to gain unauthorized access and under-

mine the confidentiality of sensitive information transmitted between the device

and a server.

• Insecure Ecosystem Interfaces : The vulnerability of the device or its connected

components arising from the insecure web, backend API, cloud, or mobile in-

terfaces in the external ecosystem.

• Lack of Secure Update Mechanism: Insufficient ability to update devices se-

curely. These deficiencies encompass the absence of device-based firmware val-

idation, the unsecured transmission of data without encryption, the lack of

anti-rollback methods, and the failure to provide notifications regarding secu-

rity changes resulting from updates.

• Use of Insecure or Outdated Components : The utilization of obsolete or inse-

cure software components or libraries that could expose the device to potential

13


attacks. This entails the utilization of external software or hardware obtained

from a compromised supply chain, as well as the unsecured modification of

system platforms. The security of the IoT ecosystem may be compromised by

vulnerabilities in software dependencies or outdated systems.

• Insufficient Privacy Protection: Users’ personal information is saved on the

device or in the ecosystem and used accidentally, improperly, or illegally. In-

formation about one’s health, energy use, and driving habits can fall into this

category of privacy concerns. Privacy is at risk without adequate safeguards,

and there can be legal consequences for failing to take the necessary precautions.

• Insecure Data Transfer and Storage: The absence of encryption or access control

for sensitive data at any point in the ecosystem, including whether it is stored,

transferred, or processed. Data is critical in ensuring the reliability and integrity

of IoT applications since it is used in automated controls and decision-making

processes. Unauthorized access or usage will result in adverse consequences.

• Lack of Device Management : Lack of security support for production-ready

devices, including asset management, update management, secure decommis-

sioning, systems monitoring, and response capabilities. Unauthorized devices

can access business networks, monitor activity, and intercept data if exposed to

the IoT ecosystem.

• Insecure Default Settings : Systems or devices that lack the ability to enhance

system security by restricting users from modifying configurations or that come

with insecure default settings. Once the settings have been acquired, attackers

can exploit hardcoded default passwords, concealed backdoors, and weaknesses

in the device firmware. The user encounters difficulty in simultaneously modi-

fying various parameters.

14


• Lack of Physical Hardening : The absence of physical safeguards allows potential

attackers to get sensitive information that could be utilized in subsequent large-

scale attacks or local device takeover. Internet of Things (IoT) devices are

deployed in distant and dispersed environments. An attacker can disrupt the

services offered by IoT devices by gaining access to the physical layer and making

alterations.

2.3 Summary of IoT Attacks in Training dataset

1. ACK Fragmentation: A Fragmented ACK attack is a modified version of the

ACK & PSH-ACK Flood, where 1500-byte packets are utilized to monopolize

the target network’s bandwidth while maintaining a modest packet rate. Ap-

plying application-level filters on network equipment, such as routers, would re-

quire the equipment to reassemble the packets, consuming a significant amount

of its resources. Without any filters, these attack packets can traverse various

network security devices, such as routers, ACLs, and firewalls, without being

noticed. The fragmented packets typically consist of irrelevant or useless data,

as the attacker’s objective is to fully utilize the target network’s available ca-

pacity. Similar to other DDoS attacks, the objective of a DDoS Fragmented

ACK attack is to obstruct the service for other users by impeding or causing

the target to crash by using irrelevant data [57].

2. ICMP Flood : An ICMP flood is a form of denial-of-service attack, or DoS at-

tack, which exploits the Internet Control Message Protocol (ICMP) by utilizing

echo-requests and echo-replies, often known as pings, to assess the operational

status and connectivity of a device. An ICMP flood attack, also known as a

”ping flood attack,” occurs when attackers overrun the bandwidth of a specific

network router or IP address. They achieve this by inundating the router or IP

15


address with carefully prepared ICMP packets, causing it to become overloaded

and unable to transfer traffic to the next downstream hop. When the device

attempts to react, it uses up all of its available resources (such as memory,

processing power, and interface rate), which prevents it from fulfilling genuine

requests or serving consumers [5].

3. ICMP Fragmentation: An IP/ICMP fragmentation DDoS attack is a prevalent

type of volumetric denial of service (DoS) attack. Datagram fragmentation

algorithms are employed to inundate the network during such an attack.

IP fragmentation involves dividing IP datagrams into smaller packets, transmit-

ting them over a network, and then reassembling them back into the original

datagram during regular communication. This technique is essential to adhering

to the size constraints that each network can handle. The maximum transmis-

sion unit (MTU) is an upper bound on the data size that can be transmitted.

A packet that exceeds the maximum size must be divided into smaller fragments

to ensure successful transmission. As a result, multiple packets are transmitted,

one of which includes comprehensive information about the packet, such as the

source/destination ports, length, and other relevant details.

The remaining pieces lack further components and only contain an IP header

and a data payload. These fragments lack information regarding protocol, size,

or ports.

The attacker might specifically utilize IP fragmentation to target communi-

cation systems and security components. ICMP-based fragmentation attacks

commonly include submitting forged fragments that cannot be reassembled.

Consequently, the fragments are stored temporarily, occupying memory and

potentially depleting all available memory resources. This DDoS attack in-

volves the transmission of counterfeit UDP or ICMP packets. These packets

16


are intentionally crafted to appear more significant than the network’s maxi-

mum transmission unit (MTU). However, only certain portions of the packets

are transmitted. Since the packets are counterfeit and cannot be reconstructed,

the server’s resources are rapidly depleted, resulting in its unavailability to gen-

uine traffic [54].

4. PSH ACK Flood : ACK or PUSH ACK packets are utilized bidirectionally to

transmit data until the session is terminated after establishing a connection

between the host and the client. A victim server, vulnerable to an ACK flood

attack, receives faked ACK packets that do not correspond to any sessions in

the server’s connection list.

The targeted server exhausts its system resources to ascertain the legitimacy of

the falsified packets within a session, leading to a decline in performance and

limited service availability [24].

5. DDoS RST FIN Flood : To terminate a TCP SYN session, the client and the

host exchange RST or FIN packets. During an RST or FIN flood, the targeted

server is bombarded with a high volume of faked RST or FIN packets not

associated with any of the sessions stored in the server’s database. The affected

server must commit substantial system resources to correlate incoming packets

with existing connections, leading to diminished server performance and partial

unavailability [26].

6. SYN Flood : A SYN flood, also known as a half-open attack, is a denial-of-service

(DDoS) attack that seeks to render a server inaccessible to genuine traffic by

depleting its available resources. Through the repetitive transmission of initial

connection request (SYN) packets, the attacker can inundate all accessible ports

on the server machine being targeted, resulting in the targeted device responding

to genuine traffic with sluggishness or not responding at all [14].

17


7. Synonymous IP Flood : In this attack, the victim server is inundated with a

substantial influx of falsified TCP SYN packets with identical source and desti-

nation addresses in the header, which corresponds to the victim’s address. The

designated server initiates the utilization of system resources to process every

packet [27].

8. DDoS TCP Flood : A TCP SYN Flood attack aims to exploit the TCP three-

way handshake procedure, which is fundamental for establishing connections in

TCP/IP networks. A TCP SYN Flood attack involves the deliberate sending

of several SYN requests to a target server, with the purpose of not sending the

final ACK. As a result, the server remains idle, awaiting a response that it never

receives, which leads to the utilization of resources for each of these partially

established connections [34].

9. UDP Flood : A UDP Flood attack is a form of volumetric Denial of Service (DoS)

attack that takes advantage of the User Datagram Protocol (UDP). UDP, unlike

TCP, lacks session and connection features, making it an exceptional target for

attackers. A UDP Flood attack involves an attacker inundating the victim

system with an enormous volume of UDP packets directed at random ports.

This influx of packets compels the host to:

• Scan for active programs on each port.

• Recognize that no active applications are listening on several ports.

• Reply with an ICMP Destination Unreachable packet using the Internet

Control Message Protocol (ICMP).

The large quantity of UDP packets forces the targeted system to emit a plethora

of ICMP packets. This can cause the system to become inaccessible to autho-

rized customers. To further conceal their harmful actions, attackers may falsify

18


the IP address of the UDP packets. This guarantees that the influx of return

ICMP packets is prevented from reaching them, essentially concealing their

whereabouts [35].

10. UDP Fragmentation: It is one of the variations of UDP flood. The distinguish-

ing factor lies in utilizing packets of the greatest permissible size to saturate

the channel with the least possible amount of packets. Given that these packet

pieces are counterfeit and unrelated to genuine data, the targeted server that

receives them will allocate resources to reconstruct non-existent packets from

the counterfeit fragments. Eventually, this will lead to the depletion of system

resources and the subsequent server crash or result in the overflow of channels.

Like a UDP flood, this attack is challenging to screen and has a greater risk of

channel overflow [25].

11. DNS Spoofing : DNS spoofing is a cyberattack in which a hacker deceives a

computer or network into thinking it is interacting with a genuine website or

server. Essentially, the computer engages with a counterfeit website or server

established by the attacker. This deceitful practice directs people to fraudulent

websites, exposing them to the risks of identity theft, financial fraud, malware,

and other online concerns [63].

12. HTTP Flood : An HTTP flood is a form of Distributed Denial of Service (DDoS)

attack where the attacker takes advantage of apparently valid HTTP GET or

POST requests to attack a web server or application. HTTP flood attacks are

volumetric attacks that typically involve a botnet, a collection of compromised

machines that have been fraudulently taken over, frequently with the help of

software such as Trojan Horses. HTTP floods, a type of Layer 7 attack, do not

rely on faulty packets, spoofing, or reflection techniques. They can bring down

a targeted site or server with less bandwidth than other attacks. Consequently,

19


they require a thorough comprehension of the specific site or application being

attacked, and each attack must be meticulously tailored to ensure its effective-

ness. This dramatically enhances the difficulty of detecting and obstructing

HTTP flood attacks [33].

13. Mirai : The Mirai botnet is malware that was created to take control of Internet

of Things (IoT) devices and transform them into remotely operated ”bots”

that can carry out highly impactful volumetric distributed denial of service

(DDoS) attacks. The Mirai botnet conducts scans to identify susceptible IoT

devices that possess open ports or utilize default usernames and passwords.

Upon identifying these susceptible devices, it employs vulnerabilities to acquire

entry and contaminates them with its malicious code. Once infected, the device

becomes part of the Mirai botnet, enabling the attacker to issue commands from

a central server called a ”command & control” server (C&C). Once established,

this command and control (C&C) server can be utilized to initiate extensive

distributed denial-of-service (DDoS) attacks on websites, networks, and other

digital infrastructure by harnessing the collective power of all the bots within

the Mirai Botnet simultaneously [58].

• Mirai GRE-ETH Flood : GRE, short for Generic Routing Encapsulation,

is a protocol for creating virtual point-to-point connections over an IP

network. It allows for the encapsulation of many network layer proto-

cols. DDoS scrubbing providers utilize GRE as part of their mitigation

architecture.

GRE-ETH Flood is a network attack that targets explicitly network de-

vices, such as routers and switches, by overwhelming them with excessive

GRE (Generic Routing Encapsulation) and Ethernet (ETH) frames.

This attack involves the attacker producing many GRE and Ethernet

20


frames and directing them towards the target device to overpower its

processing capabilities. The inundation of packets can deplete system re-

sources such as CPU, memory, and bandwidth, making the device unusable

or incapable of processing valid data. The primary aim of a GRE-ETH

flood attack can differ. Still, typical objectives involve disrupting network

services, inducing denial of service (DoS) or distributed denial of service

(DDoS) scenarios, and perhaps exploiting vulnerabilities in the target de-

vice’s management of GRE and Ethernet traffic [81].

• Mirai GRE-IP Flood : A GRE-IP Flood is a network attack characterized

by the deliberate inundation of a network with a substantial number of

GRE (Generic Routing Encapsulation) and IP (Internet Protocol) pack-

ets. The attack involves the attacker creating a substantial quantity of

GRE-encapsulated IP packets and directing them towards the target net-

work to exhaust its resources. The inundation of packets depletes network

bandwidth, router computational capacity, and other resources, resulting

in network congestion, deceleration, or even total unavailability of services.

The purpose of a GRE-IP Flood attack might vary. Still, typical objectives

include interrupting network operations, inducing denial of service (DoS)

circumstances, or exploiting weaknesses in network infrastructure [81].

• Mirai UDP Plain: A UDP Plain attack, often referred to as a UDP flood

attack, is a network-based denial-of-service (DoS) attack that explicitly

targets a server or network infrastructure by overwhelming it with a large

number of User Datagram Protocol (UDP) packets. In a UDP Plain at-

tack, the attacker inundates the target server or network with a substantial

volume of UDP packets, causing it to become overwhelmed and unable to

handle the incoming packets effectively. In contrast to TCP, UDP is a

connectionless protocol. It does not necessitate a handshake to establish

21


a connection, allowing for the rapid generation and transmission of many

packets. The UDP Plain attack uses UDP’s stateless feature, allowing

the attacker to transmit packets to the target without establishing a con-

nection beforehand. This facilitates the initiation of extensive attacks by

utilizing botnets or other automated mechanisms. The effects of a User

Datagram Protocol (UDP) Simple attacks can cause various problems,

such as slowing down networks, degrading services, or completely denying

access, depending on the strength and ability of the targeted infrastruc-

ture. Furthermore, due to the absence of inherent procedures in UDP for

confirming the delivery of packets or maintaining their order, the attack

might potentially lead to the loss of packets or their delivery in an incorrect

sequence, causing additional communication disruption.

14. MITM ARP Spoofing : ARP spoofing, also known as ARP poisoning, is a type of

Man-in-the-Middle (MitM) attack that allows attackers to intercept communi-

cation between network devices. The methodology of this attack comprises the

subsequent stages: Initially, the attacker acquires entry into the network and

does a comprehensive examination of the network to ascertain the IP addresses

of a minimum of two devices, usually a workstation and a router. Afterwards,

the attacker uses spoofing tools such as Arpspoof or Driftnet to send fake ARP

answers. The falsified responses falsely claim that the attacker’s MAC address

matches both the router’s IP addresses and the workstation, deceiving them into

sending their traffic to the attacker’s device instead of talking with each other

directly. As a result, the ARP cache entries of the targeted devices are modified,

redirecting their communication through the attacker’s system. This allows the

attacker to have access to all conversations. After successfully carrying out an

ARP spoofing attack, the attacker can secretly listen in on conversations, except

for those that are encrypted using methods like HTTPS; take control of sessions

22


by obtaining session IDs to gain unauthorized entry to logged-in accounts; tam-

per with communication by, for example, sending harmful files or redirecting

users to malicious websites; and initiate Distributed Denial of Service (DDoS)

attacks by supplying the MAC address of the target server instead of their own,

resulting in overwhelming traffic if done across multiple IPs [32].

2.4 Concluding Remarks

This chapter included a comprehensive summary of the essential information required

for the thesis. Initially, we delve into a comprehensive examination of the fundamen-

tal comprehension of the subject matter, encompassing a thorough analysis of crucial

technical terminology linked to the topic, with a specific emphasis on attacks re-

lated to the Internet of Things (IoT). We extensively examine the specific aspects of

IoT attack summaries used in training datasets. In Chapter 3, we will examine the

work conducted in IoT attack detection, specifically concentrating on various machine

learning and deep learning techniques, after comprehending the context and concepts

provided in this chapter.

23


Chapter 3

Literature Review

3.1 Overview

The rapid proliferation of the Internet of Things across various businesses has in-

creased security problems. The first part of this chapter provides an overview of the

research on attacks on the Internet of Things. Then, based on this research, what

were the various methods utilized to carry out Internet of Things attack detections?

• Machine Learning

• Deep Learning

3.2 Literature Review

Although there has been a lot of research on attacks targeting the Internet of Things

(IoT), comprehensive literature on classifying these attacks is currently lacking. There-

fore, our prior study introduced a new classification system that analyzes existing

surveys and taxonomies [60].

Many academic publications thoroughly analyze many aspects of risks and attacks

in the IoT field. The mentioned literary works [72] [55] [85] [29] [41] extensively

24


examine the classification of risks and intrusions linked to the Internet of Things

(IoT). These papers primarily focus on two main categories: the architectural features

of the Internet of Things (IoT) and the protocols and standards used in the IoT

area. Although there is a wealth of literature on threats and attack taxonomies, it is

essential to highlight that just a few studies specifically focus on viable solutions.

Table 3.1: Comparison of IoT Attack Surveys

Year Title # Attacks

discussed

Taxonomy Attack to

Vulnerabil-

ity Mapping

Detection

methods

2020 A survey on

privacy and

security of

Internet of

Things [55]

12 Yes No Yes

2017 A survey

of intrusion

detection in

Internet of

Things [85]

0 Yes No Yes

25


Year Title # Attacks

discussed

Taxonomy Attack to

Vulnerabil-

ity Mapping

Detection

methods

2019 Intrusion

detection

systems in

the Internet

of things:

A com-

prehensive

investigation

[29]

9 Yes No Yes

2018 Internet of

things se-

curity: A

top-down

survey [41]

0 Yes No Yes

2021 State-of-

the-Art

Review on

IoT Threats

and Attacks:

Taxonomy,

Challenges

and Solutions

[42]

59 Yes No Yes

26


Year Title # Attacks

discussed

Taxonomy Attack to

Vulnerabil-

ity Mapping

Detection

methods

2020 Machine

learning

based solu-

tions for the

security of

Internet of

Things (IoT):

A survey [73]

31 Yes No Yes

2018 IoT secu-

rity: Review,

blockchain

solutions,

and open

challenges

[39]

19 Yes Yes Yes

2018 A Compre-

hensive IoT

Attacks Sur-

vey Based on

a Building-

blocked

Reference

Model [2]

51 Yes No No

27


Year Title # Attacks

discussed

Taxonomy Attack to

Vulnerabil-

ity Mapping

Detection

methods

2020 A Com-

prehensive

Survey on

Attacks,

Security Is-

sues and

Blockchain

Solutions for

IoT and IIoT

[64]

22 Yes No No

2022 A Survey

on IoT Se-

curity: At-

tacks, Chal-

lenges and

Countermeasures[12]

17 Yes No Yes

2021 A Survey

on Security

Attacks and

Solutions

in the IoT

Network[45]

17 Yes No No

28


Year Title # Attacks

discussed

Taxonomy Attack to

Vulnerabil-

ity Mapping

Detection

methods

2021 A survey on

Classification

of Cyber-

attacks on

IoT and IIoT

devices[66]

32 Yes No Yes

2023 A Com-

prehensive

Survey on

IoT Attacks:

Taxonomy,

Detection

Mechanisms

and Chal-

lenges [60]

149 Yes Yes Yes

However, it is essential to mention that the works discussed above do not offer any

solutions to the risks and threats that emerge from the extensive use of pervasive

technologies like blockchain (BC), fog computing (FC), edge computing (EC), and

machine learning (ML). The authors have conducted surveys on diverse pervasive

technologies for analyzing risks and attacks in a fragmented manner, as documented

in the following sources: [22], [72], [75], [50], [74], and [42].

The authors of the study [15] did thorough research on security vulnerabilities in

the Internet of Things (IoT) and presented an overview of machine learning methods

29


used to counter these attacks. The authors analyzed 78 publications published until

2017, focusing on the solutions, issues, and areas of research that have not yet been

addressed in this field. The authors conducted a survey [82] in 2018 to explore sev-

eral attack strategies that target the Internet of Things (IoT). These models include

spoofing attacks, denial-of-service attacks, jamming, and eavesdropping. The authors

also suggested possible security techniques to reduce these risks, such as IoT authenti-

cation, access control, malware detection, and safe offloading. The security solutions

proposed in this study prominently included utilizing machine learning techniques

[73].

3.3 Different techniques used to perform IoT at-

tack detections

Identifying and mitigating these attacks is crucial to protecting IoT ecosystems’ secu-

rity. Several methods have been developed to address the problems of recognizing and

responding to attacks. The following methods apply to the identification of attacks in

the IoT domain. Also, Figure 3.1 provides a thorough overview of several techniques

for detecting attacks in the Internet of Things (IoT) domain.:

• Anomaly Detection: Anomaly identification, also known as outlier or event

detection, is the analytical process of identifying unusual situations inside a

particular system. The anomaly detection algorithms assess incoming traffic at

multiple levels, ranging from the IoT network level to the data centre. Anomaly

detection is crucial because it enables the identification and analysis of anoma-

lies within IoT data. Although rare, these anomalies can offer useful insights and

practical information in many sectors, including healthcare, industry, finance,

transportation, and energy. Anomaly detection in the Internet of Things (IoT)

is employed in the betting and gambling sector to discover insider trading cases

30


IoT Attack Detection
Method 

Anomaly Detection

Behavioral Analysis

Honeypots

Signature-Based
Detection

Deep Learning

Security Information
and Event

Management

Machine Learning

Instance-based
Learning

Decision Tree

Regression Method

Clustering Method

Artificial Neural
Network

Random Forest

Ensemble Learning

Other Machine
Learning Techniques

CNN

LSTM

RNN

Auto-Encoders

RBMs

DBNs

Figure 3.1: Overview of Available IoT Attack Detection and Identification Methods

31


by analyzing trade activity patterns [13].

• Behavioural Analysis : Dynamic code analysis refers to identifying and resolv-

ing issues with potentially harmful software within a physical or virtual setting.

The program’s source code is run with different test inputs to identify security

vulnerabilities that may occur due to its code when interacting with other pro-

grams or systems. Dynamic analysis is a technique used to study the behaviours

of attacks on IoT devices [47].

Utilizing behaviour analysis for detecting IoT attacks offers numerous benefits

compared to static analysis. Dynamic analysis can identify known and zero-day

threats by examining similar patterns of behaviour exhibited by several attack-

ers. Dynamic analysis is performed by employing sandbox tools like Cuckoo

Sandbox or CWSandbox, which allow for the monitoring of malware behaviours

in real-time [47].

• Signature-based Detection: Signature-based detection (SGD) requires security

professionals to create predefined rules or signatures to recognize known attack

patterns. This method is especially efficient in identifying well-known attacks

with signatures stored in the database. On the other hand, the database cannot

identify unknown attacks without signatures.

• Honey Pots : A honeypot is a cybersecurity tool that creates a realistic and

valuable network to lure potential attackers. It functions within a network

environment that is both isolated and segregated. The aforementioned system

can be seen as a simulated entity created to imitate a genuine system to attract

potential attackers to interact with it. This method allows for the surveillance

of the subsequent interaction between the attackers and the compromised device

[59].

• Security Information and Event Management : Security Information and Event

32


Management (SIEM) is a security system that assists enterprises in detecting

and addressing any security threats and weaknesses, preventing potential dis-

ruptions to business operations. Security Information and Event Management

(SIEM) systems are essential tools for corporate security teams to detect anoma-

lies in user behaviour. Moreover, these systems utilize artificial intelligence (AI)

to simplify and automate specific labour-intensive operations associated with

detecting possible threats and subsequent incident response [1].

• Machine Learning : Machine learning (ML) approaches are crucial in identifying

and reducing IoT attacks. These techniques use different algorithms to recog-

nize unusual patterns in network traffic and device activity. The algorithms are

trained using extensive datasets, including regular and malicious IoT traffic.

This allows them to learn about unique characteristics that distinguish various

attacks. Machine learning models, such as decision trees, support vector ma-

chines, and random forests, can accurately categorize network traffic as benign

or malicious by analyzing patterns they have learned. Furthermore, machine

learning algorithms can adjust to changing attack techniques by consistently

retraining on updated datasets, thus improving their ability to identify attacks

as time goes on. Furthermore, intrusion detection systems that utilize machine

learning may function in real-time, promptly notifying security administrators

when they identify suspicious behaviours. This allows for quick reactions to

possible attacks [60].

• Deep Learning : Deep learning (DL), a branch of machine learning (ML), pro-

vides sophisticated capabilities for identifying attacks in the Internet of Things

(IoT) by automatically extracting complex features from raw data without re-

quiring manual feature engineering. Convolutional neural networks (CNNs) and

recurrent neural networks (RNNs) are frequently used in deep learning-based

33


intrusion detection systems to secure the Internet of Things (IoT). CNNs are

highly effective at identifying spatial patterns in network traffic data, but RNNs

are skilled at capturing temporal relationships in sequences of device action. Us-

ing deep learning models, Internet of Things (IoT) security systems can attain

enhanced precision in detecting and defending against advanced attack strate-

gies. Deep learning algorithms can acquire hierarchical data representations,

enabling them to identify intricate and nuanced attack patterns.

Furthermore, deep learning models can process and analyze large-scale datasets

efficiently. They can adjust and perform effectively in dynamic Internet of

Things (IoT) contexts. This makes them highly suitable for identifying familiar

and new IoT attacks [60].

3.4 Concluding Remarks

To summarize, the swift growth of the Internet of Things (IoT) has resulted in a

proportional rise in security difficulties across many sectors. Although there is a sig-

nificant amount of study on IoT threats, there is a lack of comprehensive literature

specifically focusing on their classification. In our earlier study, we presented a new

classification method that combines existing surveys and taxonomies to fill this gap.

Although many academic articles analyze the various aspects of risks and threats in

IoT, there is a lack of focus on practical solutions. Furthermore, the widespread adop-

tion of advanced technologies such as blockchain, fog computing, edge computing, and

machine learning raises additional security issues that require further investigation.

However, the initiatives mentioned in our assessment offer useful insights into possible

remedies against attacks on the Internet of Things (IoT).

Within the domain of attack detection, numerous techniques have been devised to

accurately identify and counteract Internet of Things (IoT) vulnerabilities. The meth-

34


ods include anomaly detection, behavioural analysis, signature-based detection, hon-

eypots, security information and event management (SIEM), machine learning (ML),

and deep learning (DL). Machine learning techniques utilize algorithms trained on

large datasets to identify abnormal patterns in network traffic and device behaviour.

This enables the early detection and reaction to potential threats. DL algorithms

extract complex features from raw data, allowing for the precise and efficient identifi-

cation of elaborate attack patterns. By adopting these sophisticated approaches, IoT

security systems can enhance their ability to withstand emerging attack strategies,

thereby protecting IoT ecosystems and their related assets from possible threats.

35


Chapter 4

Proposed Method

4.1 Motivation

This chapter presents a Convolutional Neural Network (CNN) combined with a Long

Short-Term Memory (LSTM) model, enhanced with a Self-Attention mechanism, to

identify Internet of Things (IoT) attacks in TCP network flow data. The suggested

architecture enables efficient detection of IoT attacks without requiring feature engi-

neering. The primary objective of our initial discussion is to examine the rationale

behind the proposed strategy. In this context, we emphasize the fundamental signif-

icance of our proposed approach. In the next section, we present a comprehensive

outline of the different components of our suggested approach.

Subsequently, we present an elaborate outline of the pre-processing stage and the sug-

gested framework for detecting Internet of Things (IoT) attacks. The pre-processing

step involves extracting features from raw benign and malicious pcap files by pro-

cessing them using CIC Flowmeter. The extracted features are then saved into CSV

files. Furthermore, the process involves constructing a model combining Convolu-

tional Neural Network (CNN) and Long Short-Term Memory (LSTM) with a self-

attention network to detect attacks.

36


4.2 CICFlowmeter

CICFlowMeter is a tool that generates and analyzes network traffic flow. This tool

enables the creation of bidirectional flows, where the first packet determines the di-

rection of data transmission. As a result, over 83 statistical network traffic features,

such as Duration, Number of packets, Number of bytes, and Length of packets, can be

calculated independently for both the forward (source to destination) and backward

(destination to source) directions.

Other capabilities encompass choosing features from the available feature list, incor-

porating new features, and managing the duration of flow timeout. The application

generates a CSV file with six columns: FlowID, SourceIP, DestinationIP, SourcePort,

DestinationPort, and Protocol. The file contains over 83 network traffic analysis ele-

ments. It is essential to understand that TCP flows typically end when the connection

is torn down using a FIN message, but a flow timeout ends UDP flows. The individual

scheme can provide an arbitrary value for the flow timeout, such as 600 seconds for

both TCP and UDP [43].

4.3 Data Pre-Processing

CIC Flowmeter TCP Flow
Filter KNN Imputation

Benign / Malicious
PCAP File

Benign / Malicious
Feature File

Figure 4.1: Data Pre-processing

During the pre-processing stage, we identify a dataset that includes raw benign and

malicious pcap files to extract features. The CICflowmeter can be executed through

the command line or within Java IDEs such as IntelliJ or Eclipse. Upon launching

the application, we can choose between offline or online mode. The offline mode

37


enables us to import raw Pcap files and analyze them using the application, extracting

features from the Pcap files and saving them as CSV files. We can derive 83 temporal

statistical features from the Pcap files.

In addition, we have applied a filter to the dataset so that it would only include rows

where the assigned protocol number is 6, which corresponds to the TCP protocol [8].

Due to missing values in our data set, we employed KNN Imputation to fill in these

missing values.

K-Nearest Neighbors (KNN) is a machine learning technique for classification and re-

gression tasks. Additionally, it can be utilized to impute missing data. The K-nearest

neighbours (KNN) imputation approach involves identifying the K-nearest neighbours

to the observation with missing data and subsequently imputing the missing values

based on the non-missing values in those neighbours [3].

K-nearest neighbours (KNN) imputation is a highly favoured technique for replacing

missing data in time series because it offers numerous significant benefits. Firstly,

its non-parametric character enables it to accommodate a wide range of data distri-

butions observed in time series without making assumptions about underlying data

patterns. KNN imputation utilizes the principle that data points with similar char-

acteristics typically have similar values. This method estimates missing values by

considering the values of nearby points, thus maintaining the local structure of the

data. Unlike approaches that presume linearity, KNN imputation is appropriate for

time series data with nonlinear or complex connections between variables.

Furthermore, its ability to withstand outliers guarantees stability in irregular data

points, as it prioritizes nearby clusters rather than overall patterns. KNN imputation

can adapt to changing patterns in time series data by considering the nearest neigh-

bours inside a sliding window. This makes it highly skilled at capturing evolving data

dynamics. Furthermore, the straightforward application and little need for adjusting

parameters, such as determining the number of neighbours (k) and distance metric,

38


enhance its attractiveness as a powerful and adaptable method for filling in missing

values in time series datasets [71].

4.4 What are Convolutional Neural Networks?

The Convolutional Neural Network (CNN) is a feedforward neural network adept

at automatically extracting features from data using convolution structures, elimi-

nating the need for manual feature extraction as in traditional methods. Inspired

by biological visual perception, CNN architecture mirrors the organization of the

visual cortex: artificial neurons correspond to biological neurons, CNN kernels simu-

late different receptors detecting various features, and activation functions mimic the

threshold for neural signal transmission. Loss functions and optimizers are designed

to guide CNN learning. Compared to fully connected (FC) networks, CNN offers

several advantages. Firstly, it employs local connections, where each neuron connects

to only a few neurons in the previous layer, reducing parameters and speeding up

convergence. Secondly, weight sharing allows groups of connections to share weights,

reducing parameters. Finally, downsampling through pooling layers leverages image

local correlation to reduce data volume while retaining crucial information, mini-

mizing parameters, and discarding trivial features. These distinctive characteristics

establish CNN as a prominent algorithm in the realm of deep learning [44].

Input

Output
Layer Output

  Feature Maps 

Conv + ReLU Conv + ReLU Pooling

Flatten Layer
Fully Connected

Layer

ClassificationFeature Extraction Probabilistic
Distribution

Figure 4.2: CNN Architecture

39


4.4.1 1D-CNN

A one-dimensional convolutional neural network (1D CNN) is a specific neural net-

work that employs convolutional layers operating in one dimension. It analyzes data

that follows a temporal or sequential pattern, where each data point consists of only

one set of features. A one-dimensional convolutional neural network (1D CNN) ap-

plies one-dimensional filters to the data, extracting important patterns or features

from specific parts of the input sequence.

4.4.2 2D-CNN

A 2D CNN is a prevalent form of convolutional neural network specifically created to

analyze two-dimensional data, such as photographs. A 2D CNN utilizes convolutional

layers to apply filters that operate in two dimensions on the input data. This allows

the network to capture spatial hierarchies and patterns such as edges, textures, and

forms. These characteristics become more conceptual and less concrete as we move

up the network layers.

4.5 CNN Operations

Typically, eight components are required to construct a CNN model.

• Convolution

• Activation Function

• Pooling

• Fully-Connected Layers

• Dropout Layer

• Batch Normalization Layer

40


• Loss Function

• Optimizer

4.5.1 Convolution

Convolution is a crucial step in the process of extracting features. The results of

convolution might be referred to as feature maps. When establishing a convolution

kernel of a specific size, we will inevitably lose information at the boundary. Padding

increases the input size by adding zero values, hence indirectly adjusting the size.

In addition, the stride is used to regulate the density of convolution. As the stride

increases, the density decreases. Following the convolution process, the feature maps

include many features, which increases the risk of encountering the overfitting issue.

To eliminate redundancy, pooling (also known as downsampling) is suggested as a

solution, which includes techniques such as max pooling and average pooling [44].

The basic 2D convolution operation is defined as follows:

Y (i, j) =
∑
m

∑
n

X(i+m, j + n) ·K(m,n) (4.1)

Where:

• Y (i, j) is the output of the convolution at position (i, j).

• X(i+m, j + n) represents the pixel values of the input image.

• K(m,n) is the kernel or filter applied to the image.

• The summations over m and n traverse all the rows and columns of the kernel

K, respectively.

When incorporating stride and padding, the convolution formula is modified to ac-

commodate these parameters, enabling control over the output size and the field of

41


view of the convolution operation:

Y (i, j) =
∑
m

∑
n

X(s · i+m− p, s · j + n− p) ·K(m,n) (4.2)

Where:

• s is the stride, which dictates the step size of the filter as it slides over the

image.

• p is the padding, which involves adding layers of zeros outside the original image

to preserve spatial dimensions.

4.5.2 Activation Function

Table 4.1: Activation Functions [10]

Activation Function Formula
Sigmoid σ(x) = 1

1+e−x

Tanh tanh(x) = ex−e−x

ex+e−x

ReLU (Rectified Linear Unit) ReLU(x) = max(0, x)

Leaky ReLU Leaky ReLU(x) =

{
x, if x > 0

0.01x, otherwise

ELU (Exponential Linear
Unit)

ELU(x) =

{
x, if x > 0

α(ex − 1), otherwise

Activation functions in Convolutional Neural Networks (CNNs) play a critical role by

introducing non-linearity into the network, allowing it to learn complex patterns in the

data. Without activation functions, a CNN would essentially become a linear model,

incapable of handling the intricacies and nuances required for tasks such as image

recognition or natural language processing. The most commonly used activation

function in CNNs is the Rectified Linear Unit (ReLU), favoured for its computational

simplicity and ability to maintain the gradient flow during training, thus helping to

mitigate the vanishing gradient problem. ReLU functions by outputting the input

42


Table 4.2: Activation Functions Part 2 [10]

Activation
Function

Pros Cons

Sigmoid Smooth gradient, Outputs
between 0 and 1

Susceptible to vanishing gra-
dient problem, Outputs not
zero-centered

Tanh Outputs between -1 and 1,
Zero-centered output

Susceptible to vanishing gra-
dient problem

ReLU Computationally efficient,
Avoids vanishing gradient
problem

Prone to dying ReLU prob-
lem (neurons become inac-
tive)

Leaky ReLU Addresses dying ReLU prob-
lem, Non-zero gradient for
negative inputs

More complex than ReLU,
Not zero-centred output

ELU Avoids dying ReLU prob-
lem, Zero-centered output

More computationally ex-
pensive than ReLU

directly if it is positive. Otherwise, it outputs zero. Other activation functions like

sigmoid, tanh, and Leaky ReLU are also used, each with advantages depending on the

network architecture and specific application. These functions are applied at specific

layers throughout the CNN to help the model differentiate and correctly classify the

input signals into outputs [44].

4.5.3 Pooling

Pooling layers in Convolutional Neural Networks (CNNs) reduce the spatial dimen-

sions of the input feature maps, decreasing the number of parameters and computa-

tion required. It also helps detect features that are invariant to scale and orientation

changes. This subsampling step improves the network’s efficiency and robustness by

abstracting higher-level features while retaining the most essential information. The

most common types of pooling are max pooling and average pooling. Max pooling

returns the maximum value from each cluster of neurons at the prior layer, effectively

highlighting the most prominent features. In contrast, average pooling calculates the

average value, smoothing out the feature responses. By reducing the resolution of the

43


Table 4.3: Pooling Functions [36]

Pooling
Function

Formula Pros Cons

Max Pooling y = max(x) Preserves dominant
features, Translation
invariance

May discard less
dominant features,
Can lead to overfit-
ting

Average Pool-
ing

y = 1
n

∑n
i=1 xi Simple and efficient,

Reduces noise
Can lose important
details, Not robust to
outliers

Global Max
Pooling

y = max(x)
over all ele-
ments

Captures most im-
portant feature, Re-
duces dimensionality

Ignores spatial infor-
mation, Not suitable
for tasks requiring
spatial information

Global Aver-
age Pooling

y = 1
n

∑n
i=1 xi

over all ele-
ments

Reduces computa-
tion complexity, Less
prone to overfitting

Loss of spatial in-
formation, Not suit-
able for tasks requir-
ing spatial informa-
tion

feature maps, pooling layers also help prevent overfitting by providing an abstracted

form of the representation. These layers are typically placed between successive con-

volutional layers and play a pivotal role in the architecture of deep learning models

designed for tasks like image and video recognition [40].

4.5.4 Fully-Connected Layers

Fully-Connected layers in Convolutional Neural Networks (CNNs) are crucial com-

ponents that typically come after the convolutional and pooling layers. These layers

are called ”fully connected” because every neuron in a fully connected layer is con-

nected to all the neurons in the previous layer. The main function of these layers is

to perform high-level reasoning by integrating the localized features extracted by the

earlier convolutional and pooling layers into the final decision-making process of the

network. This is where the abstract features from the entire image or input are used

to classify or make predictions about the input data. The fully-connected layers map

44


the extracted features into the final output, such as classification labels. This makes

them crucial for tasks like image recognition, where you must determine the presence

of specific features across the whole image. These layers often use softmax activation

functions in classification tasks to convert the output into probability distributions

over predicted output classes [19] [44].

4.5.5 Dropout Layer

Input Layer

Hidden Layer

Output Layer

Figure 4.3: Dropout Layer

Dropout layers in Convolutional Neural Networks (CNNs) are a regularization ap-

proach specifically developed to mitigate overfitting, a prevalent issue in deep learning

models, especially those with a substantial number of parameters. During the training

phase, dropout selectively turns off a random subset of neurons in the layer to which

it is applied. This means these neurons do not participate in the forward pass, and

their weights are not adjusted during the backward pass. The implementation of ran-

dom deactivation compels the network to acquire more resilient characteristics that

45


do not depend on a limited group of neurons. This diminishes the model’s reliance

on specific traits and encourages a more universal learning pattern. By employing

this technique, dropout guarantees that the neural network performs highly on new,

unobserved data rather than solely on the data it was trained on. The dropout rate,

which refers to the likelihood of each neuron being deactivated, is a hyperparameter

that can be tuned to enhance performance. Dropout is typically more prevalent in

fully-connected layers than convolutional layers [23].

4.5.6 Loss Function

Table 4.4: Loss Functions [6]

Loss Func-
tion

Formula Pros Cons

Cross-
Entropy
Loss

−
∑M

c=1 yo,c log(po,c) - Directly models
probability distri-
butions
- Highly effective
for classification

- Can be numer-
ically unstable
without proper
implementation

Mean
Squared
Error (MSE)

1
N

∑N
i=1(yi − ŷi)

2 - Easy to un-
derstand and
implement
- Differentiable
function

- Poor performance
in classification
- Penalizes outliers
heavily

Mean Ab-
solute Error
(MAE)

1
N

∑N
i=1 |yi − ŷi| - Less sensitive to

outliers compared
to MSE
- Easy to under-
stand

- Gradient can be
constant, which
may affect conver-
gence

Loss functions in Convolutional Neural Networks (CNNs) are critical components that

measure the discrepancy between the predicted outputs of the network and the actual

target values during training. The choice of a loss function depends on the specific task

the CNN is designed to perform. For classification problems, a common loss function

is the categorical cross-entropy, which calculates the difference between the predicted

probability distribution over classes and the actual distribution (typically represented

46


as a one-hot encoded vector). For regression tasks, mean squared error (MSE) or mean

absolute error (MAE) are commonly used, as they measure the average of the squares

or absolute differences between predicted and actual values, respectively. The role of

the loss function is to provide a quantitative assessment of model performance, which

the training process seeks to minimize through gradient descent and backpropagation.

By continuously improving the model parameters to reduce the loss, the CNN learns

to make more accurate predictions, effectively tuning itself to the complexity of the

data it processes [44].

4.5.7 Optimizer

Table 4.5: Optimizers [28] [20]

Optimizer Formula Pros Cons

SGD wt+1 = wt − η∇J(wt)

- Simple and easy
to understand
- Effective
in large-scale
datasets

- Slow conver-
gence
- Sensitive to
hyperparameters

Adam wt+1 = wt − η√
v̂t+ϵ

m̂t

- Fast convergence
- Automatically
adjusts the learn-
ing rate

- Computationally
more intensive
- Potential for
non-convergence
on non-convex
functions

RMSprop wt+1 = wt − η√
E[g2]t+ϵ

gt

- Diverges less
- Good in on-
line and non-
stationary set-
tings

- Less popular
than Adam
- Sensitive to
initialization

AdaGrad

wt+1 = wt − η√
Gt+ϵ

gt
where Gt is the sum of
the squares of the past
gradients

- Handles sparse
gradients well
- Good for
problems with
large and sparse
datasets

- Gradient scal-
ing can cause early
stopping of learn-
ing

Optimizers in Convolutional Neural Networks (CNNs) are algorithms or methods used

47


to change the attributes of the neural network, such as weights and learning rate, in

order to reduce the losses. Optimizers guide the training process by deciding how

to update weights based on the gradients of the loss function with respect to those

weights. Common optimizers include Stochastic Gradient Descent (SGD), Adam, and

RMSprop, among others. SGD is simple and has been the traditional choice, where

each update is performed using only a subset of all data (mini-batch) to compute the

gradient, making it computationally efficient. Adam (Adaptive Moment Estimation)

combines the benefits of two other extensions of SGD, AdaGrad and RMSProp, to

handle sparse gradients on noisy problems. RMSprop adjusts the learning rate for

each parameter, dividing the learning rate for weight by a running average of the

magnitudes of recent gradients for that weight. These optimizers differ mainly in

how they use the amount of past weight update information (momentum) and how

they adapt the learning rate during training, making them suited to different types

of neural networks and convergence problems [44].

4.6 What are Residual blocks?

Input (x) Relu Relu OutputConv Layer

H(x) = x + F(x)

Conv Layer

F(x)

Identity x

Figure 4.4: Residual Block Cell

Residual blocks, also known as residual connections or shortcut connections, are in-

48


tegral components of deep neural networks, particularly in architectures like ResNet

(Residual Networks), aimed at tackling the vanishing gradient problem prevalent in

very deep networks during training. These blocks incorporate skip connections that

enable the output of one or more layers to bypass subsequent layers and be added

directly to the output of deeper layers. This design allows the model to learn residu-

als, representing the difference between the desired output and the input to a specific

layer and facilitates the learning of identity mapping. By providing a direct path for

gradient flow during backpropagation, residual blocks mitigate the vanishing gradient

problem, making it feasible to train exceedingly deep networks. Consequently, they

contribute to easier optimization, improved generalization performance, and faster

convergence, rendering them a standard and indispensable element in constructing

deep neural network architectures across diverse domains, including computer vision,

natural language processing, and speech recognition.

4.7 What are Longest Short Term Memory Net-

work?

Long Short-Term Memory Networks (LSTMs) are a specialized kind of Recurrent

Neural Network (RNN) designed to address the problem of learning long-term depen-

dencies in sequence data. Traditional RNNs often struggle with vanishing or exploding

gradients as the sequence length increases, hampering their ability to learn from data

where distant past information is crucial for predicting future states. LSTMs over-

come this challenge through their unique structure, which includes memory cells and

multiple gates—namely, the input, forget, and output gates. These gates manage the

flow of information into and out of the memory cell, effectively allowing the network to

retain or discard information dynamically. This selective memory capability enables

LSTMs to maintain long-range dependencies, making them highly effective for a wide

49


Ct-1

ht-1

Xt

Ct

ht

ht

Tanh
Σ
sig

Σ
sig

Tanh

Σ
sig

Xt

Ct-1

ht-1

Ct

ht

Σ
sig

Tanh

Input Vector

Memory From Prev
Block

Output From Prev
Block

Memory From
Current Block

Output From
Current Block

Sigmoid

Tanh

Element Wise
Multiplication

Element Wise
Summation

Forget Gate Input Gate Output Gate

Figure 4.5: LSTM Cell

Ct-1

ht-1

Xt

Ct

ht

ht

Tanh
Σ
sig

Σ
sig

Tanh

Σ
sig

Xt + 1

Ct + 1

ht + 1

ht + 1

Tanh
Σ
sig

Σ
sig

Tanh

Σ
sig

Figure 4.6: LSTM Network

range of sequential tasks such as natural language processing, speech recognition, and

time series prediction. Their ability to remember and forget information selectively

across long sequences makes them particularly powerful in fields where context and

history significantly influence current outcomes [61].

The LSTM updates for timestep t given inputs xt (current input), ht−1 (previous

output), and Ct−1 (previous cell state) are as follows:

50


Forget Gate:

ft = σ(Wf · [ht−1, xt] + bf )

Input Gate:

it = σ(Wi · [ht−1, xt] + bi)

C̃t = tanh(WC · [ht−1, xt] + bC)

Cell State Update:

Ct = ft ∗ Ct−1 + it ∗ C̃t

Output Gate:

ot = σ(Wo · [ht−1, xt] + bo)

ht = ot ∗ tanh(Ct)

4.8 What are Attention Layers?

An attention function is a mathematical operation that takes a query and a set of

key-value pairs as input and produces an output. The query, keys, values, and result

are all represented as vectors in this operation. The output is calculated by taking a

weighted total of the values. The weight allocated to each value is determined by a

compatibility function that compares the query with the relevant key [76].

The concept of attention in Long Short-Term Memory Networks (LSTMs) enhances

these models by allowing them to focus selectively on parts of the input sequence that

are most relevant to the task, improving their ability to manage long-range depen-

dencies in complex sequences. Attention mechanisms dynamically assign weighting to

different inputs at each time step, indicating the importance of each part of the data

for predicting the current output. In practice, this means that instead of treating all

parts of the input equally, the model learns to pay ”attention” to specific segments of

51


Q

MatMul

Scale

Softmax

K V

MatMul

Output

Figure 4.7: Self Attention Mechanism

the input that are more informative for the current decision or prediction. This capa-

bility is particularly useful in tasks such as machine translation, where the relevance

of input words can vary significantly depending on the context within the sequence.

By integrating attention with LSTMs, the model not only retains information over

long periods but also adapts its focus according to the evolving semantic importance

of different parts of the input, leading to improved performance on tasks requiring

sophisticated contextual interpretation [68] [49] [38] [48].

4.8.1 Self-Attention

The self-attention mechanism simulates the human brain’s attention by allocating

resources to focus on relevant rather than irrelevant information. The self-attention

52


mechanism contributes to the variance in the relevance of hidden information that the

LSTM network does not recognize by assigning different weights to hidden features

at different phases. Thus, Self Attention Mechanism retrieves advanced traits while

improving long-term dependency [84].

Input Representation

Let X ∈ Rn×d be the input matrix where n is the sequence length and d is the

dimension of the embeddings [18].

Linear Transformations

The input X is linearly transformed into three matrices: Query (Q), Key (K), and

Value (V ) using learned weight matrices WQ, WK , and WV .

Q = XWQ, K = XWK , V = XWV

Where:

WQ ∈ Rd×dk

WK ∈ Rd×dk

WV ∈ Rd×dv

dk and dv are the dimensions of the key and value vectors respectively.

Dot-Product Attention

The attention scores are calculated by taking the dot product of the query Q with

the key K transposed.

Attention(Q,K, V ) = softmax

(
QKT

√
dk

)
V

Breaking it down into steps:

53


Score Calculation

scores = QKT

Scaling

scaled scores =
QKT

√
dk

Scaling the scores by
√
dk helps stabilize the gradients.

Softmax

attention weights = softmax

(
QKT

√
dk

)
The softmax function converts the scores into a probability distribution.

Weighted Sum

context vector = attention weights · V

Explanation

• Input Representation: The input X is a matrix of shape (n, d), where each

row represents a token in the sequence, and each column represents a feature

of the embedding.

• Linear Transformations: The input matrix X is transformed into query Q,

key K, and value V matrices using weight matrices WQ, WK , and WV . These

transformations allow the model to learn different representations for querying,

keying, and valuing the input data.

• Dot-Product Attention:

– Score Calculation: The dot product QKT results in a score matrix that

represents the similarity between each query and key.

– Scaling: Dividing the scores by
√
dk helps to prevent the softmax function

from having extremely small gradients.

54


– Softmax: The softmax function normalizes the scores into probabilities,

indicating each token’s importance.

– Weighted Sum: The context vector is computed as the weighted sum of

the value vectors, where the weights are the attention scores. This context

vector captures the relevant information from the entire sequence for each

token.

This mechanism allows the model to focus on different parts of the input sequence

dynamically, depending on the context, which is particularly useful for tasks involving

long-range dependencies [18].

4.9 What are SHAP values?

SHAP (SHapley Additive exPlanations) values offer a method for elucidating the

outcomes of any machine learning model by employing a game-theoretic approach

that gauges each player’s contribution to the final result. Each feature receives an

important value in machine learning, denoting its impact on the model’s output.

These SHAP values reveal the individual influence of each feature on every prediction,

assess the relative significance of each feature, and ascertain the model’s reliance on

feature interactions [9] [67]. The SHAP value (ϕ) for a feature j in a prediction model

f is given by the formula:

ϕj(f) =
∑

S⊆F\{j}

|S|!(|F | − |S| − 1)!

|F |!
[f(S ∪ {j})− f(S)]

Where:

• F is the set of all features.

• S is a subset of features excluding j.

55


• f(S) is the model’s output using only the features in set S.

• |S| and |F | are the cardinalities of sets S and F , respectively.

• ϕj(f) is the SHAP value for feature j.

Widely utilized in machine learning, SHAP values afford a consistent and unbiased

explanation of how each feature influences the model’s predictions. They derive from

game theory principles, attributing importance metric to each feature, where positive

SHAP values denote a positive influence on predictions, while negative values imply

a negative impact, with magnitude indicating the strength of the effect [67] [9] [46]

[51].

Notably, SHAP values are model-agnostic, making them applicable to diverse machine

learning models such as linear regression, decision trees, random forests, gradient

boosting models, and neural networks. These values possess several advantageous

properties, including additivity, local accuracy, missingness, and consistency. Their

additivity allows for independent computation of each feature’s contribution to the

prediction, facilitating efficient computation even for high-dimensional datasets [67]

[9][46] [51].

Furthermore, SHAP values provide an accurate and localized interpretation of a

model’s prediction for a given input while remaining robust to missing or irrelevant

features. Importantly, they offer a consistent interpretation of a model’s behaviour,

ensuring stability even in the face of changes in model architecture or parameters.

Overall, SHAP values furnish a reliable and objective means to gain insights into a

machine learning model’s prediction mechanisms and the features exerting the great-

est influence [9] [46] [51].

56


4.9.1 SHAP Feature Importance Scores

SHAP (SHapley Additive exPlanations) feature importance provides a comprehen-

sive method for understanding the impact of individual features on a machine learn-

ing model’s predictions. Unlike traditional methods like permutation importance,

which assesses feature importance by measuring the decrease in model performance

when a feature’s values are randomly permuted, SHAP values consider the interac-

tion between features and provide a more nuanced understanding of their influence

on predictions [46] [51].

While permutation importance offers a straightforward measure of feature impor-

tance, SHAP values capture the contribution of each feature in the context of other

features, allowing for a more accurate and interpretable assessment of their impact. In

situations where feature interactions are significant in model predictions, SHAP fea-

ture importance provides more insightful and reliable results. However, permutation

importance may suffice for simpler models with fewer interactions between features

and could be computationally less intensive. Ultimately, the choice between SHAP

feature importance and permutation importance depends on the complexity of the

model and the importance of capturing feature interactions for accurate interpretation

[51].

While SHAP values typically provide local explanations for individual predictions,

aggregating these values across a dataset offers insights into the global importance

of each feature. Global feature importance quantifies the overall impact of fea-

tures across all predictions, measuring how significant each feature is in the model’s

decision-making process.

The global importance of a feature j is calculated by summing the absolute SHAP

values for that feature across all data points in the dataset:

57


Global Importance(j) =
N∑
i=1

|ϕ(i)
j |

Where:

• N represents the total number of data points in the dataset.

• ϕ
(i)
j denotes the SHAP value for feature j at the i-th instance.

• The absolute values of SHAP values are summed to account for both positive

and negative contributions uniformly.

4.10 Proposed Mechanism

The proposed approach consists of two distinct steps. The initial phase involves the

conversion of data, extraction of features, generation of datasets for training and

testing the model, and the development of the model itself. The second module

involves training the model using the training dataset, incorporating all features, and

determining feature importance by utilizing SHAP values. Subsequently, the model is

retrained using a decreased set of features while maintaining comparable performance

metrics to the original model. Therefore, determining the optimal characteristics to

use in training the model

4.10.1 Pre-processing - Stage I

This section will provide a detailed analysis of the components involved in the pre-

processing stage technique, and Figure 4.1 provides a high-level overview of the pre-

processing workflow.

• Data Collection: In data acquisition, we necessitated benign and malicious pcap

files sourced from an IoT attack dataset, significantly facilitating the advance-

ment of security analytics applications within IoT environments. Specifically,

58


we employed the CIC IoT dataset 2023 for training, validation, and testing.

Additionally, after completing the training phase, we utilized eight additional

IoT attack datasets to evaluate the efficacy of the developed model [53].

• Data Conversation and Feature Extraction: As previously outlined in Sec-

tion 4.2, we employed CIC flower, a network traffic flow generator and analyzer,

to extract temporal statistical features from raw pcap files. Subsequently, we

extracted over 83 features from these files and stored them in CSV format,

facilitating their utilization in both model training and testing phases.

• Data Filtration, Imputation and Creation: We implemented a filtering mech-

anism on the dataset, specifically retaining rows associated with an assigned

protocol number of 6, indicative of the TCP protocol [8]. To address miss-

ing data within our dataset, we opted for KNN Imputation to complete these

missing values. Subsequently, by merging these CSV files, we synthesized the

necessary dataset.

• Model Creation: We examine the components utilized in the model’s construc-

tion during this stage. This model incorporates convolutional layers, residual

blocks, LSTM layers with an attention mechanism, and some final processing

and output layers. Let’s break down each module’s purpose and workings:

Residual
Block

Max
Pooling

Residual
Block

Max
Pooling LSTM Layer

DropoutDropoutDropout

LSTM Layer
Self

Attention
Block

Flatten

Output

Dense

DropoutDropout

Figure 4.8: Model Architecture

59


Data Activation Activation

1D Conv 1D Conv

Output

Add

Figure 4.9: Residual Block

– Input layer : This layer defines the input shape required to process the

data for model training and testing. For our model, we have identified the

input shape for the input layer to be (feature len, 1), where feature len is

the number of columns in one batch.

– Residual Blocks : Start the model with CNN based residual block which is

used to deepen the network without losing the ability to train effectively.

– Max Pooling : Reduces the spatial dimensions of the output from convolu-

tional layers, summarizing features.

– Dropout : Applied after pooling and LSTM layers to prevent overfitting by

randomly dropping units during training.

– LSTM Layers : Processes the output of convolutional layers to learn from

the temporal patterns in the data. return sequences = True keeps the

time dimension for attention processing.

– Attention Layers : Applied to the output of the LSTM layers to focus the

model’s learning on important temporal elements.

– Flatten: Converts the multi-dimensional output of the attention layer into

a single-dimensional array suitable for input to the fully connected layer.

60


– Dense Layer : A dense layer serves as the intermediate stage where the

spatially distributed features extracted by convolutional , pooling layers

and LSTMs are flattened into a single vector representation. By doing

so, Dense layers enable the network to learn high-level abstractions and

relationships among the extracted features, making them more suitable for

classification or regression tasks. These intermediate Dense layers typically

incorporate non-linear activation functions like ReLU to introduce non-

linearity and capture complex patterns in the data. Additionally, they may

include dropout or batch normalization layers to regularize the network and

prevent overfitting.

– Output Layer : A dense layer that outputs the final prediction of the model.

The activation function is chosen based on the task. Since we are perform-

ing binary classification, we are using the sigmoid function.

The Figure 4.8 and Figure 4.9 provide a high-level overview of the model archi-

tecture. This architecture is particularly effective for complex sequence mod-

elling tasks that benefit from spatial feature extraction and the ability to re-

member and emphasize important parts of the sequence data over time.

4.10.2 Proposed Architecture - Stage II

• Hyperparameter Tuning : Hyperparameter tuning, also known as hyperparame-

ter optimization, refers to selecting the optimal hyperparameters for a machine

learning model to maximize its performance on a given dataset. Hyperparam-

eters are parameters set before the learning process begins and control aspects

such as the complexity of the model, the regularization strength, and the learn-

ing rate. Unlike model parameters learned during training, hyperparameters

are not learned from the data and must be specified by the user [65].

61


Grid search is a hyperparameter tuning technique to find the optimal combi-

nation of hyperparameters for a machine learning model. It involves defining a

grid of hyperparameter values and exhaustively searching through all possible

combinations of these values to identify the combination that yields the best

performance on a chosen evaluation metric.

Using grid search, we have successfully determined five optimal combinations of

hyperparameters for training, validation, and testing. Among the five options,

we have selected the hyperparameter configuration that performs the best for

calculating feature importance, retraining the model, and compressing it. This

selection was based on testing it on external datasets. Below is the Table 4.6 of

hyperparameters utilized for training the model.

Table 4.6: Hyperparameters List

No Activation Losses Optimizers Batches Epochs Shuffles

1 Relu

Binary

Cross-

entropy

Adam 32 20 True

2 Relu

Binary

Cross-

entropy

Adam 16 20 True

3 Leaky Relu

Binary

Cross-

entropy

Adam 16 20 True

4 Relu

Binary

Cross-

entropy

RMSProp 16 20 True

62


5 Relu

Binary

Cross-

entropy

RMSProp 8 20 True

• Model Training and Testing : To facilitate model training, we divided the d