An Efficient Self Attention-Based 1D-CNN-LSTM Network for IoT Attack Detection and Identification Using Network Traffic by Tinshu Sasi Bachelor of Technology (Computer Science and Engineering), Manav Rachna International University, 2014 A Thesis Submitted in Partial Fulfilment of the Requirements for the Degree of Master of Computer Science In the Graduate Academic Unit of Computer Science Supervisor(s): Rongxing Lu, PhD, Faculty of Computer Science Arash Habibi Lashkari, PhD, Faculty of Computer Science Examining Board: Sajjad Dadkhah , PhD, Faculty of Computer Science Saqib Hakak, PhD, Faculty of Computer Science Hamed Asgari Moslehabadi, PhD, Department of Mechanical Engineering This thesis is accepted by the Dean of Graduate Studies THE UNIVERSITY OF NEW BRUNSWICK June, 2024 © Tinshu Sasi, 2024 Abstract In the last 10 years, the Internet of Things (IoT) has played a crucial role in the digital transformation of society. However, it is also facing increased security vulnerabilities because of the wide range of devices it encompasses. This research presents a novel mechanism called the Self Attention-Based 1D-CNN-LSTM Network for detecting IoT attacks. The proposed mechanism achieves an impressive accuracy of 99.96% and efficiently differentiates between malicious and benign samples. By employing Shapley Additive Explanations (SHAP), we were able to identify important predictive features from the preprocessed data, which were retrieved using CICFlowmeter. This has strengthened the dependability of the model. In addition, we enhanced the model by training it on a smaller collection of features, resulting in shorter training time while preserving accuracy. We have also generated novel IoT tabular datasets consisting of nine widely accessible IoT datasets, as specified in Table 5.1, to evaluate the model’s robustness and showcase its efficacy in IoT security. ii Dedications I dedicate my thesis to my spouse, Sweety Sinha, for her unwavering support and encouragement. iii Acknowledgments I express my heartfelt gratitude to my professors, Dr. Rongxing Lu and Dr. Arash Habibi Lashkari, for their unwavering assistance and direction. They have provided me with motivation and support throughout the whole process. I express my gratitude to the University of New Brunswick, specifically the Faculty of Computer Science, for providing me with the chance and assistance during my program. I am also thankful to all the professors who imparted their knowledge to me throughout the course of the program. I express my gratitude to the members of my examining committee for their valuable input and suggestions. iv Table of Contents Abstract ii Dedication iii Acknowledgments iv Table of Contents v List of Tables viii List of Figures x Abbreviations xii 1 Introduction 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 What is an IoT architecture? . . . . . . . . . . . . . . . . . . . . . . 2 1.3 What are IoT attacks? . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 What is the difference between IoT and IT attacks? . . . . . . . . . . 6 1.5 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.6 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . 8 1.7 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Background of IoT Attacks 10 2.1 Technical Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 v 2.2 Common IoT Vulnerabilities . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Summary of IoT Attacks in Training dataset . . . . . . . . . . . . . . 15 2.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3 Literature Review 24 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3 Different techniques used to perform IoT attack detections . . . . . . 30 3.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4 Proposed Method 36 4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2 CICFlowmeter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3 Data Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.4 What are Convolutional Neural Networks? . . . . . . . . . . . . . . . 39 4.4.1 1D-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.4.2 2D-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.5 CNN Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.5.1 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.5.2 Activation Function . . . . . . . . . . . . . . . . . . . . . . . . 42 4.5.3 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.5.4 Fully-Connected Layers . . . . . . . . . . . . . . . . . . . . . . 44 4.5.5 Dropout Layer . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.5.6 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.5.7 Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.6 What are Residual blocks? . . . . . . . . . . . . . . . . . . . . . . . . 48 4.7 What are Longest Short Term Memory Network? . . . . . . . . . . . 49 4.8 What are Attention Layers? . . . . . . . . . . . . . . . . . . . . . . . 51 vi 4.8.1 Self-Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.9 What are SHAP values? . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.9.1 SHAP Feature Importance Scores . . . . . . . . . . . . . . . . 57 4.10 Proposed Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.10.1 Pre-processing - Stage I . . . . . . . . . . . . . . . . . . . . . 58 4.10.2 Proposed Architecture - Stage II . . . . . . . . . . . . . . . . 61 4.11 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5 Experiments & Results 65 5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.4 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.5 Finalizing the Proposed Model . . . . . . . . . . . . . . . . . . . . . . 84 5.5.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . 85 5.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6 Conclusion & Future Works 102 6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Bibliography 105 Vita vii List of Tables 1 List of Acronyms and Abbreviations . . . . . . . . . . . . . . . . . . xii 1.1 IoT Attacks Vs IT Attacks . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 Top 10 OWASP IoT Vulnerabilities . . . . . . . . . . . . . . . . . . . 11 3.1 Comparison of IoT Attack Surveys . . . . . . . . . . . . . . . . . . . 25 4.1 Activation Functions [10] . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2 Activation Functions Part 2 [10] . . . . . . . . . . . . . . . . . . . . . 43 4.3 Pooling Functions [36] . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.4 Loss Functions [6] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.5 Optimizers [28] [20] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.6 Hyperparameters List . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.1 Augmented Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.2 List Of All Malicious Activities Present In All Datasets in Table 5.1 [60] [77] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.3 CICFlowmeter Extracted Features List . . . . . . . . . . . . . . . . . 78 5.4 Baseline Models Evaluation Results . . . . . . . . . . . . . . . . . . 88 5.5 Our Model Accuracy List . . . . . . . . . . . . . . . . . . . . . . . . 92 5.6 Confusion Metrics For Initial Model Across All datasets . . . . . . . . 92 5.7 Top 7 Best Features For Retraining . . . . . . . . . . . . . . . . . . . 95 5.8 Reduced Set’s Features Accuracy List . . . . . . . . . . . . . . . . . . 97 viii 5.9 Confusion Metrics For 7 Feature Reduced Model Across All datasets . 100 ix List of Figures 1.1 IoT Architecture Levels . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3.1 Overview of Available IoT Attack Detection and Identification Methods 31 4.1 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2 CNN Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.3 Dropout Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.4 Residual Block Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.5 LSTM Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.6 LSTM Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.7 Self Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . 52 4.8 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.9 Residual Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.1 Comparison between Baseline Models and Our Model . . . . . . . . . 88 5.2 Our Model’s Training History Results . . . . . . . . . . . . . . . . . . 90 5.3 Our Model’s Confusion Matrix Results . . . . . . . . . . . . . . . . . 91 5.4 SHAP Summary Plot for Our Model (Trained on CIC-BCCC-NRC- IoT-2023 dataset) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.5 Feature Importance Graph for CIC-BCCC-NRC-IoT-2023 dataset . . 94 5.6 Reduced Set Feature’s Training History Results . . . . . . . . . . . . 99 5.7 Reduced Set Feature’s Confusion Matrix Results . . . . . . . . . . . . 99 x 5.8 Performance Comparison between Initial Model vs Reduced Feature Set Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 xi Abbreviations Table 1: List of Acronyms and Abbreviations Acronym/Abbreviation Description IoT Internet of Things CAN Controller Area Network IIoT Industrial Internet of Things AGVs Automated Guided Vehicles DTPs Data Transfer Protocols MQTT Message Queue Telemetry Transport CoAP Constrained Application Protocol DDS Data Distribution Service AMQP Advanced Message Queuing Protocol IT Information Technology TCP Transport Control Protocol CNN Convolutional Neural Networks LSTM Long Short Term Memory SHAP Shapley Additive Explanations OWASP Open Worldwide Application Security Project ACL Access Control List DDoS Distributed Denial-of-Service Continued on next page xii Table 1 – Continued from previous page Acronym/Abbreviation Description ICMP Internet Control Message Protocol MTU Maximum Transmission Unit UDP User Datagram Protocol DNS Domain Name System C&C) Command & Control GRE Generic Routing Encapsulation ETH Ethernet MitM Man-in-the-Middle ARP Address Resolution Protocol MAC Media Access Control Address HTTPS Hypertext Transfer Protocol Secure BC Blockchain FC Fog Computing EC Edge Computing ML Machine Learning SIEM Security Information and Event Management DL Deep learning RNN Recurrent Neural Network KNN K-Nearest Neighbors ReLU Rectified Linear Unit ELU Exponential Linear Unit BN Batch Normalization MSE Mean Squared Error MAE Mean Absolute Error Continued on next page xiii Table 1 – Continued from previous page Acronym/Abbreviation Description SGD Stochastic Gradient Descent Adam Adaptive Moment Estimation RMSProp Root Mean Squared Propagation AdaGrad Adaptive Gradient Algorithm CIC Canadian Institute for Cybersecurity xiv Chapter 1 Introduction 1.1 Introduction The digital revolution has significantly transformed our lives, with Internet of Things (IoT) playing a pivotal role. However, the rapid development of IoT in most corners of life leads to various emerging cybersecurity threats. Therefore, detecting and pre- venting potential attacks in IoT networks have recently attracted paramount interest from academia and industry. Among various attack detection approaches, machine learning-based methods, especially deep learning, have demonstrated outstanding po- tential thanks to their early detection capability [78]. Over the last several years, there has been a substantial increase in the number of attacks specifically aimed against Internet of Things (IoT) devices. These encompass various types of cyber attacks, such as infiltrating wireless webcams to gain unautho- rized access to surveillance cameras and violate user privacy; targeting implantable cardiac devices to potentially deplete battery life, disrupt pacing, and cause electric shocks, thereby endangering patients’ lives; compromising children’s smartwatches to expose their location data and personal information, pose multiple safety risks; and manipulating the Controller Area Network (CAN) bus of vehicles to potentially 1 alter speed and direction, thereby posing threats to public safety. An examination into smart house hacking found that for one week, fraudsters and unidentified groups launched over 12,000 attacks against various smart home devices, including TVs, thermostats, smart kettles, and security systems. [79] [80]. 1.2 What is an IoT architecture? The IoT architecture refers to the organization and setup of IoT devices to fulfill users’ particular requirements and requests. An IoT system is divided into three to seven layers, depending on their complexity, and each layer has a specific function. The lack of established protocols in the architecture of the Internet of Things (IoT) presents further difficulties regarding interoperability, security, and several other issues. The Internet of Things (IoT) architecture can include up to seven levels. [70]. Figure 1.1 provides a detailed illustration of the Internet of Things (IoT) and Industrial Internet of Things (IIoT). • Perception Layer : The perception layer, also known as the device layer, includes a variety of sensors such as RFID scanners, security cameras, GPS modules, and so on. These devices, such as conveyor systems, industrial robots, and automated guided vehicles (AGVs), may be used with industrial gear. These gadgets collect sensory data, monitor the production floor and surroundings, transport raw materials, and so forth [64]. • Transport Layer/ Network Layer : The Transport/Network layer is responsible for transferring data to the processing systems of the subsequent layer [64]. IoT gateways must first transform the incoming input from analogue to digital format. Subsequently, the gateway can transmit the data to a local or cloud data center via several data transfer protocols (DTPs). Some of the leading IoT protocols are Bluetooth, Wi-Fi, Zigbee, Z-Wave, 6LoWPAN (IPv6 over Low- 2 ` Perception Edge Processing ApplicationNetwork Business Control and Optimization Data Collection Security Figure 1.1: IoT Architecture Levels Power Wireless Personal Area Networks), MQTT (Message Queue Telemetry Transport), CoAP (Constrained Application Protocol), DDS (Data Distribution Service), AMQP (Advanced Message Queuing Protocol) [70]. • Edge Layer : The edge layer in an IoT configuration consists of the physical hardware, embedded operating system, and device firmware. With the growing number of interconnected devices, latency becomes a prominent issue in more extensive IoT networks. Edge computing, aided by the edge layer, resolves this problem by allowing data processing and analysis close to the data source. Latency, the period of time between the detection of an event and the execution of an action, is a critical concern for devices that are linked to a network. Reducing latency may be accomplished by placing processing resources close to the sensors or at the network edge, enabling prompt connection and data 3 exchange between devices [56]. • Processing Layer : The Processing layer, sometimes called the Middleware layer, comprises servers and databases. Its primary functions include decision-making, executing optimization algorithms, and storing large amounts of data [64]. This layer consists of cloud computing platforms that can analyze and interpret data from the physical environment. The system processes unprocessed sensor data and transforms it into relevant insights using cloud services and comprehensive data modules. Furthermore, the processing layer allows the system to promptly respond to inputs and outputs. It can make assessments and carry out tasks based on the information it receives. The data received in the perception step is used in this layer to generate predictions and provide insights [69]. • Application Layer : The application layer of IoT infrastructure is responsible for data analysis to solve business issues or achieve specific goals. The application layer provides customized functionality to meet the unique requirements of end users. The applications and services in this layer are constructed on top of the processing layer. Software tools facilitate converting data from the process- ing layer into meaningful information that humans can understand or use by automated processes [4]. • Business Layer : The business layer acts as the central point where choices and solutions are developed from the data analysis conducted in the applica- tion layer. The application layer may consist of several instances inside this layer. At the business layer, identifiable patterns from the application layer are used to gain a deeper understanding of business insights, predict future trends, and make operational decisions that improve productivity, security, cost- effectiveness, customer satisfaction, and other important business factors. In ad- dition, the business layer is accountable for overseeing commercial transactions 4 and models related to interconnected devices. It includes the administration of business processes, data analysis, and rules implementation. It serves as the basis for controlling business logic and setting up procedures to achieve all the business goals of an IoT system [70] [69] [4]. • Security Layer : The security layer is present across all levels of the IoT archi- tecture and is essential for the efficiency of an IoT solution. Considering that Internet of Things (IoT) devices often deal with confidential information, it is crucial to implement strong security measures [70] [4]. 1.3 What are IoT attacks? IoT attacks constitute cyberattacks leveraging IoT devices to access consumers’ sen- sitive data. Typically, attackers deploy malware on these devices, causing damage or infiltrating additional organizations’ data. Due to insufficiently designed security mechanisms, IoT devices emerge as prominent vulnerabilities within organizational infrastructures, posing substantial security risks. Basic IoT devices often lack robust built-in security measures to counter cyber threats. Given their limited functionalities and purposes, security considerations for such devices are frequently overlooked, ren- dering them susceptible to cyberattacks. Hackers and organizations can use common flaws and ”zero-day exploits” to attack IoT devices in various ways [60]. 5 1.4 What is the difference between IoT and IT attacks? Table 1.1: IoT Attacks Vs IT Attacks Category IoT Attacks IT Attacks Attack Sur- face Limited resources, higher vulnerability. Robust security, lower vul- nerability. Diversity of Devices Varied types, complex secu- rity. Standardized, simplified se- curity. Impact Harmful physical conse- quences. Data theft, service disrup- tion. Legacy De- vices Older devices, no updates, higher risk. Regular updates, lower risk. Unlike conventional Information Technology (IT) attacks, Internet of Things (IoT) attacks provide distinct issues that necessitate specific security solutions to minimize the associated dangers fully. The distinction between IT and IoT attacks is outlined in Table 1.1. Some of the different ways are: • Attack surface: IoT devices often possess limited processing capabilities and resources, resulting in potential deficiencies in security features compared to traditional IT systems. Consequently, IoT devices are more susceptible to at- tacks due to reduced defences. [60]. • Diversity of devices : The wide array of IoT device types, varying in form fac- tor, operating systems, and network connectivity, complicates establishing stan- dardized security measures. This diversity renders certain devices more prone to vulnerabilities and targeted attacks. [60]. 6 • Physical impact : Many IoT devices are integral to critical infrastructure or life-sustaining systems, such as medical equipment, thus exposing them to cy- berattacks with severe physical ramifications. In contrast, typical IT attacks aim to compromise data integrity or disrupt services. [60]. • Legacy devices : IoT devices often have extended lifespans, resulting in a prolif- eration of older, unsupported devices. The inability of legacy devices to receive software updates or security patches renders them particularly vulnerable to exploitation or compromise. [60]. 1.5 Problem Statement As discussed in the previous sections, the proliferation of IoT devices across various industries underscores the importance of safeguarding these devices against potential cyberattacks. Identifying and discerning malicious and benign instances by analyzing time-related features extracted from IoT TCP flow network traffic data presents a formidable challenge within the network security domain. Determining the optimal features for detecting malicious and benign instances among the extracted features and establishing their efficacy and reliability constitutes a critical inquiry. Given the escalating complexity and volume of IoT network data, it is imperative to devise ro- bust detection methodologies to distinguish between malicious and benign activities accurately. Addressing this imperative is crucial for fortifying IoT systems and net- works against potential security vulnerabilities, thereby upholding the integrity and reliability of interconnected devices and services. 7 1.6 Summary of Contributions This research endeavour mainly focuses on safeguarding IoT devices by identifying IoT attacks and analyzing time-related features derived from IoT network traffic TCP flow data. This will be achieved using deep learning algorithms to distinguish between malicious and benign samples. To summarize, this thesis has made the following contributions: • Our proposal uses an Self Attention-based 1D-CNN and LSTM network to identify IoT attacks by analyzing time-related features collected from TCP flow data. • To enhance the model’s credibility, the model will be evaluated on eight publicly accessible external datasets that have been enhanced and pre-processed using CICFlowmeter. • Determining the optimal hyperparameter combination for the model with the highest performance metrics. • Employing Shapley Additive Explanations (SHAP) to compute feature impor- tance scores and utilize them to identify the most optimal features from the extracted feature list. • Reducing the model size by retraining it on the most optimal features while maintaining comparable performance to the original model. • Releasing CIC-BCCC-NRC TabularIoTAttack-2024 dataset which includes over 80 extracted features from eight IoT datasets. 1.7 Thesis Organization The structure of the remainder of this thesis is outlined as follows: 8 1. Chapter 2: Background of IoT Attacks provides a fundamental understanding of the topic, including a discussion of key technical terms associated with the subject matter, mainly focusing on IoT attacks. It also summarizes the IoT attacks mentioned in the training datasets. 2. Chapter 3: Literature Review reviews prior research on IoT Attacks, examining the detection methodologies using machine learning and deep learning tech- niques. This chapter also offers a comprehensive classification of IoT attack types. 3. Chapter 4: Proposed Method outlines the motivation behind the proposed ap- proach, detailing the preprocessing steps and describing the model architecture, including a thorough explanation of each principal component within the archi- tecture. 4. Chapter 5: Experiments & Results presents the outcomes of the implemented research. This chapter covers the dataset, experimental setup, features, metrics, and the results obtained from the proposed method. 5. Chapter 6: Conclusion & Future Works concludes the thesis by summarizing the contributions, discussing encountered challenges, and providing insights into potential future research directions. 9 Chapter 2 Background of IoT Attacks 2.1 Technical Terms Before delving deeper into the intricacies of IoT attacks, acquiring a fundamental understanding of the subject is imperative. It is essential to thoroughly comprehend the various technical terminologies associated with this study area [60]. • Vulnerability : It denotes the intrinsic weaknesses of a system or its design, which permit unauthorized entities to execute commands, access data without appropriate authorization, and possibly initiate denial-of-service attacks. Such vulnerabilities can be detected across various domains of IoT systems. They may appear in the system’s hardware or software, the protocols and procedures implemented within these systems, and even in the behaviours and actions of the users interacting with the system [60]. • Exposure: It refers to a problem or mistake within the system configuration that allows an unauthorized person to undertake actions to obtain information [60]. • Threats : It refers to an intentional or unintentional action that exploits vulner- abilities within a system[60]. 10 • Attacks : It is defined as deliberate actions to damage a system or disrupt its nor- mal operations by exploiting vulnerabilities using various strategies and tools. Attackers engage in these hostile acts to achieve specific goals, which may be motivated by personal gratification or financial gain [60]. 2.2 Common IoT Vulnerabilities Table 2.1: Top 10 OWASP IoT Vulnerabilities Rank Vulnerability Description 1 Weak, Guessable, or Hard- coded Passwords Inadequate credentials susceptible to brute force attacks or publicly accessible, including backdoors in firmware or client software. 2 Insecure Network Services Presence of unnecessary or vulner- able network services, particularly online access, threaten information confidentiality, integrity, and avail- ability. 3 Insecure Ecosystem Inter- faces Presence of insecure online, back- end API, cloud, or mobile interfaces outside the device ecosystem, po- tentially compromising device secu- rity. 11 Rank Vulnerability Description 4 Lack of Secure Update Mechanism Inability of devices to undergo secure updates, lacking firmware validation, secure delivery, anti- rollback procedures, or alerts on se- curity changes. 5 Use of Insecure or Outdated Components Utilization of outdated or vulner- able software or hardware compo- nents, potentially exposing devices to unauthorized access. 6 Insufficient Privacy Protec- tion Insecure storage or access of user’s personal information within the ecosystem. 7 Insecure Data Transfer and Storage Absence of encryption or access control measures for sensitive data throughout the ecosystem. 8 Lack of Device Management Absence of adequate security sup- port for production devices, leading to deficiencies in asset and update management. 9 Insecure Default Settings Distribution of devices with unse- cured default configurations or lim- ited user control over settings. 12 Rank Vulnerability Description 10 Lack of Physical Hardening Absence of Physical Security Mea- sures, enabling attackers to access critical information or assume local control. Table 2.1 showcases the ten most significant vulnerabilities in IoT according to OWASP that make IoT devices susceptible to IoT attacks [7]. • Weak/Default Passwords : Absence of a strong password recovery system; Weak or default password; Non-implementation of stricter password rules; Inability to change the username and password associated with the account. • Insecure Network Services : Adversaries exploit vulnerabilities in IoT devices’ communication protocols and services to gain unauthorized access and under- mine the confidentiality of sensitive information transmitted between the device and a server. • Insecure Ecosystem Interfaces : The vulnerability of the device or its connected components arising from the insecure web, backend API, cloud, or mobile in- terfaces in the external ecosystem. • Lack of Secure Update Mechanism: Insufficient ability to update devices se- curely. These deficiencies encompass the absence of device-based firmware val- idation, the unsecured transmission of data without encryption, the lack of anti-rollback methods, and the failure to provide notifications regarding secu- rity changes resulting from updates. • Use of Insecure or Outdated Components : The utilization of obsolete or inse- cure software components or libraries that could expose the device to potential 13 attacks. This entails the utilization of external software or hardware obtained from a compromised supply chain, as well as the unsecured modification of system platforms. The security of the IoT ecosystem may be compromised by vulnerabilities in software dependencies or outdated systems. • Insufficient Privacy Protection: Users’ personal information is saved on the device or in the ecosystem and used accidentally, improperly, or illegally. In- formation about one’s health, energy use, and driving habits can fall into this category of privacy concerns. Privacy is at risk without adequate safeguards, and there can be legal consequences for failing to take the necessary precautions. • Insecure Data Transfer and Storage: The absence of encryption or access control for sensitive data at any point in the ecosystem, including whether it is stored, transferred, or processed. Data is critical in ensuring the reliability and integrity of IoT applications since it is used in automated controls and decision-making processes. Unauthorized access or usage will result in adverse consequences. • Lack of Device Management : Lack of security support for production-ready devices, including asset management, update management, secure decommis- sioning, systems monitoring, and response capabilities. Unauthorized devices can access business networks, monitor activity, and intercept data if exposed to the IoT ecosystem. • Insecure Default Settings : Systems or devices that lack the ability to enhance system security by restricting users from modifying configurations or that come with insecure default settings. Once the settings have been acquired, attackers can exploit hardcoded default passwords, concealed backdoors, and weaknesses in the device firmware. The user encounters difficulty in simultaneously modi- fying various parameters. 14 • Lack of Physical Hardening : The absence of physical safeguards allows potential attackers to get sensitive information that could be utilized in subsequent large- scale attacks or local device takeover. Internet of Things (IoT) devices are deployed in distant and dispersed environments. An attacker can disrupt the services offered by IoT devices by gaining access to the physical layer and making alterations. 2.3 Summary of IoT Attacks in Training dataset 1. ACK Fragmentation: A Fragmented ACK attack is a modified version of the ACK & PSH-ACK Flood, where 1500-byte packets are utilized to monopolize the target network’s bandwidth while maintaining a modest packet rate. Ap- plying application-level filters on network equipment, such as routers, would re- quire the equipment to reassemble the packets, consuming a significant amount of its resources. Without any filters, these attack packets can traverse various network security devices, such as routers, ACLs, and firewalls, without being noticed. The fragmented packets typically consist of irrelevant or useless data, as the attacker’s objective is to fully utilize the target network’s available ca- pacity. Similar to other DDoS attacks, the objective of a DDoS Fragmented ACK attack is to obstruct the service for other users by impeding or causing the target to crash by using irrelevant data [57]. 2. ICMP Flood : An ICMP flood is a form of denial-of-service attack, or DoS at- tack, which exploits the Internet Control Message Protocol (ICMP) by utilizing echo-requests and echo-replies, often known as pings, to assess the operational status and connectivity of a device. An ICMP flood attack, also known as a ”ping flood attack,” occurs when attackers overrun the bandwidth of a specific network router or IP address. They achieve this by inundating the router or IP 15 address with carefully prepared ICMP packets, causing it to become overloaded and unable to transfer traffic to the next downstream hop. When the device attempts to react, it uses up all of its available resources (such as memory, processing power, and interface rate), which prevents it from fulfilling genuine requests or serving consumers [5]. 3. ICMP Fragmentation: An IP/ICMP fragmentation DDoS attack is a prevalent type of volumetric denial of service (DoS) attack. Datagram fragmentation algorithms are employed to inundate the network during such an attack. IP fragmentation involves dividing IP datagrams into smaller packets, transmit- ting them over a network, and then reassembling them back into the original datagram during regular communication. This technique is essential to adhering to the size constraints that each network can handle. The maximum transmis- sion unit (MTU) is an upper bound on the data size that can be transmitted. A packet that exceeds the maximum size must be divided into smaller fragments to ensure successful transmission. As a result, multiple packets are transmitted, one of which includes comprehensive information about the packet, such as the source/destination ports, length, and other relevant details. The remaining pieces lack further components and only contain an IP header and a data payload. These fragments lack information regarding protocol, size, or ports. The attacker might specifically utilize IP fragmentation to target communi- cation systems and security components. ICMP-based fragmentation attacks commonly include submitting forged fragments that cannot be reassembled. Consequently, the fragments are stored temporarily, occupying memory and potentially depleting all available memory resources. This DDoS attack in- volves the transmission of counterfeit UDP or ICMP packets. These packets 16 are intentionally crafted to appear more significant than the network’s maxi- mum transmission unit (MTU). However, only certain portions of the packets are transmitted. Since the packets are counterfeit and cannot be reconstructed, the server’s resources are rapidly depleted, resulting in its unavailability to gen- uine traffic [54]. 4. PSH ACK Flood : ACK or PUSH ACK packets are utilized bidirectionally to transmit data until the session is terminated after establishing a connection between the host and the client. A victim server, vulnerable to an ACK flood attack, receives faked ACK packets that do not correspond to any sessions in the server’s connection list. The targeted server exhausts its system resources to ascertain the legitimacy of the falsified packets within a session, leading to a decline in performance and limited service availability [24]. 5. DDoS RST FIN Flood : To terminate a TCP SYN session, the client and the host exchange RST or FIN packets. During an RST or FIN flood, the targeted server is bombarded with a high volume of faked RST or FIN packets not associated with any of the sessions stored in the server’s database. The affected server must commit substantial system resources to correlate incoming packets with existing connections, leading to diminished server performance and partial unavailability [26]. 6. SYN Flood : A SYN flood, also known as a half-open attack, is a denial-of-service (DDoS) attack that seeks to render a server inaccessible to genuine traffic by depleting its available resources. Through the repetitive transmission of initial connection request (SYN) packets, the attacker can inundate all accessible ports on the server machine being targeted, resulting in the targeted device responding to genuine traffic with sluggishness or not responding at all [14]. 17 7. Synonymous IP Flood : In this attack, the victim server is inundated with a substantial influx of falsified TCP SYN packets with identical source and desti- nation addresses in the header, which corresponds to the victim’s address. The designated server initiates the utilization of system resources to process every packet [27]. 8. DDoS TCP Flood : A TCP SYN Flood attack aims to exploit the TCP three- way handshake procedure, which is fundamental for establishing connections in TCP/IP networks. A TCP SYN Flood attack involves the deliberate sending of several SYN requests to a target server, with the purpose of not sending the final ACK. As a result, the server remains idle, awaiting a response that it never receives, which leads to the utilization of resources for each of these partially established connections [34]. 9. UDP Flood : A UDP Flood attack is a form of volumetric Denial of Service (DoS) attack that takes advantage of the User Datagram Protocol (UDP). UDP, unlike TCP, lacks session and connection features, making it an exceptional target for attackers. A UDP Flood attack involves an attacker inundating the victim system with an enormous volume of UDP packets directed at random ports. This influx of packets compels the host to: • Scan for active programs on each port. • Recognize that no active applications are listening on several ports. • Reply with an ICMP Destination Unreachable packet using the Internet Control Message Protocol (ICMP). The large quantity of UDP packets forces the targeted system to emit a plethora of ICMP packets. This can cause the system to become inaccessible to autho- rized customers. To further conceal their harmful actions, attackers may falsify 18 the IP address of the UDP packets. This guarantees that the influx of return ICMP packets is prevented from reaching them, essentially concealing their whereabouts [35]. 10. UDP Fragmentation: It is one of the variations of UDP flood. The distinguish- ing factor lies in utilizing packets of the greatest permissible size to saturate the channel with the least possible amount of packets. Given that these packet pieces are counterfeit and unrelated to genuine data, the targeted server that receives them will allocate resources to reconstruct non-existent packets from the counterfeit fragments. Eventually, this will lead to the depletion of system resources and the subsequent server crash or result in the overflow of channels. Like a UDP flood, this attack is challenging to screen and has a greater risk of channel overflow [25]. 11. DNS Spoofing : DNS spoofing is a cyberattack in which a hacker deceives a computer or network into thinking it is interacting with a genuine website or server. Essentially, the computer engages with a counterfeit website or server established by the attacker. This deceitful practice directs people to fraudulent websites, exposing them to the risks of identity theft, financial fraud, malware, and other online concerns [63]. 12. HTTP Flood : An HTTP flood is a form of Distributed Denial of Service (DDoS) attack where the attacker takes advantage of apparently valid HTTP GET or POST requests to attack a web server or application. HTTP flood attacks are volumetric attacks that typically involve a botnet, a collection of compromised machines that have been fraudulently taken over, frequently with the help of software such as Trojan Horses. HTTP floods, a type of Layer 7 attack, do not rely on faulty packets, spoofing, or reflection techniques. They can bring down a targeted site or server with less bandwidth than other attacks. Consequently, 19 they require a thorough comprehension of the specific site or application being attacked, and each attack must be meticulously tailored to ensure its effective- ness. This dramatically enhances the difficulty of detecting and obstructing HTTP flood attacks [33]. 13. Mirai : The Mirai botnet is malware that was created to take control of Internet of Things (IoT) devices and transform them into remotely operated ”bots” that can carry out highly impactful volumetric distributed denial of service (DDoS) attacks. The Mirai botnet conducts scans to identify susceptible IoT devices that possess open ports or utilize default usernames and passwords. Upon identifying these susceptible devices, it employs vulnerabilities to acquire entry and contaminates them with its malicious code. Once infected, the device becomes part of the Mirai botnet, enabling the attacker to issue commands from a central server called a ”command & control” server (C&C). Once established, this command and control (C&C) server can be utilized to initiate extensive distributed denial-of-service (DDoS) attacks on websites, networks, and other digital infrastructure by harnessing the collective power of all the bots within the Mirai Botnet simultaneously [58]. • Mirai GRE-ETH Flood : GRE, short for Generic Routing Encapsulation, is a protocol for creating virtual point-to-point connections over an IP network. It allows for the encapsulation of many network layer proto- cols. DDoS scrubbing providers utilize GRE as part of their mitigation architecture. GRE-ETH Flood is a network attack that targets explicitly network de- vices, such as routers and switches, by overwhelming them with excessive GRE (Generic Routing Encapsulation) and Ethernet (ETH) frames. This attack involves the attacker producing many GRE and Ethernet 20 frames and directing them towards the target device to overpower its processing capabilities. The inundation of packets can deplete system re- sources such as CPU, memory, and bandwidth, making the device unusable or incapable of processing valid data. The primary aim of a GRE-ETH flood attack can differ. Still, typical objectives involve disrupting network services, inducing denial of service (DoS) or distributed denial of service (DDoS) scenarios, and perhaps exploiting vulnerabilities in the target de- vice’s management of GRE and Ethernet traffic [81]. • Mirai GRE-IP Flood : A GRE-IP Flood is a network attack characterized by the deliberate inundation of a network with a substantial number of GRE (Generic Routing Encapsulation) and IP (Internet Protocol) pack- ets. The attack involves the attacker creating a substantial quantity of GRE-encapsulated IP packets and directing them towards the target net- work to exhaust its resources. The inundation of packets depletes network bandwidth, router computational capacity, and other resources, resulting in network congestion, deceleration, or even total unavailability of services. The purpose of a GRE-IP Flood attack might vary. Still, typical objectives include interrupting network operations, inducing denial of service (DoS) circumstances, or exploiting weaknesses in network infrastructure [81]. • Mirai UDP Plain: A UDP Plain attack, often referred to as a UDP flood attack, is a network-based denial-of-service (DoS) attack that explicitly targets a server or network infrastructure by overwhelming it with a large number of User Datagram Protocol (UDP) packets. In a UDP Plain at- tack, the attacker inundates the target server or network with a substantial volume of UDP packets, causing it to become overwhelmed and unable to handle the incoming packets effectively. In contrast to TCP, UDP is a connectionless protocol. It does not necessitate a handshake to establish 21 a connection, allowing for the rapid generation and transmission of many packets. The UDP Plain attack uses UDP’s stateless feature, allowing the attacker to transmit packets to the target without establishing a con- nection beforehand. This facilitates the initiation of extensive attacks by utilizing botnets or other automated mechanisms. The effects of a User Datagram Protocol (UDP) Simple attacks can cause various problems, such as slowing down networks, degrading services, or completely denying access, depending on the strength and ability of the targeted infrastruc- ture. Furthermore, due to the absence of inherent procedures in UDP for confirming the delivery of packets or maintaining their order, the attack might potentially lead to the loss of packets or their delivery in an incorrect sequence, causing additional communication disruption. 14. MITM ARP Spoofing : ARP spoofing, also known as ARP poisoning, is a type of Man-in-the-Middle (MitM) attack that allows attackers to intercept communi- cation between network devices. The methodology of this attack comprises the subsequent stages: Initially, the attacker acquires entry into the network and does a comprehensive examination of the network to ascertain the IP addresses of a minimum of two devices, usually a workstation and a router. Afterwards, the attacker uses spoofing tools such as Arpspoof or Driftnet to send fake ARP answers. The falsified responses falsely claim that the attacker’s MAC address matches both the router’s IP addresses and the workstation, deceiving them into sending their traffic to the attacker’s device instead of talking with each other directly. As a result, the ARP cache entries of the targeted devices are modified, redirecting their communication through the attacker’s system. This allows the attacker to have access to all conversations. After successfully carrying out an ARP spoofing attack, the attacker can secretly listen in on conversations, except for those that are encrypted using methods like HTTPS; take control of sessions 22 by obtaining session IDs to gain unauthorized entry to logged-in accounts; tam- per with communication by, for example, sending harmful files or redirecting users to malicious websites; and initiate Distributed Denial of Service (DDoS) attacks by supplying the MAC address of the target server instead of their own, resulting in overwhelming traffic if done across multiple IPs [32]. 2.4 Concluding Remarks This chapter included a comprehensive summary of the essential information required for the thesis. Initially, we delve into a comprehensive examination of the fundamen- tal comprehension of the subject matter, encompassing a thorough analysis of crucial technical terminology linked to the topic, with a specific emphasis on attacks re- lated to the Internet of Things (IoT). We extensively examine the specific aspects of IoT attack summaries used in training datasets. In Chapter 3, we will examine the work conducted in IoT attack detection, specifically concentrating on various machine learning and deep learning techniques, after comprehending the context and concepts provided in this chapter. 23 Chapter 3 Literature Review 3.1 Overview The rapid proliferation of the Internet of Things across various businesses has in- creased security problems. The first part of this chapter provides an overview of the research on attacks on the Internet of Things. Then, based on this research, what were the various methods utilized to carry out Internet of Things attack detections? • Machine Learning • Deep Learning 3.2 Literature Review Although there has been a lot of research on attacks targeting the Internet of Things (IoT), comprehensive literature on classifying these attacks is currently lacking. There- fore, our prior study introduced a new classification system that analyzes existing surveys and taxonomies [60]. Many academic publications thoroughly analyze many aspects of risks and attacks in the IoT field. The mentioned literary works [72] [55] [85] [29] [41] extensively 24 examine the classification of risks and intrusions linked to the Internet of Things (IoT). These papers primarily focus on two main categories: the architectural features of the Internet of Things (IoT) and the protocols and standards used in the IoT area. Although there is a wealth of literature on threats and attack taxonomies, it is essential to highlight that just a few studies specifically focus on viable solutions. Table 3.1: Comparison of IoT Attack Surveys Year Title # Attacks discussed Taxonomy Attack to Vulnerabil- ity Mapping Detection methods 2020 A survey on privacy and security of Internet of Things [55] 12 Yes No Yes 2017 A survey of intrusion detection in Internet of Things [85] 0 Yes No Yes 25 Year Title # Attacks discussed Taxonomy Attack to Vulnerabil- ity Mapping Detection methods 2019 Intrusion detection systems in the Internet of things: A com- prehensive investigation [29] 9 Yes No Yes 2018 Internet of things se- curity: A top-down survey [41] 0 Yes No Yes 2021 State-of- the-Art Review on IoT Threats and Attacks: Taxonomy, Challenges and Solutions [42] 59 Yes No Yes 26 Year Title # Attacks discussed Taxonomy Attack to Vulnerabil- ity Mapping Detection methods 2020 Machine learning based solu- tions for the security of Internet of Things (IoT): A survey [73] 31 Yes No Yes 2018 IoT secu- rity: Review, blockchain solutions, and open challenges [39] 19 Yes Yes Yes 2018 A Compre- hensive IoT Attacks Sur- vey Based on a Building- blocked Reference Model [2] 51 Yes No No 27 Year Title # Attacks discussed Taxonomy Attack to Vulnerabil- ity Mapping Detection methods 2020 A Com- prehensive Survey on Attacks, Security Is- sues and Blockchain Solutions for IoT and IIoT [64] 22 Yes No No 2022 A Survey on IoT Se- curity: At- tacks, Chal- lenges and Countermeasures[12] 17 Yes No Yes 2021 A Survey on Security Attacks and Solutions in the IoT Network[45] 17 Yes No No 28 Year Title # Attacks discussed Taxonomy Attack to Vulnerabil- ity Mapping Detection methods 2021 A survey on Classification of Cyber- attacks on IoT and IIoT devices[66] 32 Yes No Yes 2023 A Com- prehensive Survey on IoT Attacks: Taxonomy, Detection Mechanisms and Chal- lenges [60] 149 Yes Yes Yes However, it is essential to mention that the works discussed above do not offer any solutions to the risks and threats that emerge from the extensive use of pervasive technologies like blockchain (BC), fog computing (FC), edge computing (EC), and machine learning (ML). The authors have conducted surveys on diverse pervasive technologies for analyzing risks and attacks in a fragmented manner, as documented in the following sources: [22], [72], [75], [50], [74], and [42]. The authors of the study [15] did thorough research on security vulnerabilities in the Internet of Things (IoT) and presented an overview of machine learning methods 29 used to counter these attacks. The authors analyzed 78 publications published until 2017, focusing on the solutions, issues, and areas of research that have not yet been addressed in this field. The authors conducted a survey [82] in 2018 to explore sev- eral attack strategies that target the Internet of Things (IoT). These models include spoofing attacks, denial-of-service attacks, jamming, and eavesdropping. The authors also suggested possible security techniques to reduce these risks, such as IoT authenti- cation, access control, malware detection, and safe offloading. The security solutions proposed in this study prominently included utilizing machine learning techniques [73]. 3.3 Different techniques used to perform IoT at- tack detections Identifying and mitigating these attacks is crucial to protecting IoT ecosystems’ secu- rity. Several methods have been developed to address the problems of recognizing and responding to attacks. The following methods apply to the identification of attacks in the IoT domain. Also, Figure 3.1 provides a thorough overview of several techniques for detecting attacks in the Internet of Things (IoT) domain.: • Anomaly Detection: Anomaly identification, also known as outlier or event detection, is the analytical process of identifying unusual situations inside a particular system. The anomaly detection algorithms assess incoming traffic at multiple levels, ranging from the IoT network level to the data centre. Anomaly detection is crucial because it enables the identification and analysis of anoma- lies within IoT data. Although rare, these anomalies can offer useful insights and practical information in many sectors, including healthcare, industry, finance, transportation, and energy. Anomaly detection in the Internet of Things (IoT) is employed in the betting and gambling sector to discover insider trading cases 30 IoT Attack Detection Method Anomaly Detection Behavioral Analysis Honeypots Signature-Based Detection Deep Learning Security Information and Event Management Machine Learning Instance-based Learning Decision Tree Regression Method Clustering Method Artificial Neural Network Random Forest Ensemble Learning Other Machine Learning Techniques CNN LSTM RNN Auto-Encoders RBMs DBNs Figure 3.1: Overview of Available IoT Attack Detection and Identification Methods 31 by analyzing trade activity patterns [13]. • Behavioural Analysis : Dynamic code analysis refers to identifying and resolv- ing issues with potentially harmful software within a physical or virtual setting. The program’s source code is run with different test inputs to identify security vulnerabilities that may occur due to its code when interacting with other pro- grams or systems. Dynamic analysis is a technique used to study the behaviours of attacks on IoT devices [47]. Utilizing behaviour analysis for detecting IoT attacks offers numerous benefits compared to static analysis. Dynamic analysis can identify known and zero-day threats by examining similar patterns of behaviour exhibited by several attack- ers. Dynamic analysis is performed by employing sandbox tools like Cuckoo Sandbox or CWSandbox, which allow for the monitoring of malware behaviours in real-time [47]. • Signature-based Detection: Signature-based detection (SGD) requires security professionals to create predefined rules or signatures to recognize known attack patterns. This method is especially efficient in identifying well-known attacks with signatures stored in the database. On the other hand, the database cannot identify unknown attacks without signatures. • Honey Pots : A honeypot is a cybersecurity tool that creates a realistic and valuable network to lure potential attackers. It functions within a network environment that is both isolated and segregated. The aforementioned system can be seen as a simulated entity created to imitate a genuine system to attract potential attackers to interact with it. This method allows for the surveillance of the subsequent interaction between the attackers and the compromised device [59]. • Security Information and Event Management : Security Information and Event 32 Management (SIEM) is a security system that assists enterprises in detecting and addressing any security threats and weaknesses, preventing potential dis- ruptions to business operations. Security Information and Event Management (SIEM) systems are essential tools for corporate security teams to detect anoma- lies in user behaviour. Moreover, these systems utilize artificial intelligence (AI) to simplify and automate specific labour-intensive operations associated with detecting possible threats and subsequent incident response [1]. • Machine Learning : Machine learning (ML) approaches are crucial in identifying and reducing IoT attacks. These techniques use different algorithms to recog- nize unusual patterns in network traffic and device activity. The algorithms are trained using extensive datasets, including regular and malicious IoT traffic. This allows them to learn about unique characteristics that distinguish various attacks. Machine learning models, such as decision trees, support vector ma- chines, and random forests, can accurately categorize network traffic as benign or malicious by analyzing patterns they have learned. Furthermore, machine learning algorithms can adjust to changing attack techniques by consistently retraining on updated datasets, thus improving their ability to identify attacks as time goes on. Furthermore, intrusion detection systems that utilize machine learning may function in real-time, promptly notifying security administrators when they identify suspicious behaviours. This allows for quick reactions to possible attacks [60]. • Deep Learning : Deep learning (DL), a branch of machine learning (ML), pro- vides sophisticated capabilities for identifying attacks in the Internet of Things (IoT) by automatically extracting complex features from raw data without re- quiring manual feature engineering. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are frequently used in deep learning-based 33 intrusion detection systems to secure the Internet of Things (IoT). CNNs are highly effective at identifying spatial patterns in network traffic data, but RNNs are skilled at capturing temporal relationships in sequences of device action. Us- ing deep learning models, Internet of Things (IoT) security systems can attain enhanced precision in detecting and defending against advanced attack strate- gies. Deep learning algorithms can acquire hierarchical data representations, enabling them to identify intricate and nuanced attack patterns. Furthermore, deep learning models can process and analyze large-scale datasets efficiently. They can adjust and perform effectively in dynamic Internet of Things (IoT) contexts. This makes them highly suitable for identifying familiar and new IoT attacks [60]. 3.4 Concluding Remarks To summarize, the swift growth of the Internet of Things (IoT) has resulted in a proportional rise in security difficulties across many sectors. Although there is a sig- nificant amount of study on IoT threats, there is a lack of comprehensive literature specifically focusing on their classification. In our earlier study, we presented a new classification method that combines existing surveys and taxonomies to fill this gap. Although many academic articles analyze the various aspects of risks and threats in IoT, there is a lack of focus on practical solutions. Furthermore, the widespread adop- tion of advanced technologies such as blockchain, fog computing, edge computing, and machine learning raises additional security issues that require further investigation. However, the initiatives mentioned in our assessment offer useful insights into possible remedies against attacks on the Internet of Things (IoT). Within the domain of attack detection, numerous techniques have been devised to accurately identify and counteract Internet of Things (IoT) vulnerabilities. The meth- 34 ods include anomaly detection, behavioural analysis, signature-based detection, hon- eypots, security information and event management (SIEM), machine learning (ML), and deep learning (DL). Machine learning techniques utilize algorithms trained on large datasets to identify abnormal patterns in network traffic and device behaviour. This enables the early detection and reaction to potential threats. DL algorithms extract complex features from raw data, allowing for the precise and efficient identifi- cation of elaborate attack patterns. By adopting these sophisticated approaches, IoT security systems can enhance their ability to withstand emerging attack strategies, thereby protecting IoT ecosystems and their related assets from possible threats. 35 Chapter 4 Proposed Method 4.1 Motivation This chapter presents a Convolutional Neural Network (CNN) combined with a Long Short-Term Memory (LSTM) model, enhanced with a Self-Attention mechanism, to identify Internet of Things (IoT) attacks in TCP network flow data. The suggested architecture enables efficient detection of IoT attacks without requiring feature engi- neering. The primary objective of our initial discussion is to examine the rationale behind the proposed strategy. In this context, we emphasize the fundamental signif- icance of our proposed approach. In the next section, we present a comprehensive outline of the different components of our suggested approach. Subsequently, we present an elaborate outline of the pre-processing stage and the sug- gested framework for detecting Internet of Things (IoT) attacks. The pre-processing step involves extracting features from raw benign and malicious pcap files by pro- cessing them using CIC Flowmeter. The extracted features are then saved into CSV files. Furthermore, the process involves constructing a model combining Convolu- tional Neural Network (CNN) and Long Short-Term Memory (LSTM) with a self- attention network to detect attacks. 36 4.2 CICFlowmeter CICFlowMeter is a tool that generates and analyzes network traffic flow. This tool enables the creation of bidirectional flows, where the first packet determines the di- rection of data transmission. As a result, over 83 statistical network traffic features, such as Duration, Number of packets, Number of bytes, and Length of packets, can be calculated independently for both the forward (source to destination) and backward (destination to source) directions. Other capabilities encompass choosing features from the available feature list, incor- porating new features, and managing the duration of flow timeout. The application generates a CSV file with six columns: FlowID, SourceIP, DestinationIP, SourcePort, DestinationPort, and Protocol. The file contains over 83 network traffic analysis ele- ments. It is essential to understand that TCP flows typically end when the connection is torn down using a FIN message, but a flow timeout ends UDP flows. The individual scheme can provide an arbitrary value for the flow timeout, such as 600 seconds for both TCP and UDP [43]. 4.3 Data Pre-Processing CIC Flowmeter TCP Flow Filter KNN Imputation Benign / Malicious PCAP File Benign / Malicious Feature File Figure 4.1: Data Pre-processing During the pre-processing stage, we identify a dataset that includes raw benign and malicious pcap files to extract features. The CICflowmeter can be executed through the command line or within Java IDEs such as IntelliJ or Eclipse. Upon launching the application, we can choose between offline or online mode. The offline mode 37 enables us to import raw Pcap files and analyze them using the application, extracting features from the Pcap files and saving them as CSV files. We can derive 83 temporal statistical features from the Pcap files. In addition, we have applied a filter to the dataset so that it would only include rows where the assigned protocol number is 6, which corresponds to the TCP protocol [8]. Due to missing values in our data set, we employed KNN Imputation to fill in these missing values. K-Nearest Neighbors (KNN) is a machine learning technique for classification and re- gression tasks. Additionally, it can be utilized to impute missing data. The K-nearest neighbours (KNN) imputation approach involves identifying the K-nearest neighbours to the observation with missing data and subsequently imputing the missing values based on the non-missing values in those neighbours [3]. K-nearest neighbours (KNN) imputation is a highly favoured technique for replacing missing data in time series because it offers numerous significant benefits. Firstly, its non-parametric character enables it to accommodate a wide range of data distri- butions observed in time series without making assumptions about underlying data patterns. KNN imputation utilizes the principle that data points with similar char- acteristics typically have similar values. This method estimates missing values by considering the values of nearby points, thus maintaining the local structure of the data. Unlike approaches that presume linearity, KNN imputation is appropriate for time series data with nonlinear or complex connections between variables. Furthermore, its ability to withstand outliers guarantees stability in irregular data points, as it prioritizes nearby clusters rather than overall patterns. KNN imputation can adapt to changing patterns in time series data by considering the nearest neigh- bours inside a sliding window. This makes it highly skilled at capturing evolving data dynamics. Furthermore, the straightforward application and little need for adjusting parameters, such as determining the number of neighbours (k) and distance metric, 38 enhance its attractiveness as a powerful and adaptable method for filling in missing values in time series datasets [71]. 4.4 What are Convolutional Neural Networks? The Convolutional Neural Network (CNN) is a feedforward neural network adept at automatically extracting features from data using convolution structures, elimi- nating the need for manual feature extraction as in traditional methods. Inspired by biological visual perception, CNN architecture mirrors the organization of the visual cortex: artificial neurons correspond to biological neurons, CNN kernels simu- late different receptors detecting various features, and activation functions mimic the threshold for neural signal transmission. Loss functions and optimizers are designed to guide CNN learning. Compared to fully connected (FC) networks, CNN offers several advantages. Firstly, it employs local connections, where each neuron connects to only a few neurons in the previous layer, reducing parameters and speeding up convergence. Secondly, weight sharing allows groups of connections to share weights, reducing parameters. Finally, downsampling through pooling layers leverages image local correlation to reduce data volume while retaining crucial information, mini- mizing parameters, and discarding trivial features. These distinctive characteristics establish CNN as a prominent algorithm in the realm of deep learning [44]. Input Output Layer Output Feature Maps Conv + ReLU Conv + ReLU Pooling Flatten Layer Fully Connected Layer ClassificationFeature Extraction Probabilistic Distribution Figure 4.2: CNN Architecture 39 4.4.1 1D-CNN A one-dimensional convolutional neural network (1D CNN) is a specific neural net- work that employs convolutional layers operating in one dimension. It analyzes data that follows a temporal or sequential pattern, where each data point consists of only one set of features. A one-dimensional convolutional neural network (1D CNN) ap- plies one-dimensional filters to the data, extracting important patterns or features from specific parts of the input sequence. 4.4.2 2D-CNN A 2D CNN is a prevalent form of convolutional neural network specifically created to analyze two-dimensional data, such as photographs. A 2D CNN utilizes convolutional layers to apply filters that operate in two dimensions on the input data. This allows the network to capture spatial hierarchies and patterns such as edges, textures, and forms. These characteristics become more conceptual and less concrete as we move up the network layers. 4.5 CNN Operations Typically, eight components are required to construct a CNN model. • Convolution • Activation Function • Pooling • Fully-Connected Layers • Dropout Layer • Batch Normalization Layer 40 • Loss Function • Optimizer 4.5.1 Convolution Convolution is a crucial step in the process of extracting features. The results of convolution might be referred to as feature maps. When establishing a convolution kernel of a specific size, we will inevitably lose information at the boundary. Padding increases the input size by adding zero values, hence indirectly adjusting the size. In addition, the stride is used to regulate the density of convolution. As the stride increases, the density decreases. Following the convolution process, the feature maps include many features, which increases the risk of encountering the overfitting issue. To eliminate redundancy, pooling (also known as downsampling) is suggested as a solution, which includes techniques such as max pooling and average pooling [44]. The basic 2D convolution operation is defined as follows: Y (i, j) = ∑ m ∑ n X(i+m, j + n) ·K(m,n) (4.1) Where: • Y (i, j) is the output of the convolution at position (i, j). • X(i+m, j + n) represents the pixel values of the input image. • K(m,n) is the kernel or filter applied to the image. • The summations over m and n traverse all the rows and columns of the kernel K, respectively. When incorporating stride and padding, the convolution formula is modified to ac- commodate these parameters, enabling control over the output size and the field of 41 view of the convolution operation: Y (i, j) = ∑ m ∑ n X(s · i+m− p, s · j + n− p) ·K(m,n) (4.2) Where: • s is the stride, which dictates the step size of the filter as it slides over the image. • p is the padding, which involves adding layers of zeros outside the original image to preserve spatial dimensions. 4.5.2 Activation Function Table 4.1: Activation Functions [10] Activation Function Formula Sigmoid σ(x) = 1 1+e−x Tanh tanh(x) = ex−e−x ex+e−x ReLU (Rectified Linear Unit) ReLU(x) = max(0, x) Leaky ReLU Leaky ReLU(x) = { x, if x > 0 0.01x, otherwise ELU (Exponential Linear Unit) ELU(x) = { x, if x > 0 α(ex − 1), otherwise Activation functions in Convolutional Neural Networks (CNNs) play a critical role by introducing non-linearity into the network, allowing it to learn complex patterns in the data. Without activation functions, a CNN would essentially become a linear model, incapable of handling the intricacies and nuances required for tasks such as image recognition or natural language processing. The most commonly used activation function in CNNs is the Rectified Linear Unit (ReLU), favoured for its computational simplicity and ability to maintain the gradient flow during training, thus helping to mitigate the vanishing gradient problem. ReLU functions by outputting the input 42 Table 4.2: Activation Functions Part 2 [10] Activation Function Pros Cons Sigmoid Smooth gradient, Outputs between 0 and 1 Susceptible to vanishing gra- dient problem, Outputs not zero-centered Tanh Outputs between -1 and 1, Zero-centered output Susceptible to vanishing gra- dient problem ReLU Computationally efficient, Avoids vanishing gradient problem Prone to dying ReLU prob- lem (neurons become inac- tive) Leaky ReLU Addresses dying ReLU prob- lem, Non-zero gradient for negative inputs More complex than ReLU, Not zero-centred output ELU Avoids dying ReLU prob- lem, Zero-centered output More computationally ex- pensive than ReLU directly if it is positive. Otherwise, it outputs zero. Other activation functions like sigmoid, tanh, and Leaky ReLU are also used, each with advantages depending on the network architecture and specific application. These functions are applied at specific layers throughout the CNN to help the model differentiate and correctly classify the input signals into outputs [44]. 4.5.3 Pooling Pooling layers in Convolutional Neural Networks (CNNs) reduce the spatial dimen- sions of the input feature maps, decreasing the number of parameters and computa- tion required. It also helps detect features that are invariant to scale and orientation changes. This subsampling step improves the network’s efficiency and robustness by abstracting higher-level features while retaining the most essential information. The most common types of pooling are max pooling and average pooling. Max pooling returns the maximum value from each cluster of neurons at the prior layer, effectively highlighting the most prominent features. In contrast, average pooling calculates the average value, smoothing out the feature responses. By reducing the resolution of the 43 Table 4.3: Pooling Functions [36] Pooling Function Formula Pros Cons Max Pooling y = max(x) Preserves dominant features, Translation invariance May discard less dominant features, Can lead to overfit- ting Average Pool- ing y = 1 n ∑n i=1 xi Simple and efficient, Reduces noise Can lose important details, Not robust to outliers Global Max Pooling y = max(x) over all ele- ments Captures most im- portant feature, Re- duces dimensionality Ignores spatial infor- mation, Not suitable for tasks requiring spatial information Global Aver- age Pooling y = 1 n ∑n i=1 xi over all ele- ments Reduces computa- tion complexity, Less prone to overfitting Loss of spatial in- formation, Not suit- able for tasks requir- ing spatial informa- tion feature maps, pooling layers also help prevent overfitting by providing an abstracted form of the representation. These layers are typically placed between successive con- volutional layers and play a pivotal role in the architecture of deep learning models designed for tasks like image and video recognition [40]. 4.5.4 Fully-Connected Layers Fully-Connected layers in Convolutional Neural Networks (CNNs) are crucial com- ponents that typically come after the convolutional and pooling layers. These layers are called ”fully connected” because every neuron in a fully connected layer is con- nected to all the neurons in the previous layer. The main function of these layers is to perform high-level reasoning by integrating the localized features extracted by the earlier convolutional and pooling layers into the final decision-making process of the network. This is where the abstract features from the entire image or input are used to classify or make predictions about the input data. The fully-connected layers map 44 the extracted features into the final output, such as classification labels. This makes them crucial for tasks like image recognition, where you must determine the presence of specific features across the whole image. These layers often use softmax activation functions in classification tasks to convert the output into probability distributions over predicted output classes [19] [44]. 4.5.5 Dropout Layer Input Layer Hidden Layer Output Layer Figure 4.3: Dropout Layer Dropout layers in Convolutional Neural Networks (CNNs) are a regularization ap- proach specifically developed to mitigate overfitting, a prevalent issue in deep learning models, especially those with a substantial number of parameters. During the training phase, dropout selectively turns off a random subset of neurons in the layer to which it is applied. This means these neurons do not participate in the forward pass, and their weights are not adjusted during the backward pass. The implementation of ran- dom deactivation compels the network to acquire more resilient characteristics that 45 do not depend on a limited group of neurons. This diminishes the model’s reliance on specific traits and encourages a more universal learning pattern. By employing this technique, dropout guarantees that the neural network performs highly on new, unobserved data rather than solely on the data it was trained on. The dropout rate, which refers to the likelihood of each neuron being deactivated, is a hyperparameter that can be tuned to enhance performance. Dropout is typically more prevalent in fully-connected layers than convolutional layers [23]. 4.5.6 Loss Function Table 4.4: Loss Functions [6] Loss Func- tion Formula Pros Cons Cross- Entropy Loss − ∑M c=1 yo,c log(po,c) - Directly models probability distri- butions - Highly effective for classification - Can be numer- ically unstable without proper implementation Mean Squared Error (MSE) 1 N ∑N i=1(yi − ŷi) 2 - Easy to un- derstand and implement - Differentiable function - Poor performance in classification - Penalizes outliers heavily Mean Ab- solute Error (MAE) 1 N ∑N i=1 |yi − ŷi| - Less sensitive to outliers compared to MSE - Easy to under- stand - Gradient can be constant, which may affect conver- gence Loss functions in Convolutional Neural Networks (CNNs) are critical components that measure the discrepancy between the predicted outputs of the network and the actual target values during training. The choice of a loss function depends on the specific task the CNN is designed to perform. For classification problems, a common loss function is the categorical cross-entropy, which calculates the difference between the predicted probability distribution over classes and the actual distribution (typically represented 46 as a one-hot encoded vector). For regression tasks, mean squared error (MSE) or mean absolute error (MAE) are commonly used, as they measure the average of the squares or absolute differences between predicted and actual values, respectively. The role of the loss function is to provide a quantitative assessment of model performance, which the training process seeks to minimize through gradient descent and backpropagation. By continuously improving the model parameters to reduce the loss, the CNN learns to make more accurate predictions, effectively tuning itself to the complexity of the data it processes [44]. 4.5.7 Optimizer Table 4.5: Optimizers [28] [20] Optimizer Formula Pros Cons SGD wt+1 = wt − η∇J(wt) - Simple and easy to understand - Effective in large-scale datasets - Slow conver- gence - Sensitive to hyperparameters Adam wt+1 = wt − η√ v̂t+ϵ m̂t - Fast convergence - Automatically adjusts the learn- ing rate - Computationally more intensive - Potential for non-convergence on non-convex functions RMSprop wt+1 = wt − η√ E[g2]t+ϵ gt - Diverges less - Good in on- line and non- stationary set- tings - Less popular than Adam - Sensitive to initialization AdaGrad wt+1 = wt − η√ Gt+ϵ gt where Gt is the sum of the squares of the past gradients - Handles sparse gradients well - Good for problems with large and sparse datasets - Gradient scal- ing can cause early stopping of learn- ing Optimizers in Convolutional Neural Networks (CNNs) are algorithms or methods used 47 to change the attributes of the neural network, such as weights and learning rate, in order to reduce the losses. Optimizers guide the training process by deciding how to update weights based on the gradients of the loss function with respect to those weights. Common optimizers include Stochastic Gradient Descent (SGD), Adam, and RMSprop, among others. SGD is simple and has been the traditional choice, where each update is performed using only a subset of all data (mini-batch) to compute the gradient, making it computationally efficient. Adam (Adaptive Moment Estimation) combines the benefits of two other extensions of SGD, AdaGrad and RMSProp, to handle sparse gradients on noisy problems. RMSprop adjusts the learning rate for each parameter, dividing the learning rate for weight by a running average of the magnitudes of recent gradients for that weight. These optimizers differ mainly in how they use the amount of past weight update information (momentum) and how they adapt the learning rate during training, making them suited to different types of neural networks and convergence problems [44]. 4.6 What are Residual blocks? Input (x) Relu Relu OutputConv Layer H(x) = x + F(x) Conv Layer F(x) Identity x Figure 4.4: Residual Block Cell Residual blocks, also known as residual connections or shortcut connections, are in- 48 tegral components of deep neural networks, particularly in architectures like ResNet (Residual Networks), aimed at tackling the vanishing gradient problem prevalent in very deep networks during training. These blocks incorporate skip connections that enable the output of one or more layers to bypass subsequent layers and be added directly to the output of deeper layers. This design allows the model to learn residu- als, representing the difference between the desired output and the input to a specific layer and facilitates the learning of identity mapping. By providing a direct path for gradient flow during backpropagation, residual blocks mitigate the vanishing gradient problem, making it feasible to train exceedingly deep networks. Consequently, they contribute to easier optimization, improved generalization performance, and faster convergence, rendering them a standard and indispensable element in constructing deep neural network architectures across diverse domains, including computer vision, natural language processing, and speech recognition. 4.7 What are Longest Short Term Memory Net- work? Long Short-Term Memory Networks (LSTMs) are a specialized kind of Recurrent Neural Network (RNN) designed to address the problem of learning long-term depen- dencies in sequence data. Traditional RNNs often struggle with vanishing or exploding gradients as the sequence length increases, hampering their ability to learn from data where distant past information is crucial for predicting future states. LSTMs over- come this challenge through their unique structure, which includes memory cells and multiple gates—namely, the input, forget, and output gates. These gates manage the flow of information into and out of the memory cell, effectively allowing the network to retain or discard information dynamically. This selective memory capability enables LSTMs to maintain long-range dependencies, making them highly effective for a wide 49 Ct-1 ht-1 Xt Ct ht ht Tanh Σ sig Σ sig Tanh Σ sig Xt Ct-1 ht-1 Ct ht Σ sig Tanh Input Vector Memory From Prev Block Output From Prev Block Memory From Current Block Output From Current Block Sigmoid Tanh Element Wise Multiplication Element Wise Summation Forget Gate Input Gate Output Gate Figure 4.5: LSTM Cell Ct-1 ht-1 Xt Ct ht ht Tanh Σ sig Σ sig Tanh Σ sig Xt + 1 Ct + 1 ht + 1 ht + 1 Tanh Σ sig Σ sig Tanh Σ sig Figure 4.6: LSTM Network range of sequential tasks such as natural language processing, speech recognition, and time series prediction. Their ability to remember and forget information selectively across long sequences makes them particularly powerful in fields where context and history significantly influence current outcomes [61]. The LSTM updates for timestep t given inputs xt (current input), ht−1 (previous output), and Ct−1 (previous cell state) are as follows: 50 Forget Gate: ft = σ(Wf · [ht−1, xt] + bf ) Input Gate: it = σ(Wi · [ht−1, xt] + bi) C̃t = tanh(WC · [ht−1, xt] + bC) Cell State Update: Ct = ft ∗ Ct−1 + it ∗ C̃t Output Gate: ot = σ(Wo · [ht−1, xt] + bo) ht = ot ∗ tanh(Ct) 4.8 What are Attention Layers? An attention function is a mathematical operation that takes a query and a set of key-value pairs as input and produces an output. The query, keys, values, and result are all represented as vectors in this operation. The output is calculated by taking a weighted total of the values. The weight allocated to each value is determined by a compatibility function that compares the query with the relevant key [76]. The concept of attention in Long Short-Term Memory Networks (LSTMs) enhances these models by allowing them to focus selectively on parts of the input sequence that are most relevant to the task, improving their ability to manage long-range depen- dencies in complex sequences. Attention mechanisms dynamically assign weighting to different inputs at each time step, indicating the importance of each part of the data for predicting the current output. In practice, this means that instead of treating all parts of the input equally, the model learns to pay ”attention” to specific segments of 51 Q MatMul Scale Softmax K V MatMul Output Figure 4.7: Self Attention Mechanism the input that are more informative for the current decision or prediction. This capa- bility is particularly useful in tasks such as machine translation, where the relevance of input words can vary significantly depending on the context within the sequence. By integrating attention with LSTMs, the model not only retains information over long periods but also adapts its focus according to the evolving semantic importance of different parts of the input, leading to improved performance on tasks requiring sophisticated contextual interpretation [68] [49] [38] [48]. 4.8.1 Self-Attention The self-attention mechanism simulates the human brain’s attention by allocating resources to focus on relevant rather than irrelevant information. The self-attention 52 mechanism contributes to the variance in the relevance of hidden information that the LSTM network does not recognize by assigning different weights to hidden features at different phases. Thus, Self Attention Mechanism retrieves advanced traits while improving long-term dependency [84]. Input Representation Let X ∈ Rn×d be the input matrix where n is the sequence length and d is the dimension of the embeddings [18]. Linear Transformations The input X is linearly transformed into three matrices: Query (Q), Key (K), and Value (V ) using learned weight matrices WQ, WK , and WV . Q = XWQ, K = XWK , V = XWV Where: WQ ∈ Rd×dk WK ∈ Rd×dk WV ∈ Rd×dv dk and dv are the dimensions of the key and value vectors respectively. Dot-Product Attention The attention scores are calculated by taking the dot product of the query Q with the key K transposed. Attention(Q,K, V ) = softmax ( QKT √ dk ) V Breaking it down into steps: 53 Score Calculation scores = QKT Scaling scaled scores = QKT √ dk Scaling the scores by √ dk helps stabilize the gradients. Softmax attention weights = softmax ( QKT √ dk ) The softmax function converts the scores into a probability distribution. Weighted Sum context vector = attention weights · V Explanation • Input Representation: The input X is a matrix of shape (n, d), where each row represents a token in the sequence, and each column represents a feature of the embedding. • Linear Transformations: The input matrix X is transformed into query Q, key K, and value V matrices using weight matrices WQ, WK , and WV . These transformations allow the model to learn different representations for querying, keying, and valuing the input data. • Dot-Product Attention: – Score Calculation: The dot product QKT results in a score matrix that represents the similarity between each query and key. – Scaling: Dividing the scores by √ dk helps to prevent the softmax function from having extremely small gradients. 54 – Softmax: The softmax function normalizes the scores into probabilities, indicating each token’s importance. – Weighted Sum: The context vector is computed as the weighted sum of the value vectors, where the weights are the attention scores. This context vector captures the relevant information from the entire sequence for each token. This mechanism allows the model to focus on different parts of the input sequence dynamically, depending on the context, which is particularly useful for tasks involving long-range dependencies [18]. 4.9 What are SHAP values? SHAP (SHapley Additive exPlanations) values offer a method for elucidating the outcomes of any machine learning model by employing a game-theoretic approach that gauges each player’s contribution to the final result. Each feature receives an important value in machine learning, denoting its impact on the model’s output. These SHAP values reveal the individual influence of each feature on every prediction, assess the relative significance of each feature, and ascertain the model’s reliance on feature interactions [9] [67]. The SHAP value (ϕ) for a feature j in a prediction model f is given by the formula: ϕj(f) = ∑ S⊆F\{j} |S|!(|F | − |S| − 1)! |F |! [f(S ∪ {j})− f(S)] Where: • F is the set of all features. • S is a subset of features excluding j. 55 • f(S) is the model’s output using only the features in set S. • |S| and |F | are the cardinalities of sets S and F , respectively. • ϕj(f) is the SHAP value for feature j. Widely utilized in machine learning, SHAP values afford a consistent and unbiased explanation of how each feature influences the model’s predictions. They derive from game theory principles, attributing importance metric to each feature, where positive SHAP values denote a positive influence on predictions, while negative values imply a negative impact, with magnitude indicating the strength of the effect [67] [9] [46] [51]. Notably, SHAP values are model-agnostic, making them applicable to diverse machine learning models such as linear regression, decision trees, random forests, gradient boosting models, and neural networks. These values possess several advantageous properties, including additivity, local accuracy, missingness, and consistency. Their additivity allows for independent computation of each feature’s contribution to the prediction, facilitating efficient computation even for high-dimensional datasets [67] [9][46] [51]. Furthermore, SHAP values provide an accurate and localized interpretation of a model’s prediction for a given input while remaining robust to missing or irrelevant features. Importantly, they offer a consistent interpretation of a model’s behaviour, ensuring stability even in the face of changes in model architecture or parameters. Overall, SHAP values furnish a reliable and objective means to gain insights into a machine learning model’s prediction mechanisms and the features exerting the great- est influence [9] [46] [51]. 56 4.9.1 SHAP Feature Importance Scores SHAP (SHapley Additive exPlanations) feature importance provides a comprehen- sive method for understanding the impact of individual features on a machine learn- ing model’s predictions. Unlike traditional methods like permutation importance, which assesses feature importance by measuring the decrease in model performance when a feature’s values are randomly permuted, SHAP values consider the interac- tion between features and provide a more nuanced understanding of their influence on predictions [46] [51]. While permutation importance offers a straightforward measure of feature impor- tance, SHAP values capture the contribution of each feature in the context of other features, allowing for a more accurate and interpretable assessment of their impact. In situations where feature interactions are significant in model predictions, SHAP fea- ture importance provides more insightful and reliable results. However, permutation importance may suffice for simpler models with fewer interactions between features and could be computationally less intensive. Ultimately, the choice between SHAP feature importance and permutation importance depends on the complexity of the model and the importance of capturing feature interactions for accurate interpretation [51]. While SHAP values typically provide local explanations for individual predictions, aggregating these values across a dataset offers insights into the global importance of each feature. Global feature importance quantifies the overall impact of fea- tures across all predictions, measuring how significant each feature is in the model’s decision-making process. The global importance of a feature j is calculated by summing the absolute SHAP values for that feature across all data points in the dataset: 57 Global Importance(j) = N∑ i=1 |ϕ(i) j | Where: • N represents the total number of data points in the dataset. • ϕ (i) j denotes the SHAP value for feature j at the i-th instance. • The absolute values of SHAP values are summed to account for both positive and negative contributions uniformly. 4.10 Proposed Mechanism The proposed approach consists of two distinct steps. The initial phase involves the conversion of data, extraction of features, generation of datasets for training and testing the model, and the development of the model itself. The second module involves training the model using the training dataset, incorporating all features, and determining feature importance by utilizing SHAP values. Subsequently, the model is retrained using a decreased set of features while maintaining comparable performance metrics to the original model. Therefore, determining the optimal characteristics to use in training the model 4.10.1 Pre-processing - Stage I This section will provide a detailed analysis of the components involved in the pre- processing stage technique, and Figure 4.1 provides a high-level overview of the pre- processing workflow. • Data Collection: In data acquisition, we necessitated benign and malicious pcap files sourced from an IoT attack dataset, significantly facilitating the advance- ment of security analytics applications within IoT environments. Specifically, 58 we employed the CIC IoT dataset 2023 for training, validation, and testing. Additionally, after completing the training phase, we utilized eight additional IoT attack datasets to evaluate the efficacy of the developed model [53]. • Data Conversation and Feature Extraction: As previously outlined in Sec- tion 4.2, we employed CIC flower, a network traffic flow generator and analyzer, to extract temporal statistical features from raw pcap files. Subsequently, we extracted over 83 features from these files and stored them in CSV format, facilitating their utilization in both model training and testing phases. • Data Filtration, Imputation and Creation: We implemented a filtering mech- anism on the dataset, specifically retaining rows associated with an assigned protocol number of 6, indicative of the TCP protocol [8]. To address miss- ing data within our dataset, we opted for KNN Imputation to complete these missing values. Subsequently, by merging these CSV files, we synthesized the necessary dataset. • Model Creation: We examine the components utilized in the model’s construc- tion during this stage. This model incorporates convolutional layers, residual blocks, LSTM layers with an attention mechanism, and some final processing and output layers. Let’s break down each module’s purpose and workings: Residual Block Max Pooling Residual Block Max Pooling LSTM Layer DropoutDropoutDropout LSTM Layer Self Attention Block Flatten Output Dense DropoutDropout Figure 4.8: Model Architecture 59 Data Activation Activation 1D Conv 1D Conv Output Add Figure 4.9: Residual Block – Input layer : This layer defines the input shape required to process the data for model training and testing. For our model, we have identified the input shape for the input layer to be (feature len, 1), where feature len is the number of columns in one batch. – Residual Blocks : Start the model with CNN based residual block which is used to deepen the network without losing the ability to train effectively. – Max Pooling : Reduces the spatial dimensions of the output from convolu- tional layers, summarizing features. – Dropout : Applied after pooling and LSTM layers to prevent overfitting by randomly dropping units during training. – LSTM Layers : Processes the output of convolutional layers to learn from the temporal patterns in the data. return sequences = True keeps the time dimension for attention processing. – Attention Layers : Applied to the output of the LSTM layers to focus the model’s learning on important temporal elements. – Flatten: Converts the multi-dimensional output of the attention layer into a single-dimensional array suitable for input to the fully connected layer. 60 – Dense Layer : A dense layer serves as the intermediate stage where the spatially distributed features extracted by convolutional , pooling layers and LSTMs are flattened into a single vector representation. By doing so, Dense layers enable the network to learn high-level abstractions and relationships among the extracted features, making them more suitable for classification or regression tasks. These intermediate Dense layers typically incorporate non-linear activation functions like ReLU to introduce non- linearity and capture complex patterns in the data. Additionally, they may include dropout or batch normalization layers to regularize the network and prevent overfitting. – Output Layer : A dense layer that outputs the final prediction of the model. The activation function is chosen based on the task. Since we are perform- ing binary classification, we are using the sigmoid function. The Figure 4.8 and Figure 4.9 provide a high-level overview of the model archi- tecture. This architecture is particularly effective for complex sequence mod- elling tasks that benefit from spatial feature extraction and the ability to re- member and emphasize important parts of the sequence data over time. 4.10.2 Proposed Architecture - Stage II • Hyperparameter Tuning : Hyperparameter tuning, also known as hyperparame- ter optimization, refers to selecting the optimal hyperparameters for a machine learning model to maximize its performance on a given dataset. Hyperparam- eters are parameters set before the learning process begins and control aspects such as the complexity of the model, the regularization strength, and the learn- ing rate. Unlike model parameters learned during training, hyperparameters are not learned from the data and must be specified by the user [65]. 61 Grid search is a hyperparameter tuning technique to find the optimal combi- nation of hyperparameters for a machine learning model. It involves defining a grid of hyperparameter values and exhaustively searching through all possible combinations of these values to identify the combination that yields the best performance on a chosen evaluation metric. Using grid search, we have successfully determined five optimal combinations of hyperparameters for training, validation, and testing. Among the five options, we have selected the hyperparameter configuration that performs the best for calculating feature importance, retraining the model, and compressing it. This selection was based on testing it on external datasets. Below is the Table 4.6 of hyperparameters utilized for training the model. Table 4.6: Hyperparameters List No Activation Losses Optimizers Batches Epochs Shuffles 1 Relu Binary Cross- entropy Adam 32 20 True 2 Relu Binary Cross- entropy Adam 16 20 True 3 Leaky Relu Binary Cross- entropy Adam 16 20 True 4 Relu Binary Cross- entropy RMSProp 16 20 True 62 5 Relu Binary Cross- entropy RMSProp 8 20 True • Model Training and Testing : To facilitate model training, we divided the d