I want to share with you an upcoming IEEE Systems Journal entry, “A Comprehensive Survey of Databases and Deep Learning Methods for Cybersecurity and Intrusion Detection Systems” (Gümüşbaş). This excellent survey provides a detailed overview of recent advancements using deep neural networks (DNN) for intrusion detection systems (IDS).
“What is IDS?”, you might ask. IDS is an intelligent monitoring tool, implemented either as a hardware or software solution, that aggregates network-system telemetry (e.g., packets and logs) and applies various algorithms to detect/block/alert suspicious activities.
The need for information security can be traced back to the origin of language itself; people, tribes, and nations all had some secrets that no one else must know. Fast forward to today, we see that more and more individuals are embracing the internet, almost as a way of life, with eCommerce, Social Media, Internet-of-Things, and so-on.
This has contributed to an alarming (for information security folks) and awesome (for data science folks) increase in the volume, velocity, and variety of data floating on the internet; also known as the three V’s of big data (see Figure 1).
Before reading on, I recommend that you have a basic understanding of what a DNN is and how it fundamentally works.
According to survey [Gümüşbaş], there are three types of IDS algorithms, each with their own strengths and weakness:
- Rule-based: “use prior knowledge of attacks, such as the corresponding data distributions, to create a rule system and perform detection.”
- Statistics-based: “detect anomalies by building a statistical distribution of intrusion patterns.”
- Machine Learning-based: “learning algorithms are adopted to train classifiers that can distinguish among different types of attacks.”
Rule-based implementations are fast and efficient (good for volume and velocity), but struggles with emerging/0-day threats (bad for variety). To solve this limitation, statistics-based implementations are introduced to enable the detection of unknown threats, but as you can imagine, is a computationally complex process (bad for volume and velocity).
For the remainder of this article, I will discuss machine learning-based implementations, specifically three flavors of DNN which are fast, efficient, and most importantly, adaptive:
- Convolutional neural network (CNN)
- Long-short term memory (LSTM)
- Generative adversarial network (GAN)
CNN is predominantly used in image based classifications. It uses a sequence of layers to “convolve” multidimensional data (see Figure 2). The way I look at CNN is the same way I look at origami, that is, take a flat image and methodically fold that image into a correlated shape. Researchers have developed techniques in transforming network-system telemetry into 2-D, greyscale images which, when fed into a CNN, have demonstrated usefulness for classifying various attack types. It must be noted that CNN is not able to process time-series data.
Unlike CNN, LSTM is predominantly used in time-series based classification (see Figure 3). It uses a temporal sequence of cells to “memorize’’ relevant features in streaming data. On paper, LSTM and IDS are a perfect match. However, due to recent advancements with CNN, researchers have shifted their attention away from LSTM to CNN due to its broader applications in other domains and hybrid LSTM-CNN have shown promising results.
GAN is used to learn the data distribution (i.e., characteristics) under normal, non-suspicious conditions (see Figure 4). Suspicious activities are detected by measuring the distance between normal data and freshly captured data. However, the synthetic data generated by learned distribution is often unrealistic and requires human intervention to ensure good results.
Given any machine learning project, one of the most important tasks is collecting reliable datasets in order to train your model. Because information security is a global problem, many multi-national communities (academia, industrial, and government) have developed vast datasets containing some of the most frequent attack types today:
Unfortunately, these datasets are often de-identified, simulated, incomplete, and simply not diverse enough. Privacy laws have made it difficult to obtain reliable datasets.
In this article, we explored IDS algorithms, DNN methods, and trainable datasets. I happen to be a software engineer, a security consultant, and a data scientist (…to be), so this topic is very close to the work I do. There is tremendous opportunities in this area!
D. Gümüşbaş, T. Yıldırım, A. Genovese and F. Scotti, “A Comprehensive Survey of Databases and Deep Learning Methods for Cybersecurity and Intrusion Detection Systems,” in IEEE Systems Journal, doi: 10.1109/JSYST.2020.2992966.