

Built a network intrusion detection model
Problem:
Classify the incoming traffic to a server and successfully predict if it is benign, suspicious or malicious traffic.
Dataset used:
https://huggingface.co/datasets/witfoo/precinct6-cybersecurity-100m
This is a massive labelled dataset with 114 million rows
Journey:
I had to make 5 versions to arrive at a satisfactory conclusion.
Version 0, 1, 1.1:
This was all about exploration. As this data is already structured and labelled, I kind of blindly used the features and built a two stage model with models like random forest etc, and it didn't work well. Then I consulted with my TAs and they recommended to research on models to deal with massive data points. So I did, and decided to use Deep Neural Network.
The dataset problem:
Upon further investigation, I found that the data is heavily imbalanced: 99.40% is benign traffic, 0.54% suspicious, and 0.06% malicious. And I could only find the malicious once in the last 4 million rows, that is file 56 and 57. File 57 is fully malicious traffic.
Version 4 and 5:
In order to deal with the imbalance of the dataset (this comes in parquets 0-56, each has 2 million rows), I pulled 10,000 rows of benign from all the files, and all the suspicious from all the files and few malicious from file 56 and 57. Trained using DNN, and result was literally 100% accuracy and recall. It was obvious something was wrong, investigating...
Version 6:
From the investigation of models 4 and 5, I found a couple of stupid mistakes I made. Like, I did not leave behind a complete file for testing alone, and I was using some features that were post transaction. That means the model got clues from the post features that indicated if it's an attack or not.
So I rebuilt the dataset for version 6. File 56 was left alone for test because that's the only file with all three - benign, suspicious and malicious - transactions. Then I took 10,000 rows and all the suspicious from rest of the files and 70% of malicious from file 57. Removed the post transaction features from train and trained a two stage model. Stage_1 classifies the traffic into benign or threat and stage_2 classifies all the threat output from stage_1 to suspicious or malicious.
Result:
Got realistic results. When tested on random 500k rows of file 56, there was only 5.7% off predictions and to hard test the result, I ran stage_2 only on all the suspicious and malicious traffic from file 56 and we only had a 10.2% off predictions.
Git: https://github.com/Elijah-bino/Intrusion_recog_model_v6
I would love feedback. I gotta tell this, subreddit is very active and gives honest feedback.