“k-means” Clustering and it’s Real World Use cases in Security Domain

Neeraj Nawale
5 min readJul 17, 2021

What is Clustering?

“Clustering” is the process of grouping similar entities together. The goal of this unsupervised machine learning technique is to find similarities in the data point and group similar data points together.

image referred from geeksforgeeks

Need of Clustering

Grouping similar entities together help profile the attributes of different groups. The main advantage of clustering over classification is that, it is adaptable to changes and helps single out useful features that distinguish different groups. Clustering is also used to reduces the dimensionality of the data when you are dealing with a copious number of variables.

What does K-means Clustering mean?

K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. The results of the K-means clustering algorithm are :

1. The centroids of the K clusters, which can be used to label new data
2. Labels for the training data

How does K-means Clustering works?

Steps ::

1. Select the number K to decide the number of clusters.
2. Select random K points or centroids. (It can be other from the input dataset).
3. Assign each data point to their closest centroid, which will form the predefined K clusters.
4. Calculate the variance and place a new centroid of each cluster.
5. Repeat the third steps, which means reassign each data point to the new closest centroid of each cluster.
6. If any reassignment occurs, then go to step-4.

Real world use cases of K-means clustering in Security domain

1} Intrusion Detection System (IDS)

An intrusion detection system (IDS) is a device or software application that monitors a network for malicious activity or policy violations. Any malicious activity or violation is typically reported or collected centrally using a security information and event management system. Anomaly detection is one of intrusion detection system. Current anomaly detection is often associated with high false alarm with moderate accuracy and detection rates when it’s unable to detect all types of attacks correctly.
To overcome this problem, K-Means clustering is useful, which will cluster all data into the corresponding group before applying a classifier for classification purpose with reasonable false alarm rate. This approach has resulted in high accuracy and good detection rates but with moderate false alarm on novel attacks.

2} Malware Detection

Malware detection refers to the process of detecting the presence of malware on a host system or of distinguishing whether a specific program is malicious or benign. Malware detection technique plays vital role in detecting malware attack that can give high impact towards the cyber world. By using clustering, unsupervised machine learning is able to detect malware attack by identifying the behavior of the malware.
Clustering detection model by using K-Means clustering approach to detect malware behavior of data based on the features of the malware. Clustering techniques that use unsupervised algorithm in machine learning plays an important role in grouping similar malware characteristics by studying the behavior of the malware which results in, model is capable to cluster normal and suspicious data into two separate groups with high detection rate which is more than 90 percent accuracy.

3} Spam Filtering

Electronic mail (email) has become an essential element for Internet users. The unwanted emails are known as spam email. These emails are sent in bulk to large number of recipients. This increased volume of spam email results a most common problem i.e. maintaining email inbox. Spam Email is major issue for internet community because it causes wastage of resources and also pollutes our environment. To prevent these adverse effects of spam email, spam filtering is essential task.
K-means Clustering is an effective way of identifying spam. The way that it works is by looking at the different sections of the email (header, sender, and content). The data is then grouped together. These groups can then be classified to identify which are spam. Including clustering in the classification process improves the accuracy of the filter to 97%.

4} Crime Analysis

Crime analysis is a law enforcement function that involves systematic analysis for identifying and analyzing patterns and trends in crime and disorder. Crime analysis also plays a role in devising solutions to crime problems, and formulating crime prevention strategies. Analysis of crime is essential for providing safety and security to the civilian population.
K means clustering technique is used to extract useful information from the high volume crime dataset and to interpret the data which assist police in identify and analyze crime patterns to reduce further occurrences of similar incidence and provide information to reduce the crime.

5} Android Malware Classification

Android malware is malicious software that targets a specific type of device: the Android device. Android’s less secure platform, such as its Play Store where applications are downloaded, and users’ ability to sideload content from the internet creates an environment where malware can thrive. Malware often also harvests fake clicks on the ads, doubling up on the value for the makers. Ransomware and Scareware are the main malicious activities.
K means clustering can used to create cluster of these malwares and in addition, a classification model for Android malware classification where each cluster prediction becomes elements of the cluster. The cluster constructed from rule-based clustering algorithm is then used to train the classifier algorithm.

Conclusion

K means clustering is one of the most popular clustering algorithms and usually the first thing practitioners apply when solving clustering tasks to get an idea of the structure of the dataset. The goal of k means is to group data points into distinct non-overlapping subgroups. It does a very good job when the clusters have a kind of spherical shapes and very useful in security domain.

NOTE : K-means Clustering internally uses “Euclidean distance” to calculate distance between two data points.

Thank You…

--

--