K-mean clustering and its real use case in the security domain

K-means clustering uses “centroids”, K different randomly-initiated points in the data, and assigns every data point to the nearest centroid. After every point has been assigned, the centroid is moved to the average of all of the points assigned to it. Then the process repeats: every point is assigned to its nearest centroid, centroids are moved to the average of points assigned to it. The algorithm is done when no point changes assigned centroid.

Shubham kumar
10 min readJul 19, 2021

What is K-means Clustering

Clustering, in general, is an “unsupervised learning” method. That means we don’t have a target variable. We’re just letting the patterns in the data become more apparent.

K-means clustering distinguishes itself from Hierarchical since it creates K random centroids scattered throughout the data. The algorithm looks a little bit like…

  • Initialize K random centroids.
  • You could pick K random data points and make those your starting points.
  • Otherwise, you pick K random values for each variable.
  • For every data point, look at which centroid is nearest to it.
  • Using some sort of measurement like Euclidean or Cosine distance.
  • Assign the data point to the nearest centroid.
  • For every centroid, move the centroid to the average of the points assigned to that centroid.
  • Repeat the last three steps until the centroid assignment no longer changes.
  • The algorithm is said to have “converged” once there are no more changes.

These centroids act as the average representation of the points that are assigned to it. This gives you a story almost right away. You can compare the centroid values and tell if one cluster favors a group of variables or if the clusters have logical groupings of key variables.

Visualizing K-means Clustering

K-means clustering produces a very nice visual so here is a quick example of how each step might look.

  • Here’s 50 data points with three randomly initiated centroids.
  • Iteration 2 shows the new location of the centroid centers.
  • Iteration 3 has a handful more blue points as the centroids move.
  • Jumping to iteration 6, we see the red centroid has moved further to the right.
  • Iteration 9 shows the green section is much smaller than in iteration 2, blue has taken over the top, and the red centroid is thinner than in iteration 6.
  • The 9th iteration’s results were the same as the 8th iteration’s, so it has “converged”.

Pre-processing and Things to Consider for K-means Clustering

There are a handful of pitfalls of k-means clustering that you should address.

Bigger values carry more weight.

This is a problem since not all of your data will be on the same scale. If you’re looking at a data set of real estate listings, the last purchase price and the number of bathrooms will be dramatically different. The distance between purchase prices will carry more weight.

The solution is to do some sort of normalization — minmax-normalization or z-score normalization are some fairly common options.

More variables pull data points further apart.

Imagine you’re in Seattle and you want to go to Mount Rainier. If you could go in a straight line, as the crow files, it’s around 50+ miles. Not too bad in two dimension (X and Y, two variables). But what if you’d like to get to the top of Mount Rainier. Well that’s another dimension (X, Y, and now Z).

If you wanted to get to the top of Mt. Rainier (elevation of 14,409 ft. above sea level) from Seattle (177 ft. above sea level), you’d travel around 14,200 feet (the diagonal line). If all three of these variables are in the k-means algorithm, it will seem twice as far apart than using only the two variables.

This happens all the time with real data. Throwing every variable in (even after normalizing) can create wide distance between otherwise close data points.

The solution is to do data reduction (such as principal components analysis) to create new variables that are more compact with data or do some variable selection (such as decision trees or forward-selection) to just drop variables that aren’t very important / predictive if you have a response variable.

For more on extra-dimensional space…

Works Best on Numeric Data

Since the k-means algorithm computes the distance between two points, you can’t really do that with categorical (low, medium, high) variables.

A simple workaround for multiple categorical variables is to calculate the percent of times each variable matches in comparison to the cluster centroid.

Measuring Performance of K-means

Homogeneity and Completeness — If you have pre-existing class labels that you’re trying to duplicate with k-means clustering, you can use two measures: homogeneity and completeness.

  • Homogeneity means all of the observations with the same class label are in the same cluster.
  • Completeness means all members of the same class are in the same cluster.
  • Scikit-Learn (Python) has an excellent write-up on these two measures.

WithinSS — The Within sum of squares measures the sum of the squared difference from the cluster center. A smaller WithinSS (or SSW) means there is less variance in that cluster’s data. Here’s an example set of clusters and their data.

  • The Cluster VAR X takes each row and subtracts it from its respective cluster’s average for X.
  • The same goes for Cluster VAR Y.
  • The Var SUM adds up the two Cluster VAR columns.
  • At the bottom, SSW (A, B, C) adds up the rows for its respective cluster.
  • We see that the WithinSS for cluster…
  • A = 13.5
  • B = 6.75
  • C = 8.667
  • The TOTAL WithinSS is the sum of those three values. 13.5+6.47+8.667 = 28.637
  • The TotalSS (Total Sum of Squares) is same process ignoring the cluster groupings.
  • For the table above, the Total sum of squares would be 536.182

How to Choose K (# of Clusters)

You can use the Total Within SS to compare different K’s. Trying several different K’s (2 clusters, 3 clusters, 4 clusters… N clusters), you plot their total within sum of squares and then look for an elbow in the graph.

  • While not perfect, you can definitely see a flattening out around 7 or 8 clusters.
  • You may want to run this plot several times due to the random initialization of starting centroids.

USE-CASE:-

Crime Analysis using K-Means Clustering

In today’s world security is an aspect which is given higher priority by all political and government worldwide and aiming to reduce crime incidence. As data mining is the appropriate field to apply on high volume crime dataset and knowledge gained from data mining approaches will be useful and support police force. So In this paper crime analysis is done by performing k-means clustering on crime dataset using rapid miner tool

  1. INTRODUCTION
    In present scenario criminals are becoming technologically sophisticated in committing crime and one challenge faced by intelligence and law enforcement agencies is difficulty in analyzing large volume of data involved in crime and terrorist activities therefore agencies need to know technique to catch criminal and remain ahead in the eternal race between the criminals and the law enforcement. So appropriate field need to chosen to perform crime analysis and as data mining refers to extracting or mining knowledge from large amounts of data, data mining is used here on high volume crime dataset and knowledge gained from data mining approaches is useful and support police forces. To perform crime analysis appropriate data mining approach need to be chosen and as clustering is an approach of data mining which groups a set of objects in such a way that object in the same group are more similar than those in other groups and involved various algorithms that differ significantly in their notion of what constitutes a cluster and how to efficiently find them. In this paper k means clustering technique of data mining used to extract useful information from the high volume crime dataset and to interpret the data which assist police in identify and analyze crime patterns to reduce further occurrences of similar incidence and provide information to reduce the crime. In this paper k mean clustering is implemented using open source data mining tool which are analytical tools used for analyzing data .Among the available open source data mining suite such as R, Tanagra ,WEKA ,KNIME ,ORANGE ,Rapid miner.k means clustering is done with the help of rapid miner tool which is an open source statistical and data mining package written in Java with flexible data mining support options. Also for crime analysis dataset used is Crime dataset an offences recorded by the police in England and Wales by offence and police force area from 1990 to 2011–12 .In this paper homicide which is crime committed by human by killing another human is being analyzed . This paper is divided into 7 sections: Related work, Proposed System Architecture, Experimental set up & Results, Conclusion, Future scope, References
    1.1 Crime analysis
    Crime analysis is defined as analytical processes which provides relevant information relative to crime patterns and trend correlations to assist personnel in planning the deployment of resources for the prevention and suppression of criminal activities It is important to analyze crime due to following reasons : 1. Analyze crime to inform law enforcers about general and specific crime trends in timely manner 2. Analyze crime to take advantage of the plenty of information existing in justice system and public domain. Crime rates are rapidly changing and improved analysis finds hidden patterns of crime, if any, without any explicit prior knowledge of these patterns.
    The main objectives of crime analysis include:
    1. Extraction of crime patterns by analysis of available crime and criminal data
    2. Prediction of crime based on spatial distribution of existing data and anticipation of crime rate using different data mining techniques
    3. Detection of crime
  2. RELATED WORK
    Data mining in the study and analysis of criminology can be categorized into main areas, crime control and crime suppression. De Bruin et. al. [1] introduced a framework for crime trends using a new distance measure for comparing all individuals based on their profiles and then clustering them accordingly. Manish Gupta et. al. [2]. highlights the existing systems used by Indian police as e-governance initiatives and also proposes an interactive query based interface as crime analysis tool to assist police in their activities. He proposed interface which is used to extract useful information from the vast crime database maintained by National Crime Record Bureau (NCRB) and find crime hot spots using crime data mining techniques such as clustering etc. The effectiveness of the proposed interface has been illustrated on Indian crime records. Nazlena Mohamad Ali et al.[3] discuss on a development of Visual Interactive Malaysia Crime News Retrieval System (i-JEN) and describe the approach, user studies and planned, the system architecture and future plan. Their main objectives were to construct crime-based event; investigate the use of crime based event in improving the classification and clustering; develop an interactive crime news retrieval system; visualize crime news in an effective and interactive way; integrate them into a usable and robust system and evaluate the usability and system performance and the study will contribute to the better understanding of the crime data consumption in the Malaysian context as well as the developed system with the visualization features to address crime data and the eventual goal of combating the crimes .Sutapat Thiprungsri [4] examines the application of cluster analysis in the accounting domain, particularly discrepancy detection in audit. The purpose of his study is to examine the use of clustering technology to automate fraud filtering during an audit. He used cluster analysis to help auditors focus their efforts when evaluating group life insurance claims. A. Malathi et al.[5] look at the use of missing value and clustering algorithm for a data mining approach to help predict the crimes patterns and fast up the process of solving crime. Malathi. A et. al.[6] used a clustering/classify based model to anticipate crime trends. The data mining techniques are used to analyze the city crime data from Police Department. The results of this data mining could potentially be used to lessen and even prevent crime for the forth coming years. Dr. S. Santhosh Baboo and Malathi. A [7] research work focused on developing a crime analysis tool for Indian scenario using different data mining techniques that can help law enforcement department to efficiently handle crime investigation. The proposed tool enables agencies to easily and economically clean, characterize and analyze crime data to identify actionable patterns and trends .Kadhim B. Swadi Al-Janabi [8] presents a proposed framework for the crime and criminal data analysis and detection using Decision tree Algorithms for data classification and Simple K Means algorithm for data clustering. The paper tends to help specialists in discovering patterns and trends, making forecasts, finding relationships and possible explanations, mapping criminal networks and identifying possible suspects. Aravindan Mahendiran et al. [9] apply myriad of tools on crime data sets to mine for information that is hidden from human perception. With the help of state of the art visualization techniques we present the patterns discovered through our algorithms in a neat and intuitive way that enables law enforcement departments to channelize their resources accordingly. Sutapat Thiprungsri[10] examine the possibility of using clustering technology for auditing. Automating fraud filtering can be of great value to continuous audits. The objective of their study is to examine the use of cluster analysis as an alternative and innovative anomaly detection technique in the wire transfer system. K. Zakir Hussain et al. [11] tried try to capture years of human experience into computer models via data mining and by designing a simulation model.

reference :- https://www.researchgate.net/publication/269667894_Crime_Analysis_using_K-Means_Clustering?enrichId=rgreq-4a1467f682197c1e0102f0cc8b9e111e-XXX&enrichSource=Y292ZXJQYWdlOzI2OTY2Nzg5NDtBUzo1MDI1MjUzNDY0NDczNjBAMTQ5NjgyMjc4NzA3MA%3D%3D&el=1_x_2&_esc=publicationCoverPdf

--

--