Clustering
Business Applications¶
- E-commerce : Group similar customers based on there purchasing behaviour, money spending patterns, products used, geogrophical location etc.
- Image Segmentation : Grouping/ Clustering similar pixels
- Manual labeling by clustering large data to smaller number of clusters and labeling just the clusters
Dunn Index¶
- Small Intra Clustering Distance and Large Inter Clustering Distance
- For \({C_k}\) clusters with \(d(i,j)\) as Inter-cluster distance between \(C_i\) & \(C_j\) , \(d'(i)\) as Intra-cluster distance of cluser \(C_i\), Larger Dunn Index means better clustering
Density Based Clustering¶
- Cluster points give rise to dense regions
- Noise points give rise to sparse regions
Density based spatial clustering of noise (DBSCAN)¶
- Density at point
p
is the number of points within a hypersphere of radiusEps
aroundp
- Dense region is a hypersphere of radius
Eps
that contains at leastMin Points
number of points - Core point : If
p
has >=Min Points
in anEps
radius around it - Border Point : if
p
is not a core point butp
belongs to the neighbourhood ofq
(core point) i.edist(p,q) <= Eps
andq
is core - Noise point : if
p
is neither Core or Border is noise point - Density Edge: if
p
&q
are core points anddist(p,q) <= Eps
then edgep-q
is a density edge - Density connected points: If there exists a path formed by Density Edges to reach from a point
p
toq
then these are Density connected points
DBSCAN Algorithm¶
- \(\forall \ x_i \ \epsilon \ D\) label all points as core, border or noise point using range queries build on KD-Tree
- Remove all noise point from the data
- For each core point not assigned to a cluster
- Create a new cluster with point
p
- Add all the points that are density connected to
p
into this cluster
- Create a new cluster with point
- For each border point assign it to the neareast core point's cluster
Hyper parameters Tuning¶
Min Points
~~ 2*dimensionality- Choose larger value of
Min Points
if dataset has noisy points Eps
\(\forall x_i\) calculate \(d_i\) i.e. distance of \(x_i\) from theMin Points
th nearest neighbour- Sort \(d_i\) in increasing order and pick
Eps
as the \(d_i\) with the elbow in the plot
Complexity¶
- Time Complexity : \(O(nLog(n))\)
- Space Complexity : \(O(n)\)
K-Means¶
K-Means gives us k sets of points from which k clusters are derived
Optimisation Task :¶
For \(D_i = \{x_1, x_2, ... , x_n \}\) find \(\{C_j\}_k\) and there by \(\{S_j\}_k\) Objective function :
such that \(\forall_i \ x_i \ \epsilon \ S_j\) and \(\forall_{i,j} \ S_i \cap S_j = \phi\) where \(C_i\) is the Centroid of the ith cluster of points \(S_i\) and is given by
Lloyd's Algorithm¶
Lloyd's algorithm is used to solve the above optimisation approximately as it is very hard to optimise accurately
- Initialization : Randomly pick k points from \(D\) and call them \(\{C_j\}_k\) i.e. \(C_1, C_2, ..., C_k\)
- Assignent : For each \(x_i\) in \(D\) select the neareast \(C_j\) by calculating the \(dist(x_i, C_j) \ \ \forall \ j \epsilon [1,k]\) and add \(x_i\) to \(S_j\)
- Recompute/Update Centroid : recalculate centroids using : \(C_i = \frac{1}{|S_i|} \sum_{x_j \ \epsilon \ S_i} x_j\)
- Repeat Assignent and Update until the distance/change between old and new centroid is negligible and we can say that the algorithm has converged
K-mean++¶
- Initialization Sensitivity : Final clusters depend on the initialisation done in the first step
- Repeat K means with different initializations and pick the best clustering based on smaller intra-cluster and larger intra cluster distance
- Smart Initialization : k-mean ++
- Pick the first centroid \(C_1\) randomly from \(D\)
- Create a distribution \(\forall \ x_i \ \epsilon \ D\) given by \(d_i = dist(x_i, nearest \ \ Centroid)^2\)
- Pick a point from \(D-C_1\) with probability proportional to \(d_i\) and annotate it as \(C_2\)
Limitations¶
- Small number of outliers can affect the result of K-mean and K-means++ hugely
- k-means has problens when clusters are of different sizes because k means tries to create clusters of similar sizes
- k-means has problens when clusters are of different densities because k means tries to create clusters of similar densities
- k-means has problens with Non-globular (Non-Convex) clusters
- Good hack to solve these limitations is to increase k and then combining similar clusters
K-Medoids¶
Instead of giving a mean value as centroid which may or maynot be in the dataset, we can give a already interpretable data point as the centroid.
Partitioning around medoids (PAM)¶
- Initialization : Same as K-means++
- Assignment : Closest Medoid (Same as K-means)
- Recompute / Update: Swap each medoid with a non medoid point and if the loss decreases keep the swap else undo the swap. If Swap is success than do the assignment again
- \(Loss = \sum_{j=1}^k \sum_{x_i \epsilon S_j} ||x_i - m_j||^2\)
- The distance metric can be replaced with a kernel, which is again a benefit over K-means
Determing the right K¶
- Elbow-Method /Knee-Method : Calculate Loss for different values of k and choose the k after which decrease in loss is insignificant (basically the elbow point)
Time Complexity¶
Training Time complexity : \(O(nkdi)\)
- n = number of points
- k = No of clusters
- d = dimensions of data
- i = no of iteration
- Typically k and i are small so ~~ \(O(nd)\)
Hierarchical Clustering¶
Agglomerative & Divisive Clustering¶
- Agglomerative clustering assumes all points are individual clusters and starts grouping these clusterns in each iteration based on some notion of similarity or distances
- Divisive Clustering is just the reverse process and it starts with one big cluster and divides into smaller clusters
- Computaion Steps
- Compute the proximity matrix
- Let each data point be a cluster
- Repeat
- Merge the two closest clusters
- Update the proximity matrix
- Until only a single cluster remains
Inter Cluster Similarity or proximity¶
- Single Link Agglomeration (Min Method)
- Min breaks when data has outliers
- Complete Link Agglomeration (Max Method)
- Max breaks when data has large clusters or differnt size clusters or non circular clusters
- Group Avg Method (Compromise between min and max)
- All Min, Max and Avg methods can be kernelised
- Centroid Method
- Wards method is same as Group Avg with similarity as distance squared
Complexity¶
- Space comlexity : \(O(n^2)\)
- Time complexity : \(O(n3)\) to \(O(n2Log(n))\)
-
Both are very high and increase with n which is why it is not used very much in big data
- Visualising Clusters
- How to choose the right clustering algorithm