Divisive Hierarchical Clustering with K-means

Clustering is an important analysis tool in many fields, such as pattern recognition, image classification, biological sciences, marketing, city-planning, document retrievals, etc. Divisive hierarchical clustering is one of the most widely used clustering methods. Divisive hierarchical clustering with k-means is one of the efficient clustering methods among all the clustering methods. In this method, a cluster is split into k-smaller clusters under continuous iteration using k-means clustering until every element has its own cluster. Here while using k-means clustering the initial centers are taken differently as by converting the m-dimensional data into one dimensional data, then dividing one dimensional data into k parts. Sorting those one dimensional in different parts and taking the middle element id and that particular ids one dimensional element is taken as centroid, these four centroids are taken as initial four centroids for the m-dimensional data . The architecture of the divisive hierarchical clustering with K-means clearly explains that it is working.

K-Means is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. This procedure in K-Means follows an easy and simple way to differentiate the given data set through a certain number of clusters (assume k clusters) fixed apriori. The main idea is to define k centers, one for each cluster. These centers should be placed in a cunning way because of different location causes different result. So, the better choice is to place them as much as possible far away from each other.

The next step is to take each point belonging to a given data set and associate it to the nearest center. When no point is pending, the first step is completed and an early group age is done. At this point one needs to re-calculate k new centroids as barycenter of the clusters resulting from the previous step. After acquiring these ‘k’ new centroids, a perfectly new binding has to be built in between the same data set points and the nearest new center. A loop has been generated. As a result of this looping may noticed that the k centers change their location step by step until no more changes are done or in other words centers do not move any more.

To implement divisive hierarchical clustering algorithm with K-means and to apply Agglomerative Hierarchical Clustering on the resultant data in data mining where efficient and accurate result. In Hierarchical Clustering by finding the initial k centroids in a fixed manner instead of randomly choosing them. In which k centroids are chosen by dividing the one dimensional data of a particular cluster into k parts and then sorting those individual parts separately, then the middle elements id in each part is mapped to id of m-dimensional data. The m-dimensional elements whose ids are matched, taken as initial k centroids of any cluster. The applying the Agglomerative Hierarchical Clustering on the resultant each element has its own individual cluster, where the clusters are merger based on the centroid distance. Then finally obtaining k-clusters.

A Divisive hierarchical clustering is one of the most important tasks in data mining and this method works by grouping objects into a tree of clusters. The top-down strategy is starting with all objects in one cluster. It subdivides the clusters into smaller and smaller pieces by k-means algorithm by choosing initial k centroids in a fixed manner to get an efficient result, until each object form a cluster on its own and by applying Agglomerative Hierarchical Clustering on the result to get the efficient k cluster with high accuracy.

Algorithmic steps for k-means clustering

Suppose X= x1, x2, x3,........, xn be a set of data points given and let V=v1, v2,......., vc be a set of centers.

  • Find the initial k centroids by the above process.
  • Calculate the Euclidian distance between each data point and cluster centers.
  • Assign the data point to the cluster center whose distance from the cluster centre is least compared to all other cluster centers.
  • Recalculate the new cluster center using
  • Recalculate the distance between each data point and new obtained cluster centers.
  • If no data point was reassigned then stop, otherwise repeat from step 3)

Algorithm

  1. Initially all items belong to one cluster Ci=0.
  2.  Split Ci into sub-clusters, Ci+1 and Ci+2.
  3. Apply K-mean on Ci+1 and Ci+2.
  4. Increment the value of i.
  5. Repeat steps b, c and d until the desired cluster structure is obtained. 
  6. Node 0 containing the whole data set
  7. C1=2 input nodes 1-2.
  8. C2=3 input nodes-> 2- 4 (1 spilt into 2 sub group-3 and 4).
  9. C3=4 input nodes ->3-6 (1 spilt into 2 sub group-5 & 6).
  10. Do until Ckmax not reached where Ckmax is maximum number of clusters.

Difference between K-Means and Hierarchical clustering

  • Hierarchical clustering can’t handle big data well but K Means clustering can. This is because the time complexity of K Means is linear i.e. O(n) while that of hierarchical clustering is quadratic i.e. O(n2).
  • In K Means clustering, since we start with random choice of clusters, the results produced by running the algorithm multiple times might differ. While results are reproducible in Hierarchical clustering.
  • K Means is found to work well when the shape of the clusters is hyper spherical (like circle in 2D, sphere in 3D).
  • K Means clustering requires prior knowledge of K i.e. no. of clusters you want to divide your data into. But, you can stop at whatever number of clusters you find appropriate in hierarchical clustering by interpreting the dendrogram.