Clustering is the process of grouping the data into classes or clusters so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters. Dissimilarities are assessed based on the attribute values describing the objects.

There are a large number of clustering algorithms. The major methods can be classified into the following categories.

**Partitioning methods**. A partitioning method constructs K partitions of the data, which satisfy the following requirements: (1) each group must contain at least one object and (2) each object must belong to exactly one group. Given the initial K number of partitions to construct, the method creates initial partitions. It then uses an iterative relocation technique that attempts to improve the partitioning by moving objects from one group to another. There are various kinds of criteria for judging the quality of the partitions. Some most popular include the k-means algorithm, where each cluster is represented by the mean value of the objects in the cluster, and the k-medoids algorithm, where each cluster is represented by one of the objects located near the center of the cluster. **Hierarchical methods**. A hierarchical method creates a hierarchical decomposition of the given set of data objects. These methods are agglomerative or divisive. The agglomerative (bottom-up) approach starts with each object forming a separate group. It successively merges the objects or groups close to one another, until all groups are merged into one. The divisive (top-down) approach starts with all the objects in the same cluster. In each successive iteration, a cluster is split up into smaller clusters, until eventually each object is in one cluster or until a termination condition holds. **Density-based methods**. Methods based on the distance between objects can find only spherical-shaped clusters and encounter difficulty in discovering clusters of arbitrary shapes. So other methods have been developed based on the notion of density. The general idea is to continue growing the given cluster as long as the density (number of objects or data points) in the “neighborhood” exceeds some threshold; that is, for each data point within a given cluster, the neighborhood of a given radius has to contain at least a minimum number of points. **Model-based methods**. Model-based methods hypothesize a model for each of the clusters and find the best fit of the data to the given model. A model-based technique might locate clusters by constructing a density function that reflects the spatial distribution of the data points. Unlike conventional clustering, which primarily identifies groups of like objects, this conceptual clustering goes one step further by also finding characteristic descriptions for each group, where each group represents a concept or a class.

A hierarchical clustering model training typically starts by calculating a **distance matrix** – a matrix with distances between data points in a multidimensional hyperspace, where each input variable defines one dimension of that hyperspace. Distance measure can be a geometrical distance or some other, more complex measure. A **dendrogram** is a tree diagram frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering. Dendrograms are also often used in computational biology to illustrate the clustering of genes or samples. The following set of pictures shows the process of building an agglomerative hierarchical clustering dendrogram.

Cluster analysis segments a heterogeneous population into a number of more homogenous subgroups or clusters. Typical usage scenarios include:

- Discovering distinct groups of customers
- Identifying groups of houses in a city
- In biology, deriving animal and plant taxonomies
- Can even make predictions once the clusters are built and distribution of a target variable in the clusters is calculated.

## Comments

## Greg Low said:

Hi Dejan,

Great stuff. Thanks for posting.

When calculating the distance between objects, how do they rate the value/importance of individual attributes?

## Dejan Sarka said:

Hierarchical clustering can use any kind of a distance; in fact, it does not need the original cases once the distance matrix is built. Therefore, you can use a distance that takes into account correlations, like the Mahalanobis distance (http://en.wikipedia.org/wiki/Mahalanobis_distance).

MS supports K-Means and Expectation-Maximization clustering (will describe them in some future blog). Algorithm parameters include the MAXIMUM_INPUT_ATTRIBUTES parameter. When you have more attributes than this parameter specifies, the feature selection is invoked. In Clustering, MS uses the entropy-based interestingness score to select the most interesting attributes only (https://msdn.microsoft.com/en-us/library/ms175382.aspx).