THE SQL Server Blog Spot on the Web

Welcome to SQLblog.com - The SQL Server blog spot on the web Sign in | |
in Search

Dejan Sarka

Data Mining Algorithms – Principal Component Analysis

Principal component analysis (PCA) is a technique used to emphasize the majority of the variation and bring out strong patterns in a dataset. It is often used to make data easy to explore and visualize. It is closely connected to eigenvectors and eigenvalues.

A short definition of the algorithm: PCA uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This transformation is defined in such a way that the first principal component has the largest possible variance, and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to (i.e., uncorrelated with) the preceding components. The principal components are orthogonal because they are the eigenvectors of the covariance matrix, which is symmetric.

Initially, variables used in the analysis form a multidimensional space, or matrix, of dimensionality m, if you use m variables. The following picture shows a two-dimensional space. Values of the variables v1 and v2 define cases in this 2-D space. Variability of the cases is spread across both source variables approximately equally.

image

Finding principal components means finding new m axes, where m is exactly equal to the number of the source variables. However, these new axes are selected in such way that the most of the variability of the cases is spread over a single new variable, or over a principal component, like shown in the following picture.

image

We can deconstruct the data points matrix into eigenvectors and eigenvalues. Every eigenvector has a corresponding eigenvalue. A eigenvector is a direction of the line and a eigenvalue is a number, telling how much variance there is in the data in that direction, or how spread out the data is on the line. he eigenvector with the highest eigenvalue is therefore the principal component. Here is an example of calculation of eigenvectors and eigenvalues for a simple two-dimensional matrix.

image

The interpretation of the principal components is up to you and might be pretty complex. This fact might limit PCA usability for business-oriented problems. PCA is used more in machine learning and statistics than in data mining, which is more end user oriented, and the results thus should be easy understandable. You use the PCA to:

  • Explore the data to explain the variability;
  • Reduce the dimensionality – replace the m variables with n principal components, where n < m, in such a way that preserves the most of the variability;
  • Use the residual variability not explained by the PCs for anomaly detection.
Published Tuesday, June 2, 2015 12:15 PM by Dejan Sarka

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

olaf said:

very nice visuals!

August 31, 2015 3:38 PM
 

Dejan Sarka said:

Thanks:-)

September 1, 2015 10:10 AM
 

Alan said:

Does anyone know how to calculate PCA using PL/pgSQL in PostgreSQL DBMS, considering a covariance matrix of potentially high order?

Thanks. Cheers.

October 15, 2015 4:53 PM

Leave a Comment

(required) 
(required) 
Submit

About Dejan Sarka

Dejan Sarka, MCT and SQL Server MVP, is an independent consultant, trainer, and developer focusing on database & business intelligence applications. His specialties are advanced topics like data modeling, data mining, and data quality. He is the founder of the Slovenian SQL Server and .NET Users Group. Dejan Sarka is the main author or coauthor of fourteen books about databases and SQL Server. Dejan Sarka also developed and is developing many courses and seminars for SolidQ, Microsoft and Pluralsight. He is a regular speaker at many conferences worldwide for more than 15 years, including conferences like Microsoft TechEd, PASS Summit and others.

This Blog

Syndication

Privacy Statement