Dejan Sarka : machine learning, PCAhttp://sqlblog.com/blogs/dejan_sarka/archive/tags/machine+learning/PCA/default.aspxTags: machine learning, PCAenCommunityServer 2.1 SP2 (Build: 61129.1)Data Mining Algorithms – Principal Component Analysishttp://sqlblog.com/blogs/dejan_sarka/archive/2015/06/02/data-mining-algorithms-principal-component-analysis.aspxTue, 02 Jun 2015 12:15:32 GMT21093a07-8b3d-42db-8cbf-3350fcbf5496:58796Dejan Sarka3http://sqlblog.com/blogs/dejan_sarka/comments/58796.aspxhttp://sqlblog.com/blogs/dejan_sarka/commentrss.aspx?PostID=58796<p> Principal component analysis (PCA) is a technique used to emphasize the majority of the variation and bring out strong patterns in a dataset. It is often used to make data easy to explore and visualize. It is closely connected to eigenvectors and eigenvalues. </p> <p><a href="http://en.wikipedia.org/wiki/Principal_component_analysis">A short definition</a> of the algorithm: PCA uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This transformation is defined in such a way that the first principal component has the largest possible variance, and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to (i.e., uncorrelated with) the preceding components. The principal components are orthogonal because they are the eigenvectors of the covariance matrix, which is symmetric. </p> <p>Initially, variables used in the analysis form a multidimensional space, or matrix, of dimensionality m, if you use m variables. The following picture shows a two-dimensional space. Values of the variables v1 and v2 define cases in this 2-D space. Variability of the cases is spread across both source variables approximately equally. </p> <p><a href="http://sqlblog.com/blogs/dejan_sarka/image_0E3B86EA.png"><img title="image" style="border-top:0px;border-right:0px;border-bottom:0px;border-left:0px;display:inline;" border="0" alt="image" src="http://sqlblog.com/blogs/dejan_sarka/image_thumb_4FD5C336.png" width="644" height="436" /></a></p> <p>Finding principal components means finding new m axes, where m is exactly equal to the number of the source variables. However, these new axes are selected in such way that the most of the variability of the cases is spread over a single new variable, or over a principal component, like shown in the following picture.</p> <p><a href="http://sqlblog.com/blogs/dejan_sarka/image_188F3BFB.png"><img title="image" style="border-top:0px;border-right:0px;border-bottom:0px;border-left:0px;display:inline;" border="0" alt="image" src="http://sqlblog.com/blogs/dejan_sarka/image_thumb_55B2F780.png" width="644" height="436" /></a> </p> <p>We can deconstruct the data points matrix into eigenvectors and eigenvalues. Every eigenvector has a corresponding eigenvalue. A eigenvector is a direction of the line and a eigenvalue is a number, telling how much variance there is in the data in that direction, or how spread out the data is on the line. he eigenvector with the highest eigenvalue is therefore the principal component. Here is an example of calculation of eigenvectors and eigenvalues for a simple two-dimensional matrix.</p> <p><a href="http://sqlblog.com/blogs/dejan_sarka/image_19F5EF7E.png"><img title="image" style="border-top:0px;border-right:0px;border-bottom:0px;border-left:0px;display:inline;" border="0" alt="image" src="http://sqlblog.com/blogs/dejan_sarka/image_thumb_16E6223A.png" width="644" height="343" /></a> </p> <p>The interpretation of the principal components is up to you and might be pretty complex. This fact might limit PCA usability for business-oriented problems. PCA is used more in machine learning and statistics than in data mining, which is more end user oriented, and the results thus should be easy understandable. You use the PCA to:</p> <ul> <li>Explore the data to explain the variability;</li> <li>Reduce the dimensionality – replace the m variables with n principal components, where n < m, in such a way that preserves the most of the variability;</li> <li>Use the residual variability not explained by the PCs for anomaly detection.</li> </ul><img src="http://sqlblog.com/aggbug.aspx?PostID=58796" width="1" height="1">Data Miningbusiness intelligencedata analysisunsupervised methodmachine learningRAzure MLPCA