Introduction Principal Component Analysis is the study of the underlying dimensionality of data sets. Often data sets will have many variables, many of.

Slide 1 Introduction Principal Component Analysis is the study of the underlying dimensionality of data sets. Often data sets will have many variables, many of which are highly correlated to each other. The principal components of a data set effectively create a new set of variables for the data set. This new set of variables usually dramatically reduces the number of variables that are analyzed, reducing the complexity of the data. The new variables will have very low correlations with each other, and will often have some interesting interpretations indicating the main factors describing the variability of the data. Anatomical Measurements One of the classical examples of principal component analysis uses a data set of body measurements. The example I will demonstrate contained the measurements of 28 students, 15 male and 13 female. The measurements taken were: Hand WristHeightForearm HeadChestWaist Using principal component analysis, these seven variables were reduced to three principal components that explained 86.5% of the variation in the data set for women, and 87% of the variation for men. The first principal component consisted of positive coefficients for the seven variables and accounted for 53.2% and 59.6% of the variation for males and females; thus, the most variability of the subjects was found in their overall size. The second principal component is more interesting to interpret. The coefficients for males and females were very similar. The coefficients were large and positive for hand and wrist measurements, but large and negative for height. This indicates that the second largest variation in body measurements was in hand and wrist measurements compared to height. More explicitly, there were a significant number of people with small hand and wrist measurements relative to their height, or large hand and wrist measurements relative to their height. The third principal component had unique traits for men and women respectively. In women, the third greatest variation was in head versus forearm measurements. In men, the third greatest variation was found in the head and wrist versus hand and forearm. Overall, these three principal component account for over 85% of the variation present in this small set. Derivation of PCs Let Σ denote the covariance matrix of a vector of random variables x. Let α k denote the eigenvector corresponding to the k th largest eigenvalue of Σ. For the purposes of this derivation, we will use the constraint (α T k ) (α T k ) = 1 Then the k th principal component of the data is given by z k = (α T k ) (x) I will only prove this for the first principal component z 1 = (α T k ) (x). Note that the proof can be extended for all k principal components. Proof: The form of the first principal component: To find the first principal component, we want to maximize var(z 1). Consider the vector of random variables x and a vector α 1 and let z 1 = α T 1 x. Then var(z 1 ) = var(α T 1 x) = α T 1 Σα 1. In order to maximize this expression subject to the given constraint that α T 1 α 1 = 1, we use the technique of Lagrange multipliers. Let λ be a Lagrange multiplier. Now, the expression to be maximized is α T 1 Σα 1 – λ(α T 1 α 1 – 1). In order to maximize this, we differentiate and set to 0. This expression is then (d/dα1) ( α T 1 Σα 1 – λ(α T 1 α 1 – 1) ) = 0 → Σα 1 - λα 1 = 0 →Σα 1 = λα 1 Thus, λ is an eigenvalue of Σ and α 1 is the corresponding eigenvector. Using these facts, we can show that var(z 1 )= α T 1 Σα 1 = α T 1 λα 1 = λα T 1 α 1 = λ. We are maximizing this variance, so λ has to be as large as possible, and thus, α 1 is the eigenvector corresponding to the largest eigenvalue. This is the desired form of the principal component z 1. Conclusion The derivation of principal components shows that each principal component is chosen to maximize the variance. Thus, one can create a relatively small set of variables from the principal components that contain a large amount of the variation present in the data. In summary, Principal Component Analysis provides a rigorous technique with which to Reduce the dimensionality of a data set Create variables that are uncorrelated Retain most of the variation present in the data (between 70% and 99%) Provide interesting interpretations for the ‘true’ variation in data. Principal Component Analysis: Finding the Underlying Dimensionality of Data Maricella Foster-Molina ‘07 Swarthmore College, Department of Mathematics & Statistics References Lay, David C. Linear Algebra and its Applications 3 rd ed. Boston, Massachusetts: Pearson Education, Inc. 2003. Jolliffe, Ian T. Principal Component Analysis. Secaucus, New Jersey: Springer-Verlag New York, Incorporated, 2002. Poole, Keith T., and Howard Rosenthal. Congress: A Political-Economic History of Roll Call Voting. Oxford, England: Oxford University Press, 1997. Poole, Keith T. and Howard Rosenthal’s software VoteView. For more information see: http://mathstat.swarthmore.edu/webspot Political Extension Congressional Dimensionality Keith Poole and Howard Rosenthal did not use the method of principal components, but they found a method to effectively find the principal components of congressional voting patterns. The most interesting thing they found was that congressional voting is invariably one or two dimensional; that is, their voting can principally be explained by only two factors. Below is a diagram of the 1971 Senate. Acknowledgements I want to thank Professor Stromquist and Dr. Leslie Foster for their help with this project. Derivation continued The form of the second principal component: The second principal component will be again maximizing var(z 2 ) = var(α 2 x), for some vector α 2. This time, the constraint will be that α 2 x be uncorrelated to α 1 x, or cov(α 1 x, α 2 x ) = 0 I will not provide the derivation here. Suffice it to say that α 2 is in fact the eigenvector corresponding to the second largest eigenvalue. The form of the rest of the principal components: The rest of the principal components follow this pattern, but have increasingly complex proofs as additional constraints are added. Thus, we have shown that the k principal component of a vector of random variables is z k = (α T k ) (x) Where α k is the eigenvector corresponding to the k th largest eigenvalue of the covariance matrix. Fig. 1. The above diagram is of Senators in 2-dimensional issue space in 1971. The two dimensions shown are effectively the principal components gathered from Congressional voting records Republican/Conservative v Democratic/Liberal Civil Rights (Voting Rights Act, school segregation issues, etc.)

Introduction Principal Component Analysis is the study of the underlying dimensionality of data sets. Often data sets will have many variables, many of.

Description

Comments