A Guide to Principal Component Analysis (PCA) for Machine Learning
Learn more about PCA for machine learning in this short guide.
Principal Component Analysis (PCA) is one of the most commonly used unsupervised machine learning algorithms across a variety of applications: exploratory data analysis, dimensionality reduction, information compression, data de-noising, and plenty more!
The intuition behind PCA
Let’s get a better understanding of PCA before we delve into its inner workings. Imagine we have a 2-dimensional dataset. Each dimension can be represented as a feature column:
We can represent the same dataset as a scatterplot:
The main aim of PCA is to find such principal components, which can describe the data points with a set of... well, principal components.
The principal components are vectors, but they are not chosen at random. The first principal component is computed so that it explains the greatest amount of variance in the original features. The second component is orthogonal to the first, and it explains the greatest amount of variance left after the first principal component.
The original data can be represented as feature vectors. PCA allows us to go a step further and represent the data as linear combinations of principal components. Getting principal components is equivalent to a linear transformation of data from the feature1 x feature2 axis to a PCA1 x PCA2 axis.
Why is this useful?
In the small 2-dimensional example above, we do not gain much by using PCA, since a feature vector of the form (feature1, feature2) will be very similar to a vector of the form (first principal component (PCA1), second principal component (PCA2)). But in very large datasets (where the number of dimensions can surpass 100 different variables), principal components remove noise by reducing a large number of features to just a couple of principal components. Principal components are orthogonal projections of data onto lower-dimensional space.
In theory, PCA produces the same number of principal components as there are features in the training dataset. In practice, though, we do not keep all of the principal components. Each successive principal component explains the variance that is left after its preceding component, so picking just a few of the first components sufficiently approximates the original dataset without the need for additional features.
The result is a new set of features in the form of principal components, which have multiple practical applications.
On its own, PCA is used across a variety of use cases:
Visualize multidimensional data. Data visualizations are a great tool for communicating multidimensional data as 2- or 3-dimensional plots.
Compress information. Principal Component Analysis is used to compress information to store and transmit data more efficiently. For example, it can be used to compress images without losing too much quality, or in signal processing. The technique has successfully been applied across a wide range of compression problems in pattern recognition (specifically face recognition), image recognition, and more.
Simplify complex business decisions. PCA has been employed to simplify traditionally complex business decisions. For example, traders use over 300 financial instruments to manage portfolios. The algorithm has proven successful in the risk management of interest rate derivative portfolios, lowering the number of financial instruments from more than 300 to just 3-4 principal components.
Clarify convoluted scientific processes. The algorithm has been applied extensively in the understanding of convoluted and multidirectional factors, which increase the probability of neural ensembles to trigger action potentials.
When PCA is used as part of preprocessing, the algorithm is applied to:
Reduce the number of dimensions in the training dataset.
De-noise the data. Because PCA is computed by finding the components which explain the greatest amount of variance, it captures the signal in the data and omits the noise.
Let's take a look at how Principal Component Analysis is computed.
3. How is PCA calculated?
There are multiple ways to calculate PCA:
Eigendecomposition of the covariance matrix
Singular value decomposition of the data matrix
Eigenvalue approximation via power iterative computation
Non-linear iterative partial least squares (NIPALS) computation
… and more.
Let’s take a closer look at the first method - eigendecomposition of the covariance matrix - to gain a deeper appreciation of PCA. There are several steps in computing PCA:
Feature standardization. We standardize each feature to have a mean of 0 and a variance of 1. As we explain later in assumptions and limitations, features with values that are on different orders of magnitude prevent PCA from computing the best principal components.
Obtain the covariance matrix computation. The covariance matrix is a square matrix, of d x d dimensions, where d stands for “dimension” (or feature or column, if our data is tabular). It shows the pairwise feature correlation between each feature.
Calculate the eigendecomposition of the covariance matrix. We calculate the eigenvectors (unit vectors) and their associated eigenvalues (scalars by which we multiply the eigenvector) of the covariance matrix. If you want to brush up on your linear algebra, this is a good resource to refresh your knowledge of eigendecomposition.
Sort the eigenvectors from the highest eigenvalue to the lowest. The eigenvector with the highest eigenvalue is the first principal component. Higher eigenvalues correspond to greater amounts of shared variance explained.
Select the number of principal components. Select the top N eigenvectors (based on their eigenvalues) to become the N principal components. The optimal number of principal components is both subjective and problem-dependent. Usually, we look at the cumulative amount of shared variance explained by the combination of principal components and pick that number of components, which still significantly explains the shared variance.
Keep in mind that the majority of data scientists will not calculate PCA by hand, but rather implement it in Python with ScikitLearn, or use R to compute it. These mathematical foundations enrich our understanding of PCA but are not necessary for its implementation. Understanding PCA allows us to have a better idea of its advantages and disadvantages.
4. What are the advantages and disadvantages of PCA?
PCA offers multiple benefits, but it also suffers from certain shortcomings.
Advantages of PCA:
Easy to compute. PCA is based on linear algebra, which is computationally easy to solve by computers.
Speeds up other machine learning algorithms. Machine learning algorithms converge faster when trained on principal components instead of the original dataset.
Counteracts the issues of high-dimensional data. High-dimensional data causes regression-based algorithms to overfit easily. By using PCA beforehand to lower the dimensions of the training dataset, we prevent the predictive algorithms from overfitting.
Disadvantages of PCA:
Low interpretability of principal components. Principal components are linear combinations of the features from the original data, but they are not as easy to interpret. For example, it is difficult to tell which are the most important features in the dataset after computing principal components.
The trade-off between information loss and dimensionality reduction. Although dimensionality reduction is useful, it comes at a cost. Information loss is a necessary part of PCA. Balancing the trade-off between dimensionality reduction and information loss is unfortunately a necessary compromise that we have to make when using PCA.
5. What are the assumptions and limitations of PCA?
PCA is related to the set of operations in the Pearson correlation, so it inherits similar assumptions and limitations:
PCA assumes a correlation between features. If the features (or dimensions or columns, in tabular data) are not correlated, PCA will be unable to determine principal components.
PCA is sensitive to the scale of the features. Imagine we have two features - one takes values between 0 and 1000, while the other takes values between 0 and 1. PCA will be extremely biased towards the first feature being the first principle component, regardless of the actual maximum variance within the data. This is why it’s so important to standardize the values first.
PCA is not robust against outliers. Similar to the point above, the algorithm will be biased in datasets with strong outliers. This is why it is recommended to remove outliers before performing PCA.
PCA assumes a linear relationship between features. The algorithm is not well suited to capturing non-linear relationships. That’s why it’s advised to turn non-linear features or relationships between features into linear, using the standard methods such as log transforms.
Technical implementations often assume no missing values. When computing PCA using statistical software tools, they often assume that the feature set has no missing values (no empty rows). Be sure to remove those rows and/or columns with missing values, or impute missing values with a close approximation (e.g. the mean of the column).
6. PCA in practice
The ways in which you use PCA in practice depends on how much you know about the entire data science process.
We recommend that beginners start by modeling data on datasets that have already been collected and cleaned, while experienced data scientists can scale their operations by choosing the right software for the task at hand.
6.1 Beginner projects to try out Principal Component Analysis
Countless high-dimensional datasets can be used to try out PCA in practice. Among the best ones are:
Keboola offers a platform for data scientists who want to build their own machine learning models. It comes with one-click deployed Jupyter Notebooks, through which all of the modeling can be done using Julia, R, or Python.
Deep dive into the data science process with Keboola:
Collect the relevant data.
Explore and clean the data to discover patterns.
Preprocess the data with PCA.
Train your machine learning model.
Evaluate the model with a variety of metrics.
Want to take it a step further? Keboola can help you to instrumentalize your entire data operations pipeline.
Being a data-centric platform, Keboola also allows you to build your own ETL pipelines and orchestrate tasks to get your data ready for machine learning algorithms. You can deploy multiple models with different algorithms to version your work and compare which ones perform best. Start building models today by creating a free account
Stay in touch
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.