Promoting the use of R in the NHS

Blog Article

This post was originally published on this site

(This article was first published on Enhance Data Science, and kindly contributed to R-bloggers)

This is this second post of the “Create your Machine Learning library from scratch with R !” series. Today, we will see how you can implement Principal components analysis (PCA) using only the linear algebra available in R. Previously, we managed to implement linear regression and logistic regression from scratch and next time we will deal with K nearest neighbors (KNN).

Principal components analysis

The PCA is a dimensionality reduction method which seeks the vectors which explains most of the variance in the dataset. From a mathematical standpoint, the PCA is just a coordinates change to represent the points in a more appropriate basis. Picking few of these coordinates is enough to explain an important part of the variance in the dataset.

The mathematics of PCA

Let mathbf{x}_1, ... mathbf{x}_n be the observations of our datasets, the points are in mathbb{R}^p. We assume that they are centered and of unit variance. We denote mathbf{X}=(mathbf{x}_1, ..., mathbf{x}_n)^T the matrix of observations.
Then, mathbf{X}^T mathbf{X} can be diagonalized and has real and positive eigenvalues (it is a symmetric positive definite matrix).
We denote lambda_1 > ... > lambda_p its ordered eigenvalues and u_1, ... , u_p the associated eigenvectors. It can be shown that frac{sum_{1leq i leq k} lambda_i}{sum_{1leq i leq p} lambda_i} is the cumulative variance explained by u_1, ..., u_k.
It can also be shown that u_1, ..., u_k is the orthonormal basis of size k which explains the most variances.

This is exactly what we wanted ! We have a smaller basis which explains as much variance as possible !

PCA in R

The implementation in R has three-steps:

  1. We center the data and divide them by their deviations. Our data now comply with PCA hypothesis.
  2. We diagonalise mathbf{X}^T mathbf{X} and store the eigenvectors and eigenvalues
  3. The cumulative variance is computed and the required numbers of eigenvectors k to reach the variance threshold is stored. We only keep the first k eigenvectors
  ##Compute the mean of each variable
  if (center)
  ## Otherwise, we set the mean to 0 
  ####Compute the standard dev of each variable
  if (scale)
  ## Otherwise, we set the sd to 0 

  ##Cov matrix
  ##Computing the cumulative variance
  my_pca[['cumulative_variance']] =cumsum(eigen_cov[['values']])
  ##Number of required components
  my_pca[['n_components']] =sum((my_pca[['cumulative_variance']]/sum(eigen_cov[['values']]))<variance_explained)+1
  ##Selection of the principal components
  my_pca[['transform']] =eigen_cov[['vectors']][,1:my_pca[['n_components']]]
  attr(my_pca, "class") <- "my_pca"

Now that we have the transformation matrix, we can perform the projection on the new basis.


The function applies the change of basis formula and a projection on the k principals components.

Plot the PCA projection

Using the predict function, we can now plot the projection of the observations on the two main components. As in the part 1, we used the Iris dataset.

pca1=my_pca(as.matrix(iris[,1:4]),1,scale=TRUE,center = TRUE)
PCA iris
Projection of the Iris dataset on the two mains PCA

Comparison with the FactoMineR implementation

We can now compare our implementation with the standard FactoMineR implementation of Principal Component Analysis.

pca_stats= PCA(as.matrix(iris[,1:4]))
ggplot(data=iris)+geom_point(aes(x=projected_stats[,1],y=-projected_stats[,2],color=Species))+xlab('PC1')+ylab('PC2')+ggtitle('Iris dataset projected on the two mains PC (FactomineR)')

When running this, you should get a plot very similar to the previous one. This ensures the sanity of our implementation.

Projection of the Iris dataset using the FactoMineR implementation

Thanks for reading ! To find more posts on Machine Learning, Python and R, you can follow us on Facebook or Twitter.

The post Create your Machine Learning library from scratch with R ! (2/5) – PCA appeared first on Enhance Data Science.

To leave a comment for the author, please follow the link and comment on their blog: Enhance Data Science. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Comments are closed.