How to Perform Regression with more Predictors than Observations
A common scenario in multiple linear regression is to have a large set of observations/examples wherein each example consists of a set of measurements made on a few independent variables, known as predictors, and the corresponding numeric value of the dependent variable, known as the response. These examples are then used to build a regression model of the following form:
\(y_i = \beta_0 + \beta_1 x_{i1} +\beta_2 x_{i2} + \cdots +\beta_d x_{id} + \epsilon_i\)
The equation states that the response \(y_i\) for the ith input vector or record is a function of d+1 constants (the \(\beta \text{s}\)), values of the d predictors for the ith observation, and an error \(\epsilon_i\). The term \(\beta_0\) is known as the bias/intercept. The model is often written in the matrix vector notation as
\(\bf y = \bf {X}\bf {\beta} + \boldsymbol{\epsilon}\text{, where}\)
\(\bf y\) is a column vector of N elements, \(\bf X\) is a \(N\times (d+1)\) matrix of N observations, and \(\boldsymbol{\epsilon}\) is a column vector of error terms. Note that I have absorbed the bias term in the coefficient vector \(\boldsymbol \beta\) and each record now consists of d+1 components with the first component being equal to 1.
The model parameters, \(\beta \text{s}\), are determined by the least squares method to minimize the residual sum of squares (RSS), a measure of difference between the actual and the predicted output. This leads to the following unique solution vector of regression coefficients when we have more observations than features
\(\tilde{\boldsymbol{\beta}} = (\bf X^{T}\bf X)^{-1} \bf X^{T}\bf y\)
The vector of predicted values is then given by the following expression
\(\tilde{\bf y} = \bf X \tilde{\boldsymbol \beta} = \bf X (\bf X^{T}\bf X)^{-1} \bf X^{T}\bf y\)
This approach to regression using least squares is known as ordinary least squares (OLS) regression. In some regression applications, we have multiple outputs also. In such cases, the term multivariate regression is used and the above model is expressed as
\(\tilde{\bf Y} = \bf X\bf B + \bf E\text{,}\)
where \(\bf Y\) is the \(N\times K\) response matrix assuming that we have K outputs. The matrix B is the coefficient matrix of size \((d+1)\times K\); each column of B matrix represents a set of regression coefficients for relating one of the outputs to inputs. The matrix E is the \(N\times K\) matrix of errors. The least squares estimates for regression coefficients in this case are given as
\(\tilde{\bf B} = (\bf X^{T}\bf X)^{-1} \bf X^{T}\bf Y\)
Fewer Observations and more Predictors
While OLS regression is a very popular method, it becomes unworkable when \(\bf X^T \bf X\) is a singular matrix and its inverse doesn’t exist. This happens when the number of predictors, d, is more than the number of observations, N. The OLS regression approach also becomes unworkable when the predictors are highly correlated resulting in the columns of X matrix being not linearly independent. In such cases, the regression coefficients are unreliable. This situation is known as multicollinearity. Both of above situations are found in many applications in fields such as bioinformatics, chemical and drug analysis, and neuroimaging data analysis.
One potential solution for multicollinearity and fewer observations is to make use of principal component analysis to create a smaller, uncorrelated set of new features and then use them to perform regression. This approach is known as PCA regression (PCR). A drawback of using PCA is that it is meant for preserving the matrix X of predictors measurements. Thus the selected principal components may not be much relevant for predicting the response variables. A better option is to look for those components of X that are more relevant for Y. This is what is done by the method known as the partial least squares (PLS). The method was developed in econometrics about fifty years ago and has been used extensively in chemometrics.
Basic Idea of Partial Least Squares Regression
In PCA, we decompose \(\bf X^{T}\bf X\) to determine principal components. In other words, we are working with covariance of the predictors. In PLS, we work with cross covariance by working with \(\bf X^{T}\bf Y\) to simultaneously decompose X and Y to find components that can explain X as well as Y. The \(N\times d\) X matrix is decomposed/approximated as \(\bf X = \bf T\bf P^T\) with \(\bf T^T\bf T = \bf I\), I being an identity matrix. The number of columns in T matrix, known as the X–score matrix, determines the number of extracted components; these components are called latent vectors and serve the same purpose as that of eigenvectors in PCA. Their number determines the accuracy of the resulting regression model. The matrix P is known as the X–loading matrix. Instead of regressing Y on X, it is regressed on T and is approximated as \(\bf Y = \bf T\bf Q^T\). The matrix Q is called the Y-weights matrix. The steps to determine T, P, and Q are described below using an example. Once these matrices are determined, regression coefficients to predict Y from X are determined.
PLS Regression Steps
There are a few slightly different versions of algorithms for PLS. I am going to describe an iterative algorithm, NIPALS which stands for nonlinear iterative partial least squares, to extract components, the combinations of predictors, in PLS. I will illustrate the steps for performing PLS regression using a small made-up example. The example consists of profiles of three candidates described in terms of their numerical ratings on four traits. This part of the data, three observations and four predictors, forms matrix X. Each candidate profile is evaluated by two experts who give their scores to each applicant. This forms the response matrix Y. The X and Y matrices are given as follows and our objective is to build a regression model to predict scores given by two experts.
Id# | Trait1 | Trait2 | Trait3 | Trait4 | Score1 | Score2 |
---|---|---|---|---|---|---|
1 | 7 | 7 | 6 | 8 | 7 | 8 |
2 | 5 | 4 | 6 | 5 | 5 | 6 |
3 | 8 | 7 | 5 | 3 | 2 | 4 |
Step1: Initialize two matrices, E0 = X and F0 = Y. Center and normalize each column of E0 and F0.
Step 2: Form cross product matrix \(\bf S=\bf X^T\bf Y\) and perform its singular value decomposition (SVD). Designate the first left singular vector as \(\bf w_1 \). The vector \(\bf w_1\) forms the first column of a weight matrix, W, which is later used to express regression coefficients to relate Y and X. These steps in R and their results are shown below.
X = matrix(c(7,5,8,7,4,7,6,6,5,8,5,3), nrow=3,ncol=4) Y = matrix(c(7,5,2,8,6,4),nrow=3,ncol = 2) E0 = scale(X) F0 = scale(Y) S0 = t(E0)%*%F0 s = svd(S0) w1 = s$u[,1] print(w1) [1] -0.27586226 -0.04227804 0.64530686 0.71111999
Step 3: Using \(\bf w_1 \), next calculate \(\bf t_1 \), the first component/column of matrix T. Follow this up by calculating normalized \(\bf p_1 \text{ and }\bf q_1\). These steps are shown below.
t1 = E0%*%w1 print(t1) [,1] [1,] 1.0414819 [2,] 0.6281868 [3,] -1.6696687 p1 = t(E0)%*%t1/colSums(t(t1)%*%t1)# Normalized p1 q1 = t(F0)%*%t1/colSums(t(t1)%*%t1)# Normalized q1 print(p1) [,1] [1,] -0.4489104 [2,] -0.2549863 [3,] 0.6777327 [4,] 0.6019191 print(q1) [,1] [1,] 0.6604169 [2,] 0.6353619
Step 4: Having extracted the first latent vector and corresponding loading vectors, the matrices E0 and F0 are deflated by subtracting information related to the first latent vector. This produces deflated matrices E1 and F1 as shown in the calculations below.
E1 = E0-t1%*%t(p1) F1 = F0-t1%*%t(q1)
Step 5: Calculate the cross product matrix of E1 and F1 as was done in Step 2. With this new cross product matrix, repeat the steps 3 and 4 and save the resulting \(\bf w, \bf t, \bf p\text{ and } \bf q \) vectors to form the next columns of matrices W, T, P, and Q, respectively. This yields the second component. After this, we continue the above steps till the deflated matrices are empty or the necessary number of components have been extracted. Then the algorithm stops.
Carrying out the steps as shown above, we get the following result.
E1 = E0-t1%*%t(p1) F1 = F0-t1%*%t(q1) S1 = t(E1)%*%F1 s1 =svd(S1) w2 = s1$u[,1] print(w2) [1] -0.5827933 -0.7163611 0.1092040 -0.3677679 t2 = E1%*%w2 print(t2) [,1] [1,] -1.1766607 [2,] 1.3882964 [3,] -0.2116356 p2 = t(E1)%*%t2/colSums(t(t2)%*%t2) q2 = t(F1)%*%t2/colSums(t(t2)%*%t2) print(p2) [,1] [1,] -0.5827933 [2,] -0.7163611 [3,] 0.1092040 [4,] -0.3677679 print(q2) [,1] [1,] -0.2034234 [2,] -0.2874933 E2 = E1 - t2%*%t(p2) F2 = F1 - t2%*%t(q2) S2 = t(E2)%*%F2 print(S2) [,1] [,2] [1,] -3.296958e-18 -4.659511e-18 [2,] 6.426732e-17 9.082743e-17 [3,] 6.593917e-18 9.319021e-18 [4,] 4.998050e-17 7.063622e-17
We notice that the deflated matrices are now empty. So no more components need to be extracted. Now that we have determined matrices T and Q, we can predict Y as \(\tilde {\bf Y} = \bf T\bf Q^T\). This calculation is shown below.
Y_pred = T%*%t(Q) print(Y_pred) [,1] [,2] [1,] 0.9271726 1.000000e+00 [2,] 0.1324532 -1.665335e-16 [3,] -1.0596259 -1.000000e+00
Well! These predicted numbers do not look right. But, wait. Lets look at centered and normalized Y, i.e. the matrix F0.
print(F0) [,1] [,2] [1,] 0.9271726 1 [2,] 0.1324532 0 [3,] -1.0596259 -1 attr(,"scaled:center") [1] 4.666667 6.000000 attr(,"scaled:scale") [1] 2.516611 2.000000
So, what we have is predicted values that match well with actual values in a centered and scaled space. However, our interest is in being able to predict Y from X. Thus, we need to get to a relationship that regresses Y against X. This requires some algebraic manipulation wherein we define a new matrix \(\bf R = \bf W(\bf P^T W)^{-1}\), and use it to relate directly X and Y in the familiar regression equation \(\tilde {\bf Y}= \bf X\bf B_{PLS}\), where \(\bf B_{PLS}= \bf R Q^T\). Since the matrices we are working with were derived after normalizing the data, an adjustment for this is as well needed. Finally, we need to adjust for centering, intercept term, before we can get the final prediction for Y in terms of X. This is done by subtracting the current predicted Y means from the original Y means and the result is added to respective columns to obtain the final predictions. Thus, we can express the final model as \(\tilde {\bf Y}= \bf X\bf B_{PLS}^*\) where matrix X has an additional column of all 1s and \(\bf X\bf B_{PLS}^*\)includes intercept. Calculations for these adjustments are shown below.
R = W%*%solve(t(P)%*%W)# Calculate R matrix B1 = R%*%t(C)# B matrix before adjustment B = diag(1/apply(X,2,sd))%*%Bs%*%diag(apply(Y,2,sd))# final B matrix Y_int_pred = X%*%Br# Intermediate result print(Y_int_pred) [,1] [,2] [1,] 16.52966 14.07664 [2,] 14.52966 12.07664 [3,] 11.52966 10.07664 intercept = apply(Y,2,mean)-apply(Y_int_pred,2,mean) print(intercept) [1] -9.529659 -6.076640 YPred = matrix(c(Y_int_pred[,1]+intercept[1],Y_int_pred[,2]+intercept[2]),nrow=3,ncol=2) print(YPred) [,1] [,2] [1,] 7 8 [2,] 5 6 [3,] 2 4
Finally, we have the prediction which happens to be 100% accurate in this case. Of course, the prediction is being made on the same data that was used to derive the model.
PCR Vs. PLS
Since PCR, principal component analysis followed by OLS regression, is also applicable when we have more predictors than observations or highly correlated predictors, its a good idea to compare PCR and PLS using a real data set. This is what we are going to do in this section using the pls package in R. There is another package, plsdepot, in R as well and one can instead use that. A PLS regression implementation in python is also available in Sklearn library.
I will use the yarn data set available in the pls package. The data set contains 28 near-infrared spectra (NIR) of PET, a type of polyester yarn, measured at 268 wavelengths, serving as predictors and yarn density as response. A plot of spectra for all 28 observations is shown below.Although the package performs cross validation to select the number of components in the model, I will specify the number of components as 8. That way, we will get results for components ranging from one to eight. This will help us comparing PLS and PCR. The package offers three different cross validation (CV) methods. I will use the leave-one-out (LOO) cross validation as the number of examples is small. The following code shows the steps I followed in building the pls regression model.
library(pls) data("yarn") str(yarn) 'data.frame': 28 obs. of 3 variables: $ NIR : num [1:28, 1:268] 3.07 3.07 3.08 3.08 3.1 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : NULL .. ..$ : NULL $ density: num 100 80.2 79.5 60.8 60 ... $ train : logi TRUE TRUE TRUE TRUE TRUE TRUE ... > yarn.pls <- plsr(density ~ NIR, 8, data = yarn, validation = "LOO") Data: X dimension: 28 268 Y dimension: 28 1 Fit method: kernelpls Number of components considered: 8 VALIDATION: RMSEP Cross-validated using 28 leave-one-out segments. (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps CV 27.46 4.600 3.900 2.090 0.7686 0.5004 0.4425 0.2966 0.2643 adjCV 27.46 4.454 3.973 2.084 0.7570 0.4967 0.4398 0.2926 0.2610 TRAINING: % variance explained 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps X 46.83 98.38 99.46 99.67 99.85 99.97 99.98 99.99 density 98.12 98.25 99.64 99.97 99.99 99.99 100.00 100.00
We note that there is not much difference in coefficient values using CV or adjusted CV estimates. The model summary shows that the first two components are able to explain over 98% of the variance in predictors and the response. This is also made clear when we look at the RMSEP (Root Mean Square Error of Prediction) for different number of components. With four components, the RMSEP value is 0.41936.
RMSEP(yarn.pls, newdata = yarn) (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 26.47630 3.62801 3.50180 1.59352 0.41936 0.29223 0.25136 0.10932 0.08832
Now, lets try to fit a PCR model by specifying the same number of components.
yarn.pcr <- pcr(density ~ NIR, 8, data = yarn, validation = "LOO") summary(yarn.pcr) Data: X dimension: 28 268 Y dimension: 28 1 Fit method: svdpc Number of components considered: 8 VALIDATION: RMSEP Cross-validated using 28 leave-one-out segments. (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps CV 27.46 30.82 4.276 2.558 2.612 1.643 0.4891 0.5288 0.3033 adjCV 27.46 31.43 4.264 2.547 2.618 1.609 0.4852 0.5258 0.2994 TRAINING: % variance explained 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps X 52.17 98.60 99.47 99.70 99.88 99.97 99.98 99.99 density 5.50 98.15 99.40 99.58 99.95 99.99 99.99 100.00 > RMSEP(yarn.pcr, newdata = yarn) (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps 26.4763 25.7380 3.6016 2.0592 1.7244 0.5972 0.2864 0.2582 0.1262
Looking at the PCR model, we find that the RMSEP error value for four components is 1.7244. This means that PCR requires more components to achieve a performance level similar to PLS. Even with eight components, PLS gives slightly lower error than PCR.
Summary
Let me summarize what I have discussed thus far.
- Both PCR and PLS are able to perform regression when you have highly correlated predictors or more predictors than the number of examples.
- PLS requires fewer components (latent vectors) compared to PCR to achieve the same level of prediction error. This is due to the fact that PLS is based on cross-variance while PCR is based on variance.
- Another way to look at the difference between PLS and PCR is that PLS can be viewed as a supervised method of dimensionality reduction for regression while PCR is an unsupervised method for dimensionality reduction.
- PLS can fit a single model to a multiple response regression problem unlike the ordinary regression where a separate model will be fitted for individual response variables.
- Finally, there is another approach to selecting predictors known as the shrinkage method. The most well known example of shrinkage methods is ridge regression. In shrinkage methods, none of the predictors are discarded and no predictor combinations are formed. Rather, an additional term in the regression process is introduced which constraints the regression coefficient magnitudes. PCR, PLS and ridge regression behave similarly. However, ridge regression is preferred because of better interpretation.
My simplified example is based on the following paper that you may want to consult for details.