# Dimensionality Reduction via Linear Discriminant Analysis

In my previous post, I had written about principal component analysis (PCA) for dimensionality reduction. In PCA, the class label of each and every example, even if available, is ignored and only the feature values from each example are considered. Thus, PCA is considered an *unsupervised* approach. The emphasis in PCA is to preserve data variability as much as possible while reducing the dimensionality. Linear discriminant analysis (LDA) on the other hand makes use of class labels as well and its focus is on finding a lower dimensional space that emphasizes class separability. In other words, LDA tries to find such a lower dimensional representation of the data where training examples from different classes are mapped far apart. Since LDA uses class labels, it is considered a *supervised* learning approach.

##### Basic Idea Behind LDA

The basic idea behind LDA is best understood by looking at the figure below where data from two classes, shown in red and black, is being projected onto a line. The figure shows only two choices for the line onto which the projections are made. Since our goal is to prefer a projection where the projected data from two classes do not overlap as far as possible, clearly the projection on the right is preferable.

Given that an infinite number of projections are possible, we need to define a measure that can help us select the best projection. Intuitively, we would like the projected points from two classes to be as far apart as possible. We can capture this notion by measuring the difference between means of class 1 and class 2 projected points. Although means can be far apart, it is still possible for projected points from two classes to show a fair amount of overlap if the projected points are spread out. To counter this, we also want the projected points from each class to lie close to each other. This means the mapped points for each class should have as small a spread as possible. Now, combining the two requirements for a good projection we can write a criterion function *J* as

*J(Projection) = (Difference between projected means of class 1 and 2)**2/(Sum of spread of class 1 and class 2 projected points)*

Thus, we look for a line onto which the projection yields a high enough value for the difference in projected means and a small enough value for the sum of variances so as to yield a maximum value for the criterion function. This criterion function was suggested by R.A. Fisher and hence is known as *Fisher’s criterion* and the resulting solution is called *Fisher’s linear discriminant function*.

##### LDA Math and Steps

Lets look at few equations to figure out the steps for applying LDA. We begin with *N **d*-dimensional feature vectors, \(\boldsymbol{x}_i, i= 1, \cdots,N\), from two classes, \(n_1 \text{ from class } c_1 \text{ and }n_2 \text{ from class }c_2\). Let a vector of unit length **w **define the line onto which the N data vectors are mapped to produce projections \(y_i, i= 1, \cdots,N\) via the relationship \(y_i= \bf {w}^t\bf{x}_i\).

The means and spreads, more commonly known as *scatter*, of the projected data points are then:

\(\tilde{m}_j = \frac{1}{n_j}\sum\limits_{y_i\in{c_j}}\limits^{n_j}y_i, j= 1, 2\)

\(\tilde{s}_{j}^2 = \sum\limits_{y_i\in{c_j}}\limits^{n_j}(y_i-m_j)^2, j= 1, 2\)

The sum \(\tilde{s}_{1}^2 + \tilde{s}_{2}^2\) is called the total *within-class scatter* of the projected data points. The Fisher criterion, measuring the goodness of projection onto the line defined by **w** is thus expressed as

\(J(\bf {w}) = \frac{|\tilde{m}_1 – \tilde{m}_2|^2}{ \tilde{s}_{1}^2 + \tilde{s}_{2}^2}\)

Instead of searching for optimal **w** to maximize the criterion, it turns out that a close form solution exists. It is given by the following relationship

\(\bf w = \bf S_{W}^{-1}(\bf {m}_1 -\bf {m}_2)\), where

\(\bf{m}_j = \frac{1}{n_j}\sum\limits_{\bf x_i\in{c_j}}\limits^{n_j}\bf x_i, j= 1, 2\)

are the means of class 1 and class 2 examples in the original space. The matrix \(\bf S_{W}\) is known as the *within-class scatter matrix*; it is formed by summing the *scatter matrices* of class 1 and class2. That is

\(\bf S_{W} = \bf {S}_1 +\bf {S}_2\), and

\(\bf S_j=\sum\limits_{\bf x_i\in{c_j}}\limits^{n_j}(\bf {x}_i -\bf {m}_j)(\bf {x}_i -\bf {m}_j)^T \text{ for }j= 1, 2\)

Thus, the steps for applying LDA are:

- Compute the mean vectors of two classes
- Compute the scatter matrices of two classes
- Add scatter matrices to form the within-class scatter matrix \(\bf S_{W}\) and calculate its inverse
- Obtain the solution vector
**w**by performing matrix multiplication \(\bf S_{W}^{-1}(\bf {m}_1 -\bf {m}_2)\) - Obtain the projections as \(y_i= \bf {w}^t\bf{x}_i\)

##### A Simple LDA Example

Lets consider a business that uses three skill tests to assign its employees to one of two groups: A team and B team. We have test scores of ten employees, five from each group. Thus, we have ten three-dimensional vectors from two classes. We will obtain a one-dimensional representation of these ten three-dimensional examples by following the steps outlined above. Lets first describe the input vectors and calculate the mean vectors of team A and team B.

import numpy as np from numpy import linalg as LA import matplotlib.pyplot as plt Ateam = np.array([[8,9,6],[6,7,5],[9,6,3],[7,8,2],[9,4,4]])# Skill scores for employees in team A Ateam_mean = np.mean(Ateam,0)# compute mean vector print(Ateam_mean)

[7.8 6.8 4. ]

Bteam = np.array([[5,4,7],[3,7,2],[4,5,5],[2,6,4],[4,3,4]])# Skill scores for employees in team B Bteam_mean = np.mean(Bteam,0) print(Bteam_mean)

[3.6 5. 4.4]

Next, we calculate scatter matrices of individual classes and add them to get the within-class scatter matrix.

S1 = np.zeros((3,3)) for j in range(len(Ateam)): S1 += np.outer((Ateam[j]-Ateam_mean),(Ateam[j]-Ateam_mean)) print(S1)

[[ 6.8 -5.2 -1. ] [-5.2 14.8 3. ] [-1. 3. 10. ]]

S2 = np.zeros((3,3)) for j in range(len(Bteam)): S2 += np.outer((Bteam[j]-Bteam_mean),(Bteam[j]-Bteam_mean)) print(S2)

[[ 5.2 -5. 5.8] [-5. 10. -7. ] [ 5.8 -7. 13.2]]

SW = S1+S2 ##Within-Class scatter matrix print(SW)

[[ 12. -10.2 4.8] [-10.2 24.8 -4. ] [ 4.8 -4. 23.2]]

Note that we could have calculated class scatter matrices as shown below. This is because the sample covariance matrix is obtained by dividing every term of the scatter matrix by a factor of number of examples minus one.

S1 = (len(Ateam)-1)*np.cov(Ateam.T) S2 = (len(Bteam)-1)*np.cov(Bteam.T)

Once we have the within-class scatter matrix and mean vectors, we calculate the solution vector to project our three-dimensional data to a line by following the steps shown below.

from numpy.linalg import inv SW_inv = inv(SW) W = np.matmul(SW_inv,(Ateam_mean-Bteam_mean))## Weight vector for projection print(SW_inv) print(W)

[[ 0.13580391 0.05279105 -0.01899546] [ 0.05279105 0.06199744 -0.00023307] [-0.01899546 -0.00023307 0.04699336]]

[ 0.67299849 0.33341102 -0.09899779]

We can now map our ten examples to one dimension. We can also calculate the projected means. Assuming that two classes, team A and B, are equally probable, we can calculate the midway point between the two projected means. This point, designated as AB_cutoff point, can be used to perform classification of new data.

Ateam_projections = np.matmul(W,Ateam.T) print(Ateam_projections) Bteam_projections = np.matmul(W,Bteam.T) print(Bteam_projections)

[7.79070038 5.87687915 7.76045915 7.18028202 6.99463932] [4.00565202 4.15487705 3.86406013 2.95047197 3.29623587]

Ateam_mean_proj = np.matmul(W.T,Ateam_mean) Bteam_mean_proj = np.matmul(W.T,Bteam_mean) AB_cutoff = 0.5*(Ateam_mean_proj+Bteam_mean_proj) print(Ateam_mean_proj,Bteam_mean_proj,AB_cutoff)

7.1205920055937515 3.6542594103251362 5.387425707959444

A graph of the resulting projections is shown below; it clearly shows that team A and team B data gets projected separately.

Now suppose the business in our example hires a new employee whose score on three skills is 5, 5, and 6. Should this employee be assigned to team A or B? We can answer this by finding the new employee’s projection. This turns out to be 4.438 which is less than the AB_cutoff value, i.e. closer to team B’s projection mean. Hence the new employee should be assigned to team B.

##### How to Apply LDA for More than Two Classes?

Not all classification problems are two class problems. So the question is how do we perform dimensionality reduction with LDA when the number of classes is, say, *K*. In such situations, the LDA approach is known as the *multiple discriminant analysis* and it uses *K-1* projections to map the data from the original *d*-dimensional space to a (*K-1)-*dimensional space under the condition that *d > K*.

Lets represent the *K-1* projections through the mapping \(\bf{y}_i= \bf{W}^t\bf{x}_i\), where ** W** is a (

*k-1) x d*projection matrix with each row representing a discriminant function. The projected vectors \(\bf {y}_i, i= 1, \cdots,N\) are of

*K-1*dimensions. Following the requirement that projected points from different classes should be far apart with small spread, the solution for projection matrix

*is given by the eigenvectors corresponding to the top*

**W***k-1*eigenvalues of the following matrix

\(\bf S_W^{-1}\bf S_B\), where

\(\bf S_{W} = \sum \limits_{i=1}\limits^{K}\bf {S}_i \) is the within-class matrix as defined earlier, and

\(\bf S_{B} = \sum \limits_{i=1}\limits^{K}(\bf {m}_i-\bf m)(\bf {m}_i-\bf m)^t \) is known as the *between-class scatter matrix *with ** m **being the mean of all the

*d*-dimensional examples from

*K*classes.

Lets go back to our A and B teams example and assume that we have another set of three-dimensional examples based on three skill scores. These examples represent members of team C. To perform dimensionality reduction on our three teams data, we will now generate two projections using LDA. To do this, we first compute team C mean and scatter.

Cteam = np.array([[3,5,8],[3,4,8],[4,5,9],[4,5,8],[5,4,7]])# Skill scores for employees in team C Cteam_mean = np.mean(Cteam,0) print(Cteam_mean)

[3.8 4.6 8.]

S3 = np.zeros((3,3)) for j in range(len(Cteam)): S3 += np.outer((Cteam[j]-Cteam_mean),(Cteam[j]-Cteam_mean)) print(S3)

S3 = np.zeros((3,3)) for j in range(len(Cteam)): S3 += np.outer((Cteam[j]-Cteam_mean),(Cteam[j]-Cteam_mean)) print(S3)

[[ 2.8 -0.4 -1. ] [-0.4 1.2 1. ] [-1. 1. 2. ]]

To calculate the between-class matrix, we calculate the mean of all team members from teams A, B and C, and then perform the necessary vector multiplications as shown below. The within-class scatter matrix is calculated by adding team A, B and C scatter matrices.

total_mean = np.mean([Ateam_mean,Bteam_mean,Cteam_mean],0) SB = np.zeros((3,3)) SB += np.outer((Ateam_mean - total_mean),(Ateam_mean - total_mean)) SB += np.outer((Bteam_mean - total_mean),(Bteam_mean - total_mean)) SB += np.outer((Cteam_mean - total_mean),(Cteam_mean - total_mean)) print(total_mean) print(SB)

[5.06666667 5.46666667 5.46666667] [[11.22666667 5.42666667 -5.65333333] [ 5.42666667 2.74666667 -3.65333333] [-5.65333333 -3.65333333 9.70666667]]

SW = S1+S2+S3 ##Within-Class scatter matrix print(SW)

[[ 14.8 -10.6 3.8] [-10.6 26. -3. ] [ 3.8 -3. 25.2]]

Having calculated the needed matrices, we can now compute the eigenvalues and eigenvectors of the \(\bold S_W^{-1}\bold S_B\) matrix.

SW_invSB = np.matmul(SW_inv,SB) eig_vals, eig_vecs = np.linalg.eig(SW_invSB) print(eig_vals) print(eig_vecs)

[ 1.96266109e+00 -4.68125067e-17 2.17635603e-01] [[-0.85426543 -0.3810106 0.43488256] [-0.45224032 0.9163546 0.26973255] [ 0.25633818 0.12298443 0.85913998]]

First and third eigenvalues are the two largest eigenvalues. Thus we select the corresponding eigenvectors to form the projection matrix and project the data to two dimensions. The steps for these and a scatter plot of the projected points is shown below.

W = np.array([eig_vecs[:,0],eig_vecs[:,2]]) y_Ateam = Ateam.dot(W.T)# Projections of team A members y_Bteam = Bteam.dot(W.T)# Projections of team B members y_Cteam = Cteam.dot(W.T)# Projections of team C members plt.scatter(y_Ateam[:,0],y_Ateam[:,1],c="r",label='A Team',marker= 'x') plt.scatter(y_Bteam[:,0],y_Bteam[:,1], c="g", label='B Team') plt.scatter(y_Cteam[:,0],y_Cteam[:,1], c="b", label='C Team',marker= 'x') plt.xlabel('Proj1') plt.ylabel('Proj2') plt.legend() plt.show()

It is seen that the projections of different classes are well separated except a point from team B overlapping with a point from team C. Lets use the new employee’s scores again and map the new employee to the 2-dimensional space defined above. In this case, the mapped coordinates are [-4.9944 8.6779]. Since this point is closest to mean of projections for team C, the new employee gets assigned to team C.

##### A Comparison of Dimensionality Reduction via LDA and PCA

We will use a well known data set and apply LDA and PCA to see what kind of results are obtained. We do this by using the *wine* data set available in the scikit-learn machine learning library. This data set has 178 examples from three classes. Each example consists of 13 real-valued features. In this case, we will use the LDA and PCA class from the scikit-learn library.

from sklearn.datasets import load_wine from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.decomposition import PCA wine = load_wine() Wine_Data = wine.data Wine_Class = wine.target lda = LinearDiscriminantAnalysis(n_components=2) lda.fit(Wine_Data, Wine_Class) Proj_lda = lda.transform(Wine_Data) pca = PCA(n_components=2) pca.fit(Wine_Data) Proj_pca = pca.transform(Wine_Data)

The resulting projections in two-dimensions by the both methods are shown below. It is clear from these projections that the dimensionality reduction by the LDA approach maintains class separation well while the reduction by the PCA has data from all three classes mixed up.

To summarize, we should be using LDA for dimensionality reduction if the objective is to maintain class separability. If preserving the data variability, i.e. keeping the data approximation error while reducing dimensionality, is of importance, then PCA should be used. While PCA is meant only for dimensionality reduction, LDA can be used for dimensionality reduction as well as for classification.

Hi, is that correct that LDA/FDA can only generate 2 output?

Not correct. The number of outputs/features that LDA/FLD can generate is c-1 where c is the number of classes and the dimensionality of the data is d with d>c.

Thanks for your reply.

Suppose I have 100 features and have 2 classes (and have 50 samples), I want to reduce to 5 features. Is LDA/FDA only generate 2 output?

Or I can have reduced the features from 100 features to 5 features, and will get (5 features with 50 samples).

Let me know if my question is unclear.

In your scenario, you will use PCA/SVD to reduce data to 5-dimensions. LDA/FLD will give you only one reduced feature. You may also want to check my post on Partial Least Squares (PLS) which considers the situation you have, that is more dimensions than the number of samples.

(Sorry, I unable to reply your message above because of the layout)

I am very confused with your term “feature” and “dimension”.

Let me make it clear:

I have 100 feature/dimension and have 2 classes (and have 50 samples), I want to reduce to 5 features/dimension. Is LDA/FDA only generate 2 feature/dimension?

Or I can have reduced the features from 100 features/dimension to 5 features/dimension, and will get (5 features/dimension with 50 samples).

Sorry for your time, I am very confused.

100 features/dimensions are in your original data. You want to reduce it to 5-dimensions. Each reduced dimension is going to be a new feature. You have 50 examples. This means you have 50 samples of 100-dimensional vectors. By applying PCA/SVD, you can reduce your data to 50 examples of 5-dimensional vectors. Hope this helps.

Thanks for your time and reply.

Let me make it clearer with the terms:

Original dataset:

100 features/dimensions;

2 classes;

50 samples/examples.

(50 samples of 100-dimensional vector)

Using PCA/SVD: I will get 50 examples of 5 dimensional-vectors)

Using LDA/FDA will give you only one reduced feature, will get (50 examples of 99-dimensional vector)

I think Linear Discriminant Analysis (LDA) , Fisher Linear Analysis (FLA), Fisher Discriminant Analysis (FDA) are the same name.

One correction in what you have stated.

Using LDA/FDA, you will get 50 examples of 1-dimensional vectors. This means you will get 50 numbers, one number for each example.

FLA, and FDA are the same and are a specific case of LDA.

Is Linear Discriminant Analysis / Fisher Discriminant Analysis only generate 2 output as dimensional reduction method?

Thanks.

Using LDA/FDA, will get 50 examples of 1-dimensional vectors.

That is strange, is it because I have 2 classes, so “c-1” is my output (2-1 = 1) ?

If the data have 5 classes, so LDA/FDA will get 3-dimensional vector?

If you have 5 classes, LDA/FDA will give you 4 -dimensional vectors.

Good luck!

Thanks. That is a bit strange.

4-dimensional vectors, is that meaning reduce from 100 dimensions to 4 dimensions?

I was thinking LDA/FDA able to take the top k eigenvector.

Start with n dimensions and end with k dimensions, where k < n

(Last question, hopefully, to make it clear)

You are confusing LDA with PCA where you take top k eigenvectors. May I suggest you re-read posts on PCA and LDA and work through the examples to minimize confusion. Good luck!

Thanks. I understand your meaning.

From my understand, LDA/FDA able to take the top k eigenvector.

Start with n dimensions and end with k dimensions, where k c.

I am so sorry, that do I make a huge mistake initially?

No problem. We all sometime get confused. Good luck with your work.

Just a clarification. In PCA, we take top k eigenvectors of the correlation matrix. In LDA (multiple linear discriminant analysis with more than 2 classes), we also take top k eigenvectors of the matrix product of inverse of within-class and between class scatter matrices.

Thanks. So, LDA/FDA able to “Start with n dimensions and end with k dimensions, where k < n".

What my understanding that LDA/FDA able to reduce dimension from n dimension to k dimension (and not restricted to only 2 dimension).

The process between PCA and LDA/FDA is different, but both have k eigenvales, and can "Start with n dimensions and end with k dimensions, where k < n".

LDA/FDA is not restricted to 2 dimensions but the correct term in that situation is MLDA/MFDA (Multiple linear discriminant analysis).

Hi, after I had read a couple of website. You are correct at the beginning. Sorry for confusing.

In the context of dimensionality reduction using LDA/FDA. The output is “c-1” where “c” is the number of classes and the dimensionality of the data is n with “n>c”. and

NOT “LDA/FDA can start with “n” dimensions and end with k dimensions, where “k” less than “n”. “

I am glad you got it.

Do you agree that? Let say my original dataset has 2 classes, the output will be 1 dimensionality ( 2 – 1 =1 ), likewise, if my original dataset has 5 classes, the output will be 4 dimensionality.

Yes. That is what I have been telling you from the beginning.