Understanding Compression of Convolutional Neural Nets: Part 1
Understanding Compression of Convolutional Neural Nets: Part 1
There are neural network applications where the resources, for example compute power, memory space, battery power etc., are limited. In such instances, we may want to simplify a trained network to better deal with available resources. While simplifying, we obviously would like to maintain the network accuracy as much as possible. Such a network simplification is termed compression of deep neural networks. In this three-part blog posts, I am going to share with you some simple examples of neural network compression to help you better understand compressing deep neural nets. In Part 1, I will begin with a technique, known as singular value decomposition (SVD), which is the basis of compression methods. Part 2 will show compression of feedforward neural networks and Part 3 will demonstrate compression of convolution layers in convolutional neural networks (CNNs).
Singular Value Decomposition (SVD)
\(\bf A = \begin{bmatrix}1 & 2\\3 & 4\end {bmatrix}\)
\(\bf A = \begin{bmatrix}1 & 2\\2 & 4\end {bmatrix}\)
\(\bf A = \bf {UDV}^t\)
where \(\bf U\) is an orthogonal matrix of size m x m of left singular vectors and \(\bf V\) is an orthogonal matrix of size n x n of right singular vectors. The matrix \(\bf D\) is a diagonal matrix of size m x n of singular values. A low rank approximation to matrix A of rank r is obtained by using only a subset of singular values and the corresponding left and right singular vectors as given by the following expression. In other words, the approximation is obtained by the weighted sum of rank one matrices.
\(\hat{ \bf A} = \sum\limits_{j=1}\limits^{k} d_{jj}\bf U_j\bf V^t,\text{ }k\leq r\)
Let me give you an example by decomposing a matrix and obtaining its different approximations.
from scipy import linalg import numpy as np np.set_printoptions(precision=2) np.set_printoptions(suppress=True) # Create a 4x3 matrix A = np.array([[6,6,2],[0,1,3],[4,0,1],[0,6,2]]) # Obtain SVD decomposition U,d,Vt = linalg.svd(A) print(' U matrix\n',U,'\n d array\n',d,'\n V-Transpose matrix\n',Vt)
This code creates a 4×3 matrix A and decomposes it to obtain U and V matrices along with an array d of singular values as shown below. Note that the array of d singular values represents the diagonal elements of matrix D of the SVD equation given earlier. In this case, there are three non-zero singular values; this means that the rank of matrix A is three.
U matrix [[-0.82 -0.28 0.21 -0.46] [-0.16 0.2 -0.93 -0.26] [-0.24 -0.62 -0.28 0.69] [-0.5 0.7 0.1 0.5 ]] d array [10.51 5.06 2.63] V-Transpose matrix [[-0.56 -0.77 -0.32] [-0.83 0.54 0.16] [ 0.05 0.36 -0.93]]
Given U, V, and d, we can get back our original matrix A without any error by performing the matrix multiplication \(\bf {UDV}^t\).
D = np.zeros((4,3))# (4,3) is the size of matrix A for i in range(len(d)): D[i,i]= d[i] Aa = U@(D@Vt)# Aa is the reconstructed matrix print(' Reconstructed Matrix An',Aa)
Reconstructed Matrix A [[6. 6. 2.] [0. 1. 3.] [4. 0. 1.] [0. 6. 2.]]
The reconstructed matrix is identical to the original matrix. Suppose we reconstruct matrix A using only the first two singular values. We do that by replacing ‘len(d)’ above with ‘len(d)-1’ and get the following approximated matrix A.
Reconstructed Matrix A [[ 5.97 5.8 2.51] [ 0.12 1.87 0.72] [ 4.04 0.26 0.31] [-0.01 5.91 2.25]]
Since this approximation is obtained by summing two rank one matrices, you can easily verify the rank of this matrix as 2. You can also calculate the approximation error by comparing the original and the reconstructed matrix element by element , squaring the difference, and adding them, as shown below. This measure of comparing two matrices is known as Frobenius norm.
from numpy.linalg import matrix_rank print(matrix_rank(Aa)) err = np.sum(np.square(F-Aa)) print(' Approximation Error = ',err)
2 Approximation Error = 6.904768216213785
If you look at the approximation error, you will find that it equals square of 2.63, the singular value not used in reconstructing the approximation. You can also approximate matrix A by using only one singular value. We do this by replacing ‘len(d)’ with ‘len(d)-2’ and get the following approximation.
Reconstructed Matrix A [[4.79 6.57 2.75] [0.96 1.32 0.55] [1.43 1.95 0.82] [2.92 4.01 1.67]]
You can verify that the error in this case would be the sum of squares of the second and third singular values not used in creating approximation.
Truncated SVD
The singular value decomposition described above creates an approximate matrix whose size remains the same. In some cases, the matrix A to be decomposed is treated as a data matrix with each row representing an observation and each column representing an observed feature. In such a scenario, the objective is not only to approximate but also to achieve a reduction in the size of the approximated matrix, i.e. the resulting matrix has same number of rows (observations) but a fewer number of columns. Thus, the matrix approximation is aimed at achieving dimensionality reduction similar to PCA. The use of SVD in this context is known as truncated SVD. To obtain a reduced approximation matrix, we use only top k < r singular values and the first k columns of the U matrix and calculate the reduced representation as
\(\bf A_k= \bf U_{k}\bf D_{k},\text{ }k\leq r\)
where \(\bf {U}_k\) is the sub-matrix formed by the first k columns of U and \(\bf {D}_k\) is the sub-matrix of D formed by first k rows and columns.
An example of calculating the truncated SVD is shown below.
k = 2 Ak = U[:,0:k]@D[0:k,0:k] print('Reduced matrix Ak \n',Ak)
Reduced matrix Ak [[-8.58 -1.43] [-1.73 1.02] [-2.55 -3.15] [-5.23 3.54]]
The Scikit-learn library also provides an implementation of truncated SVD as shown below.
from sklearn.decomposition import TruncatedSVD A.dtype='Float64' dim_red = TruncatedSVD(k, algorithm = 'arpack') Ak = dim_red.fit_transform(A)
Before concluding this post, let me show you two examples of SVD applications. The first example is of image compression wherein I have performed singular value decomposition on an image and reconstructed it using 32, 16, and 8 top singular values. The results of this reconstruction along with the original image are shown below. As you can see, the first two approximations provide a good compression without satisfying much of the image quality.

The second example is of truncated SVD application. In this case, the truncated SVD has been used to reduce the Iris data matrix to a two-dimensional space, thus doing dimensionality reduction.
from sklearn.datasets import load_iris iris = load_iris() iris_data = iris.data iris_target = iris.target iris_transformed = dim_red.fit_transform(iris_data) plt.scatter(iris_transformed[:,0],iris_transformed[:,1], c = iris_target,label=cla) plt.title('Plot of Iris Data in 2-Dimensional Transformed Space') plt.show()

The dimensionality reduction shows that Iris data can be well separated in the transformed two-dimensional space.
With the above introduction to SVD, we are ready to move on to compressing neural networks and convolutional neural networks in the coming blogs.
Hey Krishan,
I was wondering about the truncated SVD part: why did the V^T matrix disappear? We don’t get an accurate reduced representation of matrix A now, right? As I understand it, we want the columns of the original matrix A with the highest singular values (as they hold the most information). This means we have to get that part of the A matrix back, which isn’t the case right now (the values in Ak are nothing like the values in the original A). Please do let me know if I’m wrong or if you have any clarifications!
Best regards,
Klaas
You are correct in that we don’t get an accurate reduced representation of matrix A in truncated SVD. The reason is that the truncated SVD is transforming the original data, matrix A, into a new feature space of lesser dimensionality. Being a different space, the numbers in this space look very different from the original space. If we need to go back to the original values, we would need to use the V matrix also. You may see some similarities with PCA here. There also the transformed data has no resemblance with the original numbers but we can get back to the original numbers or their approximated values by applying the inverse transformation.
Hope this helps.