Solving PCA without matrix calculus and Lagrange multipliers

1. Introduction

Do you know what PCA is and does but you are unable to derive the results by yourself? Maybe you are stuck because of matrix calculus or maybe you don’t know how to use Lagrange multipliers, but you are just like me and the best way of understanding things is by deriving them yourself. If you understand linear algebra the PCA can be solved without calculus. I will be using different theorems, that will be proven at the end of this blog. My knowledge on how to solve this problem comes from the great “Linear Algebra and its applications” by David C. Lay[1], which i highly recommend. I was trying to find similar articles or procedures, but with no success. If you know some that has taken the same route as in this blog, please link it in the comments (which are disabled since i haven’t figured it out yet) and i will put it in the references.

1.1. Notation

Vectors are denoted by boldface lowercase letters like \(\bf{x}\), where we’re using the convention that vectors are column vectors. We assume that any arbitrary vector comes from \(\mathbb{R}^{n}\). Matrices are denoted by boldface uppercase letters like \(\bf{X}\). Scalar values are denoted as x. Random variables are denoted by uppercase letters like \(X\) and random vectors are denoted as \(\mathbb{X}\). The expected value w.r.t. random variable \(X\) is denoted as \(\mathbb{E}_{X}[ \cdot ]\). The \(\mathbb{l}2\)-norm is denoted as \(\Vert \bf{x} \Vert_{2}\) and is defined as an inner product \(\langle\bf{x}, \bf{x}\rangle = \bf{x}^{T}\bf{x}\).

2. Outline of the objective problem

The basic premise behind PCA is to project some high-dimensional data onto a subspace generated by a set of vectors. We can define the objective function of PCA in two ways, where the second (maximization) comes naturally from the first (minimization), and so we will start by outlining the most basic definition. At the beginning we only consider one data point (a single vector) \(\bf{x} \in \mathbb{R}^{n}\) to build some intuition and to simplify the calculations. Then, we can take any arbitrarily big dataset \(\bf{X} \in \mathbb{R}^{m \times n}\) and proceed in the similar manner. We will also only take a look at projecting our datapoint onto a 1-dimentional subspace. Projecting onto a higher dimensional subspaces will be described later.

The PCA objective function tries to minimize the distance between our data vector \(\bf{x}\) and the subspace spanned by some vector \(\bf{v} \in \mathbb{R}^{n}\), which we denote as \(Span\{\bf{v}\}\). We put a constraint, that \(\bf{v}\) must be a unit vector, as this will only assure the unique solution. Since, minimizing the squared distance will give us the same result, we will be using it to write our objective, which turns out to be:

\[\begin{aligned} (\hat{\bf{v}}, \hat{\alpha}):= \underset{\bf{v}, \ \alpha, \ s.t. \ \Vert\bf{v}\Vert_{2}=1}{\operatorname{argmin}}\Vert\bf{x}-\alpha \bf{v}\Vert_{2}^2 \\ \end{aligned}\]

3. Solving

3.1. Solving for \(\alpha\)

Normally we would start with solving for \(\alpha\) which is simple since it is a scalar, but then we would have to use matrix calculus to find \(\hat{\bf{v}}\) such that the objective is minimized. There is another way. First off, we will use the result of so called The Best Approx. Theorem (1) , which comes from the well-known Pythagorean Theorem. This theorem tells us that:

\[\begin{align} \Vert\bf{x} - \hat{\bf{x}}\Vert_{2}^{2} \le \Vert\bf{x} - \bf{y}\Vert_{2}^{2},\ \forall{\bf{y}} \in Span\{\bf{v}\} \end{align}\]

where \(\hat{\bf{x}}\) is an orthogonal projection of \(\bf{x}\) onto \(Span\{\bf{v}\}\) defined by \(\hat{\bf{x}} = \frac{\langle \bf{x}, \bf{v} \rangle}{\langle \bf{v}, \bf{v} \rangle} \bf{v}\), but we have put a constraint that \(\bf{v}\) is a unit vector, thus \(\hat{\bf{x}} = \langle \bf{x}, \bf{v} \rangle \bf{v}\).

If you look closely you will see that we have solved half of the PCA objective. Since \(\Vert\bf{x} - \hat{\bf{x}}\Vert_{2}^{2} \le \Vert\bf{x} - \bf{y}\Vert_{2}^{2}\) it must also be the case that \(\Vert\bf{x} - \hat{\bf{x}}\Vert_{2}^{2} \le \underset{\bf{y} \in Span\{\bf{v}\}}{\operatorname{min}} \Vert\bf{x} - \bf{y}\Vert_{2}^{2}\), and so \(\hat{\alpha} = \langle \bf{x}, \bf{v} \rangle\) is the solution. Going back into our objective and substituting for \(\hat{\alpha}\) we get:

\[\begin{align} \hat{\bf{v}}: = \underset{\bf{v}, \ s.t. \ \Vert\bf{v}\Vert_{2}=1}{\operatorname{argmin}}\Vert\bf{x}-\langle \bf{x}, \bf{v} \rangle \bf{v}\Vert_{2}^2 = \underset{\bf{v}, \ s.t. \ \Vert\bf{v}\Vert_{2}=1}{\operatorname{argmin}}\Vert\bf{x}\Vert_{2}^2 + (\langle \bf{x}, \bf{v} \rangle)^2\Vert\bf{v}\Vert_{2}^2 - 2(\langle \bf{x}, \bf{v} \rangle)^2 \end{align}\] \[\begin{align} = \underset{\bf{v}, \ s.t. \ \Vert\bf{v}\Vert_{2}=1}{\operatorname{argmin}} - (\langle \bf{x}, \bf{v} \rangle)^2 = \underset{\bf{v}, \ s.t. \ \Vert\bf{v}\Vert_{2}=1}{\operatorname{argmax}}(\langle \bf{x}, \bf{v} \rangle)^2 = \underset{\bf{v}, \ s.t. \ \Vert\bf{v}\Vert_{2}=1}{\operatorname{argmax}}\bf{v}^{T} \bf{x} \bf{x}^{T} \bf{v} \end{align}\]

Where we omit \(\Vert\bf{x}\Vert_{2}^{2}\) as it does not depend on \(\bf{v}\), also since \(\bf{v}\) is a unit vector \(\Vert\bf{v}\Vert_{2}^{2}\)=1, and we cancel out one of \((\langle \bf{x}, \bf{v} \rangle)^2\). But the question “what is \(\bf{v}\)?” still remains.

3.2. Expected value of random vector

So, intuitively \(\bf{v}\) should be somehow related to our data point, but what should it be? Say, we got our data point \(\bf{x}\) by the means of some measurement during an experiment, then \(\bf{x}\) is a realization of some random vector \(\mathbb{X}\). We are interested not in minimizing the distance between one realization of \(\mathbb{X}\) and \(\bf{v}\) but in minimizing the expected value of this distance, thus the statistical PCA objective is:

\[\begin{align} (\hat{\bf{v}}, \hat{\alpha}):= \underset{\bf{v}, \ \alpha, \ s.t. \ \Vert\bf{v}\Vert_{2}=1}{\operatorname{argmin}}\ \mathbb{E}_{\mathbb{X}} [\ \Vert\mathbb{X}-\alpha \bf{v}\Vert_{2}^2\ ] \end{align}\]

We have already simplified this formula and so we can use it here, because the expected value is a linear transformation. Thus, we get:

\[\begin{align} \hat{\bf{v}}: = \underset{\bf{v}, \ s.t. \ \Vert\bf{v}\Vert_{2}=1}{\operatorname{argmax}} \ \mathbb{E}_{\mathbb{X}} [\bf{v}^{\bf{T}} \mathbb{X} \mathbb{X}^{\bf{T}} \bf{v}] = \underset{\bf{v}, \ s.t. \ \Vert\bf{v}\Vert_{2}=1}{\operatorname{argmax}} \ \bf{v}^{\bf{T}} \mathbb{E}_{\mathbb{X}} [\mathbb{X} \mathbb{X}^{\bf{T}}]\bf{v} = \underset{\bf{v}, \ s.t. \ \Vert\bf{v}\Vert_{2}=1}{\operatorname{argmax}} \ \bf{v^{T} \Sigma v} \end{align}\]

Where we take \(\bf{v}\) out of the expectation as it is constant if the expectation is taken w.r.t. \(\mathbb{X}\). The last step comes from the definition of covariance of random vector, but we have assumed here that \(\mathbb{E}_{\mathbb{X}}[\mathbb{X}] = \bf{0_{n}}\) and so \(\mathbb{E}_{\mathbb{X}} [\mathbb{X} \mathbb{X}^{T}] =Cov_{\mathbb{X}}(\mathbb{X}) = \bf{\Sigma}\). We know that the covariance matrix is symmetric, and because it is obtained by multiplication of two random vectors it is also positive semi-definite, which means that \(\forall{\bf{x}} \in \mathbb{R}^n,\ \bf{x^{T} \Sigma x} \ge 0\). Those two properties are the key for solving PCA objective without any calculus.

3.3. Orthogonal eigendecomposition of covariance matrix

Every symmetric matrix attains an orthogonal eigendecomposition, if this matrix is also positive semi-definite, then all the eigenvalues have a property that \(\lambda_{i} \ge 0, \ i \in [1, n]\). We write this decomposition as \(\bf{\Sigma} = \bf{PDP^{T}}\), where \(\bf{D}\) is a diagonal matrix with eigenvalues as entries and \(\bf{P}\) is an orthogonal matrix such that \(\bf{P^{T}P} = \bf{I}_{n \times n}\). The main convention is that eigenvectors of \(\bf{P}\) are sorted so that their corresponding eigenvalues are placed in the decreasing order on the diagonal of \(\bf{D}\). Very important here is to mention that those eigenvectors creates eigenbasis for \(\mathbb{R}^{n}\), as this will allow us to solve for \(\bf{v}\). With all that in mind we can write:

\[\begin{align} \underset{\bf{v}, \ s.t. \ \Vert\bf{v}\Vert_{2}=1}{\operatorname{argmax}} \ \bf{v^{T} \Sigma v} = \underset{\bf{v}, \ s.t. \ \Vert\bf{v}\Vert_{2}=1}{\operatorname{argmax}} \ \bf{v^{T} PDP^{T} v} \end{align}\]

3.4. Coordinate mapping

As we have already mentioned, vectors of \(\bf{P}\) spans \(\mathbb{R}^{n}\), and this knowledge allows us to use so-called “coordinate mapping” which we denote as: \([\bf{x}] \mapsto \bf{x}\) and define by \(\bf{P} [\bf{x}] = \bf{x}\). Now, since \(Col(\bf{P}) = \mathbb{R}^{n}\), any transformation by \(\bf{P}\) is onto and one-to-one, thus it is invertible, meaning that \(\bf{P^{T}} \bf{x} = [\bf{x}]\). This give us another step forward with the objective:

\[\begin{align} \underset{\bf{v}, \ s.t. \ \Vert\bf{v}\Vert_{2}=1}{\operatorname{argmax}} \ \bf{v^{T} PDP^{T} v} = \underset{\bf{[v]}, \ s.t. \ \Vert\bf{v}\Vert_{2}=1}{\operatorname{argmax}} \ \bf{[v]^{T} D [v]} \end{align}\]

But how our coordinate mapping affects the constraint \(\Vert\bf{v}\Vert_{2}=1\)? It is very easy to show that length of \(\bf{v}\) is invariant under this mapping.

\[\begin{align} \Vert\bf{v}\Vert_{2}=\Vert\bf{P} [\bf{v}]\Vert_{2}=\bf{[v]^{T} P^{T} P [\bf{v}]=[v]^{T}} \bf{I}_{n \times n} [\bf{v}]=\Vert[\bf{v}]\Vert_{2}=1 \end{align}\]

3.5. Maximization of Quadratic forms

What we have above is called a “Quadratic form”, and there exist a Theorem (2) which states that \(\underset{\bf{x}}{\operatorname{max}} \{ \bf{x^{T} A x}: \Vert\bf{x}\Vert_{2}=1 \} = \lambda_{1}\). This means that the maximum is equal to the first eigenvalue (when they are placed in the decreasing order) of \(\bf{A}\). This theorem also provides us with the information at what \(\bf{x}\) this maximum is attained, which is the first eigenvector, let us denote it as \(\bf{u}_{1}\). With this theorem we are able to solve the objective function:

\[\begin{align} \hat{\bf{v}}:= \underset{\bf{[v]}, \ s.t. \ \Vert\bf{v}\Vert_{2}=1}{\operatorname{argmax}} \ \bf{[v]^{T} D [v]} = \bf{u}_1 \end{align}\]

And so, the vector \(\bf{v}\) that minimizes our primary objective is the first eigenvector of the covariance marix! We can finally state our solution:

\[\begin{align} (\hat{\bf{v}}, \hat{\alpha}):= (\bf{u}_{1}, \langle \bf{x}, \bf{u}_{1} \rangle) \end{align}\]

3.6. Few words about eigenvalues

Because of similarity-invariance property of trace operation, it should be easy to see that \(tr(\bf{\Sigma}) = tr(\bf{PD^{T}P}) = tr(\bf{D}) = \sum_{i=1}^{n} \lambda_{i}\). This means that eigenvalues capture the total variance of features. Since, we are placing them in a decreasing order, the first eigenvalue captures the most variance among all the features, the second eigenvalue captures a little less variance, and so on. Because of this we call those eigenvalues “principal components” and eigenvectors are called “principal axis” of covariance matrix \(\bf{\Sigma}\).

4. What about the dataset \(\bf{X}\)?

4.1. Few words about \(\bf{X}\)

So, in 1.1. we have defined \(\bf{X}\) to be a \(m \times n\) matrix of data points. I’m always treating rows to be observations and columns to be features. For example, if \(\bf{X}\) describes a medical data, rows would be patients and columns would be different parameters, like height, age, blood pressure etc. I know that some books and courses take different approach and consider the dataset to be \(\bf{X^{T}}\). So, if \(\bf{x}_{i}\) is an \(i\)th observation in the dataset, we define the dataset as:

\[\begin{align} \bf{X} = \begin{pmatrix} \bf{x}_{1}^{\bf{T}} \\ \vdots \\ \bf{x}_{n}^{\bf{T}} \end{pmatrix} \in \mathbb{R}^{m \times n} \end{align}\]

4.2. Constructing the objective

How to take the intuition from the example with one data point and translate it to the whole dataset? The simplest and most obvious way would be to take the mean of all squared distances. Let,

\[\begin{align} (\hat{\bf{v}}, \hat{\alpha}_{i}):= \underset{\bf{v}, \ \alpha, \ s.t. \ \Vert\bf{v}\Vert_{2}=1}{\operatorname{argmin}}\ \frac{1}{m} \sum_{i=1}^{m} \Vert\bf{x}_{i}-\alpha_{i} \bf{v}\Vert_{2}^2 \end{align}\]

be our objective function for the dataset. Using the method from 3.1. we can greatly simplify the above result:

\[\begin{align} \hat{\bf{v}}:= \underset{\bf{v}, \ s.t. \ \Vert\bf{v}\Vert_{2}=1}{\operatorname{argmax}}\ \frac{1}{m} \sum_{i=1}^{m} (\langle \bf{x}_{i}, \bf{v} \rangle)^2 = \underset{\bf{v}, \ s.t. \ \Vert\bf{v}\Vert_{2}=1}{\operatorname{argmax}}\ \frac{1}{m} \Vert\bf{X} \bf{v}\Vert_{2}^{2} = \underset{\bf{v}, \ s.t. \ \Vert\bf{v}\Vert_{2}=1}{\operatorname{argmax}}\ \frac{1}{m} \bf{v^{T} X^{T} \bf{X} v} \end{align}\]

As explained in 4.1. because of the way we are interpreting the dataset as the matrix of \(observations \times features\) it can be harder to spot that \(\frac{1}{m} \bf{X^{T} X}\) is the MLE estimator for covariance matrix \(\hat{\bf{\Sigma}}\), where again we let \(\mathbb{E_{X}}[\mathbb{X}]=\bf{0_{n}}\). If we were to take \(\bf{X^{T}}\) as the dataset, we would write \(\frac{1}{m} \bf{X X^{T}}\) which is more in line with a lot of statistics books. And so we write:

\[\begin{align} \hat{\bf{v}}:= \underset{\bf{v}, \ s.t. \ \Vert\bf{v}\Vert_{2}=1}{\operatorname{argmax}}\bf{v^{T} \hat{\Sigma} v} \end{align}\]

Our estimate for covariance matrix has the same two properties as the true covariance matrix, it is symmetric and positive semi-definite, because of that we proceed with orthogonal eigendecomposition, but here the eigenvectors and eigenvalues are only estimates. This still allows us to solve the objective in the same manner as we did in 3.5., thus the solution is:

\[\begin{align} (\hat{\bf{v}}, \hat{\alpha}_{i}):= (\hat{\bf{u}}_{1}, \langle \bf{x}_{i}, \hat{\bf{u}}_{1} \rangle) \end{align}\]

5. Higher dimensional subspaces

5.1. What do we want?

As mentioned earlier, we are projecting the data onto a 1-dimentional subspace, namely \(Span\{ \hat{\bf{u}}_{1} \}\), but we want to be able to use higher dimensional subspaces. For example, we don’t want to project MNIST dataset[2] onto a line, as it would result in only one feature, and as we expect, it would not be very useful. So, what can we do about it? We have to place another constraint. Let us assume that we have found \(\hat{\bf{u}}_{1}\) and now we want to find another \(\hat{\bf{v}}\), but this time we require that it is orthogonal to the solution already found, namely we want that \(\langle \hat{\bf{u}}_{1}, \bf{v} \rangle =0\). Thus,

\[\begin{align} \hat{\bf{v}}^{(2)}:= \underset{\bf{v}, \ s.t. \ \Vert\bf{v}\Vert_{2}=1 \ \land \ \langle \hat{\bf{u}}_{1}, \bf{v} \rangle =0}{\operatorname{argmax}} \bf{v^{T} \hat{\Sigma} v} \end{align}\]

Where we place (2) in the superscript to distinguish this solution from the first one. Theorem (2) provides us also with the solution for this problem. It states that if we require our new vector to be orthogonal with the first eigenvector, the solution must be the eigenvector corresponding to the eigenvalue second in magintude. Mathematically speaking \(\underset{\bf{x}}{\operatorname{max}} \{ \bf{x^{T} A x}: \Vert\bf{x}\Vert_{2}=1 \ \land \langle \bf{u}_{1}, \bf{x} \rangle =0 \} = \lambda_{2}\) and the only vector satisfying this maximization is the second eigenvector \(\bf{u}_{2}\). Thus,

\[\begin{align} (\hat{\bf{v}}^{(2)}, \hat{\alpha}_{i}^{(2)}):= (\hat{\bf{u}}_{2}, \langle \bf{x}_{i}, \hat{\bf{u}}_{2} \rangle) \end{align}\]

Now, the subspace \(Span\{\hat{\bf{u}}_{1}, \hat{\bf{u}}_{2}\}\) we are projecting our dataset onto is two dimensional. Using this procedure we can find subspace of arbitrary dimension as long as \(dim(Span\{\hat{\bf{u}}_{i}\}_{i=1}^{n}) \le rank(\bf{X}^{T})\). Thus,

\[\begin{align} (\hat{\bf{v}}^{(j)}, \hat{\alpha}_{i}^{(j)}):= (\hat{\bf{u}}_{j}, \langle \bf{x}_{i}, \hat{\bf{u}}_{j} \rangle), \ 1 \le j \le rank(\bf{X}^{T}) \end{align}\]

5.2. Calculating new features

What are those new features that we get from PCA? How are they calculated? We have solved the objective but what now? If we find, say \(p\) eigenvectors, we can create a matrix out of them, which we will denote as \(\bf{U} = [\hat{\bf{u}}_{i}]_{i=1}^{p} \in \mathbb{R}^{n \times p}\). Columns of this matrix generates a subspace “reduced in size”. Since, columns of \(\bf{U}\) creates a basis for this “smaller” subspace we have to go back to coordinate mapping, which tells us that \(\bf{x}_{n \times 1} = \bf{U}_{n \times p} [\bf{x}]_{p \times 1}\). We placed size of related vectors and matrices, so you can easily see what is happening. We now right-multiply by \(\bf{U^{T}}\) and get:

\[\begin{align} \bf{U}_{p \times n}^{\bf{T}} \bf{x}_{n \times 1} = \bf{U}_{p \times n}^{\bf{T}} \bf{U}_{n \times p} [\bf{x}]_{p \times 1} = \bf{I}_{p \times p} [\bf{x}]_{p \times 1} = [\bf{x}]_{p \times 1} \end{align}\]

The right-hand side is the observation vector with new features, and since \(p \le n\), we have created an observation that is reduced in size. We have projected our datapoint onto a subspace spanned by columns of \(\bf{U}\)! All it’s left to do is to project the whole dataset onto this subspace, and to accomplish it we have to left-multiply every datapoint \(\bf{x}_{i}\) by the matrix \(\bf{U^{T}}\).

\[\begin{align} [[\bf{x}_{1}] \ \dotsi [\bf{x}_{n}]]_{p \times n} = \bf{U}_{p \times n}^{\bf{T}} [\bf{x}_{1} \dotsi \bf{x}_{n}]_{n \times m} \end{align}\]

But \([\bf{x}_{1} \dotsi \bf{x}_{n}]_{n \times m}\) is nothing more than just our transposed dataset \(\bf{X^{T}}\), and what we get is transformed, but most importantly transposed dataset with new features. Thus,

\[\begin{align} [\bf{X}]_{n \times p} = \bf{X}_{m \times n} \bf{U}_{n \times p} \end{align}\]

Here we go, this is exactly what PCA does, it multiplies our dataset with an orthogonal matrix which columns are eigenvectors corresponding to the eigenvalues capturing the most variance from the covariance matrix. Tell me how beautiful it is, so simple yet elegant solution for such complicated operation of dimensionality reduction. We took something very abstract and converted it into simple matrix multiplication.

Note here that calculating PCA in this way is very inefficient, and i don’t mean the matrix multiplication, but the eigendecoposition of covariance matrix. Usually it is obtained by truncated SVD, for example PCA implementation in sklearn[3] utilizes this approach. Keep in mind that the method described here is also a truncated approach, as we find the first eigenvector, then the second one, the third one, and so on. The only thing we lack is some kind of algorithm for finding eigenvectors.

6. Theorems

6.1. The Best Approx. Theorem (1)

Let \(\mathbb{V} \subset \mathbb{R}^{n}\), let \(\bf{x} \in \mathbb{R}^{n} \ s.t. \ \bf{x} \notin \mathbb{V}\), and let \(\bf{y} \in \mathbb{V}\). Now, let \(\hat{\bf{x}}\) be an orthogonal projection of \(\bf{x}\) onto \(\mathbb{V}\), then \(\hat{\bf{x}}\) is the closest point in \(\mathbb{V}\) to \(\bf{x}\) in the sense that

\[\begin{align} \Vert\bf{x} - \hat{\bf{x}}\Vert_{2}^{2} \le \Vert\bf{x} - \bf{y}\Vert_{2}^{2},\ \forall{\bf{y}} \in \mathbb{V} \end{align}\]

The proof comes from the Pythagorean Theorem. I will paste a drawing below, so it is clear what is going on. I made the drawing, but it is heavily inspired by David C. Lay’s book[1] mentioned at the beginning.

enter image description here

So, the Pythagorean Theorem gives us two important relations:

\[\begin{align} \Vert\hat{\bf{x}}\Vert_{2}^{2} + \Vert\bf{x} - \hat{\bf{x}}\Vert_{2}^{2} = |\bf{x}\Vert_{2}^{2} \end{align}\] \[\begin{align} \Vert\hat{\bf{x}} - \bf{y}\Vert_{2}^{2} + \Vert\bf{x} - \hat{\bf{x}}\Vert_{2}^{2} = \Vert\bf{x} - \bf{y}\Vert_{2}^{2} \end{align}\]

We can clearly see it on the graph. But, all the distances are nonnegative, which tells us that:

\[\begin{align} \Vert\bf{x} - \hat{\bf{x}}\Vert_{2}^{2} \le \Vert\bf{x} - \bf{y}\Vert_{2}^{2}, \ \forall y \in \mathbb{V} \end{align}\]

Where the equality holds only and only if \(\bf{y} = \hat{\bf{x}}\). And that completes the proof.

6.2. Theorem (2)

First off, from what i know this theorem doesn’t have its unique name, thus lack thereof. Second, the theorem gives us the solution for both 3.5. and 5.1., which will be also proved, so nothing is left out. Third, we only need maximization part of the theorem, but it also provides us with the solution for minimization (of a quadratic form), but it will be omitted in the theorem to avoid any confusion.

6.2.1. First part

Let \(\bf{A} \in \mathbb{R}^{n \times n}\) be any symmetric matrix, and define \(M = \underset{\bf{x}}{\operatorname{max}} \{ \bf{x^{T} A x}: \Vert\bf{x}\Vert_{2}=1 \}\). Then, \(M\) is the greatest eigenvalue \(\lambda_{1}\) of \(\bf{A}\). The value of \(\bf{x^{T} A x}\) is \(M\) when \(\bf{x}\) is a unit eigenvector \(\bf{u}_{1}\) of \(\bf{A}\) corresponding to \(\lambda_{1}\).

6.2.2. Proof of the first part

Since \(\bf{A}\) is symmetric it attains an orthogonal eigendecomposition \(\bf{A} = \bf{PDP^{T}}\), with \(\bf{P^{T}P} = \bf{I}_{n \times n}\). Because of that and coordinate mapping (tackled in 3.4.) we can rewrite the quadratic form as:

\[\begin{align} \bf{x^{T} A x} = \bf{x^{T} PDP^{T} x} = \bf{[x]^{T} D [x]} \end{align}\]

Because the length of \(\bf{x}\) is invariant under coordinate mapping using orthogonal matrices, \(\bf{x^{T} A x}\) and \(\bf{[x]^{T} D [x]}\) will attain the same values as \(\bf{x}\) and \(\bf{[x]}\) range over the set of all unit vectors. The matrix \(\bf{D}\) is diagonal and if we let \([x]_{i}\) be the entries of \(\bf{[x]}\), we have,

\[\begin{align} \bf{[x]^{T} D [x]} = \sum_{i=1}^{n} \lambda_{i} [x]_{i}^{2} \le \lambda_{1} \sum_{i=1}^{n} [x]_{i}^{2} = \lambda_{1} \Vert\bf{[x]}\Vert_{2}^{2} = \lambda_{1} \Vert\bf{P^{T}x}\Vert_{2}^{2} = \lambda_{1} \Vert\bf{x}\Vert_{2}^{2} = \lambda_{1} \end{align}\]

Thus, we see that the maximum of \(\bf{x^{T} A x}\) and \(\bf{[x]^{T} D [x]}\) is \(\lambda_{1}\). Since, \(\lambda_{1}\) is the (1,1)-entry in \(\bf{D}\), \(\bf{[x]}\) that let us get this value is \(\bf{[x]} = \bf{e_{1}}\), which is a vector of standard basis for \(\mathbb{R}^{n}\). We find \(\bf{x}\) by \(\bf{P [x]} = \bf{P e_{1}} = \bf{u_{1}}\). This proves the first part of the theorem.

6.2.3. Second part

Let \(\bf{A}\), \(\lambda_{1}\), and \(\bf{u_{1}}\) be as in the first part of the theorem, and define \(M_{2} = \underset{\bf{x}}{\operatorname{max}} \{ \bf{x^{T} A x}: \Vert\bf{x}\Vert_{2}=1 \ \land \bf{x^{T} u_{1}}=0 \}\). Then \(M_{2}\) is the second greatest eigenvalue \(\lambda_{2}\), and this maximum is attained when \(\bf{x}\) is an eigenvector \(\bf{u_{2}}\) corresponding to \(\lambda_{2}\).

6.2.4. Proof of the second part

We start by working only with the orthogonal constraint \(\bf{x^{T} u_{1}}=0\). We again use coordinate mapping to show that,

\[\begin{align} \bf{x^{T} u_{1}} = \bf{[x]^{T} P^{T} u_{1}} = \bf{[x]^{T}} \begin{pmatrix} \bf{u^{T}_{1}} \\ \vdots \\ \bf{u^{T}_{n}} \end{pmatrix} \bf{u_{1}} = \bf{[x]^{T}} \begin{pmatrix} \bf{u^{T}_{1} u_{1}} \\ \vdots \\ \bf{u^{T}_{n} u_{1}} \end{pmatrix} = \bf{[x]^{T}} \bf{e_{1}} = [x]_{i} = 0 \end{align}\]

This new constraint we have placed on our quadratic form causes the first entry of \(\bf{[x]}\) to be equal to 0, and so \(\bf{x^{T} u_{1}}=0 \iff \bf{[x]} \ne \bf{e_{1}}\). Now, it is very easy to show why the second part of the theorem is true, so we will only write the inequality for the sum:

\[\begin{align} \sum_{i=1}^{n} \lambda_{i} [x]_{i}^{2} \le \lambda_{1} 0 + \lambda_{2} \sum_{i=2}^{n} [x]_{i}^{2} \implies \lambda_{2} \end{align}\]

And the second greatest eigenvalue is attained at \(\bf{[x]} = \bf{e_{2}}\), which in turn gives us \(\bf{x} = \bf{u_{2}}\). This concludes our proof. By using induction you could prove that this is true for any arbitrary number of orthogonal constraint.

7. References

[1] D. C. Lay, Algebra and its applications 5 ED, vol. 91, no. 1. 2011.
[2] “MNIST database.” link
[3] “sklearn.decomposition.PCA.” link

Tags: PCA linear algebra