In an overdetermined linear system, there are typically more equations than unknowns, so an exact solution to
\(Ax=b\) may not exist. Least squares replaces “solve exactly” with “fit as closely as possible” by minimizing the
squared error:
\[
\min_x \ \lVert Ax-b\rVert^2.
\]
The vector \(r=Ax-b\) is called the residual. Least squares chooses \(x\) so that the residual norm
\(\lVert r\rVert\) is as small as possible. A standard derivation sets the gradient of \(\lVert Ax-b\rVert^2\) to zero,
which leads to the normal equations:
\[
\begin{aligned}
\lVert Ax-b\rVert^2 &= (Ax-b)^{\mathsf T}(Ax-b) \\
\frac{\partial}{\partial x}\lVert Ax-b\rVert^2 &= 2A^{\mathsf T}(Ax-b)=0 \\
A^{\mathsf T}A\,x &= A^{\mathsf T}b.
\end{aligned}
\]
Interpretation: the residual at the solution is orthogonal to the columns of \(A\). This is a projection onto the column space of \(A\).