Gradient descent gives one way of minimizing J.Here is an another wayof doing so.
θ=(XTX)−1XTy
There is no need to do feature scaling with the normal equation.
The following is a comparison of gradient descent and the normal equation:
Gradient DescentNormal EquationNeed to choose alphaNo need to choose alphaNeeds many iterationsNo need to iterateO (kn2)O (n3), need to calculate inverse of XTXWorks well when n is largeSlow if n is very largeWith the normal equation, computing the inversion has complexity O(n3). So if we have a very large number of features, the normal equation will be slow. In practice, when n exceeds 10,000 it might be a good time to go from a normal solution to an iterative process.
When implementing the normal equation in octave we want to use the 'pinv' function rather than 'inv.' The 'pinv' function will give you a value of θ even if XTX is not invertible.
If XTX is noninvertible, the common causes might be having :
Redundant features, where two features are very closely related (i.e. they are linearly dependent)Too many features (e.g. m ≤ n). In this case, delete some features or use "regularization" (to be explained in a later lesson).Solutions to the above problems include deleting a feature that is linearly dependent with another or deleting one or more features when there are too many features.