Understanding least-squared cost function

I've been using some of the loss functions in machine learning like Least Squared Error, Logistic Loss, Softmax Loss, etc. for quite some time but never had I dug deep into them to really understand how they were derived and the motivation behind them. Recently I took a look at Stanford's CS 229 materials and find it mind-blowing to finally have some understanding about those common loss functions and I want to share some of my knowledge here.

First let's take a look at Least Squared Error. Least-squared cost function is usually used in regression problems, defined by the formula
where h is our hypothesis function of the input that estimates the output y and least-squared cost function is to measure how close we are in our estimation. Clearly, the higher L the further we are from estimating y correctly, so we are trying to minimize this cost function. But why least-squared cost function is a natural cost function to use in this case?

Let's first assume that we try to model y as a linear function of x like the following
where e is the error term that captures either random noise or un-modeled effects. We can also assume that e is IID distributed according to a Gaussian distribution (with mean of 0). With that said,
or in other words, since e = y - øx, we can substitute e in the above equation to
Or p(y|x;ø) is a normal distribution with mean of øx. Intuitively we want to maximize this quantity since we want our model to estimate correctly the value of y. Let L(ø) be the likelihood function of ø, as our original target is to have a "good" ø (one that maximizes the likelihood of y given input x). So L(ø) = p(y|x;ø).  Since we assume that e is IID
We would like to maximize this quantity (maximize the likelihood function), and this is equivalent to maximizing the log function of the likelihood
Since the first term, m log(…) is a constant, maximizing this function is equivalent to minimizing only the second term of the function, which is our familiar least-squared cost function
So as we can see, least-squared cost function is very natural and intuitive if we view the problem under the perspective of probability. Looking to sharing some more fascinating things in the next blog.

Reference:
Stanford CS 229 Lecture notes by Andrew Ng

Comments