1. 10
benkuhn.net
1.

2. 5

To argue briefly the case for why you might sometimes want squared error (or at least some form of nonlinear penalization of larger errors):

It’s true in the case of automated stock trading that, say, 100 gains of \$1 equal 1 gain of \$100, so absolute measures are what you really care about. But this is not true in many estimation problems. In image labeling, robot navigation, and various other such problems, frequent small errors are preferable to the “equivalent” number of larger errors. You’d rather have 10 images off by 5 pixels than one image off by 50 pixels, and you’d rather have frequent heading errors of 1º than the occasional heading error of 30º. Small errors are generally “ok”, producing good-enough-to-be-useful results and easy-to-recover-from slight errors, while large errors produce Very Bad results that are not usable.

This is in a way the flip-side of the robustness question. Squared error puts more emphasis on avoiding large errors (since they’re penalized quadratically), which makes it perform badly when some of these are spurious large errors (noise in the data, very rare conditions we can’t effectively do anything about, etc.). Absolute error is more robust to outliers, but in turn also makes less effort to avoid actual large errors: an estimator will happily spend a bunch of time trying to reduce 3-pixel errors to 2-pixel errors even if this costs you an increase in major errors. Which is often not what you want, since the difference between 3-pixel and 2-pixel errors is “who cares”, while the large errors are actual problems.

There are all sorts of compromises you could try of course (which people do try). For example you may want to use squared error up to a max where it then turns into a linear penalty (this is the Huber loss). You might even want to just cap the penalty at a max, e.g. past some point of mislabeling the image labeling is simply “way off”.

1. 1

Thinking about the derivations for linear regression and LMS, I seem to remember that the squared error has some nice mathematical properties, but I’m too lazy to refresh my mind about them :-P