I found a neat little example in one of my introductory stats books about Bayesian versus maximum-likelihood estimation for the simple problem of estimating a binomial distribution given only one sample.

I was going to try and show the math but since Blogger is not making it possible to actually render MathML I’ll just hand-wave instead. *[Fixed in Whisper. —ed.]*

So let’s say we’re trying to estimate a binomial distribution parameterized by , and that we’ve only seen one estimate. For example, someone flips a coin once, and we have to decide what the coin’s probability of heads is.

The maximum likelhood estimate for is easy: if your single sample is a 1, then , and if your sample is 0, . (And if you go through the laborious process of writing the log likelihood, setting the derivative equal to 0, and solving it, you come up with the general rule of (# of 1’s) / (# of 1’s + # of 0’s), which is kinda what you would expect.)

In the coin case it seems crazy to say, I saw one head, so I’m going to assume that the coin *always* turns up heads, but that’s because of our prior knowledge of how coins behave. If we’re given a black box with a button and two lights, and you press the button, and one of the lights come on, then maybe estimating that that light always comes on when you press the button makes a little more sense.

Finding the Bayesian estimate is slightly more complicated. Let’s use a uniform prior. Our conditional distribution is and , and if you work it out, the posterior ends up as and .

Now if we were in the world of classication, we’d take the MAP estimate, which is a fancy way of saying the value with the biggest probability, or the mode of the distribution. Since we’re using a uniform prior, that would end up as the same as the MLE. But we’re not. We’re in the world of real numbers, so we can take something better: the expected value, or the mean of the distribution. This is known as the Bayes estimate, and there are some decision-theoretic reasons for using it, but informally, it makes more sense than using the MAP estimate: you can take into account the entire shape of the distribution, not just the mode.

Using the Bayes estimate, we arrive at
if the sample was a 1, and
if the sample was a zero. So we’re at a place where Bayesian logic and frequentist logic arrive at very different answers, *even with a uniform prior*.

Up till now we’ve been talking about “estimation theory”, i.e. the art of estimating shit. But estimation theory is basically decision theory in disguise, where your decision space is the same as your parameter space: you’re deciding on a value for , given your input data, and your prior knowledge, if any.

Now what’s cool about moving to the world of decision theory is that we can say: if I have to decide on a particular value for , how can I minimize my expected cost, aka my risk? A natural choice for a cost, or loss, function, is squared error. If the true value is , I’d like to estimate in such a way that is minimized. So we don’t have to argue philosophically about MLE versus MAP versus minimax versus Bayes estimates; we can quantify how well each of them do under this framework.

And it turns out that, if you plot the risk for the MLE estimate and for the Bayes estimate under different values of the true value , then MOST of the time, the Bayes estimate has lower risk than the MLE. It’s only when is close to 0 or to 1 that MLE has lower risk.

So that’s pretty cool. It seems like the Bayes estimate must be a superior estimate.

Of course, I set this whole thing up. Those “decision-theoretic reasons” for choosing the Bayes estimate I mentioned? Well, they’re theorems that show that the Bayes estimate minimizes risk. And, in fact, the Bayes estimate of the mean of the distribution is *specific* to squared-error loss. If we chose another loss function, we could come up with a potentially very different Bayes estimate.

But my intention wasn’t really to trick you into believing that Bayes estimates are awesome. (Though they are!) I wanted to show that:

- Bayes and classical approaches can come up with very different estimates, even with a uniform prior.
- If you cast things in decision-theoretic terms, you can make some real quantitative statements about different ways of estimating.

In the decision theory world, you can *customize* your estimates to minimize your particular costs in your particular situation. And that’s an idea that I think is very, very powerful.

Very interesting. But you pushed all the interesting action out to the loss function. If you do 1-0 loss — that is, you get credit if you’re right, but for everything else you’re worthless — then the mode, not the mean, of the posterior is optimal. Therefore MAP.

It’s not at all clear to me, for really general estimation settings, whether 1-0 or squared error is better.

Well 0-1 loss is crazy talk if you’re estimating a continuous value.

But it looks like it magically all works out for both cases. If you’re estimating something discrete, then 0-1 loss means MAP is optimal, and MAP kind of the only thing you can do anyways.

If you’re doing regression against a continuous value, then squared-error loss means that EV is optimal, and that’s kind of the most natural thing to do too.

Magic.