In a previous post I describe how you can cook up a Bayesian framework that results in IMDB’s so-called “true Bayesian estimate”, a formula which, on its face, doesn’t look particularly Bayesian.
As my astute commenters pointed out, this formula has many simpler interpretations without needing to invoke the B word. For example, it’s a linear interpolation between two values: where is our mean vote, is some smoothing target, is the smoothing weight. can be any function, as long as it increases with , stays between 0 and 1, and is 0 when is 0. Those constraints give you the right behavior: with no votes, your estimate is exactly; as you add votes, it approaches , and controls how fast that happens.
This formulation naturally leads to the following question: if I’m smoothing like this to deal with paucity-of-data issues, what value of should I pick? IMBD uses the , the global movie mean. Intuitively that makes sense, but is it the right choice?
What’s nice about the expression for above is that the behavior we’re most interested in is when , i.e. when there are no votes. In that case, , because of how I’ve constrained . So finding the best is equivalent to finding the best when .
Happily, we can answer the question of the best analytically, at least if we’re happy to imagining that there is a “true” value of the movie .
Given , we can define a loss function that describes how bad we think a particular value of is. But we don’t really know what is for any movie (if we did, we wouldn’t be bothering with any of this). So we can generalize that a step further and define a risk function quantifying our expected loss: the aggregate of the loss function across all possible values of , weighted by the probability of each value. This gives us the tool we really need to answer the question above: the that minimizes our risk is the winner.
In the absence of any specific notions about errors, we’ll use the standard loss function for reals, squared-error loss: . Then it’s just a matter of churning the crank:
We can drop that first term since we’re only interested in minimimizing this as a function of . To find the minimum:
Unsurprisingly, we see that the best estimate of under squared-error loss is the mean of the distribution of . Since we’re interested in the case where , this implies that the best value to use for is also the mean.
So IMDB’s choice of makes sense: the mean vote over all your movies is a great estimate of the mean of the distribution of .
A couple concluding points:
- This answer is specific to squared-error loss; if you plug in another loss function, the optimal value for might very well change. And you might actually have a specific model in mind for how “bad” mis-estimates are. Maybe over-estimates are worse than under-estimates, or something like that.
- The definition of the distribution of is actually completely vague above. In fact we don’t even talk about it; we just use it implicitly in our terms. So you should feel free to plug in (the mean of) whatever distribution you believe most accurately represents your product/movie/whatever. IMDB could arguable to better by plugging in per-category means, or something even fancier.
- IMDB is actually a particularly bad case because movie opinions are extremely subjective. If you’re serious about modeling very subjective things, we should be talking about multinomial models, Dirichlet priors, and the like.
But the take-home message is: in the absence of a specific loss function that you really believe, smoothing towards the mean isn’t just intuitive, it’s minimizing your risk.