Showing only posts with topic "stats" [.rss for this topic]. See all posts.

How to rank products based on user input

This is a post I’ve been meaning to write for over a year, since I first came across an article entitled How Not to Sort by Average Rating (made popular on, amongst other places, Hacker News). Recent mentions of that article have finally spurred me to finish this off. Well, better late than never.

The Context

Let’s say you have a lot of things. Let’s say your users can vote on how much they like those things. And let’s say that you want rank those things, so that you can list them by how good they are. How should you do it?

You can find examples of this pattern all over the web. Amazon ranks products, Hacker News ranks links, and Yelp ranks businesses, all based on reviews or votes from their users.

These sites all rank the same way: they compile user votes into a score, and items are ranked by that score. So how the score is calculated determines entirely how the items are ranked.1

So how might we generate such a score? The most obvious approach is to simply use the average of the users’ votes. And, in fact, that is what these sites do. (Hacker news also mixes in a time-based measure of freshness.)

Unfortunately, there’s a big problem with using the average: when you only have a few votes, the average can take on extreme values. For example, if you have only one vote, the average is that vote. (And if you have no votes… well, what do you do then?) The result is that low-vote items can easily occupy the extreme points on the list—-at the very top or very bottom.

You see this problem every time you go to Amazon and sort by average customer review. The top-most items are a morass of single-vote products, and you have to wade through them before you get to the good stuff.

So what would be a better ranking? Well, we’d probably like to see an item with 50 5-star votes before we see an item with a single 5-star vote. Intuitively speaking, the more votes, the more certain we are of an item’s usefulness, so we should be focusing our attention on high-score, high-vote-count items.

And, although we’re less likely to be concerned with what’s at the bottom of the list, the symmetric case also seems right: an item with 50 1-star votes should be ranked lower than one with a single 1-star vote, since the 50-star one is more certain to be bad.

Taken together, these two observations suggest that the best ordering is one where the low-vote items are placed somewhere in the middle, and where high-confidence items occupy the extremes—good at the top, and bad at the bottom.

(We could develop this argument more formally, by talking about things like expected utility / risk, but I’m going to leave it just intuitive for this post.)

The Problem with Confidence Intervals

The “How Not To” article linked above suggests that the right way to rank products to avoid the problem with the average is to construct a confidence interval around the average vote, and rank by the the lower bound of that interval. This confidence interval has the following behavior: when the amount of data is large, it is a tight bound around the average vote; when the amount of data is small, it is a very loose bound around the average.

Taking the lower bound of the confidence interval is fine when the number of votes for an item is large: the lower bound will be close to the average itself. But, just as with the average, the problem with this approach occurs when the number of votes is small. In this case, the lower bound of the confidence interval will be very low.

The result is that new items with few votes will always be ranked alongside the bad items with many votes, at the bottom of the list. In other words, if you rank by the lower bound of a confidence interval, you will rank an item with no votes alongside an item with 1,000 bad votes.

If you use this approach, every new item will start out at the bottom of the list.

Enter Bayesian Statistics

Is there a better alternative? Fortunately, this kind of paucity-of-data problem is tailor-made for Bayesian statistics.

A Bayesian approach gives us a framework to model not only the votes themselves, but also a prior belief of what we think an item looks like, before seeing any votes. We can use this prior to do smoothing: take the average vote, which can be jumpy, and “smooth” it towards the prior, which is steady. And the smoothing is done in such a way that, with a small numbers of votes, the score mostly reflects the prior, and as more votes arrive, the score move towards the average vote. In other words, the votes are eventually able to “override” the prior when there are enough of them.

Being able to incorporate a prior has two advantages. First, it gives us a principled way of modeling what should happen with zero votes. We are no longer at the mercy of the confidence interval; we can decide explicitly how low-vote products should be treated. Second, it provides a convenient mechanism for us to plug in any extra information that we have on hand. This extra information becomes the prior belief.

But what kind of extra information do we have? Equivalently, how do we determine the prior? Consider: when you decide to watch a Coen brothers movie, you make that decision based on past Coen brothers movies. When you buy a Sony product, you make a decision based on what you know of the brand.

In general, when you know nothing about an item, you can generalize from information about related items, items from similar sources, etc. We will do the same thing to create a prior.

Let’s see how we can use Bayesian statistics to do smoothing towards a meaningful prior.

Solution #1: the “True Bayesian Average”

The first solution, made popular by IMDB, is the so-called “true Bayesian average” (although so far as I know that terminology does not actually come from statistics). Using TBA, to compute a score , we do:

where is the average vote for the item, is the number of votes, is the smoothing target, and is a tuning parameter that controls how quickly the score moves away from as the number of votes increases. (You can read more about the Bayesian interpretation of the ‘true Bayesian average’.)

This formula has a nice interpretation: is a pseudo-count, a number of “pseudo” votes, each for exactly the value . These votes are automatically added to the votes for every item, and then we take the average of the pseudo-votes and “real” votes combined.

In this formula, the prior is , the smoothing target. What value should we choose for it? It turns out that if we set to the average vote over all items (or over some representative class of items), we get the behavior we wanted above: low-vote items start life near the middle of the herd, not near the bottom, and make their way up or down the list as the votes come in. (You can read a more rigorous argument about why using the global average is a good target for smoothing.)

The TBA is easy to implement, and it’s trivial to adapt an existing system that already uses average votes to use it.

But we can do better.

The Problem with the “True Bayesian Average”

The problem with the TBA is that it assumes a Normal distribution over user votes.

Taken at face value, we know this assumption is bad for two reasons: one, we have discrete, not continuous, votes, and two, we have no actual expectation that votes will be normally distributed, except in the limit. But in reality, neither of these are significant problems. We will always have some modeling assumptions of dubious correctness for the sake of tractability, and these are within the realm of plausibility.

The bigger problem with the assumption of Normality is that it forces us to model items as if they had some “true” score, sitting at the mean of a Normal distribution. But we know some items are simply divisive. Some Amazon products accrue large numbers of both 5-star and 1-star reviews. Some movies are loved and hated in equal measure. Being able to model those kinds of distributions accurately would be to our advantage, and a Normal distribution won’t let us do that.

Ultimately, of course, we need to produce an ordering, so we need to condense everything we know about an item into a single value.2 But it would be to our advantage to do this in an explicit, controllable manner.

This suggests we would like a solution which decomposes scoring into three parts:

  1. Our prior beliefs about an item;
  2. The users’ votes on an item; and
  3. The mapping between the vote histogram and a score.

Such a system could both account for paucity-of-data problems, and provide us with explicit control on how the items are ranked.

Let’s see how we can do this.

Solution #2: Dirichlet Priors and an Explicit Value Function

To accomplish these goals, we have to forsake the normal distribution for the multinomial. A multinomial model will let us represent the complete histogram of the votes received for each items. For example, for Amazon products, a multinomial model will capture how many votes for one star, how many votes for two stars, and so on, a product had. If there are types of votes that users can assign to an item, we can parameterize (i.e. fully specify) the corresponding multinomial with an -dimensional vector.

(I’m glossing over one technicality, which is that a multinomial really measures only the relative proportion, not the actual counts, of the histogram. But this detail won’t matter in our case.)

To fit the multinomial into a Bayesian framework, we will model our prior belief as a Dirichlet distribution. Just as a multinomial distribution represents a single -dimensional vector, a Dirichlet distribution is a probability distributions over all such -dimensional vectors. In effect, it is a distribution over all possible vote histograms for an item.

We use the Dirichlet because it is a conjugate prior of the multinomial. This means that when we use Bayes’s rule to combine the Dirichlet with the multinomial (we’ll see how to do this below), the resulting distribution is also a Dirichlet. This property is very convenient because it allows us to keep the representation to a single form, and use the same technique iteratively—we start with a Dirichlet, and every new set of votes we incorporate leaves us with a Dirichlet.

The Dirichlet is a complicated distribution. Luckily, the properties we’re interested in make it very simple to use.

For one, it is parameterized by a histogram just as well as a multinomial is. That is, if we write for a Dirichlet distribution and for a multinomial, describes a multinomial distribution corresponding to a vote histogram for a particular item with 7 one-star votes, 3 two-star reviews, etc., and describes a Dirchelet. Of course, the meaning of the Dirichlet is quite different from the meaning of the multinomial, and for the sake of brevity we won’t go into how to interpret it here. But for our purposes, this is a nice property because it means that specifying a Dirichlet given a vote histogram is trivial.

The other very handy property of Dirichlets is that when they’re combined with multinomials using Bayes’s Rule, not only is the result a Dirichlet, it’s a Dirichlet that’s easily specifiable in terms of the two input distributions. Recall that Bayes’s Rule states:

where is the set of observed votes and is a possible model of the item. In our case, that means that is our prior belief, is what our actual votes look like given the model, is our updated model, and is some normalizing constant, which we can ignore for now.

If we call our prior belief the Dirichlet and our conditional distribution the multinomial , and skip over all the difficult math, then we can turn Bayes’s rule into a rule for updating our model:

In other words, to create the posterior (that is, resulting) distribution, all we need to do is add the two input histograms. (Note that falls away.)

So now we have a way of taking our prior information, incorporating the user votes, and finding the resulting distribution. The resulting distribution is also a vote histogram for an item, smoothed, just as in the TBA case, towards the prior belief histogram. (And, in fact, the same pseudo-count analogy applies: are the pseudo-votes, and the real votes, and we’re simply adding them together. Neat, huh?)

But what do we do with that?

The final step of the puzzle is to transform this distribution into a single score for ranking. The best way to do this is to take a function that describes the score of a particular vote histogram, and compute the expected value of that function under our distribution. The expected value will represents the function as evaluated over every possible value in the distribution, weighted by the probability of seeing that value. In effect, it will capture the function as applied to the entire distribution, and package it up nicely into a single number for us, ready to be used as a score.

To compute the expected value, in general, you are required to solve a nasty integral. Happily, we can take advantage of one final property of the Dirichlet, which is that, if your Dirichlet is parameterized by , the expected value of the proportion of votes in category is simply:

In other words, the expected value of the proportion of votes in the th bucket is simply the proportion of that parameter over the total sum of the parameters.

If we stick to a linear scoring functions, we can take advantage of the linearity of expectation and use this result, avoiding anything more complicated than multiplication.

For example, sticking with our Amazon case, let’s use the “obvious” function:

where we give each one-star vote a 1, each 2-star vote a 2, and so on, and just sum these up to produce a score. Because this is linear, the expected value of this function under our Dirichlet is simply:

Of course, this simple function has many issues. For example, are these weights really want you want in practice? They imply that a five-star score worth exactly 5 times a one-star score, which is may not be the case. And it does not do anything special with “divisive” items, which you might want to up- or down-rank.

But, this framework will allow you to plug in any function you want at that point, with the caveat that non-linear functions may involve some nasty integration.3

So there you have it! We’ve achived all three of our desired goals. We can take a prior belief, any number of user votes (including none at all), and a custom scoring function, and combine them all to produce an ranking.

The final question is what prior belief we should use. Intuitively (and reasoning by analogy from the TBA case above), if we use the mean vote histogram over all items, scaled down by some tuning parameter, we should have the desired behavior introducted in the first section, where low-vote items start near the middle of the list, and high-vote high-score items are at the top, and high-vote low-score items at the bottom. Proof of the correctness of this statement is left as an exercise to the reader. (Hint: see the related blog post.)

At Long Last, Some Code

Let’s wrap it up with some Ruby code. If you’ve skipped to this point, congratulations—all of the above was a very long, very discursive way of arriving at something that is, in fact, very simple:

## assumes 5 possible vote categories, but easily adaptable

DEFAULT_PRIOR = [2, 2, 2, 2, 2]

## input is a five-element array of integers
## output is a score between 1.0 and 5.0
def score votes, prior=DEFAULT_PRIOR
  posterior = votes.zip(prior).map { |a, b| a + b }
  sum = posterior.inject { |a, b| a + b }
  posterior.
    map.with_index { |v, i| (i + 1) * v }.
    inject { |a, b| a + b }.
    to_f / sum
end

If you play around with this method, you can see:

  • With no votes, you get a score in the middle of the range, 3.0.
  • As you add high scores, the value increases, and as you add low scores, it decreases.
  • If the score is, say, 4.0, adding more 4-star votes doesn’t change it.
  • If you make the default prior bigger, you need more votes to move away from 3.0.

Enjoy!

1 Strictly speaking, there’s no reason you need to generate an intermediate score if all you’re interested in is a ranking. You could generate an ordering of the items directly and skip the middle step. But on these sites, the intermediate score value has useful semantics and is exposed to the user anyways.

2 Basically. Don’t get nitpicky with me.

3 You can probably get away with just running the function on directly, i.e. using a “point estimate” instead of an expected value. Be sure to understand the implications.

Smoothing users’ votes

In a previous post I describe how you can cook up a Bayesian framework that results in IMDB’s so-called “true Bayesian estimate”, a formula which, on its face, doesn’t look particularly Bayesian.

As my astute commenters pointed out, this formula has many simpler interpretations without needing to invoke the B word. For example, it’s a linear interpolation between two values: where is our mean vote, is some smoothing target, is the smoothing weight. can be any function, as long as it increases with , stays between 0 and 1, and is 0 when is 0. Those constraints give you the right behavior: with no votes, your estimate is exactly; as you add votes, it approaches , and controls how fast that happens.

This formulation naturally leads to the following question: if I’m smoothing like this to deal with paucity-of-data issues, what value of should I pick? IMBD uses the , the global movie mean. Intuitively that makes sense, but is it the right choice?

What’s nice about the expression for above is that the behavior we’re most interested in is when , i.e. when there are no votes. In that case, , because of how I’ve constrained . So finding the best is equivalent to finding the best when .

Happily, we can answer the question of the best analytically, at least if we’re happy to imagining that there is a “true” value of the movie .

Given , we can define a loss function that describes how bad we think a particular value of is. But we don’t really know what is for any movie (if we did, we wouldn’t be bothering with any of this). So we can generalize that a step further and define a risk function quantifying our expected loss: the aggregate of the loss function across all possible values of , weighted by the probability of each value. This gives us the tool we really need to answer the question above: the that minimizes our risk is the winner.

In the absence of any specific notions about errors, we’ll use the standard loss function for reals, squared-error loss: . Then it’s just a matter of churning the crank:

We can drop that first term since we’re only interested in minimimizing this as a function of . To find the minimum:

Unsurprisingly, we see that the best estimate of under squared-error loss is the mean of the distribution of . Since we’re interested in the case where , this implies that the best value to use for is also the mean.

So IMDB’s choice of makes sense: the mean vote over all your movies is a great estimate of the mean of the distribution of .

A couple concluding points:

  1. This answer is specific to squared-error loss; if you plug in another loss function, the optimal value for might very well change. And you might actually have a specific model in mind for how “bad” mis-estimates are. Maybe over-estimates are worse than under-estimates, or something like that.
  2. The definition of the distribution of is actually completely vague above. In fact we don’t even talk about it; we just use it implicitly in our terms. So you should feel free to plug in (the mean of) whatever distribution you believe most accurately represents your product/movie/whatever. IMDB could arguable to better by plugging in per-category means, or something even fancier.
  3. IMDB is actually a particularly bad case because movie opinions are extremely subjective. If you’re serious about modeling very subjective things, we should be talking about multinomial models, Dirichlet priors, and the like.

But the take-home message is: in the absence of a specific loss function that you really believe, smoothing towards the mean isn’t just intuitive, it’s minimizing your risk.

Understanding the “Bayesian Average”

IMDB rates movies using a score they call the true Bayesian estimate (bottom of the page). I’m pretty sure that’s a made-up term. A couple other sites, like BoardGameGeek, use the same thing and call it a “Bayesian average”. I think that’s a made-up term, too, even through there’s a Wikipedia article on it.

Nonetheless, the formula is simple, and it has a nice interpretation. Here it is:

where is the mean vote across all movies, is the number of votes, is the mean rating for the movie, and is the “minimum number of votes required to be listed in the top 250 (currently 1300)”.

The nice interpretation is this: pretend that, in addition to the votes that users give a movie, you’re also throwing in votes of score each. In effect you’re pushing the scores towards the global average, by votes.

Is this arbitarary? Actually, no. It’s the mean (i.e. MLE) of the posterior distribution you get when you have a Normal prior with mean and precision , and a Normal conditional with variance 1.0.

In other words, you’re starting with a belief that, in the absense of votes, a movie/boardgame should be ranked as average, and you’re assuming that user votes are normally-distributed around the “true” score with variance 1.0. Then you’re looking at the posterior distribution (i.e. the probability distribution that arises as a result of those assumptions), and you’re picking the most likely value from that, which in the case of Gaussians is the mean.

Let’s see how that works.

To find the posterior distribution, we could work through the math, or we could just look at the Wikipedia article on conjugate priors. We’ll see that the posterior distribution of a Normal, when the prior is also a Normal, is a Normal with mean

where and are the mean and precision of the prior, respectively, is the precision of the vote distribution, and is the number of votes. In the case of IMDB, we assumed above that , so we have

Comparing the IMDB equation to this, we can see that above is here, above is here, above is here, and above is the hyperparameter . So we know that even though IMDB says is the “minimum number of votes required to be listed in the top 250 list”, that’s an arbitrary decision on their part: it can be anything and the formula still works. is the precision of the prior distribution; as it gets bigger, the prior distribution gets “sharper”, and thus has more of an effect on the posterior distribution.

Now the assumptions we made to get to this point are almost laughable. If nothing else, we know that Gaussians are unbounded and continuous, and user votes on IMBD are integers in the range of 1-10. The interesting take-away message here is that even though we made a lot of assumptions above that were laughably wrong, the end result is a reasonable formula with an nice, intuitive meaning.

The St. Petersburg Paradox

On the topic of numeric paradoxes, here’s another one that drove a lot of work in economic and decision theory: the St. Petersburg paradox.

Here’s the deal. You’re offered a chance to play a game wherein you repeatedly flip a coin until it comes up heads, at which point the game is over. If the coin comes up heads the first time, you win a dollar. If it takes two flips to come up heads, you win two dollars. The third time, four dollars. The fourth time, eight dollars. And so on; the rule is, if you see heads on the th flip, you win dollars.

How much would you pay to play this game?

The paradox is: the expected value of this game is infinity, so according to all your pretty formulas, you should immediately pay all your life savings for a single chance at this game. (Each possible outcome has an expected value of 50 cents, and there are an infinite number of them, and expectation distributes over summation, so the expected value is an infinite sum of 50 cents, which works out to be a little thing I like to call infinity dollars.)

Of course that’s a paradox because it’s crazy talk to bet more than a few bucks on such a game. The paradox highlights at least two problems with blithely using positive EV as the reward you’ll get if you will play the game:

  1. It assumes that the host of the game actually has infinite funds. The Wikipedia article has a very striking breakdown of what happens to the St. Petersburg paradox when you have finite funds. It turns out that even if your backer has access to the entire GDP of the world in 2007, the expected value is only $23.77, which is quite a bit short of infinity dollars.
  2. It assumes you play the game an infinite number of times. That’s the only way you’ll get the expected value in your pocket. And the St. Petersburg paradox is a great example of just how quickly your actual take-home degenerates when subject to real-world constraints like finite repetitions. It turns out that if you want to make $10, you’ll have to play the game one million times; if you’re satisfied with $5, you’ll still have to play a thousand times.

The classical answer to the paradox has been to talk about utility, marginal utility and things like that; i.e., people with lots of money value more money less than people without very much money. And recent answers to the paradox, e.g. cumulative prospect theory, are along the lines of modeling how humans perceive risk, which (unsurprisingly) is not really in line with the actual probabilities.

But it seems to me that these solutions all involve modeling human behavior and explaining why a human wouldn’t pay a lot of money to play the game, either because money means less as it gets bigger or because they mis-value risks. But the actual paradox is not about human behavior or psychology. It’s the fact that the expected value of a game is not a good estimate of the real-world value of a game, because expected value can make assumptions about infinite funds and infinite plays, and we don’t have those.

So my solution to the St. Petersburg paradox is this: drop all events that have a probability less than some small epsilon, or a value more than some large, um, inverse epsilon. That neatly solves both of the infinity assumptions. (In this particular case one bound would do, because the probabilities drop exponentially as the values rise exponentially, but not in general.) I’ll call this the REV: the realistically expected value.

In this case, if you set the lower probability bound to be .01, and the upper value bound to be one million, then the REV of the St. Petersburg paradox is just about three bucks. (The upper value bound doesn’t even come into play.) And that’s about what I’d pay to play it.

So there you go. Fixed economics for ya.

A philosophical question

Is there really a difference between saying, “I don’t know anything, a priori, about the parameters of this distribution”, and using a uniform prior?

What about, “I don’t know anything about that value” versus “As far as I’m concerned, every possibility for that value is equally likely”?

Bayes vs MLE: an estimation theory fairy tale

I found a neat little example in one of my introductory stats books about Bayesian versus maximum-likelihood estimation for the simple problem of estimating a binomial distribution given only one sample.

I was going to try and show the math but since Blogger is not making it possible to actually render MathML I’ll just hand-wave instead. [Fixed in Whisper. —ed.]

So let’s say we’re trying to estimate a binomial distribution parameterized by , and that we’ve only seen one estimate. For example, someone flips a coin once, and we have to decide what the coin’s probability of heads is.

The maximum likelhood estimate for is easy: if your single sample is a 1, then , and if your sample is 0, . (And if you go through the laborious process of writing the log likelihood, setting the derivative equal to 0, and solving it, you come up with the general rule of (# of 1’s) / (# of 1’s + # of 0’s), which is kinda what you would expect.)

In the coin case it seems crazy to say, I saw one head, so I’m going to assume that the coin always turns up heads, but that’s because of our prior knowledge of how coins behave. If we’re given a black box with a button and two lights, and you press the button, and one of the lights come on, then maybe estimating that that light always comes on when you press the button makes a little more sense.

Finding the Bayesian estimate is slightly more complicated. Let’s use a uniform prior. Our conditional distribution is and , and if you work it out, the posterior ends up as and .

Now if we were in the world of classication, we’d take the MAP estimate, which is a fancy way of saying the value with the biggest probability, or the mode of the distribution. Since we’re using a uniform prior, that would end up as the same as the MLE. But we’re not. We’re in the world of real numbers, so we can take something better: the expected value, or the mean of the distribution. This is known as the Bayes estimate, and there are some decision-theoretic reasons for using it, but informally, it makes more sense than using the MAP estimate: you can take into account the entire shape of the distribution, not just the mode.

Using the Bayes estimate, we arrive at if the sample was a 1, and if the sample was a zero. So we’re at a place where Bayesian logic and frequentist logic arrive at very different answers, even with a uniform prior.

Up till now we’ve been talking about “estimation theory”, i.e. the art of estimating shit. But estimation theory is basically decision theory in disguise, where your decision space is the same as your parameter space: you’re deciding on a value for , given your input data, and your prior knowledge, if any.

Now what’s cool about moving to the world of decision theory is that we can say: if I have to decide on a particular value for , how can I minimize my expected cost, aka my risk? A natural choice for a cost, or loss, function, is squared error. If the true value is , I’d like to estimate in such a way that is minimized. So we don’t have to argue philosophically about MLE versus MAP versus minimax versus Bayes estimates; we can quantify how well each of them do under this framework.

And it turns out that, if you plot the risk for the MLE estimate and for the Bayes estimate under different values of the true value , then MOST of the time, the Bayes estimate has lower risk than the MLE. It’s only when is close to 0 or to 1 that MLE has lower risk.

So that’s pretty cool. It seems like the Bayes estimate must be a superior estimate.

Of course, I set this whole thing up. Those “decision-theoretic reasons” for choosing the Bayes estimate I mentioned? Well, they’re theorems that show that the Bayes estimate minimizes risk. And, in fact, the Bayes estimate of the mean of the distribution is specific to squared-error loss. If we chose another loss function, we could come up with a potentially very different Bayes estimate.

But my intention wasn’t really to trick you into believing that Bayes estimates are awesome. (Though they are!) I wanted to show that:

  1. Bayes and classical approaches can come up with very different estimates, even with a uniform prior.
  2. If you cast things in decision-theoretic terms, you can make some real quantitative statements about different ways of estimating.

In the decision theory world, you can customize your estimates to minimize your particular costs in your particular situation. And that’s an idea that I think is very, very powerful.

Decision theory and approximate randomization

In my earlier post about decision theory I alluded to a superior alternative to the classic t-test. That alternative is approximate randomization. It’s a neat way to do a hypothesis test without having to make any assumptions about the nature of the sampling distribution of the test statistic, in contrast to the assumptions required by Student’s t-test and its brethren.

Approximate randomization is ideal for comparing the result of a complicated metric run on the output of a complicated system, because you don’t have to worry about modeling any of that complexity, or, more likely, praying to the central limit theorem while and ignoring the issue. Back in my machine translation days, I used it to calculate the significant difference between the BLEU scores of two MT systems. This was pretty much the ideal scenario—BLEU is a complicated metric (at least, from the statistical point of view, which is more comfortable with things like the t-statistic, aka “the difference between the two means divided by some stuff”), and MT output is the result of something even more complicated. It worked very well, and I even wrote some slides on it.

(In fact, there’s sometimes an even better reason to use AR over t-tests than just “it makes fewer assumptions”: t-tests tend to be overly conservative when their assumptions are violated. So if you’d be happier with the alternative hypothesis, AR will be more likely to show a difference than a t-test will. There’s a great chapter on this near the beginning of Bayesian Computation with R, where Monte Carlo techniques are used to show how the test statistic sampling distribution changes under different ways of violating the assumptions.)

Something I’ve been thinking about a lot recently is how to apply approximate randomization to the Bayesian, decision-theoretic world of hypothesis tests. Unfortunately it’s not cut and dry. AR gives you a way of directly sampling from the sampling distribution of the test statistic under the null hypothesis. That’s all you need for classical tests, but in the Bayesian world, you also need to sample from the alternative distribution. For the common “two-tailed” case, the null hypothesis is that there’s no difference, and AR says, just shuffle everything around, because that shouldn’t make a difference. The alternative hypothesis is that there IS a difference, so I think you would somehow need to do something analogous, but under every possible way of there being a difference. And I’m not sure what that would really look like.

Bayesian hypothesis testing and decision theory

I’ve been doing a lot of learning at the new job. Not because people here are teaching me stuff, but more because I’m in a good position to spend a significant portion of my day learning about stuff that will help me do my job. (Which is great, and fun, and further reinforces what I know about myself by now—I’m a great self-directed learner and a very poor externally-directed learner.)

One of the things I’ve learned is that when it comes to statistics, I’m a Bayesian. And all the crap I learned about things like hypothesis testing and maximum likelihood estimation in my stats classes now seems horribly clunky and old-fashioned to me.

Let’s take hypothesis testing as an example. In the classical/frequentist world, you pick an arbitrary “small enough” probability (aka 5%), find the sampling distribution of your statistic under your null hypothesis, and if it’s below that threshold, say yea, else say nay.

Here are some things that are wrong/bad with that approach: the 5% threshold is completely arbitrary, the sampling distribution under the alternative hypothesis is not taken into consideration (i.e. you only care about type I errors), and you don’t have any way to balance the cost of type I vs type II errors. (Never mind the fact that people ALWAYS just use t-tests and ignore the fact that their datapoints are not actually distributed Normally and with the same means and variances. That, at least, I can tell you how to fix.)

Compare this with the Bayesian decision theory version of hypothesis testing: you assign a cost to the two types of error, calculate the posterior probability under both conditions, based on the observations and incorporating any prior knowledge if you have it, calculate a threshold that minimizes your expected cost, and accept or reject based on that. Doesn’t that just make more sense?

I highly recommend the book Bayesian Computation with R. (Although it doesn’t actually talk about decision theory!) It has an associated blog: LearnBayes.

Other things to look at: William H. Jefferys’s Stats 295 class materials (especially these slides, which I’m still working my way through), and his blog for the class.

Simpson’s Paradox

I found a really cool visual explanation of Simpson’s Paradox on the Wikipedier.

Informally, Simpson’s Paradox states that, if you and I are competing, and I do better than you in category A, and I also do better than you in category B, my overall score for both categories combined could actually be worse than yours. The Wikipidia article gives a real-life example:

“In both 1995 and 1996, [David] Justice had a higher batting average […] than [Derek] Jeter; however, when the two years are combined, Jeter shows a higher batting average than Justice.”

And there’s also a famous legal case about Berkeley’s admission rates for women from the 70’s, where they were sued because the overall admission rate was lower for women than for men. Turns out that if you break it down by department, each department actually had a higher admission rate for women.

This all sounds crazy until you stare at the picture above for a while. The slopes of the lines are the percentages. Both solid blue vectors have smaller slopes than their corresponding solid red vectors, but when you add the them (shown as a dashed lines), the blue vectors have a bigger slope.

What the picture really makes clear is that a ratio or a percentage is not a complete description of the situation. Knowing a percentage is equivalent to knowing the angle of a vector without knowing its magnitude. You can see from the picture that this isn’t a weird corner case; there are many choices for the second blue vector that would have the same result.

It’s been probably 10-12 years since I learned about Simpson’s Paradox in some undergrad stats class. Now I finally really understand it.