In my earlier post about decision theory I alluded to a superior alternative to the classic t-test. That alternative is approximate randomization. It’s a neat way to do a hypothesis test without having to make any assumptions about the nature of the sampling distribution of the test statistic, in contrast to the assumptions required by Student’s t-test and its brethren.
Approximate randomization is ideal for comparing the result of a complicated metric run on the output of a complicated system, because you don’t have to worry about modeling any of that complexity, or, more likely, praying to the central limit theorem while and ignoring the issue. Back in my machine translation days, I used it to calculate the significant difference between the BLEU scores of two MT systems. This was pretty much the ideal scenario—BLEU is a complicated metric (at least, from the statistical point of view, which is more comfortable with things like the t-statistic, aka “the difference between the two means divided by some stuff”), and MT output is the result of something even more complicated. It worked very well, and I even wrote some slides on it.
(In fact, there’s sometimes an even better reason to use AR over t-tests than just “it makes fewer assumptions”: t-tests tend to be overly conservative when their assumptions are violated. So if you’d be happier with the alternative hypothesis, AR will be more likely to show a difference than a t-test will. There’s a great chapter on this near the beginning of Bayesian Computation with R, where Monte Carlo techniques are used to show how the test statistic sampling distribution changes under different ways of violating the assumptions.)
Something I’ve been thinking about a lot recently is how to apply approximate randomization to the Bayesian, decision-theoretic world of hypothesis tests. Unfortunately it’s not cut and dry. AR gives you a way of directly sampling from the sampling distribution of the test statistic under the null hypothesis. That’s all you need for classical tests, but in the Bayesian world, you also need to sample from the alternative distribution. For the common “two-tailed” case, the null hypothesis is that there’s no difference, and AR says, just shuffle everything around, because that shouldn’t make a difference. The alternative hypothesis is that there IS a difference, so I think you would somehow need to do something analogous, but under every possible way of there being a difference. And I’m not sure what that would really look like.