Some git-fu

Some git-fu I've been finding particularly useful recently:

  1. Untangling concurrent changes into multiple commits: git add -p is the greatest thing since sliced bread. But did you know it features an 's' command which allows you to split a hunk into smaller hunks? Now you can untangle pretty much anything.

  2. Splitting a previous commit into multiple commits: I've been finding this one useful for quite a while. Start with a git rebase -i, mark the commit(s) as edit, and once you get there, do a git reset HEAD^. All the changes in that commit will be moved out of the staging area, and you can git add/git commit to your heart's content. Finish with a quick git rebase --continue to the throat.

  3. Fixing your email address in previous commits: I often make a new repo and forget to change my email address. (For historical, and now silly, reasons, I like to commit to different projects from different addresses, and I often screw it up.) Here's how to do a mass change: git filter-branch --env-filter "export [email protected]" commit..HEAD, where commit is the first commit to be affected. Of course, changing the email address of a commit changes its id (and the id of all subsequent commits), so be careful if you've published them. (Also note that using --env-filter=... won't work. No equal sign technology.)

  4. A git log that includes a list of files modified by each commit: git log --stat, which also gives you a colorized nice histogram of additions/deletions for each file. This is a nice middle ground between git log and git log -p.

  5. Speaking of git log -p, here's how to make it sane in the presence of moves or renames: git log -p -C -M. Otherwise it doesn't check for moves or copies, and happily gives you the full patch. (These should be on by default.)

  6. Comparing two branches: you can use git log --pretty=oneline one..two for changes in one direction (commits that 'two' has that 'one' doesn't); and for the opposite direction. You can also use the triple-dot operator to merge those two lists into one, but typically I find it useful to separate the two. Or you can check out git-wtf, which does this for you.

  7. Preview during commit message: git commit -v will paste the diff into your editor so you can review it while composing the commit message. (It won't be included in the final message, of course.)

  8. gitk: don't use it. You'll get obsessive about merge commits, rebasing, etc., and it just doesn't matter in the end. It took me about 4 months to recover from the bad mindset that gitk put me into.

Hope that was helpful to someone else!


Just read a great Stephen Pinker article about morality that appeared the in NY times earlier this year. Being the curmudgeonly contrarian that I am, I most enjoyed the identification and dissection of the moralization so prevalent but so rarely recognized in my peer group:

[W]ith the discovery of the harmful effects of secondhand smoke, smoking is now treated as immoral. Smokers are ostracized; images of people smoking are censored; and entities touched by smoke are felt to be contaminated (so hotels have not only nonsmoking rooms but nonsmoking floors). The desire for retribution has been visited on tobacco companies, who have been slapped with staggering “punitive damages.”


[W]hether an activity flips our mental switches to the “moral” setting isn’t just a matter of how much harm it does. We don’t show contempt to the man who fails to change the batteries in his smoke alarms or takes his family on a driving vacation, both of which multiply the risk they will die in an accident. Driving a gas-guzzling Hummer is reprehensible, but driving a gas-guzzling old Volvo is not; eating a Big Mac is unconscionable, but not imported cheese or crème brûlée. The reason for these double standards is obvious: people tend to align their moralization with their own lifestyles.

There's also the compelling idea that we're not actually less moral than we were in the past (a claim that old people have been making since time immemorial), but rather, our morality has simply shifted to other things:

This wave of amoralization has led the cultural right to lament that morality itself is under assault, as we see in the group that anointed itself the Moral Majority. In fact there seems to be a Law of Conservation of Moralization, so that as old behaviors are taken out of the moralized column, new ones are added to it. Dozens of things that past generations treated as practical matters are now ethical battlegrounds, including disposable diapers, I.Q. tests, poultry farms, Barbie dolls and research on breast cancer.

I'm reminded of one of my favorite Paul Graham essays, What You Can't Say, the thesis of which is that the powerful ideas that define the modern age are often ideas that were completely verboten in earlier times (e.g. Copernicus's claim that the earth revolves around the sun); thus, if we want to identify powerful ideas that will shape the future, we should look to things that are taboos today.

Trollop 1.10 released

I released a new version of Trollop with a couple minor but cool updates.

The best part is the new :io argument type, which uses open-uri to handle filenames and URIs on the commandline. So you can do something like this:

require 'trollop'

opts = Trollop::options do
opt :source, "Source file (or URI) to print",
:type => :io,
:required => true

opts[:source].each { |l| puts "> #{l.chomp}" }

Also, when trying to detect the terminal size, Trollop now tries to `stty size` before loading curses. This gives better results when running under screen (for some reason curses clears the terminal when initializing under screen).

I've also cleaned up the documentation quite a bit, expanding the examples on the main page, fixing up the RDoc comments, and generating the RDoc documentation with a modern RDoc, so that things like constants actually get documented.

If you're still using OptParse, you should really give Trollop a try. I guarantee you'll write much fewer lines of argument parsing code, and you'll get all sorts of nifty features like help page terminal size detection.

The St. Petersburg Paradox

On the topic of numeric paradoxes, here's another one that drove a lot of work in economic and decision theory: the St. Petersburg paradox.

Here's the deal. You're offered a chance to play a game wherein you repeatedly flip a coin until it comes up heads, at which point the game is over. If the coin comes up heads the first time, you win a dollar. If it takes two flips to come up heads, you win two dollars. The third time, four dollars. The fourth time, eight dollars. And so on; the rule is, if you see heads on the ith flip, you win 2^(i-1) dollars.

How much would you pay to play this game?

The paradox is: the expected value of this game is infinity, so according to all your pretty formulas, you should immediately pay all your life savings for a single chance at this game. (Each possible outcome has an expected value of 50 cents, and there are an infinite number of them, and expectation distributes over summation, so the expected value is an infinite sum of 50 cents, which works out to be a little thing I like to call infinity dollars.)

Of course that's a paradox because it's crazy talk to bet more than a few bucks on such a game. The paradox highlights at least two problems with blithely using positive EV as the reward you'll get if you will play the game:

  1. It assumes that the host of the game actually has infinite funds. The Wikipedia article has a very striking breakdown of what happens to the St. Petersburg paradox when you have finite funds. It turns out that even if your backer has access to the entire GDP of the world in 2007, the expected value is only $23.77, which is quite a bit short of infinity dollars.

  2. It assumes you play the game an infinite number of times. That's the only way you'll get the expected value in your pocket. And the St. Petersburg paradox is a great example of just how quickly your actual take-home degenerates when subject to real-world constraints like finite repetitions. It turns out that if you want to make $10, you'll have to play the game one million times; if you're satisfied with $5, you'll still have to play a thousand times.

The classical answer to the paradox has been to talk about utility, marginal utility and things like that; i.e., people with lots of money value more money less than people without very much money. And recent answers to the paradox, e.g. cumulative prospect theory, are along the lines of modeling how humans perceive risk, which (unsurprisingly) is not really in line with the actual probabilities.

But it seems to me that these solutions all involve modeling human behavior and explaining why a human wouldn't pay a lot of money to play the game, either because money means less as it gets bigger or because they mis-value risks. But the actual paradox is not about human behavior or psychology. It's the fact that the expected value of a game is not a good estimate of the real-world value of a game, because expected value can make assumptions about infinite funds and infinite plays, and we don't have those.

So my solution to the St. Petersburg paradox is this: drop all events that have a probability less than some small epsilon, or a value more than some large, um, inverse epsilon. That neatly solves both of the infinity assumptions. (In this particular case one bound would do, because the probabilities drop exponentially as the values rise exponentially, but not in general.) I'll call this the REV: the realistically expected value.

In this case, if you set the lower probability bound to be .01, and the upper value bound to be one million, then the REV of the St. Petersburg paradox is just about three bucks. (The upper value bound doesn't even come into play.) And that's about what I'd pay to play it.

So there you go. I just fixed economics.

The AIG "scandal"

My wife, who knows more about corporate structure than the average joe, points out that the AIG executives who spent $440k at a lavish retreat shorty after the federal government granted AIG a $85b bailout were, in fact, executives of the profitable, non-bailout-requiring life insurance group, and were unrelated to the bailout-requiring investment insurance and bond-rating companies, except to the extent that both companies are held by the same holding company.

The nature of a holding company corporate structure is fairly strict. Money can't be transferred around between them arbitrarily, so it's very possible for one held company to be successful while another is completely bankrupt. I found a good analogy in the *cough* Reddit comments for the above article:

A family is going through some financial troubles because the dad gambled the money away and is getting welfare checks. However, the son who has been successful in his job is still going to Europe because he paid for it months before and to cancel it would incur penalty fees.

You're blaming the family for going on vacation when they need money for their monthly expenses, when in reality, it's just the son, and he paid for it using his own earnings, not the welfare check.

A philosophical question

Is there really a difference between saying, "I don't know anything, a priori, about the parameters of this distribution", and using a uniform prior?

What about, "I don't know anything about that value" versus "As far as I'm concerned, every possibility for that value is equally likely"?

Bayes vs MLE: an estimation theory fairy tale

I found a neat little example in one of my introductory stats books about Bayesian versus maximum-likelihood estimation for the simple problem of estimating a binomial distribution given only one sample.

I was going to try and show the math but since Blogger is not making it possible to actually render MathML I'll just hand-wave instead.

So let's say we're trying to estimate a binomial distribution parameterized by p, and that we've only seen one estimate. For example, someone flips a coin once, and we have to decide what the coin's probability of heads is.

The maximum likelhood estimate for p is easy: if your single sample is a 1, then p=1, and if your sample is 0, p=0. (And if you go through the laborious process of writing the log likelihood, setting the derivative equal to 0, and solving it, you come up with the general rule of (# of 1's) / (# of 1's + # of 0's), which is kinda what you would expect.)

In the coin case it seems crazy to say, I saw one head, so I'm going to assume that the coin always turns up heads, but that's because of our prior knowledge of how coins behave. If we're given a black box with a button and two lights, and you press the button, and one of the lights come on, then maybe estimating that that light always comes on when you press the button makes a little more sense.

Finding the Bayesian estimate is slightly more complicated. Let's use a uniform prior. Our conditional distribution is f(1|p)=p and f(0|p)=1-p, and if you work it out, the posterior ends up as h(p|1)=2p and h(p|0)=2(1-p).

Now if we were in the world of classication, we'd take the MAP estimate, which is a fancy way of saying the value with the biggest probability, or the mode of the distribution. Since we're using a uniform prior, that would end up as the same as the MLE. But we're not. We're in the world of real numbers, so we can take something better: the expected value, or the mean of the distribution. This is known as the Bayes estimate, and there are some decision-theoretic reasons for using it, but informally, it makes more sense than using the MAP estimate: you can take into account the entire shape of the distribution, not just the mode.

Using the Bayes estimate, we arrive at p=2/3 if the sample was a 1, and p=1/3 if the sample was a zero. So we're at a place where Bayesian logic and frequentist logic arrive at very different answers, even with a uniform prior.

Up till now we've been talking about "estimation theory", i.e. the art of estimating shit. But estimation theory is basically decision theory in disguise, where your decision space is the same as your parameter space: you're deciding on a value for p, given your input data, and your prior knowledge, if any.

Now what's cool about moving to the world of decision theory is that we can say: if I have to decide on a particular value for p, how can I minimize my expected cost, aka my risk? A natural choice for a cost, or loss, function, is squared error. If the true value is q, I'd like to estimate p in such a way that E[(q-p)^2] is minimized. So we don't have to argue philosophically about MLE versus MAP versus minimax versus Bayes estimates; we can quantify how well each of them do under this framework.

And it turns out that, if you plot the risk for the MLE estimate and for the Bayes estimate under different values of the true value q, then MOST of the time, the Bayes estimate has lower risk than the MLE. It's only when q is close to 0 or to 1 that MLE has lower risk.

So that's pretty cool. It seems like the Bayes estimate must be a superior estimate.

Of course, I set this whole thing up. Those "decision-theoretic reasons" for choosing the Bayes estimate I mentioned? Well, they're theorems that show that the Bayes estimate minimizes risk. And, in fact, the Bayes estimate of the mean of the distribution is specific to squared-error loss. If we chose another loss function, we could come up with a potentially very different Bayes estimate.

But my intention wasn't really to trick you into believing that Bayes estimates are awesome. (Though they are!) My intention was to show that:
  1. Bayes and classical approaches can come up with very different estimates, even with a uniform prior.
  2. If you cast things in decision-theoretic terms, you can make some real quantitative statements about different ways of estimating.
In the decision theory world, you can customize your estimates to minimize your particular costs in your particular situation. And that's an idea that I think is very, very powerful.

maff test

It really seems like this should display some kind of equation:

0 1 θ 2 d x

I can't make it work despite all my xhtml'ing. Blogger fail.

Decision theory and approximate randomization

In my earlier post about decision theory I alluded to a superior alternative to the classic t-test. That alternative is approximate randomization. It's a neat way to do a hypothesis test without having to make any assumptions about the nature of the sampling distribution of the test statistic, in contrast to the assumptions required by Student's t-test and its brethren.

Approximate randomization is ideal for comparing the result of a complicated metric run on the output of a complicated system, because you don't have to worry about modeling any of that complexity, or, more likely, praying to the central limit theorem while and ignoring the issue. Back in my machine translation days, I used it to calculate the significant difference between the BLEU scores of two MT systems. This was pretty much the ideal scenario—BLEU is a complicated metric (at least, from the statistical point of view, which is more comfortable with things like the t-statistic, aka "the difference between the two means divided by some stuff"), and MT output is the result of something even more complicated. It worked very well, and I even wrote some slides on it.

(In fact, there's sometimes an even better reason to use AR over t-tests than just "it makes fewer assumptions": t-tests tend to be overly conservative when their assumptions are violated. So if you'd be happier with the alternative hypothesis, AR will be more likely to show a difference than a t-test will. There's a great chapter on this near the beginning of Bayesian Computation with R, where Monte Carlo techniques are used to show how the test statistic sampling distribution changes under different ways of violating the assumptions.)

Something I've been thinking about a lot recently is how to apply approximate randomization to the Bayesian, decision-theoretic world of hypothesis tests. Unfortunately it's not cut and dry. AR gives you a way of directly sampling from the sampling distribution of the test statistic under the null hypothesis. That's all you need for classical tests, but in the Bayesian world, you also need to sample from the alternative distribution. For the common "two-tailed" case, the null hypothesis is that there's no difference, and AR says, just shuffle everything around, because that shouldn't make a difference. The alternative hypothesis is that there IS a difference, so I think you would somehow need to do something analogous, but under every possible way of there being a difference. And I'm not sure what that would really look like.


I created my first Greasemonkey script yesterday, to ease my wife's Redfin addiction. The idea was simple: map restaurants, grocery stores, coffee shops, etc near each house.

Starting from not knowing anything more than what Greasemonkey was (including not knowing Javascript), it took me 45 minutes to produce a working script.

It was fun, and it's a great reminder that, unlike TV, a website is the product of a shared computation between the server and the client. Redfin can send me whatever it wants, but ultimately, I decide how to display it. Not a new idea, but it's nice to finally be a part of it.

Bayesian hypothesis testing and decision theory

I've been doing a lot of learning at the new job. Not because people here are teaching me stuff, but more because I'm in a good position to spend a significant portion of my day learning about stuff that will help me do my job. (Which is great, and fun, and further reinforces what I know about myself by now—I'm a great self-directed learner and a very poor externally-directed learner.)

One of the things I've learned is that when it comes to statistics, I'm a Bayesian. And all the crap I learned about things like hypothesis testing and maximum likelihood estimation in my stats classes now seems horribly clunky and old-fashioned to me.

Let's take hypothesis testing as an example. In the classical/frequentist world, you pick an arbitrary "small enough" probability (aka 5%), find the sampling distribution of your statistic under your null hypothesis, and if it's below that threshold, say yea, else say nay.

Here are some things that are wrong/bad with that approach: the 5% threshold is completely arbitrary, the sampling distribution under the alternative hypothesis is not taken into consideration (i.e. you only care about type I errors), and you don't have any way to balance the cost of type I vs type II errors. (Never mind the fact that people ALWAYS just use t-tests and ignore the fact that their datapoints are not actually distributed Normally and with the same means and variances. That, at least, I can tell you how to fix.)

Compare this with the Bayesian decision theory version of hypothesis testing: you assign a cost to the two types of error, calculate the posterior probability under both conditions, based on the observations and incorporating any prior knowledge if you have it, calculate a threshold that minimizes your expected cost, and accept or reject based on that. Doesn't that just make more sense?

I highly recommend the book Bayesian Computation with R. (Although it doesn't actually talk about decision theory!) It has an associated blog: LearnBayes.

Other things to look at: William H. Jefferys's Stats 295 class materials (especially these slides, which I'm still working my way through), and his blog for the class.

Blog Archive