I created my first Greasemonkey script yesterday, to ease my wife’s Redfin addiction. The idea was simple: map restaurants, grocery stores, coffee shops, etc near each house.

Starting from not knowing anything more than what Greasemonkey was (including not knowing Javascript), it took me 45 minutes to produce a working script.

It was fun, and it’s a great reminder that, unlike TV, a website is the product of a shared computation between the server and the client. Redfin can send me whatever it wants, but ultimately, I decide how to display it. Not a new idea, but it’s nice to finally be a part of it.

Bayesian hypothesis testing and decision theory

I’ve been doing a lot of learning at the new job. Not because people here are teaching me stuff, but more because I’m in a good position to spend a significant portion of my day learning about stuff that will help me do my job. (Which is great, and fun, and further reinforces what I know about myself by now—I’m a great self-directed learner and a very poor externally-directed learner.)

One of the things I’ve learned is that when it comes to statistics, I’m a Bayesian. And all the crap I learned about things like hypothesis testing and maximum likelihood estimation in my stats classes now seems horribly clunky and old-fashioned to me.

Let’s take hypothesis testing as an example. In the classical/frequentist world, you pick an arbitrary “small enough” probability (aka 5%), find the sampling distribution of your statistic under your null hypothesis, and if it’s below that threshold, say yea, else say nay.

Here are some things that are wrong/bad with that approach: the 5% threshold is completely arbitrary, the sampling distribution under the alternative hypothesis is not taken into consideration (i.e. you only care about type I errors), and you don’t have any way to balance the cost of type I vs type II errors. (Never mind the fact that people ALWAYS just use t-tests and ignore the fact that their datapoints are not actually distributed Normally and with the same means and variances. That, at least, I can tell you how to fix.)

Compare this with the Bayesian decision theory version of hypothesis testing: you assign a cost to the two types of error, calculate the posterior probability under both conditions, based on the observations and incorporating any prior knowledge if you have it, calculate a threshold that minimizes your expected cost, and accept or reject based on that. Doesn’t that just make more sense?

I highly recommend the book Bayesian Computation with R. (Although it doesn’t actually talk about decision theory!) It has an associated blog: LearnBayes.

Other things to look at: William H. Jefferys’s Stats 295 class materials (especially these slides, which I’m still working my way through), and his blog for the class.

Simpson’s Paradox

I found a really cool visual explanation of Simpson’s Paradox on the Wikipedier.

Informally, Simpson’s Paradox states that, if you and I are competing, and I do better than you in category A, and I also do better than you in category B, my overall score for both categories combined could actually be worse than yours. The Wikipidia article gives a real-life example:

“In both 1995 and 1996, [David] Justice had a higher batting average […] than [Derek] Jeter; however, when the two years are combined, Jeter shows a higher batting average than Justice.”

And there’s also a famous legal case about Berkeley’s admission rates for women from the 70’s, where they were sued because the overall admission rate was lower for women than for men. Turns out that if you break it down by department, each department actually had a higher admission rate for women.

This all sounds crazy until you stare at the picture above for a while. The slopes of the lines are the percentages. Both solid blue vectors have smaller slopes than their corresponding solid red vectors, but when you add the them (shown as a dashed lines), the blue vectors have a bigger slope.

What the picture really makes clear is that a ratio or a percentage is not a complete description of the situation. Knowing a percentage is equivalent to knowing the angle of a vector without knowing its magnitude. You can see from the picture that this isn’t a weird corner case; there are many choices for the second blue vector that would have the same result.

It’s been probably 10-12 years since I learned about Simpson’s Paradox in some undergrad stats class. Now I finally really understand it.

A brilliant idea

Filter comments which contain incorrect punctuation or misspellings. In this case it’s YouTube comments (and apparently I’m not the only one who has a problem describing them without using the word “cesspool”), but frankly Reddit could use it just as well. The results are compelling, there’s a very strong correlation between comment stupidity and poor spelling/punctuation.

This is brilliant in its simplicity, and it would be interesting to see how this compares to an approach using StupidFilter, which I guess is some kind of binary SVM classifier.


  1. Far more debuggable and understandable than the output of SVM or a Bayesian classifier.
  2. Much quicker to execute!
  3. Wide-spread usage would have a positive effect on society as a whole.

An awesome function

As seen in a random Reddit comment:

def yesterdays_date():
    yesterday = time.localtime()
    return yesterday

Git 1.6.0 changes

Git 1.6.0 (just released) contains now detects Ruby class, module and method definitions in diff output. Previously it was just class names. (This patch.)

Other things I’m excited about in the new Git:

  1. git-clone --mirror is a handy way to set up a bare mirror repository.
  2. git-diff --check now checks for leftover merge conflict markers.
  3. git-stash save now has a —keep-index option. This lets you stash away the local changes and bring the changes staged in the index to your working tree for examination and testing.
  4. git-stash also has a new branch subcommand to create a new branch out of stashed changes.

Trollop news

Looks like there was a Ruby Inside article featuring Trollop a few weeks ago. Partially as a result of this, I have at least two other people contributing patches. For a project that’s been around for a few years and basically had no one but me use it, that’s a nice change of pace.

I’ve also moved it over from SVN to git (hosted on Gitorious), which probably will help some.

Google textfile auto-titleing

If you search for “ditz readme” on the Googles, the correct result, which is a text file and not an HTML page, appears with the title “DitzãŪREADME”. This is probably because there’s a link to it titled as such in this Japanese description of a Ditz emacs mode. Apparently Google prefers that title over the link just called “README” on the the Ditz main page.

Ditz 0.4, and the magic of Ruby DSLs

I’ve just released Ditz 0.4. The big-ticket item in this release is the plugin system, which makes it very easy to tweak Ditz’s models, views and controllers. There’s an included git plugin which does some nice things like linking git commits and git branches to individual Ditz issues.

The new bash completion is pretty nice too. The completion code has been reworked a bit and now ties in very nicely with the argument processing. Check out this code from the Ditz’s controller (operator.rb, for those following along from your repo):

operation :start, "Start work on an issue", :unstarted_issue
def start project, config, issue
  ## ...

Just by calling the operation method, we get:

  • Argument checking. There must be one argument to ‘ditz start’, and it must be an unstarted issue. Any violations are handled nicely for us without having to invoke the method.
  • Help messages in ‘ditz help’ and ‘ditz help start’.
  • Argument completion. Running ’ditz start ’ outputs a list of possible completions for a command (in this case, all unstarted issues) and then exits. Shell completion scripts can parse this output and present it to you when you hit tab.

So that’s a little DSL that I think turned out well.

Writing a plugin is also nicely DSLified. Here are some examples from plugin/git.rb:

class Issue
 field :git_branch, :ask => false

 def git_commits
   ## ...

Here we reopen the Issue class and add a field called git_branch, and we specify that the UI shouldn’t ask for this field when an issue is created, since I decided that would be too annoying. (We’ll see how we allow the user to explicitly set it below.) We also add a method that’s responsible for actually getting the commits out of git.

Since our configuration file is just a Ditz model object, we can do the same thing to add the configuration parameters we need:

class Config
  field :git_commit_url_prefix,
    :prompt =>"URL prefix (if any) to link git commits to"
  field :git_branch_url_prefix,
    :prompt => "URL prefix (if any) to link git branches to"

We’ll use those two fields to add some links when we generate HTML.

Adding commands to Ditz’s controller is just as easy. Just reopen the class:

class Operator
 operation :set_branch, "Set the git feature branch of an issue",
           :issue, :maybe_string
 def set_branch project, config, issue, maybe_string
   ## ...

So now we have a set-branch command that takes an issue name, and an optional branch name. And it’s a first-class citizen alongside every other command: shows up in the help page, has argument auto-completion, etc.

Finally, let’s see how we modify the views. One thing we’d like to see is the git branch for an issue, if it’s been set.

class ScreenView
 add_to_view :issue_summary do |issue, config|
   " Git branch: #{issue.git_branch || 'none'}\n"

Here we’ve opened up the ScreenView class (which is used for generating the screen output, as opposed to the HTML output) and added a closure to the summary section, which prints out the value of the model field we added above. The HTML version is similar:

class HtmlView
  add_to_view :issue_summary do |issue, config|
    next unless issue.git_branch
    [{ :issue => issue,
       :url_prefix => config.git_branch_url_prefix }, <<EOS]
  Git branch:
  <td class='attrname'>Git branch:</td>
  <td class='attrval'>
    <%= url_prefix ?
      link_to([url_prefix, issue.git_branch].join,
        issue.git_branch) :
      h(issue.git_branch) %>

The HTML generation returns ERB, and a hash of variables necessary for resolving it. (In this case we could also have used string substitution, but that’s not always the case—you might want to make use of variables that are only available at generation time.) Note that it also makes use of some of the convenient helper functions (like link_to and h), which I’ve helpfully defined for you in html.rb.

So that’s how to modify ditz’s models, views and controllers in one easy go. You can add fields and helper methods to model objects (including the configuration object), you can add commands to the controller, and you can add view elements to the screen and HTML output.

For reference, the complete source code to the git plugin is here.

Vim ruby syntax comment reformatting

The vim ruby syntax seems to screw up comments that have multiple hashes. E.g. I like to differentiate

### section heading comments,
## non-inline comments, and
x = a + b # inline comments

But reformatting the comments (e.g. with “gq}”) always screws them up, unless you do:

$ mkdir -p ~/.vim/after/syntax
$ cat > ~/.vim/after/syntax/ruby.vim
set comments=n:#

which tells vim that multiple hash marks are ok.

prev  0 1 2 3 4 5 6 7  next