R, dataframes, and "by"

If you've used R for any reasonable amount of time, you know that the by() function is frustratingly close to being a fantastically useful tool for dataframe manipulation. It lets you break a dataframe down into subsets, as defined by a factor, and to run a function on each of those subsets.

Unfortunately, if you try and actually use it, you will waste several years of your precious life-juices trying to coerce its output into something—anything—you can actually use. Because for some insane reason by() feels the need to return an object of class "by", which is a subclass of "list", and which is therefore impossible to actually manipulate in any reasonable way. Trust me. List processing is not the answer for tabular goddamn numeric data processing.

What you DO want is a function that takes a dataframe, and a factor, and a function, just like by(), but which returns a fucking dataframe. Just like you started with. Dataframes are your friend.

So here's a function that acts just like by(), but it returns a dataframe. It takes the name of the factor (as in, a string) as the second argument, and it helpfully labels a column in the resulting dataframe with those values. Left as an exercise to the reader: extend it to take a list of factors, like aggregate(), instead of the factor name (which is decidedly non-R-like), and to use the list labels as column labels.

One word of caution: do not look up what do.call() does, and do not try and understand why it's necessary below. That way leads to madness.

wby <- function(data, factor.name, func) {
f <- data[,factor.name]
d <- data.frame(do.call(rbind, by(data, f, func)))
d <- cbind(d, row.names(d))
names(d)[length(names(d))] <- factor.name
row.names(d) <- NULL

Simpson's Paradox

I found a really cool visual explanation of Simpson's Paradox on the Wikipedier.

Informally, Simpson's Paradox states that, if you and I are competing, and I do better than you in category A, and I also do better than you in category B, my overall score for both categories combined could actually be worse than yours. The Wikipidia article gives a real-life example:

"In both 1995 and 1996, [David] Justice had a higher batting average [...] than [Derek] Jeter; however, when the two years are combined, Jeter shows a higher batting average than Justice."

And there's also a famous legal case about Berkeley's admission rates for women from the 70's, where they were sued because the overall admission rate was lower for women than for men. Turns out that if you break it down by department, each department actually had a higher admission rate for women.

This all sounds crazy until you stare at the picture above for a while. The slopes of the lines are the percentages. Both solid blue vectors have smaller slopes than their corresponding solid red vectors, but when you add the them (shown as a dashed lines), the blue vectors have a bigger slope.

What the picture really makes clear is that a ratio or a percentage is not a complete description of the situation. Knowing a percentage is equivalent to knowing the angle of a vector without knowing its magnitude. You can see from the picture that this isn't a weird corner case; there are many choices for the second blue vector that would have the same result.

It's been probably 10-12 years since I learned about Simpson's Paradox in some undergrad stats class. Now I finally really understand it.

Blog Archive