Thursday, June 7, 2012

You should not use split in production code

Recently I have stumbled on a problem with split function applied on list of factors. The issue is that it might produce wrong splits when splitting factors contain dots.
Here is the example of the problem. Invoking the following code:

df <- data.frame(x = rep(c("a", "a.b"), 3),
                 y = rep(c("b.c", "c"), 3),
                 z = 1:6)

split(df, df[,-3])

produces:

$a.b.c
    x   y z
1   a b.c 1
2 a.b   c 2
3   a b.c 3
4 a.b   c 4
5   a b.c 5
6 a.b   c 6

$a.b.b.c
[1] x y z
<0 rows> (or 0-length row.names)

$a.c
[1] x y z
<0 rows> (or 0-length row.names)

And we can see that incorrect splits were produced. The issue is that split uses interaction to combine list of factors passed to it. One can see this problem by invoking:

> interaction(df[,-3])
[1] a.b.c a.b.c a.b.c a.b.c a.b.c a.b.c
Levels: a.b.c a.b.b.c a.c

The problem might be not a huge issue in interactive mode, but in production code such behavior is a problem. There are three obvious ways to improve how split works:
  1. Rewriting split internals to avoid this problem;
  2. Allow passing  sep parameter to split that would be further passed to  interaction;
  3. Warning if resulting number of levels in combined factor does not equal the multiplication of number of levels of combined factors (assuming drop = F option).
Until this issue is solved there is a workaround using split and two other options using by and dlply (from plyr package):

#Workaround
split(df, lapply(df[,-3], as.integer))

#Alternative 1
by(df, df[,-3], identity)

#Alternative 2
library(plyr)
dlply(df,.(x,y))

4 comments:

  1. You're mistaken. There is nothing wrong with split(). You're just expecting it to behave differently than it was intended. split() was _specifically_ intended to split on the interaction of the list of factors you provide it. So the output you see is exactly correct.

    The issue is that you want a function that splits sequentially, first by one factor, and then the resulting pieces by the next factor, etc. As you note, there are other functions that do this. So your problem came from misusing split; it simply shouldn't be used in the manner your trying to use it!

    ReplyDelete
    Replies
    1. In such a case I would like split at least to accept sep parameter exactly like interaction accepts it.

      I would also like to know an example of the situation when such a behavior is expected. That is - when using split function we would like to have two factors combined like in my exaplme? I was unable to find one - this was the reason for writing of this post.

      Delete
  2. I often use colsplit from the reshape2 package for similar tasks. maybe give it a try

    ReplyDelete
  3. For what it's worth, a sep operator was added to r-devel on June 26 by Brian Ripley in response to a bug report: I'm not sure whether this has made it into a patched/release version yet or not.

    ReplyDelete