Here is the example of the problem. Invoking the following code:
df <- data.frame(x = rep(c("a", "a.b"), 3),
y = rep(c("b.c", "c"), 3),
z = 1:6)
split(df, df[,-3])
produces:
$a.b.c
x y
z
1 a b.c 1
2 a.b c 2
3 a b.c 3
4 a.b c 4
5 a b.c 5
6 a.b c 6
$a.b.b.c
[1] x y z
<0 rows> (or 0-length row.names)
$a.c
[1] x y z
<0 rows> (or 0-length row.names)
And we can see that incorrect splits were produced. The issue is that split uses interaction to combine list of factors passed to it. One can see this problem by invoking:
> interaction(df[,-3])
[1] a.b.c a.b.c a.b.c a.b.c a.b.c a.b.c
Levels: a.b.c a.b.b.c a.c
The problem might be not a huge issue in interactive mode, but in production code such behavior is a problem. There are three obvious ways to improve how split works:
- Rewriting split internals to avoid this problem;
- Allow passing sep parameter to split that would be further passed to interaction;
- Warning if resulting number of levels in combined factor does not equal the multiplication of number of levels of combined factors (assuming drop = F option).
#Workaround
split(df, lapply(df[,-3], as.integer))
#Alternative 1
by(df, df[,-3], identity)
#Alternative 2
library(plyr)
dlply(df,.(x,y))
You're mistaken. There is nothing wrong with split(). You're just expecting it to behave differently than it was intended. split() was _specifically_ intended to split on the interaction of the list of factors you provide it. So the output you see is exactly correct.
ReplyDeleteThe issue is that you want a function that splits sequentially, first by one factor, and then the resulting pieces by the next factor, etc. As you note, there are other functions that do this. So your problem came from misusing split; it simply shouldn't be used in the manner your trying to use it!
In such a case I would like split at least to accept sep parameter exactly like interaction accepts it.
DeleteI would also like to know an example of the situation when such a behavior is expected. That is - when using split function we would like to have two factors combined like in my exaplme? I was unable to find one - this was the reason for writing of this post.
I often use colsplit from the reshape2 package for similar tasks. maybe give it a try
ReplyDeleteFor what it's worth, a sep operator was added to r-devel on June 26 by Brian Ripley in response to a bug report: I'm not sure whether this has made it into a patched/release version yet or not.
ReplyDelete