I used Titanic data set as an example combining age and sex dimensions to get two-way data.
I plot residuals of Chi-squared test (like in assocplot) on the left and probability of survival on the right. A nice feature of geom_tile is that nicely highlights missing data (children were not crew members). Here is a code generating the plots:
library(ggplot2)
library(grid)
library(reshape2)
m <- acast(melt(unclass(Titanic)), Class ~ Age + Sex, sum)
names(dimnames(m)) <- c("Class", "Age_Sex")
df <- melt(unclass(chisq.test(m)$res), value.name = "residuals")
g1 <- ggplot(df, aes(x = Class, y = Age_Sex)) +
geom_tile(aes(fill = residuals)) +
scale_fill_gradientn(colours=c("blue","white","red")) +
theme_bw()
m <- acast(melt(unclass(Titanic)), Class~Age+Sex,
function(x) {x[2] / sum(x)})
names(dimnames(m)) <- c("Class","Age_Sex")
df <- melt(m, value.name = "survived")
g2 <- ggplot(df, aes(x = Class, y = Age_Sex)) +
geom_tile(aes(fill = survived)) +
scale_fill_gradient(low = "blue", high = "red")+theme_bw()
grid.newpage()
pushViewport(viewport(layout = grid.layout(1, 2)))
print(g1, vp = viewport(layout.pos.row = 1, layout.pos.col = 1))
print(g2, vp = viewport(layout.pos.row = 1, layout.pos.col = 2))
And the result:
Can you please explain what the 'survived' scale represents.
ReplyDeleteThanks, I really like this idea.
"survived" scale represents probability that a member of given group (intersection of Class/Age/Sex) survived in Titanic accident.
ReplyDeletemelt(unclass(Titanic)) results in a data frame with 5 columns:
> head(melt(unclass(Titanic)))
Class Sex Age Survived value
1 1st Male Child No 0
2 2nd Male Child No 0
3 3rd Male Child No 35
4 Crew Male Child No 0
5 1st Female Child No 0
6 2nd Female Child No 0
Then if we use formula "Class~Age+Sex" a two element vector is passed to function containing counts for "survived" levels "No" and "Yes".
You can find it out by inspecting the following code:
> acast(melt(unclass(Titanic)), Class~Age+Sex,
+ function(x) { str(x); x[2] / sum(x)})
num(0)
num [1:2] 4 140
num [1:2] 13 80
num [1:2] 89 76
num [1:2] 3 20
num [1:2] 118 57
num [1:2] 154 14
num [1:2] 387 75
num [1:2] 670 192
num [1:2] 0 1
num [1:2] 0 13
num [1:2] 17 14
num [1:2] 0 0
num [1:2] 0 5
num [1:2] 0 11
num [1:2] 35 13
num [1:2] 0 0
Adult_Female Adult_Male Child_Female Child_Male
1st 0.9722222 0.32571429 1.0000000 1.0000000
2nd 0.8602151 0.08333333 1.0000000 1.0000000
3rd 0.4606061 0.16233766 0.4516129 0.2708333
Crew 0.8695652 0.22273782 NaN NaN
Interestingly the first "num(0)" in the output is due to additional call of function in vaggregate because no "fill" argument is passed to acast.
Very cool.
ReplyDeleteI'm having trouble following some of the code, though. In the acast function, why does x[2] return the number of survivors for each group?
x[2] is number of survivors with Survided=="Yes" in each intersection of Class/Age/Sex values.
DeleteTo see that first look at the code in my last explaining comment. Then notice that:
1) In formula:
m <- acast(melt(unclass(Titanic)), Class~Age+Sex,
function(x) {x[2] / sum(x)})
we do not supply value.var argument so "value" column is taken as values for calculations (this is a default column choice of acast).
2) In formula Class~Age+Sex we did not use Survived variable. It is a factor with levels: No Yes (order is important) so our acast will pass to the function two values from column "value" in this order (corresponding to "No" and "Yes" counts). So x[2] is Yes count and sum(x) is total count.
Notice that in acast documentation you have information that: "If the combination of variables you supply does not uniquely identify one row in the original data set, you will need to supply an aggregating function, fun.aggregate. This function should take a vector of numbers and return a single summary statistic.".
This is exactly what happens here.
Thanks for the quick response!
ReplyDeleteI now understand why [2] would yield all the "Yes" values, but I'm still a bit confused as to why x corresponds to the survived column. Is this because acast looks for the first argument not specified in the formula?
Thanks again.
acast looks at column with name "value" by default.
DeleteThe data frame passed to acast looks like this:
> head(melt(unclass(Titanic)))
Class Sex Age Survived value
1 1st Male Child No 0
2 2nd Male Child No 0
3 3rd Male Child No 35
4 Crew Male Child No 0
5 1st Female Child No 0
6 2nd Female Child No 0
so acast gets data from "value" column
ORDERED by values in Survived column because Survived was not used in the formula and in data frame melt(unclass(Titanic)) "No" values are before "Yes" values.
If you used the following code:
v <- melt(unclass(Titanic))
v <- v[order(v$Survived, decreasing=T),]
m <- acast(v, Class~Age+Sex, function(x) {x[2] / sum(x)})
you would get the following table:
> m
Adult_Female Adult_Male Child_Female Child_Male
1st 0.02777778 0.6742857 0.0000000 0.0000000
2nd 0.13978495 0.9166667 0.0000000 0.0000000
3rd 0.53939394 0.8376623 0.5483871 0.7291667
Crew 0.13043478 0.7772622 NaN NaN
with probabilities of "No" because in data frame v now "Yes" values are before "No" values.
Another way to do this (maybe a bit cleaner) is through use of plyr package:
Deletelibrary(plyr)
ddply(melt(unclass(Titanic)),.(Class,Sex,Age), function(x) { sum(subset(x,Survived == "Yes", value)) / sum(x$value)})
Hello, I tried your code but it can not display the graphical class you think you know the problem?
ReplyDeleteit displays black.
Thank you in advance
Maybe your default graphic device is not wide enough.
ReplyDeleteTry widening it using the mouse.