Saturday, May 5, 2012

Visualizing tables in ggplot2

Recently I wanted to recreate  assocplot  using  ggplot2. In the end I propose a simple way to visualize data arranged two-way tables using geom_tile.

I used Titanic data set as an example combining age and sex dimensions to get two-way data.

I plot residuals of Chi-squared test (like in assocplot) on the left and probability of survival on the right. A nice feature of geom_tile is that nicely highlights missing data (children were not crew members). Here is a code generating the plots:


m <- acast(melt(unclass(Titanic)), Class ~ Age + Sex, sum)
names(dimnames(m)) <- c("Class", "Age_Sex")
df <- melt(unclass(chisq.test(m)$res), = "residuals")
g1 <- ggplot(df, aes(x = Class, y = Age_Sex)) +
  geom_tile(aes(fill = residuals)) +
  scale_fill_gradientn(colours=c("blue","white","red")) +

m <- acast(melt(unclass(Titanic)), Class~Age+Sex,
           function(x) {x[2] / sum(x)})
names(dimnames(m)) <- c("Class","Age_Sex")
df <- melt(m, = "survived")
g2 <- ggplot(df, aes(x = Class, y = Age_Sex)) +
  geom_tile(aes(fill = survived)) +
  scale_fill_gradient(low = "blue", high = "red")+theme_bw()

pushViewport(viewport(layout = grid.layout(1, 2)))
print(g1, vp = viewport(layout.pos.row = 1, layout.pos.col = 1))
print(g2, vp = viewport(layout.pos.row = 1, layout.pos.col = 2))

And the result:


  1. Can you please explain what the 'survived' scale represents.

    Thanks, I really like this idea.

  2. "survived" scale represents probability that a member of given group (intersection of Class/Age/Sex) survived in Titanic accident.

    melt(unclass(Titanic)) results in a data frame with 5 columns:

    > head(melt(unclass(Titanic)))
    Class Sex Age Survived value
    1 1st Male Child No 0
    2 2nd Male Child No 0
    3 3rd Male Child No 35
    4 Crew Male Child No 0
    5 1st Female Child No 0
    6 2nd Female Child No 0

    Then if we use formula "Class~Age+Sex" a two element vector is passed to function containing counts for "survived" levels "No" and "Yes".

    You can find it out by inspecting the following code:

    > acast(melt(unclass(Titanic)), Class~Age+Sex,
    + function(x) { str(x); x[2] / sum(x)})
    num [1:2] 4 140
    num [1:2] 13 80
    num [1:2] 89 76
    num [1:2] 3 20
    num [1:2] 118 57
    num [1:2] 154 14
    num [1:2] 387 75
    num [1:2] 670 192
    num [1:2] 0 1
    num [1:2] 0 13
    num [1:2] 17 14
    num [1:2] 0 0
    num [1:2] 0 5
    num [1:2] 0 11
    num [1:2] 35 13
    num [1:2] 0 0
    Adult_Female Adult_Male Child_Female Child_Male
    1st 0.9722222 0.32571429 1.0000000 1.0000000
    2nd 0.8602151 0.08333333 1.0000000 1.0000000
    3rd 0.4606061 0.16233766 0.4516129 0.2708333
    Crew 0.8695652 0.22273782 NaN NaN

    Interestingly the first "num(0)" in the output is due to additional call of function in vaggregate because no "fill" argument is passed to acast.

  3. Very cool.

    I'm having trouble following some of the code, though. In the acast function, why does x[2] return the number of survivors for each group?

    1. x[2] is number of survivors with Survided=="Yes" in each intersection of Class/Age/Sex values.

      To see that first look at the code in my last explaining comment. Then notice that:
      1) In formula:
      m <- acast(melt(unclass(Titanic)), Class~Age+Sex,
      function(x) {x[2] / sum(x)})
      we do not supply value.var argument so "value" column is taken as values for calculations (this is a default column choice of acast).

      2) In formula Class~Age+Sex we did not use Survived variable. It is a factor with levels: No Yes (order is important) so our acast will pass to the function two values from column "value" in this order (corresponding to "No" and "Yes" counts). So x[2] is Yes count and sum(x) is total count.

      Notice that in acast documentation you have information that: "If the combination of variables you supply does not uniquely identify one row in the original data set, you will need to supply an aggregating function, fun.aggregate. This function should take a vector of numbers and return a single summary statistic.".
      This is exactly what happens here.

  4. Thanks for the quick response!

    I now understand why [2] would yield all the "Yes" values, but I'm still a bit confused as to why x corresponds to the survived column. Is this because acast looks for the first argument not specified in the formula?

    Thanks again.

    1. acast looks at column with name "value" by default.
      The data frame passed to acast looks like this:

      > head(melt(unclass(Titanic)))
      Class Sex Age Survived value
      1 1st Male Child No 0
      2 2nd Male Child No 0
      3 3rd Male Child No 35
      4 Crew Male Child No 0
      5 1st Female Child No 0
      6 2nd Female Child No 0

      so acast gets data from "value" column
      ORDERED by values in Survived column because Survived was not used in the formula and in data frame melt(unclass(Titanic)) "No" values are before "Yes" values.

      If you used the following code:

      v <- melt(unclass(Titanic))
      v <- v[order(v$Survived, decreasing=T),]
      m <- acast(v, Class~Age+Sex, function(x) {x[2] / sum(x)})

      you would get the following table:
      > m
      Adult_Female Adult_Male Child_Female Child_Male
      1st 0.02777778 0.6742857 0.0000000 0.0000000
      2nd 0.13978495 0.9166667 0.0000000 0.0000000
      3rd 0.53939394 0.8376623 0.5483871 0.7291667
      Crew 0.13043478 0.7772622 NaN NaN

      with probabilities of "No" because in data frame v now "Yes" values are before "No" values.

    2. Another way to do this (maybe a bit cleaner) is through use of plyr package:

      ddply(melt(unclass(Titanic)),.(Class,Sex,Age), function(x) { sum(subset(x,Survived == "Yes", value)) / sum(x$value)})

  5. Hello, I tried your code but it can not display the graphical class you think you know the problem?
    it displays black.
    Thank you in advance

  6. Maybe your default graphic device is not wide enough.
    Try widening it using the mouse.