Thursday, November 10, 2011

Applying multiple functions to data frame

A very typical task in data analysis is calculation of summary statistics for each variable in data frame. Standard lapply or sapply functions work very nice for this but operate only on single function. The problem is that I often want to calculate several diffrent statistics of the data. For example assume that we want to calculate minimum, maximum and mean value of each variable in data frame.

The simplest solution for this is to write a function that does all the calculations and returns a vector. The sample code is:

multi.fun <- function(x) {
      c(min = min(x), mean = mean(x), max = max(x))
}

It gives the following result for cars data set:

> sapply(cars, multi.fun)
     speed   dist
min    4.0   2.00
mean  15.4  42.98
max   25.0 120.00

However, when I work in interactive mode I would prefer to have a function that would accept multiple functions as arguments. I came up with the following solution to this problem:

multi.sapply <- function(...) {
      arglist <- match.call(expand.dots = FALSE)$...
      var.names <- sapply(arglist, deparse)
      has.name <- (names(arglist) != "")
      var.names[has.name] <- names(arglist)[has.name]
      arglist <- lapply(arglist, eval.parent, n = 2)
      x <- arglist[[1]]
      arglist[[1]] <- NULL
      result <- sapply(arglist, function (FUN, x) sapply(x, FUN), x)
      colnames(result) <- var.names[-1]
      return(result)
}

My multi.sapply function takes a vector as first argument and next one can specify multiple functions that are to be applied to this vector. Applying it to cars data yields:

> multi.sapply(cars, min, mean, max)
      min  mean max
speed   4 15.40  25
dist    2 42.98 120

If function argument is given name it will be used as column name instead of deparsed expression. This functionality is shown by the following example summarizing several statistics of EuStockMarkets data set:

> log.returns <- data.frame(diff(log(EuStockMarkets)))
> multi.sapply(log.returns, sd, min,
>              VaR10 = function(x) quantile(x, 0.1))
              sd         min        VaR10
DAX  0.010300837 -0.09627702 -0.010862458
SMI  0.009250036 -0.08382500 -0.009696908
CAC  0.011030875 -0.07575318 -0.012354424
FTSE 0.007957728 -0.04139903 -0.009139666

18 comments:

  1. Very cool. It would be really helpful if you added comments in your multi.sapply() function to explain what each step is doing.

    ReplyDelete
  2. Nice idea.

    I think your code could be simple with a first argument: function(x,...)

    Then you could easily check the mode of x and have i work for vectors also.
    if(mode(x) == "list") { what you do now} else
    {a bit simpler}

    ReplyDelete
  3. Nice!!! Im gonna use it, thanks
    Eran

    ReplyDelete
  4. Steen,
    I ommit naming first argument because then this name would not be allowed as name for function. For example if function was defined as function(x,...) then call like
    multi.sapply(1:10, x = function(x) paste("call", x))
    would not work properly.
    Also this code shows that function works properly on vectors as its result is:
    x
    [1,] "Call: 1"
    [2,] "Call: 2"
    [3,] "Call: 3"
    [4,] "Call: 4"
    [5,] "Call: 5"
    [6,] "Call: 6"
    [7,] "Call: 7"
    [8,] "Call: 8"
    [9,] "Call: 9"
    [10,] "Call: 10"

    ReplyDelete
  5. Erik,
    The code works as follows:

    # extract function arguments as list
    arglist <- match.call(expand.dots = FALSE)$...

    # deparses the expressions defining
    # arguments as given in multi.apply call
    var.names <- sapply(arglist, deparse)

    # if any argument was given name then its name is nonempty
    # if no argument names were given then has.name is NULL
    has.name <- (names(arglist) != "")

    # for all arguments that had name substitue deparsed
    # expression by given name
    var.names[has.name] <- names(arglist)[has.name]

    # now evaluate the expressions given in arguments
    # go two generations back as we apply eval.parent
    # witinh lapply function
    arglist <- lapply(arglist, eval.parent, n = 2)

    # first argument contains data set
    x <- arglist[[1]]

    # and here we remove it from the list
    arglist[[1]] <- NULL

    # we use sapply twice - outer traverses functions and inner data set
    # because x is a defined argument name in sapply definition
    # we have to reorder arguments in function (FUN, x)
    result <- sapply(arglist, function (FUN, x) sapply(x, FUN), x)

    # in defining column names
    # we remove first element as it was name of data set argument
    colnames(result) <- var.names[-1]
    return(result)

    ReplyDelete
  6. In package doBy there is function summaryBy although function should be defined before function call.

    Var10 <- function(x) quantile(x,0.1)
    summaryBy(.~1,data=cars,FUN=c(mean,sd,min,VaR10))

    ReplyDelete
  7. Nice piece of code. You inspired me to also have a go at it. I tried to avoid writing a function myself and ended up using the reshape and plyr packages:

    library(reshape)
    library(plyr)
    ddply(melt(cars), .(variable), summarise, min = min(value), mean = mean(value), max = max(value))

    Although your solution is much more elegant for interactive use.

    ReplyDelete
  8. ...and the result looks like this:

    variable min mean max
    1 speed 4 15.40 25
    2 dist 2 42.98 120

    ReplyDelete
  9. In plyr there is function each to combine multiply functions into one.

    sapply(cars,each(mean, sd, min, max,
    Var10=function(x)
    unname(quantile(x,0.1))))

    ReplyDelete
  10. I've learned so much with this post. Thanks for share you knowledge.

    ReplyDelete
  11. how do i pass an additional variable along with the dataset being passed

    ReplyDelete
    Replies
    1. Unfortunately the code given here does not support it directly.

      Delete
  12. If I want to apply four methods (subfunctions) to a same sample (matrix) to calculate the p-values. What should I do? I notice that your function is only available to vectors.

    ReplyDelete
  13. I do not see your code, but my code works with sapply. Probably you should replace it by other function appropriate for your case.

    ReplyDelete
    Replies
    1. May I ask whether "sapply" is related to parallel computing? I want to assign each subfunction to each core to compute the pvalues at the same time? Do you have any recommendation about that?

      Delete
    2. You can start with learning parallel package.

      Delete
  14. Hi, this is very useful. But what if the data has missing or NA values? How can we modify the code in order to account for NA values?

    ReplyDelete
    Replies
    1. You would have to pass a properly wrapped base functions, eg. pass
      function(x) mean(x, na.rm = TRUE)
      as an argument.

      Delete

Note: Only a member of this blog may post a comment.