Friday, November 25, 2011

Working with isTRUE

This week I was running computations transforming some input files into output files. The problem was that it was a repeated process. If new input files were generated or old ones were updated I needed to calculate new output files. The transformation was time consuming so I wanted to run the calculations only when required.
My initial code was:

in.files <- list.files(pattern = glob2rx("*.in"))
out.files <- gsub("in$""out", in.files)
skip <- file.info(in.files)$mtime < file.info(out.files)$mtime
for (i in seq(along.with = in.files)) {
      if (skip[i]) {
            # skip
      } else {
            # generate out.files[i] using in.files[i]
      }
}

It should skip files for which output file modification time is greater than input file modification time. However, I quickly learned that it fails when the output file is missing as file.info returns NA for missing files. The fix I have found was to use isTRUE function which returns TRUE only if its argument is exactly TRUE and otherwise will return FALSE.

The only shortcoming of isTRUE function is the fact that if its argument is TRUE but has some attributes it returns FALSE. This can be seen in the following code:

> x <- TRUE
> attr(x, "color") <- "red"
> names(x) <- "first"
> x
first
 TRUE
attr(,"color")
[1] "red"
> isTRUE(x)
[1] FALSE
> x[1]
first
 TRUE
> isTRUE(x[1])
[1] FALSE
> x[[1]]
[1] TRUE
> isTRUE(x[[1]])
[1] TRUE

The conclusion is that the safe form of the loop in the above example should be using [[ indexation:

for (in seq(along.with = in.files)) {
      if (isTRUE(skip[[i]])) {
            # skip
      } else {
            # generate out.files[i] using in.files[i]
      }
}

This brought me to the problem of vector subsetting using logical arguments. See the following example:

> x <- 1:5
> y <- c(FALSE, TRUE, NA, TRUE, FALSE)
> names(y) <- 1:5
> x[y]
[1]  2 NA  4
> x[sapply(y, isTRUE)]
[1] 2 4

When vector y contains NA it will be returned as NA in output vector. Sometimes such behavior is desirable, but there are cases when it is not. For example notice that we do not know if NA generated by x[y] is because vector x contained NA and y was TRUE or simply y was NA.

However, as we can see from the example using sapply(y, isTRUE) solves the problem. Fortunately y attributes are dropped before passing values to isTRUE so we get correct result without removing them beforehand.

2 comments:

  1. You can use which function to skip NA in logical vector.

    > which(c(TRUE,FALSE,NA,TRUE))
    [1] 1 4

    files_to_process <- in.files[which(file.info(in.files)$mtime > file.info(out.files)$mtime)]

    ReplyDelete
  2. you can use file.exists() to check whether the file is available.

    Cheers,

    ReplyDelete