## Saturday, December 28, 2013

### GNU R vs Julia: is it only a matter of devectorization?

Recently I have read a post on benefits of code devectorization in Julia. The examples given there inspired me to perform my own devectorization exercise. I decided to use bootstrapping as a test ground. The results are quite interesting (and not so bad for GNU R).

The task is very simple (and typical):

1. generate 10000 elements sample from uniform distribution
2. 1000 times perform bootstrap sample of the vector and calculate its standard deviation
3. return standard deviation of the bootstrap distribution
4. Perform steps 1-3 four times and record computation time

run  <- function() {
ssize <- 10000
nboot <- 1000
x <- runif(ssize)
y <- replicate(nboot, sd(sample(x, ssize, TRUE)))
sd(y)
}

for (i in 1:4) {
cat(system.time(run()), " ")
}
# result: 0.34  0.32  0.31  0.34

The direct translation to Julia gives:
using Distributions

function run()
ssize = 10000
nboot = 1000
x = rand(ssize)
y = Array(Float64,nboot)
for i in 1:nboot
y[i] = std(sample(x, ssize))
end
std(y)
end

for i in 1:4
print("\$(@elapsed run())  ")
end
# result: 0.38965987  0.331032088  0.3208918  0.315452803

So not counting longer Julia start up time - execution time is comparable.

Now we can try devectorizing Julia code:
function run()
ssize = 10000
nboot = 1000
x = rand(ssize)
y = Array(Float64,nboot)
bx = Array(Float64, ssize)
for i in 1:nboot
for j in 1:ssize
bx[j] = x[(rand(Uint32) % ssize) + 1]
end
y[i] = std(bx)
end
std(y)
end

for i in 1:4
print("\$(@elapsed run())  ")
end
# result: 0.072204955  0.082288181  0.072243955  0.073718137

We have over 4-times speedup. So devectorization works perfectly and Julia lives up to the promise of performance gain.

However, this is not a whole story. Remember that GNU R performs extra work behinds the scenes that standard Julia ignores - it handles missing values (NA) out of the box.

Actually we can tell Julia to take missing values into account via DataFrames package. Let us test vectorized and devectorized code.

Vectorized Julia with NAs:
using Distributions
using DataFrames

function run()
ssize = 10000
nboot = 1000
x = DataArray(rand(ssize))
y = DataArray(Float64,nboot)
for i in 1:nboot
y[i] = std(sample(x, ssize))
end
std(y)
end

for i in 1:4
print("\$(@elapsed run())  ")
end
# result: 1.320962821  1.305368346  1.250524339  1.272740111

So w are over 4 times slower than GNU R now. How about devectorized Julia with NAs? Let us try:
using DataFrames

function run()
ssize = 10000
nboot = 1000
x = DataArray(rand(ssize))
y = DataArray(Array(Float64,nboot))
bx = DataArray(Array(Float64, ssize))
for i in 1:nboot
for j in 1:ssize
bx[j] = x[(rand(Uint32) % ssize) + 1]
end
y[i] = std(bx)
end
std(y)
end

for i in 1:4
print("\$(@elapsed run())  ")
end
# result: 1.027682685  0.994636538  1.017903246  1.021942776

There is a speed up, but it is minor and we still are 3 times slower.

Probably Julia code could be tuned, but we can draw some preliminary conclusions (at least for current versions of GNU R and Julia).
If you can use default Julia data types in your analysis Julia should easily beat GNU R in performance, especially after devectorization. However if you require handling of missing values in your code and have to use DataFrames package in Julia you can expect GNU R to be quite well optimized.

PS.
While trying to get a grasp of Julia I have collected some notes on its syntax and features from R-programmers perspective. You can have a peek its preliminary version here.

1. I think what you're mostly testing here is the speed of computing the standard deviation of a sample. To see how you might do better than R using Rcpp, I whipped up this quick example: https://gist.github.com/hadley/8162972

I was surprised that Rcpp is about 5x faster than base R for this example, but it might be that's because R's sample does a better job of generating uniform samples than rand() % n (which is somewhat biased). Eliminating the intermediate sample and computing a running variance makes very little difference.

2. I "think" the speed difference is due to DataArray and DataFrame pass by value rather than pass by reference for each slice...probably.

3. Unfortunately, your basic conclusion is right: the current Julia DataArrays implementation is quite inefficient. We're working on it, but we expect to take some time to arrive at a solution that's both generalizable and high-level. In particular, we're trying to avoid taking the solution that R uses for floating point numbers, which is to conflate NaN with NA. That design decision makes R's floating point vectors very fast (since they're doing exactly what normal vectors would do without missing data), but it doesn't apply at all to other types and therefore focuses optimization effort on an area that we'd rather optimize after only we make our general abstraction efficient.

Another complication is that our std() implementation is also far less carefully tuned than it could be: it currently makes two passes (both are which aren't that efficient) through the data when only one tighter pass is required. Fixing this wouldn't be too much work and will probably get done soon.

4. Thank you for the comments. I am waiting for improvement of NA handling in DataFrames.
Just to add one thing there is a typical scenario in data analysis - namely the situation when I know that I will not have missing values in my data (i.e. after performing imputation or data is generated from a simulation).
In such cases switching to "raw" Julia gives a performance boost that is not available in R.

1. We're currently working on giving that performance boost in Julia without switching to a standard Array type. The biggest challenge is the need to introduce some simple mechanism for telling the compiler when the entries of a DataArray are definitely not NA.

5. To say that R "handles missing values (NA) out of the box" is a bit strong. Handling missing values means considering why the observations are missing and developing an imputation strategy based on your best understanding, taking into account the extensive literature on how to do imputation in a bias-minimizing manner. What R does out of the box is listwise deletion, in which elements marked as missing are simply ignored. Authors writing on missing data go between acknowledging that listwise deletion is sometimes a useful last resort to flatly stating that it will produce biased results and should be avoided.

Second, I'm not sure if we should care so much about the difference between missing (NA) and not-a-number (NaN). A year or so ago, I surveyed a lot of very experienced R users (including authors of well-known packages) about how they treat NA vs NaN, and couldn't find any that made any distinction. One or two said they understand the difference and feel guilty that they don't bother to make it throughout their code.

I bring it up because R can represent NaNs as just another floating-point variable (because the IEEE standard for representing floating-point numbers includes NaN markers), but an NA is a different SEXP that can't just be stuck into a C array (any R experts want to fact-check or correct me on that?). But if nobody is making the distinction betwen NaN and NA in practice, then it may not be worth the extra programmer effort and processor time to make the distinction, and we can just use out-of-the box numeric arrays and their NaNs to represent data with missing values.

[It's a digression, but I once had a blog post about putting different types of not-a-number into an array of floats; see http://modelingwithdata.org/arch/00000132.htm . I don't think any stats systems do it this way in practice.]

1. To me, using NaN's as NA's leads to (admittedly rare) erroneous statements about one's data. The example below is one that I personally find very troubling:

> is.na(c(Inf) / Inf)
 TRUE

Personally, I think R often takes an excessively "just get things done" approach. The production of completely incorrect coefficients for linearly separable data sets under logistic regression is one example of the sort of thing that I don't like about this style of pragmatism.

2. Just a small comment. R makes a distinction between is.na and is.nan:

> is.na(NaN)
 TRUE

but:

> is.nan(NA_real_)
 FALSE

So stanard R functions treat NaN as a special case of NA.
For instance you get:

> na.omit(c(1, NaN, NA, 4))
 1 4
attr(,"na.action")
 2 3
attr(,"class")
 "omit"

3. That's a good point. But it's also a somewhat odd design decision for R to have taken, because the bit pattern for NA is actually just one of many possible bit patterns that represent a NaN value. At the level of bits, NA is a special case of NaN, rather than NaN being a special case of NA.

I personally would argue that is.nan(0x7ff00000000007a2) should return TRUE to obey the IEEE floating point standard. In particular, if x is a floating point with binary representation 0x7ff00000000007a2, then x exhibits the defining property of IEEE NaN's, which is that x != x evaluates to true.

R seems to have decided to ignore the IEEE standard here, as seen when checking this defining self-inequality:

> NA_real_ != NA_real_
 NA

4. I guess that R wants to hide this internal design decision. It just borrows one bit pattern from NaN to represent NA.

Similarly for integers - NA is represented by bit pattern
"10000000000000000000000000000000"
in R.
That is why we get
> as.integer(-2^31)
 NA
as it is outside integer range for R.

I think the design was made for simplicity of storage and performance reasons. If I understand it correctly Julia tries to use cleaner approach with separate representation of NA while aiming to avoid performance setbacks. Keep up good work.