R snippets: GNU R vs Julia: is it only a matter of devectorization?

Saturday, December 28, 2013

GNU R vs Julia: is it only a matter of devectorization?

Recently I have read a post on benefits of code devectorization in Julia. The examples given there inspired me to perform my own devectorization exercise. I decided to use bootstrapping as a test ground. The results are quite interesting (and not so bad for GNU R).

The task is very simple (and typical):

generate 10000 elements sample from uniform distribution
1000 times perform bootstrap sample of the vector and calculate its standard deviation
return standard deviation of the bootstrap distribution
Perform steps 1-3 four times and record computation time

Let us start with GNU R implementation:
run <- function() {
ssize <- 10000
nboot <- 1000
x <- runif(ssize)
y <- replicate(nboot, sd(sample(x, ssize, TRUE)))
sd(y)
}

for (i in 1:4) {
cat(system.time(run())[3], " ")
}
# result: 0.34 0.32 0.31 0.34

The direct translation to Julia gives:
using Distributions

function run()
ssize = 10000
nboot = 1000
x = rand(ssize)
y = Array(Float64,nboot)
for i in 1:nboot
y[i] = std(sample(x, ssize))
end
std(y)
end

for i in 1:4
print("$(@elapsed run()) ")
end
# result: 0.38965987 0.331032088 0.3208918 0.315452803

So not counting longer Julia start up time - execution time is comparable.

Now we can try devectorizing Julia code:

function run()

ssize = 10000

nboot = 1000

x = rand(ssize)

y = Array(Float64,nboot)

bx = Array(Float64, ssize)

for i in 1:nboot

for j in 1:ssize

bx[j] = x[(rand(Uint32) % ssize) + 1]

end

y[i] = std(bx)

end

std(y)

end

for i in 1:4

print("$(@elapsed run()) ")

end

# result: 0.072204955 0.082288181 0.072243955 0.073718137

We have over 4-times speedup. So devectorization works perfectly and Julia lives up to the promise of performance gain.

However, this is not a whole story. Remember that GNU R performs extra work behinds the scenes that standard Julia ignores - it handles missing values (NA) out of the box.

Actually we can tell Julia to take missing values into account via DataFrames package. Let us test vectorized and devectorized code.

Vectorized Julia with NAs:

using Distributions

using DataFrames

function run()

ssize = 10000

nboot = 1000

x = DataArray(rand(ssize))

y = DataArray(Float64,nboot)

for i in 1:nboot

y[i] = std(sample(x, ssize))

end

std(y)

end

for i in 1:4

print("$(@elapsed run()) ")

end

# result: 1.320962821 1.305368346 1.250524339 1.272740111

So w are over 4 times slower than GNU R now. How about devectorized Julia with NAs? Let us try:

using DataFrames

function run()

ssize = 10000

nboot = 1000

x = DataArray(rand(ssize))

y = DataArray(Array(Float64,nboot))

bx = DataArray(Array(Float64, ssize))

for i in 1:nboot

for j in 1:ssize

bx[j] = x[(rand(Uint32) % ssize) + 1]

end

y[i] = std(bx)

end

std(y)

end

for i in 1:4

print("$(@elapsed run()) ")

end

# result: 1.027682685 0.994636538 1.017903246 1.021942776

There is a speed up, but it is minor and we still are 3 times slower.

Probably Julia code could be tuned, but we can draw some preliminary conclusions (at least for current versions of GNU R and Julia).

If you can use default Julia data types in your analysis Julia should easily beat GNU R in performance, especially after devectorization. However if you require handling of missing values in your code and have to use DataFrames package in Julia you can expect GNU R to be quite well optimized.

PS.

While trying to get a grasp of Julia I have collected some notes on its syntax and features from R-programmers perspective. You can have a peek its preliminary version here.

10 comments:

HadleyDecember 28, 2013 at 11:07 AM
I think what you're mostly testing here is the speed of computing the standard deviation of a sample. To see how you might do better than R using Rcpp, I whipped up this quick example: https://gist.github.com/hadley/8162972

I was surprised that Rcpp is about 5x faster than base R for this example, but it might be that's because R's sample does a better job of generating uniform samples than rand() % n (which is somewhat biased). Eliminating the intermediate sample and computing a running variance makes very little difference.
ReplyDelete
Replies
Stephen HendersonDecember 28, 2013 at 12:07 PM
I "think" the speed difference is due to DataArray and DataFrame pass by value rather than pass by reference for each slice...probably.
ReplyDelete
Replies
John Myles WhiteDecember 28, 2013 at 12:27 PM
Unfortunately, your basic conclusion is right: the current Julia DataArrays implementation is quite inefficient. We're working on it, but we expect to take some time to arrive at a solution that's both generalizable and high-level. In particular, we're trying to avoid taking the solution that R uses for floating point numbers, which is to conflate NaN with NA. That design decision makes R's floating point vectors very fast (since they're doing exactly what normal vectors would do without missing data), but it doesn't apply at all to other types and therefore focuses optimization effort on an area that we'd rather optimize after only we make our general abstraction efficient.

Another complication is that our std() implementation is also far less carefully tuned than it could be: it currently makes two passes (both are which aren't that efficient) through the data when only one tighter pass is required. Fixing this wouldn't be too much work and will probably get done soon.
ReplyDelete
Replies
Bogumił KamińskiDecember 28, 2013 at 2:21 PM
Thank you for the comments. I am waiting for improvement of NA handling in DataFrames.
Just to add one thing there is a typical scenario in data analysis - namely the situation when I know that I will not have missing values in my data (i.e. after performing imputation or data is generated from a simulation).
In such cases switching to "raw" Julia gives a performance boost that is not available in R.
ReplyDelete
Replies
bkDecember 29, 2013 at 8:36 AM
To say that R "handles missing values (NA) out of the box" is a bit strong. Handling missing values means considering why the observations are missing and developing an imputation strategy based on your best understanding, taking into account the extensive literature on how to do imputation in a bias-minimizing manner. What R does out of the box is listwise deletion, in which elements marked as missing are simply ignored. Authors writing on missing data go between acknowledging that listwise deletion is sometimes a useful last resort to flatly stating that it will produce biased results and should be avoided.

Second, I'm not sure if we should care so much about the difference between missing (NA) and not-a-number (NaN). A year or so ago, I surveyed a lot of very experienced R users (including authors of well-known packages) about how they treat NA vs NaN, and couldn't find any that made any distinction. One or two said they understand the difference and feel guilty that they don't bother to make it throughout their code.

I bring it up because R can represent NaNs as just another floating-point variable (because the IEEE standard for representing floating-point numbers includes NaN markers), but an NA is a different SEXP that can't just be stuck into a C array (any R experts want to fact-check or correct me on that?). But if nobody is making the distinction betwen NaN and NA in practice, then it may not be worth the extra programmer effort and processor time to make the distinction, and we can just use out-of-the box numeric arrays and their NaNs to represent data with missing values.

[It's a digression, but I once had a blog post about putting different types of not-a-number into an array of floats; see http://modelingwithdata.org/arch/00000132.htm . I don't think any stats systems do it this way in practice.]
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.