tag:blogger.com,1999:blog-4946490806848569840.comments2014-03-27T15:08:04.764-07:00R snippetsBogumił Kamińskinoreply@blogger.comBlogger137125tag:blogger.com,1999:blog-4946490806848569840.post-31336047487462816842014-01-28T02:55:46.633-08:002014-01-28T02:55:46.633-08:00It really is so much better than nls (rarely conve...It really is so much better than nls (rarely convergence problems)!<br />See my post here:<br />http://rmazing.wordpress.com/2012/07/05/a-better-nls/<br /><br />Cheers,<br />Andrej<br />anspiessnoreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-28225832177470572552014-01-27T23:48:31.399-08:002014-01-27T23:48:31.399-08:00The reason why I used optim is exactly because nls...The reason why I used optim is exactly because nls has convergence problems. In fact for maxiter = 1000000 nls does not converge.<br /><br />However, I did not know minpack.lm - it works much better :).Bogumił Kamińskihttp://www.blogger.com/profile/06250268799809238730noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-27120743130984114272014-01-27T23:07:56.406-08:002014-01-27T23:07:56.406-08:00Very important topic you are touching there! Since...Very important topic you are touching there! Since in your example you are doing nonlinear regression on an exponential + linear function by minimizing residual sum-of-squares, one must know that the standard 'nls' function offers no way of parameter scaling. However, the fabulous minpack.lm (also function nlsLM) offers a 'diag' parameter for parameter scaling in the 'control.nls.lm' function. But even without setting parameter scales, nlsLM converges in 6 (!) iterations:<br /><br />nlsLM(y ~ exp(a * x1) + b * x2, start = list(a = 1, b = 2))<br />Nonlinear regression model<br /> model: y ~ exp(a * x1) + b * x2<br /> data: parent.frame()<br /> a b <br />200 1 <br /> residual sum-of-squares: 0<br /><br />Number of iterations to convergence: 6 <br />Achieved convergence tolerance: 1.49e-08<br /><br />showing that the Levenberg-Marquardt method might be better in these kind of scenarios.<br /><br />Cheers,<br />Andrej<br />anspiessnoreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-25456078860485610072013-12-30T00:14:05.815-08:002013-12-30T00:14:05.815-08:00I guess that R wants to hide this internal design ...I guess that R wants to hide this internal design decision. It just borrows one bit pattern from NaN to represent NA.<br /><br />Similarly for integers - NA is represented by bit pattern<br />"10000000000000000000000000000000"<br />in R.<br />That is why we get<br />> as.integer(-2^31)<br />[1] NA<br />as it is outside integer range for R.<br /><br />I think the design was made for simplicity of storage and performance reasons. If I understand it correctly Julia tries to use cleaner approach with separate representation of NA while aiming to avoid performance setbacks. Keep up good work.Bogumił Kamińskihttp://www.blogger.com/profile/06250268799809238730noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-24158277215506854992013-12-29T19:25:25.601-08:002013-12-29T19:25:25.601-08:00That's a good point. But it's also a somew...That's a good point. But it's also a somewhat odd design decision for R to have taken, because the bit pattern for NA is actually just one of many possible bit patterns that represent a NaN value. At the level of bits, NA is a special case of NaN, rather than NaN being a special case of NA.<br /><br />I personally would argue that is.nan(0x7ff00000000007a2) should return TRUE to obey the IEEE floating point standard. In particular, if x is a floating point with binary representation 0x7ff00000000007a2, then x exhibits the defining property of IEEE NaN's, which is that x != x evaluates to true.<br /><br />R seems to have decided to ignore the IEEE standard here, as seen when checking this defining self-inequality:<br /><br />> NA_real_ != NA_real_<br />[1] NA<br />John Myles Whitehttp://www.blogger.com/profile/17057780422542771812noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-75158593533707225572013-12-29T11:31:45.555-08:002013-12-29T11:31:45.555-08:00Just a small comment. R makes a distinction betwee...Just a small comment. R makes a distinction between is.na and is.nan:<br /><br />> is.na(NaN)<br />[1] TRUE<br /><br />but:<br /><br />> is.nan(NA_real_)<br />[1] FALSE<br /><br />So stanard R functions treat NaN as a special case of NA.<br />For instance you get:<br /><br />> na.omit(c(1, NaN, NA, 4))<br />[1] 1 4<br />attr(,"na.action")<br />[1] 2 3<br />attr(,"class")<br />[1] "omit"Bogumił Kamińskihttp://www.blogger.com/profile/06250268799809238730noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-63848843225425503132013-12-29T08:55:48.314-08:002013-12-29T08:55:48.314-08:00To me, using NaN's as NA's leads to (admit...To me, using NaN's as NA's leads to (admittedly rare) erroneous statements about one's data. The example below is one that I personally find very troubling:<br /><br />> is.na(c(Inf) / Inf)<br />[1] TRUE<br /><br />Personally, I think R often takes an excessively "just get things done" approach. The production of completely incorrect coefficients for linearly separable data sets under logistic regression is one example of the sort of thing that I don't like about this style of pragmatism.John Myles Whitehttp://www.blogger.com/profile/17057780422542771812noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-70096558634959933252013-12-29T08:36:57.243-08:002013-12-29T08:36:57.243-08:00To say that R "handles missing values (NA) ou...To say that R "handles missing values (NA) out of the box" is a bit strong. Handling missing values means considering why the observations are missing and developing an imputation strategy based on your best understanding, taking into account the extensive literature on how to do imputation in a bias-minimizing manner. What R does out of the box is listwise deletion, in which elements marked as missing are simply ignored. Authors writing on missing data go between acknowledging that listwise deletion is sometimes a useful last resort to flatly stating that it will produce biased results and should be avoided.<br /><br />Second, I'm not sure if we should care so much about the difference between missing (NA) and not-a-number (NaN). A year or so ago, I surveyed a lot of very experienced R users (including authors of well-known packages) about how they treat NA vs NaN, and couldn't find any that made any distinction. One or two said they understand the difference and feel guilty that they don't bother to make it throughout their code.<br /><br />I bring it up because R can represent NaNs as just another floating-point variable (because the IEEE standard for representing floating-point numbers includes NaN markers), but an NA is a different SEXP that can't just be stuck into a C array (any R experts want to fact-check or correct me on that?). But if nobody is making the distinction betwen NaN and NA in practice, then it may not be worth the extra programmer effort and processor time to make the distinction, and we can just use out-of-the box numeric arrays and their NaNs to represent data with missing values.<br /><br />[It's a digression, but I once had a blog post about putting different types of not-a-number into an array of floats; see http://modelingwithdata.org/arch/00000132.htm . I don't think any stats systems do it this way in practice.]bkhttp://modelingwithdata.orgnoreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-22612403995784467262013-12-28T15:26:35.412-08:002013-12-28T15:26:35.412-08:00We're currently working on giving that perform...We're currently working on giving that performance boost in Julia without switching to a standard Array type. The biggest challenge is the need to introduce some simple mechanism for telling the compiler when the entries of a DataArray are definitely not NA.John Myles Whitehttp://www.blogger.com/profile/17057780422542771812noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-66676510763481985112013-12-28T14:21:44.182-08:002013-12-28T14:21:44.182-08:00Thank you for the comments. I am waiting for impro...Thank you for the comments. I am waiting for improvement of NA handling in DataFrames.<br />Just to add one thing there is a typical scenario in data analysis - namely the situation when I know that I will not have missing values in my data (i.e. after performing imputation or data is generated from a simulation).<br />In such cases switching to "raw" Julia gives a performance boost that is not available in R.Bogumił Kamińskihttp://www.blogger.com/profile/06250268799809238730noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-12047849343298714642013-12-28T12:27:18.129-08:002013-12-28T12:27:18.129-08:00Unfortunately, your basic conclusion is right: the...Unfortunately, your basic conclusion is right: the current Julia DataArrays implementation is quite inefficient. We're working on it, but we expect to take some time to arrive at a solution that's both generalizable and high-level. In particular, we're trying to avoid taking the solution that R uses for floating point numbers, which is to conflate NaN with NA. That design decision makes R's floating point vectors very fast (since they're doing exactly what normal vectors would do without missing data), but it doesn't apply at all to other types and therefore focuses optimization effort on an area that we'd rather optimize after only we make our general abstraction efficient.<br /><br />Another complication is that our std() implementation is also far less carefully tuned than it could be: it currently makes two passes (both are which aren't that efficient) through the data when only one tighter pass is required. Fixing this wouldn't be too much work and will probably get done soon.John Myles Whitehttp://www.blogger.com/profile/17057780422542771812noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-72882417691706663142013-12-28T12:07:53.490-08:002013-12-28T12:07:53.490-08:00I "think" the speed difference is due to...I "think" the speed difference is due to DataArray and DataFrame pass by value rather than pass by reference for each slice...probably.Stephen Hendersonhttp://www.blogger.com/profile/12484961735776246487noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-73846898521058298372013-12-28T11:07:33.071-08:002013-12-28T11:07:33.071-08:00I think what you're mostly testing here is the...I think what you're mostly testing here is the speed of computing the standard deviation of a sample. To see how you might do better than R using Rcpp, I whipped up this quick example: https://gist.github.com/hadley/8162972<br /><br />I was surprised that Rcpp is about 5x faster than base R for this example, but it might be that's because R's sample does a better job of generating uniform samples than rand() % n (which is somewhat biased). Eliminating the intermediate sample and computing a running variance makes very little difference.Hadleyhttp://had.co.nz/noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-43512434140395231522013-12-12T08:31:29.913-08:002013-12-12T08:31:29.913-08:00Please note that on modern multicore-CPU machines ...Please note that on modern multicore-CPU machines it is possible to boost R performance with “multicore” package. You may have noticed that R runs single-core, even on such modern machines. This in big part explains the dramatic difference in performance against Julia which has been designed for parallel processing and (I assume) makes use of all available cores on a host.<br />Package “multicore” enables R to utilize multiple cores in a straightforward manner. On my 8-core CPU machine I got over 5x speedup for the above cont model:<br />> REPS<-24<br />> system.time(lapply(rep(1000,REPS),cont.run, 10000, 1000, 0.005, 10.0, 0.01))<br />user system elapsed <br />19.054 0.000 19.127 <br />> library(multicore)<br />> system.time(mclapply(rep(1000,REPS),cont.run, 10000, 1000, 0.005, 10.0, 0.01))<br />user system elapsed <br />21.535 0.172 3.669<br /><br />As you can see the trick was to replace a call to “lapply” with a call to “mclapply” - its parallelized version provided by “multicore” package. Going further you may parallelize large chunks of existing code in a transparent, nearly effortless way by redefining basic functions such as “sapply” and “replicate” to use “mclapply” instead of “lapply”.<br />You can also gain some additional seconds by enabling implicit (background) JIT compilation in R:<br /><br />> library(compiler)<br />> enableJIT(3)<br />> system.time(lapply(rep(1000,REPS),cont.run, 10000, 1000, 0.005, 10.0, 0.01))<br />user system elapsed <br />17.417 0.009 17.512 <br />> system.time(mclapply(rep(1000,REPS),cont.run, 10000, 1000, 0.005, 10.0, 0.01))<br />user system elapsed <br />26.099 0.308 3.364Tomasz Olczakhttp://www.blogger.com/profile/15756806701801765727noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-91949666595826677422013-12-12T05:04:01.538-08:002013-12-12T05:04:01.538-08:00Very often people report p-values for linear regre...Very often people report p-values for linear regression estimates after performing variable selection step. Here is a simple simulation that shows that such a procedure might lead to wrong calibration of such tests. <a href="http://ndtcalibrationproducts.com/" rel="nofollow">NDT equipment calibration</a><br />albina N murohttp://www.blogger.com/profile/08139646674252673476noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-2132397274826932292013-12-03T07:34:16.584-08:002013-12-03T07:34:16.584-08:00Note that (on my computer at least) if you increas...Note that (on my computer at least) if you increase n, eventually boot.bare2 becomes faster than boot.qrlsJared Hulinghttp://www.blogger.com/profile/05321466290087192565noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-29626056341031133102013-12-03T05:01:47.210-08:002013-12-03T05:01:47.210-08:00Thanks for the comments. To sum up results of appl...Thanks for the comments. To sum up results of application of Jared's ideas on my PC:<br />boot.bare2 gives ~30% speed improvement over boot.bare<br />boot.qrls gives ~40% speed improvement over boot.bareBogumił Kamińskihttp://www.blogger.com/profile/06250268799809238730noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-78029250536526931432013-12-02T22:51:37.005-08:002013-12-02T22:51:37.005-08:00As Jared already suggested: Calling lm() and analo...As Jared already suggested: Calling lm() and analogously glm() is unnecessary here because the formula processing needs to be done in every iteration. Using lm.fit() or glm.fit() is much faster. For the logit example:<br /><br />boot.glm2 <- function() {<br /> dm <- data.m[sample.int(n, replace = TRUE),]<br /> glm.fit(dm[, 1:3], dm[, 4], family = binomial(), control = glm.control())$coefficients<br />}<br /><br />And then you can use it as<br /><br />set.seed(2)<br />system.time(rg2 <- replicate(10000, boot.glm2()))<br /><br />which reduces the computation time by a bit more than 50% on my machine. This is not quite competitive with boot.irls() concerning speed but is likely to be more precise. It uses a relative tolerance of 1e-8 on the deviance/log-likelihood (rather than an absolute tolerance of 1e-6 on the gradient).Zhttp://eeecon.uibk.ac.at/~zeileis/noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-41331739871030287962013-12-02T18:32:47.704-08:002013-12-02T18:32:47.704-08:00This comment has been removed by the author.Jared Hulinghttp://www.blogger.com/profile/05321466290087192565noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-91080496128760970352013-12-02T18:04:19.579-08:002013-12-02T18:04:19.579-08:00Here's an improved version of boot.bare(). it ...Here's an improved version of boot.bare(). it takes advantage of R's built in functions for matrix algebra:<br /><br />boot.bare2 <- function() {<br /> dm <- data.m[sample.int(n, replace = T),]<br /> X <- dm[, 1:3]<br /> y <- dm[, 4]<br /> solve(crossprod(X), crossprod(X,y))<br />}Jared Hulinghttp://www.blogger.com/profile/05321466290087192565noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-80653673121742364202013-12-02T17:58:30.217-08:002013-12-02T17:58:30.217-08:00the "tX <- t(X)" lines can be deleted...the "tX <- t(X)" lines can be deleted from both functions, which leads to a further improvement. missed thatJared Hulinghttp://www.blogger.com/profile/05321466290087192565noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-36239829350458834352013-12-02T17:25:20.714-08:002013-12-02T17:25:20.714-08:00Note that lm() actually calls lm.fit(), so without...Note that lm() actually calls lm.fit(), so without all the wrapper code, we can use lm.fit and it is much faster:<br /><br />boot.lm.fit <- function() {<br /> dm <- data.m[sample.int(n, replace = T),]<br /> X <- dm[, 1:3]<br /> tX <- t(X)<br /> y <- dm[, 4]<br /> lm.fit(x=X, y=y)$coef<br />}<br /><br />Now what is lm.fit actually doing? it's a wrapper for c-code to solve the least squares problem with qr decomposition, so why not just do that ourselves? this function is the fastest of all four options. the solutions should be the same, check it yourself<br /><br />boot.qrls <- function() {<br /> dm <- data.m[sample.int(n, replace = T),]<br /> X <- dm[, 1:3]<br /> tX <- t(X)<br /> y <- dm[, 4]<br /> tol = 1e-07<br /> .Call(stats:::C_Cdqrls, X, y, tol)$coefficients<br />}<br /><br />Jared Hulinghttp://www.blogger.com/profile/05321466290087192565noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-44158125658733663782013-11-28T22:50:13.353-08:002013-11-28T22:50:13.353-08:00It is equivalent.
abs(r[i]) = abs(mul * ari) = abs...It is equivalent.<br />abs(r[i]) = abs(mul * ari) = abs(mul) * abs(ari) = ari<br />as ari is always positive ant mul is either 1 or -1.Bogumił Kamińskihttp://www.blogger.com/profile/06250268799809238730noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-44871084440500423472013-11-28T20:35:30.511-08:002013-11-28T20:35:30.511-08:00That is a pretty awesome improvement in performanc...That is a pretty awesome improvement in performance! Sorry to be a pedant (this does not affect the performance comparison), but:<br /><br />the R code:<br /><br />r[i] <- mul * sum(sig > tr) / (l * n)<br />tr[runif(n) < s] <- abs(r[i])<br /><br /><br />is not equivalent to the Julia code:<br /><br />ari = aris / (l * n)<br />r[i] = mul * ari<br />for j in 1:n<br /> if rand() < s<br /> tr[j] = ari<br /> end<br />end<br /><br /><br />as you don't multiply tr[j] by 'mul' or take the absolute value.Simon Knapphttp://www.blogger.com/profile/16986267261219498729noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-34366447221702621562013-11-28T20:21:19.424-08:002013-11-28T20:21:19.424-08:00The line:
tr[runif(n) < s] <- abs(r[i])
ma...The line:<br /><br />tr[runif(n) < s] <- abs(r[i])<br /><br />makes it difficult.Simon Knapphttp://www.blogger.com/profile/16986267261219498729noreply@blogger.com