tag:blogger.com,1999:blog-4946490806848569840.comments2015-01-09T06:39:00.528-08:00R snippetsBogumił Kamińskinoreply@blogger.comBlogger175125tag:blogger.com,1999:blog-4946490806848569840.post-44985011419278527982015-01-09T06:39:00.528-08:002015-01-09T06:39:00.528-08:00Thx for the comment. However, notice that I use ==...Thx for the comment. However, notice that I use == only to show the results and it is clear that you would not use == in practice.<br /><br />But consider two things (as an example):<br />1) should it matter if you generate from low to hi or from hi to low (I would say it should not matter and in Python it does)<br />2) should you have your generated sequence as evenly spaced as possible (I would say yes and for example in R you have 'just beyond' that breaks this this assumption)Bogumił Kamińskihttp://www.blogger.com/profile/06250268799809238730noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-58579104520839009552015-01-09T06:06:42.198-08:002015-01-09T06:06:42.198-08:00Hi,
well, maybe I miss the point, why you prefer ...Hi,<br /><br />well, maybe I miss the point, why you prefer Julia. Julia is giving 6 times "false" and R produces 6 times FALSE. Equal. The point is, however, that every programmer should consider using "==" or ".==" with floating point numbers to be bad und to be avoided. Python producing the most unexspected behaviour has the biggest chance, that the programmer notices a bug in his code before delivering or real-case using it. +1 for python. <br />Once you learned to avoid "==" with floats, the topic discussed here is of no relevance anymore (or, is it?).<br /><br />Greets,<br />BernhardAnonymousnoreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-2272753680048336752015-01-07T23:57:44.587-08:002015-01-07T23:57:44.587-08:00If you are interested in details you can out this ...If you are interested in details you can out this discussion https://github.com/JuliaLang/julia/issues/9637.Bogumił Kamińskihttp://www.blogger.com/profile/06250268799809238730noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-27451491312377446982015-01-07T10:46:49.343-08:002015-01-07T10:46:49.343-08:00Julia is designed by people who really cares about...Julia is designed by people who really cares about these kind of issues, so it uses a nice algorithm to make floating point ranges work more intuitively. You should still note that there are fundamental issues with binary floating point numbers and decimal literals in code, so it is probably possible to find edge cases in Julia too.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-71579535644957052882014-07-29T08:00:43.874-07:002014-07-29T08:00:43.874-07:00Thanks for the info and reply.Thanks for the info and reply.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-15444445065715588602014-07-29T03:28:17.225-07:002014-07-29T03:28:17.225-07:00I would use trapezoid rule for approximation of de...I would use trapezoid rule for approximation of definite integral with data points distributed uniformly (http://en.wikipedia.org/wiki/Trapezoidal_rule).<br />Note that there you should set x_1=f(x_1)=0 in order to get the left boundary of the integral.Bogumił Kamińskihttp://www.blogger.com/profile/06250268799809238730noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-27184056434395023702014-07-28T07:07:09.184-07:002014-07-28T07:07:09.184-07:00I am looking for methods to calculate the area und...I am looking for methods to calculate the area under the Gain and Lift curves. The flux package has a method using the trapezoid rule (http://artax.karlin.mff.cuni.cz/r-help/library/flux/html/auc.html). What method would you use to add it to your Gain code above?Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-71826263356371602862014-07-28T03:15:31.637-07:002014-07-28T03:15:31.637-07:00And what is your question?And what is your question?Bogumił Kamińskihttp://www.blogger.com/profile/06250268799809238730noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-42345202379868227362014-07-28T03:05:54.825-07:002014-07-28T03:05:54.825-07:00Unfortunately the code given here does not support...Unfortunately the code given here does not support it directly.Bogumił Kamińskihttp://www.blogger.com/profile/06250268799809238730noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-78528044471084165082014-07-28T01:36:37.665-07:002014-07-28T01:36:37.665-07:00how do i pass an additional variable along with th...how do i pass an additional variable along with the dataset being passedJERINhttp://www.blogger.com/profile/00652838277532965758noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-6922100791024010012014-07-27T18:05:00.072-07:002014-07-27T18:05:00.072-07:00I did find this: http://stats.stackexchange.com/qu...I did find this: http://stats.stackexchange.com/questions/24325/lorenz-curve-and-gini-coefficient-for-measuring-classifier-performance<br /><br />I will also be looking at calculating the area under the Gain curve as well... (http://www.saedsayad.com/model_evaluation_c.htm)Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-21851288200133577722014-07-27T01:44:46.230-07:002014-07-27T01:44:46.230-07:00If I understand your question correctly you are as...If I understand your question correctly you are asking about the Gini coefficient.Bogumił Kamińskihttp://www.blogger.com/profile/06250268799809238730noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-72511899188215168392014-07-26T17:22:05.822-07:002014-07-26T17:22:05.822-07:00Have you used Area Under Lift (AUL), analogous to ...Have you used Area Under Lift (AUL), analogous to AUC, for model evaluation/comparison? Do you have suggestions on an approach for calculating AUL?Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-67383569096894642542014-06-16T02:33:05.096-07:002014-06-16T02:33:05.096-07:00Well, that is another reason why you should always...Well, that is another reason why you should always use full argument names. Under R 3.0.1 "scan" doesn't have "skipNul" argument.Bartek Chrołnoreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-49714990893069448872014-06-16T02:21:21.481-07:002014-06-16T02:21:21.481-07:00Yes - it is an error, because "scan" has...Yes - it is an error, because "scan" has "skip" and "skipNul" arguments so the call is ambiguous (at least under R version 3.1.0).Bogumił Kamińskihttp://www.blogger.com/profile/06250268799809238730noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-39346576424676732262014-06-16T02:16:59.103-07:002014-06-16T02:16:59.103-07:00Hmm I've got no problem with "sk=1"....Hmm I've got no problem with "sk=1". Does it throw an error or something?Bartek Chrołnoreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-33178353338446714962014-06-16T01:44:29.772-07:002014-06-16T01:44:29.772-07:00nice :)
However, for my R configuration "sk=1...nice :)<br />However, for my R configuration "sk=1" does not work and I have to do full "skip=1".Bogumił Kamińskihttp://www.blogger.com/profile/06250268799809238730noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-81749144966090556342014-06-16T01:24:13.113-07:002014-06-16T01:24:13.113-07:00Actually my solution can be made alphabet-independ...Actually my solution can be made alphabet-independent by adding only 3 characters. Now it's almost the same as Ivan's version, but thanks to a few tricks a bit shorter (169 characters)<br /><br />d=strsplit(scan("NGSL101.txt","",sk=1),"")<br />s=sapply<br />a=apply<br />b=s(s(d,factor,unique(unlist(d))),table)<br />plot(by(log(a(b,2,function(w)sum(a(b<=w,2,all)))),s(d,length),mean))Bartek Chrołnoreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-23011167645128780912014-06-16T01:10:16.416-07:002014-06-16T01:10:16.416-07:00My solution is also based on counting letters, but...My solution is also based on counting letters, but a bit shorter - 166 characters, and faster. However, it works only for words consisting of letters from english alphabet (because of "c(letters,LETTERS"). <br /><br />d=scan("NGSL101.txt","",sk=1)<br />s=sapply<br />a=s(s(strsplit(d,""),factor,c(letters,LETTERS)),table)<br />plot(by(log(apply(a,2,function(w)sum(colSums(a<=w)>51))),nchar(d),mean))Bartek Chrołnoreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-16425057459531825282014-06-14T23:50:19.631-07:002014-06-14T23:50:19.631-07:00Hi,
Your use of regular expressions is very cleve...Hi,<br /><br />Your use of regular expressions is very clever in this! However, each time you call grepl(), the regexp engine has to "compile" the pattern into some efficient internal representation to parse the input. This is the reason your code takes so much time. You can speed it up: instead of using many different patterns on a single input, it is better to use one pattern on many inputs. Pay attention to the last two lines (164 characters):<br /><br />d=scan("NGSL101.txt","",skip=1)<br />a=(s=sapply)(s(strsplit(d,""),sort),paste,collapse=".*")<br />y=rep(0,length(a));for(z in a)y=y+grepl(z,a)<br />plot(by(log(y),nchar(d),mean))<br /><br />Here grepl(z,a) returns the array of flags, each telling whether all letters in the word "z" are in a corresponding word in "a". The loop just accumulates the number of sub-words for each word.<br /><br /><br />I''d like to add that in the original code you don''t have to create an anonymous function to apply grepl():<br /><br />y = log( s( a, function( z ) sum( s( a, grepl, z ) ) ) )<br /><br />And if you''re willing to use more memory (and more time) you can do this:<br /><br />y = log( apply( s( a, grepl, a ), 1, sum ) )<br /><br />Here s( a, grepl, a ) computes a useful matrix of the "being-a-superword" binary relation among words, where (R,C) is TRUE if C is a sub-word of R.<br /><br /><br />My own solution is longer (178 characters) and not as fast:<br /><br />l=strsplit(scan("NGSL101.txt","",skip=1),"")<br />x=(s=sapply)(s(l,factor,unique(unlist(l))),summary)<br />y=(a=apply)(x,2,function(v)sum(a(v>=x,2,all)))<br />plot(by(log(y2),s(l,length),mean))<br /><br />It builds a matrix of the stock of letters in each word (x), and then counts the words from the dictionary, which fit in the current "word".<br />Иван Назаровhttp://www.blogger.com/profile/18275958472068925602noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-78711244288552170282014-06-13T05:47:11.127-07:002014-06-13T05:47:11.127-07:00I've managed to do it in 138 characters, inclu...I've managed to do it in 138 characters, including setting rownames which costs 17 characters. Without these rownames all.equal doesn't return TRUE, although all values are identical.<br /><br />r=Reduce(rbind,lapply(split(d,d$p),function(x){t=x$t<br />x$v=sapply(t,function(b)sum(x$v[t>=b&t<b+90]))<br />x[which.max(x$v),]}))<br />rownames(r)=1:10<br /><br /><br />I use Reduce instead of do.call to save one letter :)Bartek Chrołnoreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-3787618332430933302014-06-05T14:54:23.191-07:002014-06-05T14:54:23.191-07:00nice ;)nice ;)Bogumił Kamińskihttp://www.blogger.com/profile/06250268799809238730noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-15153701154288092682014-06-05T13:23:50.004-07:002014-06-05T13:23:50.004-07:00Here is a faster and shorter version of my solutio...Here is a faster and shorter version of my solution. It is 157 characters long (with newlines) and instead of running summation, it computes an "extended" cumulative sum of ordered sales, and then gets the difference of the accumulated sales value at the last date of the window and just before the first day of the window (zero if the window starts at the earliest date).<br /><br />r<-do.call(rbind,lapply(split(d,d$p),function(d){<br />d=d[order(d$t),];d$v=(v=c(0,cumsum(d$v)))[1+.bincode(<br />d$t+90,c(d$t,+Inf))]-head(v,-1);d[which.max(d$v),]}))<br /><br />However this cumsum() approach should not be used in practice without modification. The problem is with finite precision arithmetic: computing a cumulative sum is prone to growing roundoff errors, which become more severe the longer the input data. As the result, the total sales in the updated d$v might be numerically inexact, thereby adversely affecting the choice of the maximal sales period.<br />Иван Назаровhttp://www.blogger.com/profile/18275958472068925602noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-55368635405525430522014-06-04T05:58:17.150-07:002014-06-04T05:58:17.150-07:00Erm, it seems that blogspot has problems with angl...Erm, it seems that blogspot has problems with angle brackets. My solution is at http://pastebin.com/UNJp65Kw . 159 bytes, including newlines.<br /><br />It's slow as hell, but hey, it needs to be short, not fast ;-)<br />Cosinoreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-51432690172697085202014-06-04T05:50:29.204-07:002014-06-04T05:50:29.204-07:00That's my solution:
r=d[c()];sapply(levels(d$...That's my solution:<br /><br />r=d[c()];sapply(levels(d$p),function(l){s=d[d$p==l,];t=s$t<br />s$v=sapply(t,function(x)sum(s$v[t>=x&t r<br /> p v t<br />1 a 10.198044 2009-10-11<br />2 b 9.338709 2002-08-24<br />3 c 9.735401 2008-02-21<br />4 d 9.479942 2004-09-07<br />5 e 11.036041 2010-09-16<br />6 f 10.602446 2002-04-04<br />7 g 9.927153 2007-08-11<br />8 h 10.917856 2007-04-26<br />9 i 10.027341 2008-10-26<br />10 j 9.965738 2004-12-28<br />> all.equal(r, testr)<br />[1] TRUE<br />Cosinoreply@blogger.com