tag:blogger.com,1999:blog-4946490806848569840.comments2015-01-31T00:09:41.551-08:00R snippetsBogumił Kamińskinoreply@blogger.comBlogger179125tag:blogger.com,1999:blog-4946490806848569840.post-1505492409332568172015-01-31T00:09:41.551-08:002015-01-31T00:09:41.551-08:00You can start with learning parallel package.You can start with learning parallel package.Bogumił Kamińskihttp://www.blogger.com/profile/06250268799809238730noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-25948653872085996502015-01-30T23:40:45.921-08:002015-01-30T23:40:45.921-08:00May I ask whether "sapply" is related to...May I ask whether "sapply" is related to parallel computing? I want to assign each subfunction to each core to compute the pvalues at the same time? Do you have any recommendation about that?Zhenchuan Wanghttp://www.blogger.com/profile/01825098384469325909noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-29366747241336304132015-01-30T23:22:44.197-08:002015-01-30T23:22:44.197-08:00I do not see your code, but my code works with sap...I do not see your code, but my code works with sapply. Probably you should replace it by other function appropriate for your case.Bogumił Kamińskihttp://www.blogger.com/profile/06250268799809238730noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-64001703806379302962015-01-30T23:07:54.856-08:002015-01-30T23:07:54.856-08:00If I want to apply four methods (subfunctions) to ...If I want to apply four methods (subfunctions) to a same sample (matrix) to calculate the p-values. What should I do? I notice that your function is only available to vectors.Zhenchuan Wanghttp://www.blogger.com/profile/01825098384469325909noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-44985011419278527982015-01-09T06:39:00.528-08:002015-01-09T06:39:00.528-08:00Thx for the comment. However, notice that I use ==...Thx for the comment. However, notice that I use == only to show the results and it is clear that you would not use == in practice.<br /><br />But consider two things (as an example):<br />1) should it matter if you generate from low to hi or from hi to low (I would say it should not matter and in Python it does)<br />2) should you have your generated sequence as evenly spaced as possible (I would say yes and for example in R you have 'just beyond' that breaks this this assumption)Bogumił Kamińskihttp://www.blogger.com/profile/06250268799809238730noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-58579104520839009552015-01-09T06:06:42.198-08:002015-01-09T06:06:42.198-08:00Hi,
well, maybe I miss the point, why you prefer ...Hi,<br /><br />well, maybe I miss the point, why you prefer Julia. Julia is giving 6 times "false" and R produces 6 times FALSE. Equal. The point is, however, that every programmer should consider using "==" or ".==" with floating point numbers to be bad und to be avoided. Python producing the most unexspected behaviour has the biggest chance, that the programmer notices a bug in his code before delivering or real-case using it. +1 for python. <br />Once you learned to avoid "==" with floats, the topic discussed here is of no relevance anymore (or, is it?).<br /><br />Greets,<br />BernhardAnonymousnoreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-2272753680048336752015-01-07T23:57:44.587-08:002015-01-07T23:57:44.587-08:00If you are interested in details you can out this ...If you are interested in details you can out this discussion https://github.com/JuliaLang/julia/issues/9637.Bogumił Kamińskihttp://www.blogger.com/profile/06250268799809238730noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-27451491312377446982015-01-07T10:46:49.343-08:002015-01-07T10:46:49.343-08:00Julia is designed by people who really cares about...Julia is designed by people who really cares about these kind of issues, so it uses a nice algorithm to make floating point ranges work more intuitively. You should still note that there are fundamental issues with binary floating point numbers and decimal literals in code, so it is probably possible to find edge cases in Julia too.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-71579535644957052882014-07-29T08:00:43.874-07:002014-07-29T08:00:43.874-07:00Thanks for the info and reply.Thanks for the info and reply.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-15444445065715588602014-07-29T03:28:17.225-07:002014-07-29T03:28:17.225-07:00I would use trapezoid rule for approximation of de...I would use trapezoid rule for approximation of definite integral with data points distributed uniformly (http://en.wikipedia.org/wiki/Trapezoidal_rule).<br />Note that there you should set x_1=f(x_1)=0 in order to get the left boundary of the integral.Bogumił Kamińskihttp://www.blogger.com/profile/06250268799809238730noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-27184056434395023702014-07-28T07:07:09.184-07:002014-07-28T07:07:09.184-07:00I am looking for methods to calculate the area und...I am looking for methods to calculate the area under the Gain and Lift curves. The flux package has a method using the trapezoid rule (http://artax.karlin.mff.cuni.cz/r-help/library/flux/html/auc.html). What method would you use to add it to your Gain code above?Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-71826263356371602862014-07-28T03:15:31.637-07:002014-07-28T03:15:31.637-07:00And what is your question?And what is your question?Bogumił Kamińskihttp://www.blogger.com/profile/06250268799809238730noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-42345202379868227362014-07-28T03:05:54.825-07:002014-07-28T03:05:54.825-07:00Unfortunately the code given here does not support...Unfortunately the code given here does not support it directly.Bogumił Kamińskihttp://www.blogger.com/profile/06250268799809238730noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-78528044471084165082014-07-28T01:36:37.665-07:002014-07-28T01:36:37.665-07:00how do i pass an additional variable along with th...how do i pass an additional variable along with the dataset being passedJERINhttp://www.blogger.com/profile/00652838277532965758noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-6922100791024010012014-07-27T18:05:00.072-07:002014-07-27T18:05:00.072-07:00I did find this: http://stats.stackexchange.com/qu...I did find this: http://stats.stackexchange.com/questions/24325/lorenz-curve-and-gini-coefficient-for-measuring-classifier-performance<br /><br />I will also be looking at calculating the area under the Gain curve as well... (http://www.saedsayad.com/model_evaluation_c.htm)Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-21851288200133577722014-07-27T01:44:46.230-07:002014-07-27T01:44:46.230-07:00If I understand your question correctly you are as...If I understand your question correctly you are asking about the Gini coefficient.Bogumił Kamińskihttp://www.blogger.com/profile/06250268799809238730noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-72511899188215168392014-07-26T17:22:05.822-07:002014-07-26T17:22:05.822-07:00Have you used Area Under Lift (AUL), analogous to ...Have you used Area Under Lift (AUL), analogous to AUC, for model evaluation/comparison? Do you have suggestions on an approach for calculating AUL?Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-67383569096894642542014-06-16T02:33:05.096-07:002014-06-16T02:33:05.096-07:00Well, that is another reason why you should always...Well, that is another reason why you should always use full argument names. Under R 3.0.1 "scan" doesn't have "skipNul" argument.Bartek Chrołnoreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-49714990893069448872014-06-16T02:21:21.481-07:002014-06-16T02:21:21.481-07:00Yes - it is an error, because "scan" has...Yes - it is an error, because "scan" has "skip" and "skipNul" arguments so the call is ambiguous (at least under R version 3.1.0).Bogumił Kamińskihttp://www.blogger.com/profile/06250268799809238730noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-39346576424676732262014-06-16T02:16:59.103-07:002014-06-16T02:16:59.103-07:00Hmm I've got no problem with "sk=1"....Hmm I've got no problem with "sk=1". Does it throw an error or something?Bartek Chrołnoreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-33178353338446714962014-06-16T01:44:29.772-07:002014-06-16T01:44:29.772-07:00nice :)
However, for my R configuration "sk=1...nice :)<br />However, for my R configuration "sk=1" does not work and I have to do full "skip=1".Bogumił Kamińskihttp://www.blogger.com/profile/06250268799809238730noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-81749144966090556342014-06-16T01:24:13.113-07:002014-06-16T01:24:13.113-07:00Actually my solution can be made alphabet-independ...Actually my solution can be made alphabet-independent by adding only 3 characters. Now it's almost the same as Ivan's version, but thanks to a few tricks a bit shorter (169 characters)<br /><br />d=strsplit(scan("NGSL101.txt","",sk=1),"")<br />s=sapply<br />a=apply<br />b=s(s(d,factor,unique(unlist(d))),table)<br />plot(by(log(a(b,2,function(w)sum(a(b<=w,2,all)))),s(d,length),mean))Bartek Chrołnoreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-23011167645128780912014-06-16T01:10:16.416-07:002014-06-16T01:10:16.416-07:00My solution is also based on counting letters, but...My solution is also based on counting letters, but a bit shorter - 166 characters, and faster. However, it works only for words consisting of letters from english alphabet (because of "c(letters,LETTERS"). <br /><br />d=scan("NGSL101.txt","",sk=1)<br />s=sapply<br />a=s(s(strsplit(d,""),factor,c(letters,LETTERS)),table)<br />plot(by(log(apply(a,2,function(w)sum(colSums(a<=w)>51))),nchar(d),mean))Bartek Chrołnoreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-16425057459531825282014-06-14T23:50:19.631-07:002014-06-14T23:50:19.631-07:00Hi,
Your use of regular expressions is very cleve...Hi,<br /><br />Your use of regular expressions is very clever in this! However, each time you call grepl(), the regexp engine has to "compile" the pattern into some efficient internal representation to parse the input. This is the reason your code takes so much time. You can speed it up: instead of using many different patterns on a single input, it is better to use one pattern on many inputs. Pay attention to the last two lines (164 characters):<br /><br />d=scan("NGSL101.txt","",skip=1)<br />a=(s=sapply)(s(strsplit(d,""),sort),paste,collapse=".*")<br />y=rep(0,length(a));for(z in a)y=y+grepl(z,a)<br />plot(by(log(y),nchar(d),mean))<br /><br />Here grepl(z,a) returns the array of flags, each telling whether all letters in the word "z" are in a corresponding word in "a". The loop just accumulates the number of sub-words for each word.<br /><br /><br />I''d like to add that in the original code you don''t have to create an anonymous function to apply grepl():<br /><br />y = log( s( a, function( z ) sum( s( a, grepl, z ) ) ) )<br /><br />And if you''re willing to use more memory (and more time) you can do this:<br /><br />y = log( apply( s( a, grepl, a ), 1, sum ) )<br /><br />Here s( a, grepl, a ) computes a useful matrix of the "being-a-superword" binary relation among words, where (R,C) is TRUE if C is a sub-word of R.<br /><br /><br />My own solution is longer (178 characters) and not as fast:<br /><br />l=strsplit(scan("NGSL101.txt","",skip=1),"")<br />x=(s=sapply)(s(l,factor,unique(unlist(l))),summary)<br />y=(a=apply)(x,2,function(v)sum(a(v>=x,2,all)))<br />plot(by(log(y2),s(l,length),mean))<br /><br />It builds a matrix of the stock of letters in each word (x), and then counts the words from the dictionary, which fit in the current "word".<br />Иван Назаровhttp://www.blogger.com/profile/18275958472068925602noreply@blogger.comtag:blogger.com,1999:blog-4946490806848569840.post-78711244288552170282014-06-13T05:47:11.127-07:002014-06-13T05:47:11.127-07:00I've managed to do it in 138 characters, inclu...I've managed to do it in 138 characters, including setting rownames which costs 17 characters. Without these rownames all.equal doesn't return TRUE, although all values are identical.<br /><br />r=Reduce(rbind,lapply(split(d,d$p),function(x){t=x$t<br />x$v=sapply(t,function(b)sum(x$v[t>=b&t<b+90]))<br />x[which.max(x$v),]}))<br />rownames(r)=1:10<br /><br /><br />I use Reduce instead of do.call to save one letter :)Bartek Chrołnoreply@blogger.com