Tuesday, June 27, 2017

Julia vs R and Python: what does Stack Overflow Developer Survey 2017 tell us?

TLDR: Most Julia programmers also use Python. However, among all languages R is the one whose users are most likely to also develop in Julia.

Recently Stack Overflow has made public the results of Developer Survey 2017. It is definitely an interesting data set. In this post I analyzed the answers to the question "Which of the following languages have you done extensive development work in over the past year, and which do you want to work in over the next year?" from the perspective of Julia language against other programming languages.

Actually we get two variables of interest: 1) what was used and 2) what is planned to be used.

Basic statistics

Let us start off with raw Julia statistics: 0.18% developers reported having used Julia so it is not much. But 0.67% say that they want to use it next year, which is quite a large expected growth.

As usually developers use several languages, especially for data science, there are two natural questions: 1) what other languages Julia programmers also use and 2) users of what languages also use Julia.

Languages mostly used by Julia programmers

Last year:

  1. Python      73.0%
  2. JavaScript  40.0%
  3. SQL         39.0%
  4. C++         36.0%
  5. C           34.0%
  6. Java        32.0%
  7. R           29.0%
  8. Matlab      21.0%

Plans for next year

  1. Python      64.0%
  2. R           38.0%
  3. JavaScript  29.0%
  4. SQL         26.0%
  5. Go          26.0%
  6. C++         26.0%
  7. Scala       24.0%
  8. Haskell     21.0%

We can see that Python is the most popular language among Julia developers. However, we can see that R can be expected to grow significantly in popularity in the near future.

Users of what languages also use Julia

Last year

  1. R           2.26%
  2. Python      0.79%
  3. C++         0.42%
  4. SQL         0.35%
  5. JavaScript  0.27%

Plans for next year

  1. R           8.73%
  2. Python      2.70%
  3. C++         1.72%
  4. SQL         1.64%
  5. JavaScript  1.09%

Interestingly here the situation is reversed. R developer is much more likely to use Julia than Python programmer. The situation is similar for last year and for plans for the coming year.

Here you can get the code I used to produce the above results.

Thursday, June 8, 2017

Test-driving Microsoft Cognitive Toolkit in R using reticulate

Recently new tools for data science pop up constantly, so it is hard to keep up with the changes and choose those that promise to be useful in the long run.
Recently two new solutions were announced that look very promising: reticulate package for R and Microsoft Cognitive Toolkit 2.0 (CNTK). Together with Wit Jakuczun we have decided to test drive them on Azure Data Science Virtual Machine.
CNTK is a very promising deep learning toolkit, backed by Microsoft. Unfortunately it provides bindings to Python and not to R. Therefore it is an ideal situation to test reticulate package, which provides R interface to Python. We wanted to test CNTK on GPUs, so we have decided to use Azure Data Science Virtual Machine using NV6 Instance with Tesla M60 GPU.
For our test-drive we have used classic MNIST dataset. You can find all the source codes and detailed results of the test on GitHub. The short conclusion is that all components worked and played together excellently:

  • on CNTK we have build MLP network model, which has reported 2.3% classification error, without any special tuning, which is a pretty decent result;
  • calling Python library from R using reticulate was simple and stable;
  • Calculations using GPU give a significant speedup and configuration of Azure DSVM was really simple.
In summary, we have found the combo MSFT CNTK + R/reticulate + Azure DSVM to be simple to set up and powerful. It if definitely worth checking in your next data science project.


Monday, December 5, 2016

Optimization matchup: R's glpkAPI vs Julia's JuMP

tl;dr: although I use R every day and love it, doing mathematical programming using Julia is much simpler and more flexible than anything I know that is currently available in R.

Recently I have learned that Iain Dunning and Joey Huchette and Miles Lubin have received 2016 INFORMS Computing Society prize for the development of JuMP, a Julia-based domain-specific modeling language. Congratulations!

Together with Wit Jakuczun we have decided to test drive Julia's JuMP against R's package glpkAPI.

The problem we decided to solve is a standard MIP model for finding clusters in data using k-medoids method (we have used this specification of the model without relaxation). The work breakdown was that Wit writes a solution in R and I developed Julia code.

Data preparation

The data set used for the analysis can be generated using this Julia code:

function gendata(N, K, scaling)
    x = randn(N, 2)
    for i in 1:N
        α = 2π * i / K
        x[i, :] += scaling * [cos(α); sin(α)] / (2 * sin(π/K))
    end
    return x
end

srand(1)
d = gendata(50, 4, 5)
writecsv("test.txt", d)

or similar R code respectively:

gen_data <- function(N, K, scaling) {
    alpha <- 2 * pi * (1:N) / K
    sin_pi_K <- 2 * sin(pi / K)

    X <- as.data.frame(matrix(data=rnorm(n=2*N), nrow=N, ncol=2) +
                       scaling*matrix(data = c(cos(alpha), sin(alpha)),
                                      nrow = N, ncol = 2) / sin_pi_K)
    return(data.frame(x = X$V1, y = X$V2))
}

set.seed(1)
write.table(x = gen_data(200, 4, 5), file = "test.txt",
            col.names = TRUE, row.names = FALSE,
            sep = ",", dec = ".")

(Julia codes below assume that data set was generated using Julia and R assumes data was generated in R: the difference is that file generated in R has a header)

Comparing the above codes (although they are not the main objective of the exercise) highlights two things about Julia: 1) you can use unicode literals in your code and 2) in Julia using explicit loop is OK (fast). Those two points are in this case of course only an issue of taste but I actually find that both of them make the code a bit more readable.

The end result is 200 points placed in 2D plane in four clusters.

Now we move to the actual challenge: write a code that executes k-medoids algorithm for number of clusters from 2 to 6 and compare the results.

Here is the solution in Julia:

using JuMP
using GLPKMathProgInterface
using Distances

function find_clusters(K, ds)
    N = size(ds, 1)
    rho = pairwise(Euclidean(), ds')

    m = Model(solver=GLPKSolverMIP())
    @variable(m, y[1:N], Bin)
    @variable(m, x[1:N,1:N], Bin)

    @objective(m, Min, sum{x[i,j]*rho[i,j], i=1:N, j=1:N})

    @constraint(m, sum(y) == K)
    for i in 1:N
        @constraint(m, sum(x[i,:]) == 1)
    end
    for i in 1:N, j in 1:N
        @constraint(m, x[i,j] <= y[j])
    end

    solve(m)
    return getvalue(x), getvalue(y), getobjectivevalue(m)
end

function exec()
    ds = readcsv("test.txt")
    N = size(ds, 1)
    for K in 6:-1:2
        println("--- K=$K ---")
        @time x, y, obj = find_clusters(K, ds)
        println("Objective: $obj")
        println("Centers:")
        for i in 1:N
            y[i] > 0.0 && println("\t$i:\t", ds[i,:])
        end
    end
end

exec()

and here is the R code:

library(glpkAPI)

find_clusters <- function(K, ds) {
    N <- nrow(ds)
    
    rho <- cbind(expand.grid(i = 1:N, j = 1:N),
        dist = as.numeric(as.matrix(dist(ds, upper=TRUE, diag=TRUE))))
    write.table(x = rho, file = "dist.csv",
                sep = ",", dec = ".",
                col.names = TRUE, row.names = FALSE)

    #dat file
    cat("data;\n", file = "kmedoids.dat", append = FALSE)
    cat(sprintf("param pN := %s;\n", N),file="kmedoids.dat", append=T)
    cat(sprintf("param pK := %s;\n", K),file="kmedoids.dat", append=T)
    cat("end;", file = "kmedoids.dat", append = TRUE)

    wk <- mplAllocWkspGLPK()
    termOutGLPK(GLP_OFF)
    model <-  mplReadModelGLPK(wk, "kmedoids.mod", TRUE)
    data <- mplReadDataGLPK(wk, "kmedoids.dat")
    mplGenerateGLPK(wk)
    lp <- initProbGLPK()
    mplBuildProbGLPK(wk, lp)
    scaleProbGLPK(lp, GLP_SF_AUTO)
    setDefaultMIPParmGLPK()
    setSimplexParmGLPK(parm = c(MSG_LEV), val = c(GLP_MSG_OFF))
    setMIPParmGLPK(parm = c(MSG_LEV), val = c(GLP_MSG_OFF))
    solveSimplexGLPK(lp)
    solveMIPGLPK(lp)    
    
    obj  <- mipObjValGLPK(lp)
    all_values <- mipColsValGLPK(lp)
    all_names <- rep(NA_character_, length(all_values))
    for (i in 1:length(all_values)) {
        all_names[i] <- getColNameGLPK(lp, i)
    }
    sol <- data.frame(var = all_names, val = all_values)
    
    return(list(
      obj = obj,
      y = sol[grep(pattern = "vSegmentCenter", x = all_names), ],
      x = sol[grep(pattern = "vPointToSegment", x = all_names), ]))
}


ds <- read.table(file = "test.txt", sep = ",",
                 dec = ".", header = TRUE)
N <- nrow(ds)

for (K in seq(6,2)) {
  print(sprintf("--- K = %s ---", K))
  start_time <- Sys.time()
  sol <- find_clusters(K, ds)
  calc_duration <- Sys.time() - start_time
  print(sprintf("    %s sec", calc_duration))
  print(sprintf("Objective: %s", sol$obj))
  print("Centers:")
  print(ds[sol$y$val > 0, ])
}

which requires auxiliary kmedoids.mod file:

param pN;
param pK;
set pPoints := 1..pN;
set pSegments := 1..pK;

param pDist{i in pPoints, j in pPoints} >= 0;

table T_dist IN "CSV" "dist.csv":
[i,j], pDist ~ dist;

var vPointToSegment{i in pPoints, j in pPoints}, binary;
var vSegmentCenter{i in pPoints}, binary;

minimize total_cost: sum{i in pPoints, j in pPoints} (vPointToSegment[i,j] * pDist[i,j]);

subject to

cPointToOnlyOneSegment{i in pPoints}:
sum{j in pPoints} vPointToSegment[i,j] = 1;

cSegmentsCnt:
sum{i in pPoints} vSegmentCenter[i] = pK;

cPoinOnlyToActiveSegment_1{i in pPoints, j in pPoints}:
vPointToSegment[i, j] <= vSegmentCenter[j];

cPoinOnlyToActiveSegment_2{j in pPoints}:
sum{i in pPoints} vPointToSegment[i, j] <= card(pPoints)*vSegmentCenter[j];

end;

I do not want to go through the code in detail - please judge it yourself. However, the two things are immediately apparent in my opinion:

  1. the ability to be able to specify the model within Julia beats hands down glpkAPI, where you have to use external file for specification of the model and external files to communicate data to the model;
  2. in Julia if we want to switch the solver from GLPK to any other (eg. CBC) the only line we would have to change is m = Model(solver=GLPKSolverMIP()). In R you have bindings to other solvers but they are much more low-level and most importantly you would have to rewrite most of the code.

Wednesday, October 19, 2016

Deep learning in the cloud with MXNet

Last Friday together with Przemysław Szufel and Wit Jakuczun we were giving a live demo on introduction to deep learning at Digital Champions conference.

The objective of the workshop was to show how to build a simple predictive model using MXNet library in a few minutes. In the example, we train a model on GPUs using a standard MNIST database of handwritten digits.

Our post-presentation thoughts are that many people did not have experience with using GPUs in the cloud. Our point is that it is actually simpler and cheaper than having the infrastructure on-premises. Running of the whole tutorial cost us under 1$. Using bitfusion.io Scientific Computing AMI we were able to painlessly get the whole computing environment running on AWS in a few minutes.

On GitHub you can find the instructions how to set up computations in the cloud and the R source code allowing to build the model using MXNnet.

Friday, September 30, 2016

R+H2O for marketing campaign modeling

My last post about telco churn prediction with R+H2O attracted unexpectedly high response. It seems that R+H2O combo has currently a very good momentum :). Therefore Wit Jakuczun decided to publish a case study that he uses in his R boot camps that is based on the same technology stack.

You can find the shortened version of the case study material along with the source codes and data sets on GitHub.

I think it is nice not only because it is another example how to use H2O in R, but it is also a basic introduction to how to combine segmentation and prediction modeling for marketing campaign targeting.

Monday, September 12, 2016

Telco churn prediction with R+H2O

Recently together with my friend Wit Jakuczun we have discussed about a blog post on Revolution showing application of SQL Server R services to build and run telco churn model. It is a very nice analysis and we thought that it would be interesting to compare the results to H2O, which is a great tool for automated building of prediction models.

Today Wit has published on Github his codes performing the analysis. You can check them out here. The obtained model is pretty good, with AUC equal to 0.947 which is better than any of the models presented in Revolution blog. But, in my opinion the key value of the Wit's work is to show how simple it is to build models with H2O using R. From his Github repository you can download a ready to run RStudio project and all the instructions needed to build the model in just a few lines of code.

Tuesday, January 6, 2015

Sequence generation in R, Python and Julia

Recently I was comparing implementation of sequence generation functions in R, Python (numpy) and Julia. Interestingly even such basic functions have slight differences in implementation. In my opinion Julia provides the best solution and Python the worst.