Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrating GLMNet and Lasso #51

Open
sourish-cmi opened this issue Oct 4, 2022 · 11 comments
Open

Integrating GLMNet and Lasso #51

sourish-cmi opened this issue Oct 4, 2022 · 11 comments
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@sourish-cmi
Copy link
Collaborator

@ajaynshah @ayushpatnaikgit @mousum-github @SusanXKDR

I am tempted to integrate GLMNet and Lasso

https://github.com/JuliaStats/GLMNet.jl
https://github.com/JuliaStats/Lasso.jl/blob/master/docs/src/index.md
https://github.com/simonster/LARS.jl

I understand this Julia package wraps the Fortran code from glmnet.

Shall we integrate it to CRRao?

@sourish-cmi sourish-cmi self-assigned this Oct 4, 2022
@sourish-cmi sourish-cmi added the enhancement New feature or request label Oct 4, 2022
@sourish-cmi sourish-cmi added this to the Package Completeness milestone Oct 5, 2022
@sourish-cmi
Copy link
Collaborator Author

@ShouvikGhosh2048 @codetalker7

We need to do a performance check against R

@sourish-cmi sourish-cmi added the good first issue Good for newcomers label Oct 5, 2022
@sourish-cmi
Copy link
Collaborator Author

sourish-cmi commented Oct 12, 2022

Julia Code

using RDatasets

mtcars =  dataset("datasets", "mtcars")

y = Vector(mtcars[:,"MPG"])
x = Matrix(mtcars[:,["HP", "WT", "DRat", "QSec"]])

using GLMNet
using Random
Random.seed!(1234)
cv = glmnetcv(x, y) 

@sourish-cmi
Copy link
Collaborator Author

sourish-cmi commented Oct 12, 2022

R code

attach(mtcars)
y <- mtcars$mpg

x <- data.matrix(mtcars[, c('hp', 'wt', 'drat', 'qsec')])

library(glmnet)
#perform 10-fold cross-validation to find optimal lambda value
set.seed(100)
cv_model <- cv.glmnet(x, y, alpha = 1,nfolds=10)

@ShouvikGhosh2048
Copy link
Collaborator

ShouvikGhosh2048 commented Oct 13, 2022

I tried benchmarking the two programs (on WSL):

using RDatasets, GLMNet, Random, BenchmarkTools

mtcars = dataset("datasets", "mtcars")

y = Vector(mtcars[:,"MPG"])
x = Matrix(mtcars[:, ["HP", "WT", "DRat", "QSec"]])

@benchmark glmnetcv(x, y)
attach(mtcars)
library(microbenchmark)

y <- mtcars$mpg
x <- data.matrix(mtcars[, c('hp', 'wt', 'drat', 'qsec')])

library(glmnet)

microbenchmark(cv.glmnet(x, y, alpha = 1, nfolds = 10))

I got:

Julia:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 307.400 μs … 6.012 ms ┊ GC (min … max): 0.00% … 78.04%
Time (median): 321.000 μs ┊ GC (median): 0.00%
Time (mean ± σ): 349.233 μs ± 213.283 μs ┊ GC (mean ± σ): 3.52% ± 5.38%

Memory estimate: 178.95 KiB, allocs estimate: 679.

R:
Unit: milliseconds
expr min lq mean median uq max neval
cv.glmnet(x, y, alpha = 1, nfolds = 10) 49.2733 50.6152 53.89493 52.54825 54.24935 172.8879 100

The Julia programs takes 1/100th of the time taken by the R program. This seems wrong, I'm not sure if I made a mistake.

@sourish-cmi
Copy link
Collaborator Author

sourish-cmi commented Oct 14, 2022

Let me check it on my end

@sourish-cmi
Copy link
Collaborator Author

Julia performance

julia> @benchmark glmnetcv(x, y)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  293.541 μs    2.972 ms  ┊ GC (min  max): 0.00%  88.98%
 Time  (median):     305.500 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   313.728 μs ± 135.655 μs  ┊ GC (mean ± σ):  2.22% ±  4.59%

                  ▃▅▇████▅▃▁                                     
  ▂▁▂▂▂▂▂▂▂▂▂▃▃▄▅███████████▇▇▅▅▄▄▄▄▃▄▃▄▄▄▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▄
  294 μs           Histogram: frequency by time          327 μs <

 Memory estimate: 175.67 KiB, allocs estimate: 649.

R Performance

> microbenchmark(cv.glmnet(x, y, alpha = 1, nfolds = 10)
+                ,times = 100)
Unit: milliseconds
                                    expr     min       lq     mean
 cv.glmnet(x, y, alpha = 1, nfolds = 10) 16.8904 17.04585 19.07382
   median       uq      max neval
 17.23759 19.43857 88.26255   100

@sourish-cmi
Copy link
Collaborator Author

Old style benchmarking:

R performance

attach(mtcars)
library(microbenchmark)

y <- mtcars$mpg
x <- data.matrix(mtcars[, c('hp', 'wt', 'drat', 'qsec')])

library(glmnet)

microbenchmark(cv.glmnet(x, y, alpha = 1, nfolds = 10)
               ,times = 200)

start = Sys.time()
set.seed(10101)
for(i in 1:1000){
  fit =cv.glmnet(x, y, alpha = 1, nfolds = 10)
}
Sys.time()-start

Time difference of 18.86479 secs

Julia Performance

using RDatasets, GLMNet, Random, BenchmarkTools

mtcars = dataset("datasets", "mtcars")

y = Vector(mtcars[:,"MPG"])
x = Matrix(mtcars[:, ["HP", "WT", "DRat", "QSec"]])

start = Sys.time()
using Random
for i = 1:1000
    Random.seed!(1234)
    cv = glmnetcv(x, y) 
end
Sys.time()-start
0.35875892639160156

Performance gain

> 18.86479/0.35875892
[1] 52.58347

52 times performance gain

@ShouvikGhosh2048
Copy link
Collaborator

using RDatasets, GLMNet, BenchmarkTools

mtcars = dataset("datasets", "mtcars")

y = Vector(mtcars[:,"MPG"])
x = Matrix(mtcars[:, ["HP", "WT", "DRat", "QSec"]])

@benchmark glmnet(x, y)
attach(mtcars)
library(microbenchmark)

y <- mtcars$mpg
x <- data.matrix(mtcars[, c('hp', 'wt', 'drat', 'qsec')])

library(glmnet)

microbenchmark(glmnet(x, y))

Julia:

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  20.500 μs …  2.814 ms  ┊ GC (min … max): 0.00% … 98.16%
 Time  (median):     22.200 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   24.231 μs ± 47.710 μs  ┊ GC (mean ± σ):  3.34% ±  1.70%

 Memory estimate: 12.50 KiB, allocs estimate: 21.

R:

Unit: milliseconds
         expr    min     lq     mean  median     uq    max neval
 glmnet(x, y) 1.1302 1.1882 1.258889 1.24225 1.2888 2.5004   100

Julia glmnet output:

Least Squares GLMNet Solution Path (58 solutions for 4 predictors in 420 passes):
─────────────────────────────
      df   pct_dev          λ
─────────────────────────────
 [1]   0  0.0       5.14698
 [2]   1  0.127818  4.68974
 [3]   1  0.233934  4.27311
 [4]   1  0.322034  3.8935
 [5]   2  0.395479  3.54761
 [6]   2  0.46873   3.23245
 [7]   2  0.529522  2.94529
 [8]   2  0.579992  2.68364
 [9]   2  0.621893  2.44523
[10]   2  0.656659  2.22801
[11]   2  0.685543  2.03008
[12]   2  0.709524  1.84973
[13]   2  0.729433  1.6854
[14]   2  0.745961  1.53568
[15]   2  0.759684  1.39925
[16]   3  0.77273   1.27495
[17]   3  0.783621  1.16168
[18]   3  0.792664  1.05848
[19]   3  0.800171  0.96445
[20]   3  0.806403  0.878771
[21]   3  0.811577  0.800704
[22]   3  0.815873  0.729571
[23]   4  0.819758  0.664758
[24]   4  0.824109  0.605703
[25]   4  0.827712  0.551894
[26]   4  0.830713  0.502865
[27]   4  0.833204  0.458192
[28]   4  0.835266  0.417488
[29]   4  0.836983  0.380399
[30]   4  0.83841   0.346605
[31]   4  0.83959   0.315814
[32]   4  0.840573  0.287758
[33]   4  0.84139   0.262194
[34]   4  0.842065  0.238902
[35]   4  0.842628  0.217678
[36]   4  0.843096  0.19834
[37]   4  0.843483  0.18072
[38]   4  0.843805  0.164666
[39]   4  0.844074  0.150037
[40]   4  0.844295  0.136708
[41]   4  0.844479  0.124564
[42]   4  0.844633  0.113498
[43]   4  0.84476   0.103415
[44]   4  0.844866  0.0942278
[45]   4  0.844954  0.0858568
[46]   4  0.845026  0.0782295
[47]   4  0.845087  0.0712798
[48]   4  0.845137  0.0649475
[49]   4  0.845179  0.0591778
[50]   4  0.845214  0.0539206
[51]   4  0.845243  0.0491304
[52]   4  0.845267  0.0447658
[53]   4  0.845287  0.0407889
[54]   4  0.845303  0.0371654
[55]   4  0.845317  0.0338637
[56]   4  0.845329  0.0308553
[57]   4  0.845338  0.0281142
[58]   4  0.845346  0.0256166
─────────────────────────────

R glmnet output:

Call:  glmnet(x = x, y = y)

   Df  %Dev Lambda
1   0  0.00 5.1470
2   1 12.78 4.6900
3   1 23.39 4.2730
4   1 32.20 3.8940
5   2 39.55 3.5480
6   2 46.87 3.2320
7   2 52.95 2.9450
8   2 58.00 2.6840
9   2 62.19 2.4450
10  2 65.67 2.2280
11  2 68.55 2.0300
12  2 70.95 1.8500
13  2 72.94 1.6850
14  2 74.60 1.5360
15  2 75.97 1.3990
16  3 77.27 1.2750
17  3 78.36 1.1620
18  3 79.27 1.0580
19  3 80.02 0.9645
20  3 80.64 0.8788
21  3 81.16 0.8007
22  3 81.59 0.7296
23  4 81.98 0.6648
24  4 82.41 0.6057
25  4 82.77 0.5519
26  4 83.07 0.5029
27  4 83.32 0.4582
28  4 83.53 0.4175
29  4 83.70 0.3804
30  4 83.84 0.3466
31  4 83.96 0.3158
32  4 84.06 0.2878
33  4 84.14 0.2622
34  4 84.21 0.2389
35  4 84.26 0.2177
36  4 84.31 0.1983
37  4 84.35 0.1807
38  4 84.38 0.1647
39  4 84.41 0.1500
40  4 84.43 0.1367
41  4 84.45 0.1246
42  4 84.46 0.1135
43  4 84.48 0.1034
44  4 84.49 0.0942
45  4 84.50 0.0859
46  4 84.50 0.0782
47  4 84.51 0.0713
48  4 84.51 0.0649
49  4 84.52 0.0592
50  4 84.52 0.0539
51  4 84.52 0.0491
52  4 84.53 0.0448
53  4 84.53 0.0408
54  4 84.53 0.0372
55  4 84.53 0.0339
56  4 84.53 0.0309
57  4 84.53 0.0281
58  4 84.53 0.0256

@sourish-cmi
Copy link
Collaborator Author

Okay, this looks like only 4.5 times faster. This makes sense. In the cv_glmnet, the function does the sampling using some clever technique, resulting in a super performance gain.

@sourish-cmi
Copy link
Collaborator Author

sourish-cmi commented Jan 12, 2023

As per our discussion and debate at the Goa conference - it appears it is better to have a native development of GLMNet.jl
That is outside the periphery of CRRao. So we are putting it back on the back burner.

However, LASSO.jl is a native Julia package. So we will integrate LASSO.jl to CRRao

https://www.jstatsoft.org/article/view/v033i01

@sourish-cmi
Copy link
Collaborator Author

@ShouvikGhosh2048 @mousum-github @ajaynshah @codetalker7 @ayushpatnaikgit

Lookout this doc: https://juliastats.org/Lasso.jl/stable/lasso/

Lasso.jl depends completely on GLM.jl

That means for all GLM.jl models Lasso.jl will work. Top of that Lasso.jl is native Julia development

Hence we should integrate Lasso.jl as top priority

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants