New release with LightGBM 2.2.1 #390

superbobry · 2018-10-08T13:55:35Z

Release 2.2.1 of LightGBM (see microsoft/LightGBM#1727) allowed running it on older systems (most notably CentOS 7). Could you kindly consider releasing a new version of mmlspark incorporating this change?

Update: corrected the PR link.

imatiach-msft · 2018-10-08T14:05:35Z

@superbobry I think you meant a different link than #727 ?

imatiach-msft · 2018-10-08T14:06:17Z

@superbobry sure I can work on releasing a newer version but I first need to make sure it would indeed work and I need to figure out how to test it. If I release it would you be able to test it?

superbobry · 2018-10-08T15:59:52Z

@imatiach-msft thanks, I've updated the link. I will be able to test the release once it's out, yes.

By the way, thank you very much for the superb support! It is very rare to see replies on OSS' issues on the same day, not just within the same hour :)

imatiach-msft · 2018-10-12T03:48:02Z

@superbobry sorry, would you be able to try out this build:

--packages com.microsoft.ml.spark:mmlspark_2.11:0.14.dev9+1.g5783ce91
--repositories https://mmlspark.azureedge.net/maven

you need to specify our build repository since I haven't published it to maven central yet, I want to make sure to verify it first. Hope it fixes the issue! If you have trouble, let me know if we could get on a skype call together and debug the issue, I might be able to help.

superbobry · 2018-10-12T09:28:48Z

@imatiach-msft are you sure the artifact is available in the repository?

$ curl https://mmlspark.azureedge.net/maven/Azure/mmlspark_2.11/0.14.dev9+1.g5783ce91/mmlspark_2.11-0.14.dev9+1.g5783ce91.pom 
<?xml version="1.0" encoding="utf-8"?><Error><Code>BlobNotFound</Code><Message>The specified blob does not exist.
RequestId:010da925-c01e-0047-2c0d-623285000000
Time:2018-10-12T09:28:28.2599081Z</Message></Error>%

imatiach-msft · 2018-10-12T14:18:40Z

@superbobry
I can see it here: https://mmlspark.azureedge.net/maven/com/microsoft/ml/spark/mmlspark_2.11/0.14.dev9+1.g5783ce91/mmlspark_2.11-0.14.dev9+1.g5783ce91.pom
It's not under Azure, that's for the official maven central release. This is just for our builds.

superbobry · 2018-10-12T20:30:04Z

Thanks! @alois-bissuel, could you give it a try on Monday?

imatiach-msft · 2018-10-15T13:54:48Z

@alois-bissuel @superbobry please let me know if this resolves the issue or if you run into any other problems

alois-bissuel · 2018-10-15T14:56:56Z

Yes, I am trying it now. The only trouble is that the repo where I pull this custom version (https://mmlspark.azureedge.net/maven/) doesn't have the full dependency (lightGBM.jar at least), so it could streamline our work if it were. (I have managed to pull it from the repository listed in the pom you mentioned above, but it is not the most practical way)

imatiach-msft · 2018-10-15T15:20:56Z

@alois-bissuel could you please explain that a bit more, it should have the dependency based on this in the PR I built:
https://github.com/Azure/mmlspark/pull/391/files#diff-7b450c33e07c4838c2f8f1901b791b77R41

"com.microsoft.ml.lightgbm" %  "lightgbmlib" % "2.2.100"

The dependency is also from a maven repo I created from an Azure blob:
"LightGBM Maven Repo" at "https://azuremlbuild.blob.core.windows.net/maven"

imatiach-msft · 2018-10-15T15:23:22Z

@alois-bissuel I can also publish to maven central if that is a problem, but I would prefer to do that only after you have validated the update fixes the issue for you. I'm not actually sure if it would, I think I actually need to publish the lightgbm shared object file with the glibc shared object dependency to resolve the glibc errors.

alois-bissuel · 2018-10-15T15:40:11Z

@imatiach-msft : no it is OK, I managed eventually to pull all the dependencies, do not bother to publish to maven central. I will test this tomorrow. Thanks for the quick answer, I will keep you posted !

alois-bissuel · 2018-10-16T10:02:19Z

I just managed to run with the version you provided. There is still a linking error:
java.lang.UnsatisfiedLinkError: /hdfs/uuid/c19321f1-80c9-4338-8a11-30c98f5fcbd4/yarn/data/usercache/XXXX/appcache/application_1539164595645_1548310/container_e293_1539164595645_1548310_01_000588/tmp/mml-natives6180385213394115619/lib_lightgbm.so: /lib64/libstdc++.so.6: version GLIBCXX_3.4.20 not found (required by /hdfs/uuid/c19321f1-80c9-4338-8a11-30c98f5fcbd4/yarn/data/usercache/XXXX/appcache/application_1539164595645_1548310/container_e293_1539164595645_1548310_01_000588/tmp/mml-natives6180385213394115619/lib_lightgbm.so)

It seems that on CentOS 7 the latest version of GLIBCXX is 3.4.19! I don't know which version of GLIBCXX was targeted by Microsoft/LightGBM#1727.

superbobry · 2018-10-16T18:42:03Z

@alois-bissuel ≤2.25, see microsoft/LightGBM#1718 (comment).

alois-bissuel · 2018-10-17T09:53:44Z

A small update: the error above corresponds to problem linking to libstdc++, as indicated here Microsoft/LightGBM#1708. There are no more linking problems with the libstdc.

julienchamp · 2018-10-18T13:01:29Z

Thanks for all this really interesting work and potential users with a spark cluster with centos 7 !

I've tested the new jar and I have the same problem @alois-bissuel

Caused by: java.lang.UnsatisfiedLinkError: /hdfs/hadoop/yarn/local/usercache/livy/appcache/application_1539859181572_0009/container_e51_1539859181572_0009_01_000009/tmp/mml-natives3222353148039997186/lib_lightgbm.so: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found

Did you find a solution to solve this problem related to libstdc++ ?

julienchamp · 2018-10-18T13:10:15Z

And of course : I can help for some test if required @imatiach-msft !

alois-bissuel · 2018-10-23T17:01:11Z

It looks like I overcame the linking error by adding to the LD_LIBRARY_PATH a more recent version of libstdc++.so.6 (which I incidentally found in my installation of miniconda, in case somebody asks). At least, ldd lib_lightgbm.so shows no linking problem.
For running with spark, it is simply a matter of using the spark.yarn.dist.files (for the lib) and spark.executor.extraLibraryPath (to add the current working directory to the LD_LIBRARY_PATH of the executors).
I still have errors after (which seem unrelated), so I can't rule out that this fix does not work.

alois-bissuel · 2018-10-26T15:04:09Z

Coming back from the network error in #405 , I now have a bad allocation error.

One again, I added an external libstdc++.so.6 to every executor's LD_LIBRARY_PATH.

See the stacktrace:
java.lang.Exception: Dataset create call failed in LightGBM with error: std::bad_alloc
at com.microsoft.ml.spark.LightGBMUtils$.validate(LightGBMUtils.scala:26)
at com.microsoft.ml.spark.LightGBMUtils$.generateSparseDataset(LightGBMUtils.scala:349)
at com.microsoft.ml.spark.TrainUtils$.translate(TrainUtils.scala:42)
at com.microsoft.ml.spark.TrainUtils$.trainLightGBM(TrainUtils.scala:211)
at com.microsoft.ml.spark.LightGBMClassifier$$anonfun$1.apply(LightGBMClassifier.scala:58)
at com.microsoft.ml.spark.LightGBMClassifier$$anonfun$1.apply(LightGBMClassifier.scala:58)

imatiach-msft · 2018-10-26T15:22:23Z

@alois-bissuel it looks like you are running out of memory, would you be able to decrease the size of the dataset or increase the memory in your cluster?

imatiach-msft · 2018-10-26T15:23:13Z

@alois-bissuel please see the related explanation here: #406

imatiach-msft · 2018-10-26T15:24:36Z

@alois-bissuel copy-pasting for convenience:

sorry, this is an issue with lightgbm - the dataset on each partition is replicated in native memory (so native lightgbm code can run), so at minimum lightgbm takes about 2X dataset size to train.
You could try two things:
1.) increase the memory of the cluster
2.) use incremental training with lightgbm: you can split up your dataset, run lightgbm on the first split, save the native learner, and then retrain on the next dataset passing in the lightgbm learner param
Sorry about the inconvenience.

imatiach-msft · 2018-10-26T15:42:03Z

@alois-bissuel I wouldn't rule out there being a memory leak in the native code, but I do delete the arrays to create the native lightgbm dataset here for training:
https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/LightGBMUtils.scala#L318
https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/LightGBMUtils.scala#L359
and here for prediction:
https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/LightGBMBooster.scala#L67
https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/LightGBMBooster.scala#L106
here for label col:
https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/TrainUtils.scala#L62
and here to free learner:
https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/TrainUtils.scala#L111
and here to free dataset after training is done:
https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/TrainUtils.scala#L118
If I missed something somewhere then that could be an issue. But I'm not sure what else I could have missed, these are the only native constructs created during training

alois-bissuel · 2018-10-26T16:00:15Z

Thank you for the very quick answer!
I don't think this is a matter of memory available for the workers. I have a pretty small dataset (around 1 Gb), and I tried running on 10 executors with a lot of memory available (32 Gb on off-heap through yarn memory overhead), and I still have the same error.

imatiach-msft · 2018-10-26T16:12:02Z

@alois-bissuel is it the same bad-alloc error:

java.lang.Exception: Dataset create call failed in LightGBM with error: std::bad_alloc

if so, I'm not really sure what else it could be. Searching online definitely suggests it is caused by OOM. One thing to double check for sanity would be to print out the number of partitions of the dataset prior to training on lightgbm and make sure it is not 1 but some reasonable number. Also, if you sample down the dataset, say be 50%, do you still see the error? The error is definitely coming from one of the workers and not the driver so I don't think increasing driver memory would help. Are you sure the spark.executor.memory or --executor-memory configuration is set to a reasonable size?

alois-bissuel · 2018-10-30T15:59:30Z

I have checked it all: the number of partition is reasonable (or maybe too high, but I also tried setting one partition per executor), the memory setting looks sensible. I did not try subsampling the dataset, as it is already quite small (a Gb for ten executors having each three dozen Gb of RAM).

When looking at the profiling tools that we have at our company, it seems that the executors are consuming a lot of heap, but very little off-heap. Is it expected, given that LightGBM should allocate off-heap ?

imatiach-msft · 2018-10-30T16:14:06Z

@alois-bissuel hmm, just to make sure I understand correctly, by off-heap do you mean unmanaged memory (eg the native C/C++ allocations, NOT the memory on the stack in the process) and by on-heap do you mean the Java managed heap? If I understand correctly, then they should be about the same, since we take the memory from the Java side and allocate it in the native side. If the Java on-heap memory is much higher than the native off-heap memory, then indeed this seems like a bug, that we should be able to trace to the scala code written.

alois-bissuel · 2018-10-30T16:18:27Z

Yes, this is exactly what I meant (regarding on-heap and off-heap). The off-heap usage stays very low (150 Mb) whereas the on-heap usage increases to 7-8 Gb. Are there any tests or some sort of profiling which I could do ?

imatiach-msft · 2018-10-30T16:33:03Z

The off-heap usage stays very low (150 Mb) whereas the on-heap usage increases to 7-8 Gb

@alois-bissuel that is odd, I would assume the off-heap usage would be at least 1 GB since that is the size of the dataset, if not more. I'm not sure why it would be so low, 150 MB doesn't make sense to me.

I'm not sure about how to profile on spark, do there exist good tools for profiling distributed clusters? For C# I've used jetbrains memory profiler & redgate memory profiler a lot, for python I've used cProfile a lot, I've also used GNU gdb and windbg a lot in the past. Haven't used a lot of Java memory profilers, and certainly haven't used one on a distributed cluster. It looks like this is a good one based on some simple searching: https://github.com/uber-common/jvm-profiler
Ideally it would be nice to see which objects are holding so much memory and what is preventing them from being garbage collected.

imatiach-msft · 2018-12-14T17:26:42Z

closing as 2.2.2 has been merged to master, should be in the next release: #391

imatiach-msft mentioned this issue Oct 9, 2018

updated lightgbm to 2.2.2 #391

Merged

imatiach-msft self-assigned this Oct 11, 2018

alois-bissuel mentioned this issue Oct 23, 2018

LightGBM: NullPointerException #405

Closed

imatiach-msft mentioned this issue Oct 26, 2018

key not found #406

Closed

imatiach-msft added the enhancement label Nov 14, 2018

StrikerRUS mentioned this issue Nov 21, 2018

The higher version dependencies when build lightGBM microsoft/LightGBM#1858

Closed

imatiach-msft closed this as completed Dec 14, 2018

puyvqi mentioned this issue Jan 14, 2019

lightGBM 2.2.200 still has glibc issue /lib64/libstdc++.so.6: version : GLIBCXX_3.4.20' not found microsoft/LightGBM#1945

Closed

StrikerRUS mentioned this issue Feb 20, 2020

feat: Change locking strategy of Booster, allow for share and unique locks microsoft/LightGBM#2760

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New release with LightGBM 2.2.1 #390

New release with LightGBM 2.2.1 #390

superbobry commented Oct 8, 2018 •

edited

Loading

imatiach-msft commented Oct 8, 2018

imatiach-msft commented Oct 8, 2018

superbobry commented Oct 8, 2018

imatiach-msft commented Oct 12, 2018

superbobry commented Oct 12, 2018

imatiach-msft commented Oct 12, 2018

superbobry commented Oct 12, 2018

imatiach-msft commented Oct 15, 2018

alois-bissuel commented Oct 15, 2018 •

edited

Loading

imatiach-msft commented Oct 15, 2018

imatiach-msft commented Oct 15, 2018

alois-bissuel commented Oct 15, 2018

alois-bissuel commented Oct 16, 2018 •

edited

Loading

superbobry commented Oct 16, 2018 •

edited

Loading

alois-bissuel commented Oct 17, 2018

julienchamp commented Oct 18, 2018 •

edited

Loading

julienchamp commented Oct 18, 2018 •

edited

Loading

alois-bissuel commented Oct 23, 2018

alois-bissuel commented Oct 26, 2018

imatiach-msft commented Oct 26, 2018

imatiach-msft commented Oct 26, 2018

imatiach-msft commented Oct 26, 2018

imatiach-msft commented Oct 26, 2018

alois-bissuel commented Oct 26, 2018

imatiach-msft commented Oct 26, 2018

alois-bissuel commented Oct 30, 2018

imatiach-msft commented Oct 30, 2018

alois-bissuel commented Oct 30, 2018

imatiach-msft commented Oct 30, 2018

imatiach-msft commented Dec 14, 2018

New release with LightGBM 2.2.1 #390

New release with LightGBM 2.2.1 #390

Comments

superbobry commented Oct 8, 2018 • edited Loading

imatiach-msft commented Oct 8, 2018

imatiach-msft commented Oct 8, 2018

superbobry commented Oct 8, 2018

imatiach-msft commented Oct 12, 2018

superbobry commented Oct 12, 2018

imatiach-msft commented Oct 12, 2018

superbobry commented Oct 12, 2018

imatiach-msft commented Oct 15, 2018

alois-bissuel commented Oct 15, 2018 • edited Loading

imatiach-msft commented Oct 15, 2018

imatiach-msft commented Oct 15, 2018

alois-bissuel commented Oct 15, 2018

alois-bissuel commented Oct 16, 2018 • edited Loading

superbobry commented Oct 16, 2018 • edited Loading

alois-bissuel commented Oct 17, 2018

julienchamp commented Oct 18, 2018 • edited Loading

julienchamp commented Oct 18, 2018 • edited Loading

alois-bissuel commented Oct 23, 2018

alois-bissuel commented Oct 26, 2018

imatiach-msft commented Oct 26, 2018

imatiach-msft commented Oct 26, 2018

imatiach-msft commented Oct 26, 2018

imatiach-msft commented Oct 26, 2018

alois-bissuel commented Oct 26, 2018

imatiach-msft commented Oct 26, 2018

alois-bissuel commented Oct 30, 2018

imatiach-msft commented Oct 30, 2018

alois-bissuel commented Oct 30, 2018

imatiach-msft commented Oct 30, 2018

imatiach-msft commented Dec 14, 2018

superbobry commented Oct 8, 2018 •

edited

Loading

alois-bissuel commented Oct 15, 2018 •

edited

Loading

alois-bissuel commented Oct 16, 2018 •

edited

Loading

superbobry commented Oct 16, 2018 •

edited

Loading

julienchamp commented Oct 18, 2018 •

edited

Loading

julienchamp commented Oct 18, 2018 •

edited

Loading