Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New release with LightGBM 2.2.1 #390

Closed
superbobry opened this issue Oct 8, 2018 · 30 comments
Closed

New release with LightGBM 2.2.1 #390

superbobry opened this issue Oct 8, 2018 · 30 comments
Assignees

Comments

@superbobry
Copy link

superbobry commented Oct 8, 2018

Release 2.2.1 of LightGBM (see microsoft/LightGBM#1727) allowed running it on older systems (most notably CentOS 7). Could you kindly consider releasing a new version of mmlspark incorporating this change?

Update: corrected the PR link.

@imatiach-msft
Copy link
Contributor

@superbobry I think you meant a different link than #727 ?

@imatiach-msft
Copy link
Contributor

@superbobry sure I can work on releasing a newer version but I first need to make sure it would indeed work and I need to figure out how to test it. If I release it would you be able to test it?

@superbobry
Copy link
Author

@imatiach-msft thanks, I've updated the link. I will be able to test the release once it's out, yes.

By the way, thank you very much for the superb support! It is very rare to see replies on OSS' issues on the same day, not just within the same hour :)

@imatiach-msft
Copy link
Contributor

@superbobry sorry, would you be able to try out this build:

--packages com.microsoft.ml.spark:mmlspark_2.11:0.14.dev9+1.g5783ce91
--repositories https://mmlspark.azureedge.net/maven

you need to specify our build repository since I haven't published it to maven central yet, I want to make sure to verify it first. Hope it fixes the issue! If you have trouble, let me know if we could get on a skype call together and debug the issue, I might be able to help.

@superbobry
Copy link
Author

@imatiach-msft are you sure the artifact is available in the repository?

$ curl https://mmlspark.azureedge.net/maven/Azure/mmlspark_2.11/0.14.dev9+1.g5783ce91/mmlspark_2.11-0.14.dev9+1.g5783ce91.pom 
<?xml version="1.0" encoding="utf-8"?><Error><Code>BlobNotFound</Code><Message>The specified blob does not exist.
RequestId:010da925-c01e-0047-2c0d-623285000000
Time:2018-10-12T09:28:28.2599081Z</Message></Error>%

@imatiach-msft
Copy link
Contributor

@superbobry
I can see it here: https://mmlspark.azureedge.net/maven/com/microsoft/ml/spark/mmlspark_2.11/0.14.dev9+1.g5783ce91/mmlspark_2.11-0.14.dev9+1.g5783ce91.pom
It's not under Azure, that's for the official maven central release. This is just for our builds.

@superbobry
Copy link
Author

Thanks! @alois-bissuel, could you give it a try on Monday?

@imatiach-msft
Copy link
Contributor

@alois-bissuel @superbobry please let me know if this resolves the issue or if you run into any other problems

@alois-bissuel
Copy link

alois-bissuel commented Oct 15, 2018

Yes, I am trying it now. The only trouble is that the repo where I pull this custom version (https://mmlspark.azureedge.net/maven/) doesn't have the full dependency (lightGBM.jar at least), so it could streamline our work if it were. (I have managed to pull it from the repository listed in the pom you mentioned above, but it is not the most practical way)

@imatiach-msft
Copy link
Contributor

@alois-bissuel could you please explain that a bit more, it should have the dependency based on this in the PR I built:
https://github.com/Azure/mmlspark/pull/391/files#diff-7b450c33e07c4838c2f8f1901b791b77R41

"com.microsoft.ml.lightgbm" %  "lightgbmlib" % "2.2.100"

The dependency is also from a maven repo I created from an Azure blob:
"LightGBM Maven Repo" at "https://azuremlbuild.blob.core.windows.net/maven"

@imatiach-msft
Copy link
Contributor

@alois-bissuel I can also publish to maven central if that is a problem, but I would prefer to do that only after you have validated the update fixes the issue for you. I'm not actually sure if it would, I think I actually need to publish the lightgbm shared object file with the glibc shared object dependency to resolve the glibc errors.

@alois-bissuel
Copy link

@imatiach-msft : no it is OK, I managed eventually to pull all the dependencies, do not bother to publish to maven central. I will test this tomorrow. Thanks for the quick answer, I will keep you posted !

@alois-bissuel
Copy link

alois-bissuel commented Oct 16, 2018

I just managed to run with the version you provided. There is still a linking error:
java.lang.UnsatisfiedLinkError: /hdfs/uuid/c19321f1-80c9-4338-8a11-30c98f5fcbd4/yarn/data/usercache/XXXX/appcache/application_1539164595645_1548310/container_e293_1539164595645_1548310_01_000588/tmp/mml-natives6180385213394115619/lib_lightgbm.so: /lib64/libstdc++.so.6: version GLIBCXX_3.4.20 not found (required by /hdfs/uuid/c19321f1-80c9-4338-8a11-30c98f5fcbd4/yarn/data/usercache/XXXX/appcache/application_1539164595645_1548310/container_e293_1539164595645_1548310_01_000588/tmp/mml-natives6180385213394115619/lib_lightgbm.so)

It seems that on CentOS 7 the latest version of GLIBCXX is 3.4.19! I don't know which version of GLIBCXX was targeted by Microsoft/LightGBM#1727.

@superbobry
Copy link
Author

superbobry commented Oct 16, 2018

@alois-bissuel
Copy link

A small update: the error above corresponds to problem linking to libstdc++, as indicated here Microsoft/LightGBM#1708. There are no more linking problems with the libstdc.

@julienchamp
Copy link

julienchamp commented Oct 18, 2018

Thanks for all this really interesting work and potential users with a spark cluster with centos 7 !

I've tested the new jar and I have the same problem @alois-bissuel

Caused by: java.lang.UnsatisfiedLinkError: /hdfs/hadoop/yarn/local/usercache/livy/appcache/application_1539859181572_0009/container_e51_1539859181572_0009_01_000009/tmp/mml-natives3222353148039997186/lib_lightgbm.so: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found

Did you find a solution to solve this problem related to libstdc++ ?

@julienchamp
Copy link

julienchamp commented Oct 18, 2018

And of course : I can help for some test if required @imatiach-msft !

@alois-bissuel
Copy link

It looks like I overcame the linking error by adding to the LD_LIBRARY_PATH a more recent version of libstdc++.so.6 (which I incidentally found in my installation of miniconda, in case somebody asks). At least, ldd lib_lightgbm.so shows no linking problem.
For running with spark, it is simply a matter of using the spark.yarn.dist.files (for the lib) and spark.executor.extraLibraryPath (to add the current working directory to the LD_LIBRARY_PATH of the executors).
I still have errors after (which seem unrelated), so I can't rule out that this fix does not work.

@alois-bissuel
Copy link

Coming back from the network error in #405 , I now have a bad allocation error.

One again, I added an external libstdc++.so.6 to every executor's LD_LIBRARY_PATH.

See the stacktrace:
java.lang.Exception: Dataset create call failed in LightGBM with error: std::bad_alloc
at com.microsoft.ml.spark.LightGBMUtils$.validate(LightGBMUtils.scala:26)
at com.microsoft.ml.spark.LightGBMUtils$.generateSparseDataset(LightGBMUtils.scala:349)
at com.microsoft.ml.spark.TrainUtils$.translate(TrainUtils.scala:42)
at com.microsoft.ml.spark.TrainUtils$.trainLightGBM(TrainUtils.scala:211)
at com.microsoft.ml.spark.LightGBMClassifier$$anonfun$1.apply(LightGBMClassifier.scala:58)
at com.microsoft.ml.spark.LightGBMClassifier$$anonfun$1.apply(LightGBMClassifier.scala:58)

@imatiach-msft
Copy link
Contributor

@alois-bissuel it looks like you are running out of memory, would you be able to decrease the size of the dataset or increase the memory in your cluster?

@imatiach-msft
Copy link
Contributor

@alois-bissuel please see the related explanation here: #406

@imatiach-msft
Copy link
Contributor

@alois-bissuel copy-pasting for convenience:

sorry, this is an issue with lightgbm - the dataset on each partition is replicated in native memory (so native lightgbm code can run), so at minimum lightgbm takes about 2X dataset size to train.
You could try two things:
1.) increase the memory of the cluster
2.) use incremental training with lightgbm: you can split up your dataset, run lightgbm on the first split, save the native learner, and then retrain on the next dataset passing in the lightgbm learner param
Sorry about the inconvenience.

@imatiach-msft
Copy link
Contributor

@alois-bissuel
Copy link

Thank you for the very quick answer!
I don't think this is a matter of memory available for the workers. I have a pretty small dataset (around 1 Gb), and I tried running on 10 executors with a lot of memory available (32 Gb on off-heap through yarn memory overhead), and I still have the same error.

@imatiach-msft
Copy link
Contributor

@alois-bissuel is it the same bad-alloc error:

java.lang.Exception: Dataset create call failed in LightGBM with error: std::bad_alloc

if so, I'm not really sure what else it could be. Searching online definitely suggests it is caused by OOM. One thing to double check for sanity would be to print out the number of partitions of the dataset prior to training on lightgbm and make sure it is not 1 but some reasonable number. Also, if you sample down the dataset, say be 50%, do you still see the error? The error is definitely coming from one of the workers and not the driver so I don't think increasing driver memory would help. Are you sure the spark.executor.memory or --executor-memory configuration is set to a reasonable size?

@alois-bissuel
Copy link

I have checked it all: the number of partition is reasonable (or maybe too high, but I also tried setting one partition per executor), the memory setting looks sensible. I did not try subsampling the dataset, as it is already quite small (a Gb for ten executors having each three dozen Gb of RAM).

When looking at the profiling tools that we have at our company, it seems that the executors are consuming a lot of heap, but very little off-heap. Is it expected, given that LightGBM should allocate off-heap ?

@imatiach-msft
Copy link
Contributor

@alois-bissuel hmm, just to make sure I understand correctly, by off-heap do you mean unmanaged memory (eg the native C/C++ allocations, NOT the memory on the stack in the process) and by on-heap do you mean the Java managed heap? If I understand correctly, then they should be about the same, since we take the memory from the Java side and allocate it in the native side. If the Java on-heap memory is much higher than the native off-heap memory, then indeed this seems like a bug, that we should be able to trace to the scala code written.

@alois-bissuel
Copy link

Yes, this is exactly what I meant (regarding on-heap and off-heap). The off-heap usage stays very low (150 Mb) whereas the on-heap usage increases to 7-8 Gb. Are there any tests or some sort of profiling which I could do ?

@imatiach-msft
Copy link
Contributor

The off-heap usage stays very low (150 Mb) whereas the on-heap usage increases to 7-8 Gb

@alois-bissuel that is odd, I would assume the off-heap usage would be at least 1 GB since that is the size of the dataset, if not more. I'm not sure why it would be so low, 150 MB doesn't make sense to me.

I'm not sure about how to profile on spark, do there exist good tools for profiling distributed clusters? For C# I've used jetbrains memory profiler & redgate memory profiler a lot, for python I've used cProfile a lot, I've also used GNU gdb and windbg a lot in the past. Haven't used a lot of Java memory profilers, and certainly haven't used one on a distributed cluster. It looks like this is a good one based on some simple searching: https://github.com/uber-common/jvm-profiler
Ideally it would be nice to see which objects are holding so much memory and what is preventing them from being garbage collected.

@imatiach-msft
Copy link
Contributor

closing as 2.2.2 has been merged to master, should be in the next release: #391

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants