-
Notifications
You must be signed in to change notification settings - Fork 831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New release with LightGBM 2.2.1 #390
Comments
@superbobry I think you meant a different link than #727 ? |
@superbobry sure I can work on releasing a newer version but I first need to make sure it would indeed work and I need to figure out how to test it. If I release it would you be able to test it? |
@imatiach-msft thanks, I've updated the link. I will be able to test the release once it's out, yes. By the way, thank you very much for the superb support! It is very rare to see replies on OSS' issues on the same day, not just within the same hour :) |
@superbobry sorry, would you be able to try out this build:
you need to specify our build repository since I haven't published it to maven central yet, I want to make sure to verify it first. Hope it fixes the issue! If you have trouble, let me know if we could get on a skype call together and debug the issue, I might be able to help. |
@imatiach-msft are you sure the artifact is available in the repository? $ curl https://mmlspark.azureedge.net/maven/Azure/mmlspark_2.11/0.14.dev9+1.g5783ce91/mmlspark_2.11-0.14.dev9+1.g5783ce91.pom
<?xml version="1.0" encoding="utf-8"?><Error><Code>BlobNotFound</Code><Message>The specified blob does not exist.
RequestId:010da925-c01e-0047-2c0d-623285000000
Time:2018-10-12T09:28:28.2599081Z</Message></Error>% |
@superbobry |
Thanks! @alois-bissuel, could you give it a try on Monday? |
@alois-bissuel @superbobry please let me know if this resolves the issue or if you run into any other problems |
Yes, I am trying it now. The only trouble is that the repo where I pull this custom version (https://mmlspark.azureedge.net/maven/) doesn't have the full dependency (lightGBM.jar at least), so it could streamline our work if it were. (I have managed to pull it from the repository listed in the pom you mentioned above, but it is not the most practical way) |
@alois-bissuel could you please explain that a bit more, it should have the dependency based on this in the PR I built:
The dependency is also from a maven repo I created from an Azure blob: |
@alois-bissuel I can also publish to maven central if that is a problem, but I would prefer to do that only after you have validated the update fixes the issue for you. I'm not actually sure if it would, I think I actually need to publish the lightgbm shared object file with the glibc shared object dependency to resolve the glibc errors. |
@imatiach-msft : no it is OK, I managed eventually to pull all the dependencies, do not bother to publish to maven central. I will test this tomorrow. Thanks for the quick answer, I will keep you posted ! |
I just managed to run with the version you provided. There is still a linking error: It seems that on CentOS 7 the latest version of GLIBCXX is 3.4.19! I don't know which version of GLIBCXX was targeted by Microsoft/LightGBM#1727. |
@alois-bissuel ≤2.25, see microsoft/LightGBM#1718 (comment). |
A small update: the error above corresponds to problem linking to libstdc++, as indicated here Microsoft/LightGBM#1708. There are no more linking problems with the libstdc. |
Thanks for all this really interesting work and potential users with a spark cluster with centos 7 ! I've tested the new jar and I have the same problem @alois-bissuel
Did you find a solution to solve this problem related to libstdc++ ? |
And of course : I can help for some test if required @imatiach-msft ! |
It looks like I overcame the linking error by adding to the LD_LIBRARY_PATH a more recent version of libstdc++.so.6 (which I incidentally found in my installation of miniconda, in case somebody asks). At least, |
Coming back from the network error in #405 , I now have a bad allocation error. One again, I added an external libstdc++.so.6 to every executor's LD_LIBRARY_PATH. See the stacktrace: |
@alois-bissuel it looks like you are running out of memory, would you be able to decrease the size of the dataset or increase the memory in your cluster? |
@alois-bissuel please see the related explanation here: #406 |
@alois-bissuel copy-pasting for convenience: sorry, this is an issue with lightgbm - the dataset on each partition is replicated in native memory (so native lightgbm code can run), so at minimum lightgbm takes about 2X dataset size to train. |
Thank you for the very quick answer! |
@alois-bissuel is it the same bad-alloc error:
if so, I'm not really sure what else it could be. Searching online definitely suggests it is caused by OOM. One thing to double check for sanity would be to print out the number of partitions of the dataset prior to training on lightgbm and make sure it is not 1 but some reasonable number. Also, if you sample down the dataset, say be 50%, do you still see the error? The error is definitely coming from one of the workers and not the driver so I don't think increasing driver memory would help. Are you sure the spark.executor.memory or --executor-memory configuration is set to a reasonable size? |
I have checked it all: the number of partition is reasonable (or maybe too high, but I also tried setting one partition per executor), the memory setting looks sensible. I did not try subsampling the dataset, as it is already quite small (a Gb for ten executors having each three dozen Gb of RAM). When looking at the profiling tools that we have at our company, it seems that the executors are consuming a lot of heap, but very little off-heap. Is it expected, given that LightGBM should allocate off-heap ? |
@alois-bissuel hmm, just to make sure I understand correctly, by off-heap do you mean unmanaged memory (eg the native C/C++ allocations, NOT the memory on the stack in the process) and by on-heap do you mean the Java managed heap? If I understand correctly, then they should be about the same, since we take the memory from the Java side and allocate it in the native side. If the Java on-heap memory is much higher than the native off-heap memory, then indeed this seems like a bug, that we should be able to trace to the scala code written. |
Yes, this is exactly what I meant (regarding on-heap and off-heap). The off-heap usage stays very low (150 Mb) whereas the on-heap usage increases to 7-8 Gb. Are there any tests or some sort of profiling which I could do ? |
@alois-bissuel that is odd, I would assume the off-heap usage would be at least 1 GB since that is the size of the dataset, if not more. I'm not sure why it would be so low, 150 MB doesn't make sense to me. I'm not sure about how to profile on spark, do there exist good tools for profiling distributed clusters? For C# I've used jetbrains memory profiler & redgate memory profiler a lot, for python I've used cProfile a lot, I've also used GNU gdb and windbg a lot in the past. Haven't used a lot of Java memory profilers, and certainly haven't used one on a distributed cluster. It looks like this is a good one based on some simple searching: https://github.com/uber-common/jvm-profiler |
closing as 2.2.2 has been merged to master, should be in the next release: #391 |
Release 2.2.1 of LightGBM (see microsoft/LightGBM#1727) allowed running it on older systems (most notably CentOS 7). Could you kindly consider releasing a new version of mmlspark incorporating this change?
Update: corrected the PR link.
The text was updated successfully, but these errors were encountered: