-
Notifications
You must be signed in to change notification settings - Fork 831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LightGBM on Spark Bad Allocation memory error #483
Comments
@debadridtt sorry to hear about the trouble you are having. If the dataset is not confidential, would you be able to share the dataset and a code snippet that reproduces the error? I can try to debug and take a look into it. |
Hi, can you please go through the code I have posted on Stackexhange: https://datascience.stackexchange.com/questions/45144/pyspark-v-pandas-dataframe-memory-issue |
I'm running PySpark v2.2.0 |
@imatiach-msft @debadridtt Py4JJavaError: An error occurred while calling o1248.fit. My spark is 2.3.2. I am working on local windows10. I am using pyspark. I pip installed mmlspark-0.15.dev2+1.g11ad24d-py2.py3-none-any.whl. I --packages pyspark --packages Azure:mmlspark:0.16. I used spark.ml. classification.LogisticRegression for compare and it is ok. And stuck at lightgbm. My dataset is only 300k. |
@longyunshen |
@imatiach-msft import findspark from mmlspark import LightGBMRegressor lgb = LightGBMRegressor(alpha=0.3,learningRate=0.3,numIterations=100,numLeaves=31) import pyspark.sql.types as typ labels=[ schema=typ.StructType([typ.StructField(e[0], e[1], False) for e in labels]) import pyspark.ml.feature as ft births=births #import logistic regression for compare from pyspark.ml import Pipeline births_train, births_test=births import pyspark.ml.evaluation as ev from mmlspark import LightGBMRegressor from pyspark.ml import Pipeline model = pipeline.fit(births_train) # IT IS STUCK HERE!!!!!!! import pyspark.ml.evaluation as ev |
@longyunshen I see, it sounds like there is some error in the native code then. Is the dataset you are using confidential? Wondering if I can reproduce this issue locally. |
@imatiach-msft we are testing our Recommenders repo on Windows DSVM and we have the similar error. FYI, the notebook works well on Linux DSVM. To reproduce the error, please run staging/notebooks/02_model/mmlspark_lightgbm_criteo.ipynb on Windows.
|
the bug on windows should be fixed now on latest master (available with next release) |
I'm trying to run
lightgbm
on a small dataset. I'm using this command:: pyspark --packages com.microsoft.ml.spark:mmlspark_2.11:0.15.dev2+1.g11ad24d --repositories https://mmlspark.azureedge.net/maven
to launch my notebook.I'm trying other linear and bagging algorithms as well, like Logistic regression and Random Forest from the PySpark module, they seem to run fine, but sometimes I'm getting
ball_alloc memory error
when I'm trying to run Lightgbm on the same dataset. Its happening sometimes, not everytime, suppose I execute the cell for 3 times, I get a memory error in the second time and also the dataset is very small, ~2000 rows in the.csv
file.What may be the problem, because I don't even notice significant changes in the memory usage through my Resource monitor.
P.S. I'm using Windows 10
The text was updated successfully, but these errors were encountered: