LightGBM on Spark Bad Allocation memory error #483

debadridtt · 2019-02-05T12:57:14Z

I'm trying to run lightgbm on a small dataset. I'm using this command: : pyspark --packages com.microsoft.ml.spark:mmlspark_2.11:0.15.dev2+1.g11ad24d --repositories https://mmlspark.azureedge.net/maven to launch my notebook.

I'm trying other linear and bagging algorithms as well, like Logistic regression and Random Forest from the PySpark module, they seem to run fine, but sometimes I'm getting ball_alloc memory error when I'm trying to run Lightgbm on the same dataset. Its happening sometimes, not everytime, suppose I execute the cell for 3 times, I get a memory error in the second time and also the dataset is very small, ~2000 rows in the .csv file.
What may be the problem, because I don't even notice significant changes in the memory usage through my Resource monitor.

P.S. I'm using Windows 10

The text was updated successfully, but these errors were encountered:

imatiach-msft · 2019-02-08T04:36:20Z

@debadridtt sorry to hear about the trouble you are having. If the dataset is not confidential, would you be able to share the dataset and a code snippet that reproduces the error? I can try to debug and take a look into it.

debadridtt · 2019-02-08T05:59:43Z

@debadridtt sorry to hear about the trouble you are having. If the dataset is not confidential, would you be able to share the dataset and a code snippet that reproduces the error? I can try to debug and take a look into it.

Hi, can you please go through the code I have posted on Stackexhange: https://datascience.stackexchange.com/questions/45144/pyspark-v-pandas-dataframe-memory-issue

debadridtt · 2019-02-11T04:22:02Z

@debadridtt sorry to hear about the trouble you are having. If the dataset is not confidential, would you be able to share the dataset and a code snippet that reproduces the error? I can try to debug and take a look into it.

I'm running PySpark v2.2.0

longyunshen · 2019-03-28T15:31:33Z

@imatiach-msft @debadridtt
is this problem solved? I have exactly the same issue. The following is the log.

Py4JJavaError: An error occurred while calling o1248.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 93.0 failed 1 times, most recent failure: Lost task 0.0 in stage 93.0 (TID 132, localhost, executor driver): java.lang.Exception: Dataset create call failed in LightGBM with error: bad allocation
at com.microsoft.ml.spark.LightGBMUtils$.validate(LightGBMUtils.scala:29)
at com.microsoft.ml.spark.LightGBMUtils$.generateSparseDataset(LightGBMUtils.scala:380)
at com.microsoft.ml.spark.TrainUtils$.translate(TrainUtils.scala:62)
at com.microsoft.ml.spark.TrainUtils$.trainLightGBM(TrainUtils.scala:219)
at com.microsoft.ml.spark.LightGBMRegressor$$anonfun$3.apply(LightGBMRegressor.scala:90)
at com.microsoft.ml.spark.LightGBMRegressor$$anonfun$3.apply(LightGBMRegressor.scala:90)
at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:188)
at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:185)

My spark is 2.3.2. I am working on local windows10. I am using pyspark. I pip installed mmlspark-0.15.dev2+1.g11ad24d-py2.py3-none-any.whl. I --packages pyspark --packages Azure:mmlspark:0.16. I used spark.ml. classification.LogisticRegression for compare and it is ok. And stuck at lightgbm. My dataset is only 300k.

imatiach-msft · 2019-03-28T21:49:37Z

@longyunshen
I think we may have failed to create the dataset due to an out of memory error:
https://github.com/Azure/mmlspark/blob/0b84a230d1556ced87be9139dd798237711c1158/src/lightgbm/src/main/scala/LightGBMUtils.scala#L343
how large is your cluster? If you have 300k rows * (assuming) 1000 cols * 8 bytes per col + some additional data it would be around 3 GB total in memory, which doesn't seem a lot for spark but may be enough to go out of memory on a local machine. Have you tried downsampling the data?

longyunshen · 2019-03-30T02:08:31Z

@imatiach-msft
There is no cluster on my local windows10. My dataset is 300kB as a .gz file. If extracted, around 1.5 MB. The dataset is 45000 rows by 17 columns. So I think it has nothing to do with dataset itself. The following is the code I ran on spyder in windows.

import findspark
findspark.init()
import pyspark
spark = pyspark.sql.SparkSession.builder.appName("MyApp")
.config("spark.jars.packages", "Azure:mmlspark:0.16")
.getOrCreate()

from mmlspark import LightGBMRegressor

lgb = LightGBMRegressor(alpha=0.3,learningRate=0.3,numIterations=100,numLeaves=31)

import pyspark.sql.types as typ

labels=[
('INFANT_ALIVE_AT_REPORT', typ.IntegerType()),
('BIRTH_PLACE', typ.StringType()),
('MOTHER_AGE_YEARS', typ.IntegerType()),
('FATHER_COMBINED_AGE', typ.IntegerType()),
('CIG_BEFORE', typ.IntegerType()),
('CIG_1_TRI', typ.IntegerType()),
('CIG_2_TRI', typ.IntegerType()),
('CIG_3_TRI', typ.IntegerType()),
('MOTHER_HEIGHT_IN', typ.IntegerType()),
('MOTHER_PRE_WEIGHT', typ.IntegerType()),
('MOTHER_DELIVERY_WEIGHT', typ.IntegerType()),
('MOTHER_WEIGHT_GAIN', typ.IntegerType()),
('DIABETES_PRE', typ.IntegerType()),
('DIABETES_GEST', typ.IntegerType()),
('HYP_TENS_PRE', typ.IntegerType()),
('HYP_TENS_GEST', typ.IntegerType()),
('PREV_BIRTH_PRETERM', typ.IntegerType())
]

schema=typ.StructType([typ.StructField(e[0], e[1], False) for e in labels])
births=spark.read.csv('births_transformed.csv.gz',
header=True,
schema=schema)

import pyspark.ml.feature as ft

births=births
.withColumn('BIRTH_PLACE_INT', births['BIRTH_PLACE']
.cast(typ.IntegerType()))
encoder=ft.OneHotEncoder(
inputCol='BIRTH_PLACE_INT',
outputCol='BIRTH_PLACE_VEC')
featuresCreator=ft.VectorAssembler(
inputCols=[col[0] for col in labels[2:]] +
[encoder.getOutputCol()],
outputCol='features'
)

#import logistic regression for compare
import pyspark.ml.classification as cl
logistic = cl.LogisticRegression(
maxIter=10,
regParam=0.01,
labelCol='INFANT_ALIVE_AT_REPORT')

from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[
encoder,
featuresCreator,
logistic
])

births_train, births_test=births
.randomSplit([0.7, 0.3], seed=666)
model = pipeline.fit(births_train)
test_model = model.transform(births_test)

import pyspark.ml.evaluation as ev
evaluator = ev.BinaryClassificationEvaluator(
rawPredictionCol='probability',
labelCol='INFANT_ALIVE_AT_REPORT')
print(evaluator.evaluate(test_model,
{evaluator.metricName: 'areaUnderROC'}))
print(evaluator.evaluate(test_model,
{evaluator.metricName: 'areaUnderPR'}))

from mmlspark import LightGBMRegressor
lgb = LightGBMRegressor(alpha=0.3,learningRate=0.3,numIterations=100,numLeaves=31,labelCol='INFANT_ALIVE_AT_REPORT')

from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[
encoder,
featuresCreator,
lgb])

model = pipeline.fit(births_train) # IT IS STUCK HERE!!!!!!!
test_model = model.transform(births_test)

import pyspark.ml.evaluation as ev
evaluator = ev.BinaryClassificationEvaluator(
rawPredictionCol='probability',
labelCol='INFANT_ALIVE_AT_REPORT')
print(evaluator.evaluate(test_model,
{evaluator.metricName: 'areaUnderROC'}))
print(evaluator.evaluate(test_model,
{evaluator.metricName: 'areaUnderPR'}))

imatiach-msft · 2019-04-02T22:29:26Z

@longyunshen I see, it sounds like there is some error in the native code then. Is the dataset you are using confidential? Wondering if I can reproduce this issue locally.

loomlike · 2019-04-09T17:02:51Z

@imatiach-msft we are testing our Recommenders repo on Windows DSVM and we have the similar error. FYI, the notebook works well on Linux DSVM.

To reproduce the error, please run staging/notebooks/02_model/mmlspark_lightgbm_criteo.ipynb on Windows.

java.lang.Exception: Dataset create call failed in LightGBM with error: bad allocation
at com.microsoft.ml.spark.LightGBMUtils$.validate(LightGBMUtils.scala:29)
at com.microsoft.ml.spark.LightGBMUtils$.generateSparseDataset(LightGBMUtils.scala:380)
at com.microsoft.ml.spark.TrainUtils$.translate(TrainUtils.scala:62)
at com.microsoft.ml.spark.TrainUtils$.trainLightGBM(TrainUtils.scala:219)
at com.microsoft.ml.spark.LightGBMClassifier$$anonfun$3.apply(LightGBMClassifier.scala:83)
at com.microsoft.ml.spark.LightGBMClassifier$$anonfun$3.apply(LightGBMClassifier.scala:83)
at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:188)
at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:185)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

imatiach-msft · 2019-10-09T03:28:31Z

the bug on windows should be fixed now on latest master (available with next release)

debadridtt changed the title ~~LightGBM on Sparm Bad Allocation memory error~~ LightGBM on Spark Bad Allocation memory error Feb 5, 2019

mhamilton723 added the area/lightgbm label Apr 4, 2019

loomlike mentioned this issue Apr 9, 2019

Windows SETUP recommenders-team/recommenders#726

Merged

3 tasks

loomlike mentioned this issue Apr 17, 2019

[BUG] Error in mmlspark-lightgbm on Windows recommenders-team/recommenders#709

Closed

imatiach-msft added bug high priority high priority issues must be fixed as soon as possible labels May 5, 2019

imatiach-msft mentioned this issue Jun 5, 2019

Training LightGBMRanker several times gives different NDCG on testing set #580

Open

imatiach-msft mentioned this issue Aug 29, 2019

[java][mmlspark] fix lightgbm swig wrapper on windows microsoft/LightGBM#2364

Merged

imatiach-msft closed this as completed Oct 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LightGBM on Spark Bad Allocation memory error #483

LightGBM on Spark Bad Allocation memory error #483

debadridtt commented Feb 5, 2019 •

edited

Loading

imatiach-msft commented Feb 8, 2019

debadridtt commented Feb 8, 2019

debadridtt commented Feb 11, 2019

longyunshen commented Mar 28, 2019 •

edited

Loading

imatiach-msft commented Mar 28, 2019

longyunshen commented Mar 30, 2019

imatiach-msft commented Apr 2, 2019

loomlike commented Apr 9, 2019

imatiach-msft commented Oct 9, 2019

LightGBM on Spark Bad Allocation memory error #483

LightGBM on Spark Bad Allocation memory error #483

Comments

debadridtt commented Feb 5, 2019 • edited Loading

imatiach-msft commented Feb 8, 2019

debadridtt commented Feb 8, 2019

debadridtt commented Feb 11, 2019

longyunshen commented Mar 28, 2019 • edited Loading

imatiach-msft commented Mar 28, 2019

longyunshen commented Mar 30, 2019

imatiach-msft commented Apr 2, 2019

loomlike commented Apr 9, 2019

imatiach-msft commented Oct 9, 2019

debadridtt commented Feb 5, 2019 •

edited

Loading

longyunshen commented Mar 28, 2019 •

edited

Loading