You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Language version (e.g. python 3.8, scala 2.12):3.8
Spark Version (e.g. 3.2.3):3.3.2
Spark Platform (e.g. Synapse, Databricks):
Describe the problem
I am training a classifier, and the JVM always crashes when spark.executor.instances > 1, but it works fine when spark.executor.instances = 1. Can anyone help me with this issue?
Code to reproduce issue
for col in vecCols:
train = train.withColumn(col, train[col].cast(DoubleType()))
train = train.withColumn(labelCol, train[labelCol].cast(IntegerType()))
assembler = VectorAssembler(inputCols=vecCols, outputCol="features", handleInvalid="keep")
pipeline = Pipeline(stages=[assembler])
train = pipeline.fit(train).transform(train)
24/10/15 14:26:29 INFO InternalParquetRecordReader: RecordReader initialized will read a total of 2146 records.
24/10/15 14:26:29 INFO InternalParquetRecordReader: at row 0. reading next block
24/10/15 14:26:29 INFO InternalParquetRecordReader: block read in memory in 2 ms. row count = 2146
24/10/15 14:26:30 INFO StreamingPartitionTask: done with data preparation on partition 1, task 29
24/10/15 14:26:30 INFO StreamingPartitionTask: Helper task 29, partition 1 finished processing rows
24/10/15 14:26:30 INFO StreamingPartitionTask: Beginning cleanup for partition 1, task 29
24/10/15 14:26:30 INFO StreamingPartitionTask: Done with cleanup for partition 1, task 29
24/10/15 14:26:30 INFO StreamingPartitionTask: Getting final training Dataset for partition 5.
24/10/15 14:26:30 INFO Executor: Finished task 1.0 in stage 8.0 (TID 29). 1789 bytes result sent to driver
24/10/15 14:26:30 INFO StreamingPartitionTask: Creating LightGBM Booster for partition 5, task 33
24/10/15 14:26:30 INFO StreamingPartitionTask: Beginning training on LightGBM Booster for task 33, partition 5
24/10/15 14:26:30 INFO StreamingPartitionTask: LightGBM task starting iteration 0
[LightGBM] [Info] Number of positive: 61700, number of negative: 1063514
[LightGBM] [Debug] Dataset::GetMultiBinFromSparseFeatures: sparse rate 0.844580
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.479102
[LightGBM] [Debug] init for col-wise cost 0.115891 seconds, init for row-wise cost 0.390814 seconds
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.161628 seconds.
You can set force_row_wise=true to remove the overhead.
And if memory is not enough, you can set force_col_wise=true.
[LightGBM] [Debug] Using Sparse Multi-Val Bin
[LightGBM] [Info] Total Bins 10088
[LightGBM] [Info] Number of data points in the train set: 580970, number of used features: 83
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.054834 -> initscore=-2.847050
[LightGBM] [Info] Start training from score -2.847050
A fatal error has been detected by the Java Runtime Environment:
SIGSEGV (0xb) at pc=0x00007fd300c484db, pid=3104731, tid=0x00007fd30bc28700
JRE version: Java(TM) SE Runtime Environment (8.0_181-b13) (build 1.8.0_181-b13)
Java VM: Java HotSpot(TM) 64-Bit Server VM (25.181-b13 mixed mode linux-amd64 )
Problematic frame:
C [lib_lightgbm.so+0x3a54db] LightGBM::SerialTreeLearner::SplitInner(LightGBM::Tree*, int, int*, int*, bool)+0xf7b
Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
An error report file with more information is saved as:
SynapseML version
1.07
System information
Describe the problem
I am training a classifier, and the JVM always crashes when spark.executor.instances > 1, but it works fine when spark.executor.instances = 1. Can anyone help me with this issue?
Code to reproduce issue
for col in vecCols:
train = train.withColumn(col, train[col].cast(DoubleType()))
train = train.withColumn(labelCol, train[labelCol].cast(IntegerType()))
assembler = VectorAssembler(inputCols=vecCols, outputCol="features", handleInvalid="keep")
pipeline = Pipeline(stages=[assembler])
train = pipeline.fit(train).transform(train)
classifier = LightGBMClassifier(featuresCol="features", categoricalSlotNames=cateCols, featuresShapCol="importances", labelCol=labelCol, verbosity=10,executionMode='streaming', useSingleDatasetMode=True)
Other info / logs
24/10/15 14:26:29 INFO InternalParquetRecordReader: RecordReader initialized will read a total of 2146 records.
24/10/15 14:26:29 INFO InternalParquetRecordReader: at row 0. reading next block
24/10/15 14:26:29 INFO InternalParquetRecordReader: block read in memory in 2 ms. row count = 2146
24/10/15 14:26:30 INFO StreamingPartitionTask: done with data preparation on partition 1, task 29
24/10/15 14:26:30 INFO StreamingPartitionTask: Helper task 29, partition 1 finished processing rows
24/10/15 14:26:30 INFO StreamingPartitionTask: Beginning cleanup for partition 1, task 29
24/10/15 14:26:30 INFO StreamingPartitionTask: Done with cleanup for partition 1, task 29
24/10/15 14:26:30 INFO StreamingPartitionTask: Getting final training Dataset for partition 5.
24/10/15 14:26:30 INFO Executor: Finished task 1.0 in stage 8.0 (TID 29). 1789 bytes result sent to driver
24/10/15 14:26:30 INFO StreamingPartitionTask: Creating LightGBM Booster for partition 5, task 33
24/10/15 14:26:30 INFO StreamingPartitionTask: Beginning training on LightGBM Booster for task 33, partition 5
24/10/15 14:26:30 INFO StreamingPartitionTask: LightGBM task starting iteration 0
[LightGBM] [Info] Number of positive: 61700, number of negative: 1063514
[LightGBM] [Debug] Dataset::GetMultiBinFromSparseFeatures: sparse rate 0.844580
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.479102
[LightGBM] [Debug] init for col-wise cost 0.115891 seconds, init for row-wise cost 0.390814 seconds
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.161628 seconds.
You can set
force_row_wise=true
to remove the overhead.And if memory is not enough, you can set
force_col_wise=true
.[LightGBM] [Debug] Using Sparse Multi-Val Bin
[LightGBM] [Info] Total Bins 10088
[LightGBM] [Info] Number of data points in the train set: 580970, number of used features: 83
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.054834 -> initscore=-2.847050
[LightGBM] [Info] Start training from score -2.847050
A fatal error has been detected by the Java Runtime Environment:
SIGSEGV (0xb) at pc=0x00007fd300c484db, pid=3104731, tid=0x00007fd30bc28700
JRE version: Java(TM) SE Runtime Environment (8.0_181-b13) (build 1.8.0_181-b13)
Java VM: Java HotSpot(TM) 64-Bit Server VM (25.181-b13 mixed mode linux-amd64 )
Problematic frame:
C [lib_lightgbm.so+0x3a54db] LightGBM::SerialTreeLearner::SplitInner(LightGBM::Tree*, int, int*, int*, bool)+0xf7b
Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
An error report file with more information is saved as:
/data/disk03/hadoop/yarn/local/usercache/appcache/application_1715260252878_48344/container_e25_1715260252878_48344_02_000003/hs_err_pid3104731.log
If you would like to submit a bug report, please visit:
http://bugreport.java.com/bugreport/crash.jsp
The crash happened outside the Java Virtual Machine in native code.
See problematic frame for where to report the bug.
What component(s) does this bug affect?
area/cognitive
: Cognitive projectarea/core
: Core projectarea/deep-learning
: DeepLearning projectarea/lightgbm
: Lightgbm projectarea/opencv
: Opencv projectarea/vw
: VW projectarea/website
: Websitearea/build
: Project build systemarea/notebooks
: Samples under notebooks folderarea/docker
: Docker usagearea/models
: models related issueWhat language(s) does this bug affect?
language/scala
: Scala source codelanguage/python
: Pyspark APIslanguage/r
: R APIslanguage/csharp
: .NET APIslanguage/new
: Proposals for new client languagesWhat integration(s) does this bug affect?
integrations/synapse
: Azure Synapse integrationsintegrations/azureml
: Azure ML integrationsintegrations/databricks
: Databricks integrationsThe text was updated successfully, but these errors were encountered: