Dask fit task just crashes #8341

dpdrmj · 2023-11-13T22:11:07Z

Describe the issue:
Whenever I run this code, the dask job crashes and all the workers get lost and then the task just hangs forever. While if I provide small size files then the same code works fine. (<100MB). I'm not sure what the issue is. Pasting the error below in "Anything else we need to know section"

Minimal Complete Verifiable Example:

cluster = YarnCluster(environment="venv:////PATH_TO_PYENV/dask-pyenv", worker_vcores=2, worker_memory="16GiB", n_workers=2, worker_env={'CLASSPATH': 'PATH_TO_HADOOP/lib/native/libhdfs.so'}, 'ARROW_LIBHDFS_DIR':'PATH_TO_HADOOP/lib/native'}, deploy_mode='local', name='test_dask')

client = Client(cluster.scheduler_address)
dask_model = lgb.DaskLGBMClassifier(
        client=client,
        boosting_type='gbdt',
        objective='binary',
        metric='binary_logloss,auc',
        max_bin=255,
        header=True,
        num_trees=100,
        max_depth=7,
        learning_rate=0.05,
        num_leaves=63,
        tree_learner='data',
        feature_fraction=0.8,
        bagging_freq=5,
        bagging_fraction=0.8,
        min_data_in_leaf=20,
        min_sum_hessian_in_leaf=5.0,
        lambda_l1=3,
        lambda_l2=100,
        cat_l2=20,
        cat_smooth=25,
        is_enable_sparse=True,
        use_two_round_loading=False,
        verbose=2,
        label_column='name:my_label_col',
)

train_ddf = dd.read_csv(path_list, delimiter=',', encoding='utf-8', storage_options={'driver': 'libhdfs3'}).dropna().drop('auction_id', axis=1)

train_features = train_ddf.drop('my_label_col', axis=1)
train_target = train_ddf[['my_label_col']]
dask_model.fit(train_features, train_target, categorical_feature = ['feature1', 'feature2'....'featuren'])

already pasted above.

Anything else we need to know?:
here is the error log that I see:

[LightGBM] [Debug] Dataset::GetMultiBinFromSparseFeatures: sparse rate 0.934990
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.372672
[LightGBM] [Debug] init for col-wise cost 0.708685 seconds, init for row-wise cost 1.673264 seconds
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.912160 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Debug] Using Sparse Multi-Val Bin
[LightGBM] [Info] Total Bins 27836
[LightGBM] [Info] Number of data points in the train set: 4750592, number of used features: 49
[LightGBM] [Debug] Use subset for bagging
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.000239 -> initscore=-8.340142
[LightGBM] [Info] Start training from score -8.340142
[LightGBM] [Debug] Re-bagging, using 3801989 data to train
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fb4d4013756, pid=944460, tid=0x00007fb55da09640
#
# JRE version: OpenJDK Runtime Environment (8.0_382-b05) (build 1.8.0_382-8u382-ga-1~22.04.1-b05)
# Java VM: OpenJDK 64-Bit Server VM (25.382-b05 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [lib_lightgbm.so+0x413756]  LightGBM::SerialTreeLearner::SplitInner(LightGBM::Tree*, int, int*, int*, bool)+0xe16
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# PATH_TO_APP_CACHE/appcache/application_1697132938548_0291/container_1697132938548_0291_01_000003/hs_err_pid944460.log

Environment:

Dask version: 2023.10.0
Python version:3.10.12
Operating System:
Install method (conda, pip, source): pip
all the dependencies:
Package Version

asttokens 2.4.1
bokeh 3.3.0
cffi 1.16.0
click 8.1.7
cloudpickle 3.0.0
comm 0.2.0
contourpy 1.1.1
cryptography 41.0.5
dask 2023.10.0
dask-yarn 0.9+2.g8eed5e2
debugpy 1.8.0
decorator 5.1.1
distributed 2023.10.0
exceptiongroup 1.1.3
executing 2.0.1
fsspec 2023.10.0
grpcio 1.59.0
importlib-metadata 6.8.0
ipython 8.17.2
jedi 0.19.1
Jinja2 3.1.2
joblib 1.3.2
jupyter_client 8.6.0
jupyter_core 5.5.0
lightgbm 4.1.0
locket 1.0.0
lz4 4.3.2
MarkupSafe 2.1.3
matplotlib-inline 0.1.6
msgpack 1.0.7
nest-asyncio 1.5.8
numpy 1.26.1
packaging 23.2
pandas 2.1.1
parso 0.8.3
partd 1.4.1
pexpect 4.8.0
Pillow 10.1.0
pip 22.0.2
platformdirs 4.0.0
prompt-toolkit 3.0.40
protobuf 4.24.4
psutil 5.9.6
ptyprocess 0.7.0
pure-eval 0.2.2
pyarrow 13.0.0
pycparser 2.21
Pygments 2.16.1
python-dateutil 2.8.2
pytz 2023.3.post1
PyYAML 6.0.1
pyzmq 25.1.1
scikit-learn 1.3.2
scipy 1.11.3
setuptools 59.6.0
six 1.16.0
skein 0.8.2
sortedcontainers 2.4.0
stack-data 0.6.3
tblib 3.0.0
threadpoolctl 3.2.0
toolz 0.12.0
tornado 6.3.3
traitlets 5.13.0
tzdata 2023.3
urllib3 2.0.7
wcwidth 0.2.9
xyzservices 2023.10.0
zict 3.0.0
zipp 3.17.0

The text was updated successfully, but these errors were encountered:

jacobtomlinson · 2023-11-14T10:48:38Z

I think you probably need to raise this on lightgbm rather than Dask.

hendrikmakait · 2023-12-13T09:12:32Z

It looks like this problem is being tackled at microsoft/LightGBM#6196, so I will close this issue. Please reopen if anything is left that needs to be done on the distributed side.

github-actions bot added the needs triage label Nov 13, 2023

dpdrmj mentioned this issue Nov 16, 2023

[python-package] [dask] Dask fit task just crashes microsoft/LightGBM#6196

Open

hendrikmakait closed this as completed Dec 13, 2023

jacobtomlinson removed the needs triage label Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dask fit task just crashes #8341

Dask fit task just crashes #8341

dpdrmj commented Nov 13, 2023 •

edited

Loading

jacobtomlinson commented Nov 14, 2023

hendrikmakait commented Dec 13, 2023

Dask fit task just crashes #8341

Dask fit task just crashes #8341

Comments

dpdrmj commented Nov 13, 2023 • edited Loading

jacobtomlinson commented Nov 14, 2023

hendrikmakait commented Dec 13, 2023

dpdrmj commented Nov 13, 2023 •

edited

Loading