Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask fit task just crashes #8341

Closed
dpdrmj opened this issue Nov 13, 2023 · 2 comments
Closed

Dask fit task just crashes #8341

dpdrmj opened this issue Nov 13, 2023 · 2 comments

Comments

@dpdrmj
Copy link

dpdrmj commented Nov 13, 2023

Describe the issue:
Whenever I run this code, the dask job crashes and all the workers get lost and then the task just hangs forever. While if I provide small size files then the same code works fine. (<100MB). I'm not sure what the issue is. Pasting the error below in "Anything else we need to know section"

Minimal Complete Verifiable Example:

cluster = YarnCluster(environment="venv:////PATH_TO_PYENV/dask-pyenv", worker_vcores=2, worker_memory="16GiB", n_workers=2, worker_env={'CLASSPATH': 'PATH_TO_HADOOP/lib/native/libhdfs.so'}, 'ARROW_LIBHDFS_DIR':'PATH_TO_HADOOP/lib/native'}, deploy_mode='local', name='test_dask')

client = Client(cluster.scheduler_address)
dask_model = lgb.DaskLGBMClassifier(
        client=client,
        boosting_type='gbdt',
        objective='binary',
        metric='binary_logloss,auc',
        max_bin=255,
        header=True,
        num_trees=100,
        max_depth=7,
        learning_rate=0.05,
        num_leaves=63,
        tree_learner='data',
        feature_fraction=0.8,
        bagging_freq=5,
        bagging_fraction=0.8,
        min_data_in_leaf=20,
        min_sum_hessian_in_leaf=5.0,
        lambda_l1=3,
        lambda_l2=100,
        cat_l2=20,
        cat_smooth=25,
        is_enable_sparse=True,
        use_two_round_loading=False,
        verbose=2,
        label_column='name:my_label_col',
)

train_ddf = dd.read_csv(path_list, delimiter=',', encoding='utf-8', storage_options={'driver': 'libhdfs3'}).dropna().drop('auction_id', axis=1)

train_features = train_ddf.drop('my_label_col', axis=1)
train_target = train_ddf[['my_label_col']]
dask_model.fit(train_features, train_target, categorical_feature = ['feature1', 'feature2'....'featuren'])

already pasted above.

Anything else we need to know?:
here is the error log that I see:

[LightGBM] [Debug] Dataset::GetMultiBinFromSparseFeatures: sparse rate 0.934990
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.372672
[LightGBM] [Debug] init for col-wise cost 0.708685 seconds, init for row-wise cost 1.673264 seconds
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.912160 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Debug] Using Sparse Multi-Val Bin
[LightGBM] [Info] Total Bins 27836
[LightGBM] [Info] Number of data points in the train set: 4750592, number of used features: 49
[LightGBM] [Debug] Use subset for bagging
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.000239 -> initscore=-8.340142
[LightGBM] [Info] Start training from score -8.340142
[LightGBM] [Debug] Re-bagging, using 3801989 data to train
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fb4d4013756, pid=944460, tid=0x00007fb55da09640
#
# JRE version: OpenJDK Runtime Environment (8.0_382-b05) (build 1.8.0_382-8u382-ga-1~22.04.1-b05)
# Java VM: OpenJDK 64-Bit Server VM (25.382-b05 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [lib_lightgbm.so+0x413756]  LightGBM::SerialTreeLearner::SplitInner(LightGBM::Tree*, int, int*, int*, bool)+0xe16
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# PATH_TO_APP_CACHE/appcache/application_1697132938548_0291/container_1697132938548_0291_01_000003/hs_err_pid944460.log

Environment:

  • Dask version: 2023.10.0
  • Python version:3.10.12
  • Operating System:
  • Install method (conda, pip, source): pip
    all the dependencies:
    Package Version

asttokens 2.4.1
bokeh 3.3.0
cffi 1.16.0
click 8.1.7
cloudpickle 3.0.0
comm 0.2.0
contourpy 1.1.1
cryptography 41.0.5
dask 2023.10.0
dask-yarn 0.9+2.g8eed5e2
debugpy 1.8.0
decorator 5.1.1
distributed 2023.10.0
exceptiongroup 1.1.3
executing 2.0.1
fsspec 2023.10.0
grpcio 1.59.0
importlib-metadata 6.8.0
ipython 8.17.2
jedi 0.19.1
Jinja2 3.1.2
joblib 1.3.2
jupyter_client 8.6.0
jupyter_core 5.5.0
lightgbm 4.1.0
locket 1.0.0
lz4 4.3.2
MarkupSafe 2.1.3
matplotlib-inline 0.1.6
msgpack 1.0.7
nest-asyncio 1.5.8
numpy 1.26.1
packaging 23.2
pandas 2.1.1
parso 0.8.3
partd 1.4.1
pexpect 4.8.0
Pillow 10.1.0
pip 22.0.2
platformdirs 4.0.0
prompt-toolkit 3.0.40
protobuf 4.24.4
psutil 5.9.6
ptyprocess 0.7.0
pure-eval 0.2.2
pyarrow 13.0.0
pycparser 2.21
Pygments 2.16.1
python-dateutil 2.8.2
pytz 2023.3.post1
PyYAML 6.0.1
pyzmq 25.1.1
scikit-learn 1.3.2
scipy 1.11.3
setuptools 59.6.0
six 1.16.0
skein 0.8.2
sortedcontainers 2.4.0
stack-data 0.6.3
tblib 3.0.0
threadpoolctl 3.2.0
toolz 0.12.0
tornado 6.3.3
traitlets 5.13.0
tzdata 2023.3
urllib3 2.0.7
wcwidth 0.2.9
xyzservices 2023.10.0
zict 3.0.0
zipp 3.17.0

@jacobtomlinson
Copy link
Member

I think you probably need to raise this on lightgbm rather than Dask.

@hendrikmakait
Copy link
Member

It looks like this problem is being tackled at microsoft/LightGBM#6196, so I will close this issue. Please reopen if anything is left that needs to be done on the distributed side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants