[Bug]: RAY error #129

simplew2011 · 2023-12-12T03:27:53Z

Before Reporting 报告之前

I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。
I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引，并且在安装过程中没有错误发生。（否则，我们建议您使用Question模板向我们进行提问）

Search before reporting 先搜索，再报告

I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表中搜索但是没有发现类似的bug报告。

OS 系统

ubuntu

Installation Method 安装方式

from source

Data-Juicer Version Data-Juicer版本

v0.1.2

Python Version Python版本

3.8

Describe the bug 描述这个bug

language_id_score_filter算子

--executor_type ray，报错
--executor_type default，正常

To Reproduce 如何复现

python tools/process_data.py --config configs/demo/process.yaml --executor_type ray

Configs 配置信息

project_name: 'demo-process'
dataset_path: 'demos/process_on_ray/data/demo-dataset.json'  # path to your dataset directory or file
np: 4  # number of subprocess to process your dataset

export_path: './outputs/demo-process/demo-processed.jsonl'

use_cache: false
save_stats_in_one_file: true

# process schedule
# a list of several process operators with their arguments
process:
  - language_id_score_filter:
      lang: 'zh'

Logs 报错日志

outputs.zip

Screenshots 截图

Additional 额外信息

No response

The text was updated successfully, but these errors were encountered:

simplew2011 · 2023-12-12T03:28:14Z

#107

zhijianma · 2023-12-12T05:26:02Z

export_path 请先使用绝对路径。 ray 在保存时，暂时无法写入相对路径。
后边我们也会增强一下。

simplew2011 · 2023-12-12T09:39:40Z

export_path 设置为绝对路径，上面正常了；
但使用另一个chinese_convert_mapper算子，ray模式又报错了，default模式正常
python tools/process_data.py --config configs/demo/process.yaml --executor_type ray

# Process config example for dataset

# global parameters
project_name: 'demo-process'
dataset_path: '/home/wzp/code/LLMData/open_source/data-juicer/demos/process_on_ray/data/demo-dataset.json'  # path to your dataset directory or file
np: 4  # number of subprocess to process your dataset

export_path: '/home/wzp/code/LLMData/open_source/data-juicer/outputs/demo-process/demo-processed.jsonl'

# use_cache: false
# save_stats_in_one_file: true

# process schedule
# a list of several process operators with their arguments
process:
  # - language_id_score_filter:
  #     lang: 'zh'
  # - alphanumeric_filter:
  - chinese_convert_mapper:
      mode: 's2t'

NameError: name 'OPENCC_CONVERTER' is not defined

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/loguru/_handler.py", line 204, in emit
    self._queue.put(str_record)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/multiprocessing/queues.py", line 362, in put
    obj = _ForkingPickler.dumps(obj)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <class 'ray.exceptions.RayTaskError(NameError)'>: attribute lookup RayTaskError(NameError) on ray.exceptions failed
--- End of logging error ---

outputs2.zip

simplew2011 · 2023-12-12T11:50:42Z

第三个bug：

开启language_id_score_filter
如果data-juicer/demos/process_on_ray/data/demo-dataset.json文件内的数据行数超过cpu*2的数值时
RAY对数据进行分块在不同cpu上推理，多余的数据，会出现结果异常，相邻多条不同语种的文本，会得出同样结果
python tools/process_data.py --config configs/demo/process.yaml --executor_type ray
此问题配置表和数据在large_json.zip

2023-12-12 19:40:46,412 INFO plan.py:757 -- Using autodetected parallelism=192 for stage ReadJSON to satisfy parallelism at least twice the available number of CPUs (96).
2023-12-12 19:40:46,413 INFO plan.py:762 -- To satisfy the requested parallelism of 192, each read task output is split into 192 smaller blocks.

 - language_id_score_filter:
     lang: 'zh'

{"text":"欢迎来到阿里巴巴！","__dj__stats__":{"lang":"zh","lang_score":0.9645169377}}
{"text":"This paper proposed a novel method on LLM pretraining.","__dj__stats__":{"lang":"zh","lang_score":0.9645169377}}

zhijianma · 2023-12-12T13:39:10Z

export_path 设置为绝对路径，上面正常了；
但使用另一个chinese_convert_mapper算子，ray模式又报错了，default模式正常
python tools/process_data.py --config configs/demo/process.yaml --executor_type ray

# Process config example for dataset

# global parameters
project_name: 'demo-process'
dataset_path: '/home/wzp/code/LLMData/open_source/data-juicer/demos/process_on_ray/data/demo-dataset.json'  # path to your dataset directory or file
np: 4  # number of subprocess to process your dataset

export_path: '/home/wzp/code/LLMData/open_source/data-juicer/outputs/demo-process/demo-processed.jsonl'

# use_cache: false
# save_stats_in_one_file: true

# process schedule
# a list of several process operators with their arguments
process:
  # - language_id_score_filter:
  #     lang: 'zh'
  # - alphanumeric_filter:
  - chinese_convert_mapper:
      mode: 's2t'

NameError: name 'OPENCC_CONVERTER' is not defined

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/loguru/_handler.py", line 204, in emit
    self._queue.put(str_record)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/multiprocessing/queues.py", line 362, in put
    obj = _ForkingPickler.dumps(obj)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <class 'ray.exceptions.RayTaskError(NameError)'>: attribute lookup RayTaskError(NameError) on ray.exceptions failed
--- End of logging error ---

outputs2.zip

chinese_convert_mapper 这个目前确实存在问题，主要因为全局的模型上下文无法传递到其他ray 的进程中，和之前的get_model 的情况有些类似。
我们正在对于这块进行优化，后续也会不断补全测试用例

zhijianma · 2023-12-12T13:39:33Z

第三个bug：

开启language_id_score_filter

如果data-juicer/demos/process_on_ray/data/demo-dataset.json文件内的数据行数超过cpu*2的数值时

RAY对数据进行分块在不同cpu上推理，多余的数据，会出现结果异常，相邻多条不同语种的文本，会得出同样结果

python tools/process_data.py --config configs/demo/process.yaml --executor_type ray

此问题配置表和数据在large_json.zip

2023-12-12 19:40:46,412 INFO plan.py:757 -- Using autodetected parallelism=192 for stage ReadJSON to satisfy parallelism at least twice the available number of CPUs (96). 2023-12-12 19:40:46,413 INFO plan.py:762 -- To satisfy the requested parallelism of 192, each read task output is split into 192 smaller blocks.
 - language_id_score_filter:
     lang: 'zh'
{"text":"欢迎来到阿里巴巴！","__dj__stats__":{"lang":"zh","lang_score":0.9645169377}}
{"text":"This paper proposed a novel method on LLM pretraining.","__dj__stats__":{"lang":"zh","lang_score":0.9645169377}}

这个问题我们复现排查看看.
这个问题已经修复，详情可参看PR #173 和 RAY #42190

simplew2011 · 2023-12-18T11:25:01Z

分布式去重，是否可参考xorbits：https://doc.xorbits.io/zh-cn/latest/reference/experimental/generated/xorbits.experimental.dedup.html

github-actions · 2024-01-30T09:32:03Z

This issue is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this issue will be closed in 3 day.

github-actions · 2024-02-02T09:39:45Z

Close this stale issue.

simplew2011 added the bug Something isn't working label Dec 12, 2023

HYLcool assigned zhijianma Dec 13, 2023

github-actions bot added the stale-issue label Jan 30, 2024

github-actions bot closed this as completed Feb 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: RAY error #129

[Bug]: RAY error #129

simplew2011 commented Dec 12, 2023

simplew2011 commented Dec 12, 2023

zhijianma commented Dec 12, 2023

simplew2011 commented Dec 12, 2023 •

edited

Loading

simplew2011 commented Dec 12, 2023

zhijianma commented Dec 12, 2023

zhijianma commented Dec 12, 2023 •

edited

Loading

simplew2011 commented Dec 18, 2023

github-actions bot commented Jan 30, 2024

github-actions bot commented Feb 2, 2024

[Bug]: RAY error #129

[Bug]: RAY error #129

Comments

simplew2011 commented Dec 12, 2023

Before Reporting 报告之前

Search before reporting 先搜索，再报告

OS 系统

Installation Method 安装方式

Data-Juicer Version Data-Juicer版本

Python Version Python版本

Describe the bug 描述这个bug

To Reproduce 如何复现

Configs 配置信息

Logs 报错日志

Screenshots 截图

Additional 额外信息

simplew2011 commented Dec 12, 2023

zhijianma commented Dec 12, 2023

simplew2011 commented Dec 12, 2023 • edited Loading

simplew2011 commented Dec 12, 2023

zhijianma commented Dec 12, 2023

zhijianma commented Dec 12, 2023 • edited Loading

simplew2011 commented Dec 18, 2023

github-actions bot commented Jan 30, 2024

github-actions bot commented Feb 2, 2024

simplew2011 commented Dec 12, 2023 •

edited

Loading

zhijianma commented Dec 12, 2023 •

edited

Loading