Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: RAY error #129

Closed
3 tasks done
simplew2011 opened this issue Dec 12, 2023 · 9 comments
Closed
3 tasks done

[Bug]: RAY error #129

simplew2011 opened this issue Dec 12, 2023 · 9 comments
Assignees
Labels
bug Something isn't working stale-issue

Comments

@simplew2011
Copy link

Before Reporting 报告之前

  • I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。

  • I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引,并且在安装过程中没有错误发生。(否则,我们建议您使用Question模板向我们进行提问)

Search before reporting 先搜索,再报告

  • I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表 中搜索但是没有发现类似的bug报告。

OS 系统

ubuntu

Installation Method 安装方式

from source

Data-Juicer Version Data-Juicer版本

v0.1.2

Python Version Python版本

3.8

Describe the bug 描述这个bug

language_id_score_filter算子

  • --executor_type ray,报错
  • --executor_type default,正常

To Reproduce 如何复现

python tools/process_data.py --config configs/demo/process.yaml --executor_type ray

Configs 配置信息

project_name: 'demo-process'
dataset_path: 'demos/process_on_ray/data/demo-dataset.json'  # path to your dataset directory or file
np: 4  # number of subprocess to process your dataset

export_path: './outputs/demo-process/demo-processed.jsonl'

use_cache: false
save_stats_in_one_file: true

# process schedule
# a list of several process operators with their arguments
process:
  - language_id_score_filter:
      lang: 'zh'

Logs 报错日志

outputs.zip

Screenshots 截图

image

Additional 额外信息

No response

@simplew2011 simplew2011 added the bug Something isn't working label Dec 12, 2023
@simplew2011
Copy link
Author

#107

@zhijianma
Copy link
Collaborator

export_path 请先使用绝对路径。 ray 在保存时,暂时无法写入相对路径。
后边我们也会增强一下。

@simplew2011
Copy link
Author

simplew2011 commented Dec 12, 2023

  • export_path 设置为绝对路径,上面正常了;
  • 但使用另一个chinese_convert_mapper算子,ray模式又报错了,default模式正常
  • python tools/process_data.py --config configs/demo/process.yaml --executor_type ray
# Process config example for dataset

# global parameters
project_name: 'demo-process'
dataset_path: '/home/wzp/code/LLMData/open_source/data-juicer/demos/process_on_ray/data/demo-dataset.json'  # path to your dataset directory or file
np: 4  # number of subprocess to process your dataset

export_path: '/home/wzp/code/LLMData/open_source/data-juicer/outputs/demo-process/demo-processed.jsonl'

# use_cache: false
# save_stats_in_one_file: true

# process schedule
# a list of several process operators with their arguments
process:
  # - language_id_score_filter:
  #     lang: 'zh'
  # - alphanumeric_filter:
  - chinese_convert_mapper:
      mode: 's2t'
NameError: name 'OPENCC_CONVERTER' is not defined

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/loguru/_handler.py", line 204, in emit
    self._queue.put(str_record)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/multiprocessing/queues.py", line 362, in put
    obj = _ForkingPickler.dumps(obj)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <class 'ray.exceptions.RayTaskError(NameError)'>: attribute lookup RayTaskError(NameError) on ray.exceptions failed
--- End of logging error ---

outputs2.zip

@simplew2011
Copy link
Author

第三个bug:

  • 开启language_id_score_filter
  • 如果data-juicer/demos/process_on_ray/data/demo-dataset.json文件内的数据行数超过cpu*2的数值时
  • RAY对数据进行分块在不同cpu上推理,多余的数据,会出现结果异常,相邻多条不同语种的文本,会得出同样结果
  • python tools/process_data.py --config configs/demo/process.yaml --executor_type ray
  • 此问题配置表和数据在large_json.zip

2023-12-12 19:40:46,412 INFO plan.py:757 -- Using autodetected parallelism=192 for stage ReadJSON to satisfy parallelism at least twice the available number of CPUs (96).
2023-12-12 19:40:46,413 INFO plan.py:762 -- To satisfy the requested parallelism of 192, each read task output is split into 192 smaller blocks.

 - language_id_score_filter:
     lang: 'zh'
{"text":"欢迎来到阿里巴巴!","__dj__stats__":{"lang":"zh","lang_score":0.9645169377}}
{"text":"This paper proposed a novel method on LLM pretraining.","__dj__stats__":{"lang":"zh","lang_score":0.9645169377}}

@zhijianma
Copy link
Collaborator

  • export_path 设置为绝对路径,上面正常了;
  • 但使用另一个chinese_convert_mapper算子,ray模式又报错了,default模式正常
  • python tools/process_data.py --config configs/demo/process.yaml --executor_type ray
# Process config example for dataset

# global parameters
project_name: 'demo-process'
dataset_path: '/home/wzp/code/LLMData/open_source/data-juicer/demos/process_on_ray/data/demo-dataset.json'  # path to your dataset directory or file
np: 4  # number of subprocess to process your dataset

export_path: '/home/wzp/code/LLMData/open_source/data-juicer/outputs/demo-process/demo-processed.jsonl'

# use_cache: false
# save_stats_in_one_file: true

# process schedule
# a list of several process operators with their arguments
process:
  # - language_id_score_filter:
  #     lang: 'zh'
  # - alphanumeric_filter:
  - chinese_convert_mapper:
      mode: 's2t'
NameError: name 'OPENCC_CONVERTER' is not defined

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/loguru/_handler.py", line 204, in emit
    self._queue.put(str_record)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/multiprocessing/queues.py", line 362, in put
    obj = _ForkingPickler.dumps(obj)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <class 'ray.exceptions.RayTaskError(NameError)'>: attribute lookup RayTaskError(NameError) on ray.exceptions failed
--- End of logging error ---

outputs2.zip

chinese_convert_mapper 这个目前确实存在问题,主要因为全局的模型上下文无法传递到其他ray 的进程中,和之前的get_model 的情况有些类似。
我们正在对于这块进行优化,后续也会不断补全测试用例

@zhijianma
Copy link
Collaborator

zhijianma commented Dec 12, 2023

第三个bug:

  • 开启language_id_score_filter
  • 如果data-juicer/demos/process_on_ray/data/demo-dataset.json文件内的数据行数超过cpu*2的数值时
  • RAY对数据进行分块在不同cpu上推理,多余的数据,会出现结果异常,相邻多条不同语种的文本,会得出同样结果
  • python tools/process_data.py --config configs/demo/process.yaml --executor_type ray
  • 此问题配置表和数据在large_json.zip

2023-12-12 19:40:46,412 INFO plan.py:757 -- Using autodetected parallelism=192 for stage ReadJSON to satisfy parallelism at least twice the available number of CPUs (96). 2023-12-12 19:40:46,413 INFO plan.py:762 -- To satisfy the requested parallelism of 192, each read task output is split into 192 smaller blocks.

 - language_id_score_filter:
     lang: 'zh'
{"text":"欢迎来到阿里巴巴!","__dj__stats__":{"lang":"zh","lang_score":0.9645169377}}
{"text":"This paper proposed a novel method on LLM pretraining.","__dj__stats__":{"lang":"zh","lang_score":0.9645169377}}

这个问题我们复现排查看看.
这个问题已经修复,详情可参看PR #173 RAY #42190

@simplew2011
Copy link
Author

Copy link

This issue is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this issue will be closed in 3 day.

Copy link

github-actions bot commented Feb 2, 2024

Close this stale issue.

@github-actions github-actions bot closed this as completed Feb 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale-issue
Projects
None yet
Development

No branches or pull requests

2 participants