Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

预训练数据的加载规模问题 #59

Open
williamSYSU opened this issue Mar 15, 2023 · 0 comments
Open

预训练数据的加载规模问题 #59

williamSYSU opened this issue Mar 15, 2023 · 0 comments

Comments

@williamSYSU
Copy link

陆博您好,很感谢您公开UIE模型的代码!

在程序加载构造的预训练数据时,报了以下错误:


Traceback (most recent call last):                                                                                                                                                       
  File "/data/miniconda3/envs/env-3.8.8/lib/python3.8/site-packages/datasets/builder.py", line 1874, in _prepare_split_single                                                            
    writer.write_table(table)                                                                                                                                                            
  File "/data/miniconda3/envs/env-3.8.8/lib/python3.8/site-packages/datasets/arrow_writer.py", line 567, in write_table                                                                  
    pa_table = pa_table.combine_chunks()                                                                                                                                                 
  File "pyarrow/table.pxi", line 3315, in pyarrow.lib.Table.combine_chunks                                                                                                               
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status                                                                                                       
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status                                                                                                                        
pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays                                                                                                                     
                                                                                                                                                                                         
The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "run_uie_pretrain.py", line 509, in <module>
    main()
  File "run_uie_pretrain.py", line 148, in main
    datasets = load_dataset(
  File "/data/miniconda3/envs/env-3.8.8/lib/python3.8/site-packages/datasets/load.py", line 1782, in load_dataset
    builder_instance.download_and_prepare(
  File "/data/miniconda3/envs/env-3.8.8/lib/python3.8/site-packages/datasets/builder.py", line 872, in download_and_prepare
    self._download_and_prepare(
  File "/data/miniconda3/envs/env-3.8.8/lib/python3.8/site-packages/datasets/builder.py", line 967, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/data/miniconda3/envs/env-3.8.8/lib/python3.8/site-packages/datasets/builder.py", line 1749, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/data/miniconda3/envs/env-3.8.8/lib/python3.8/site-packages/datasets/builder.py", line 1892, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

当数据集规模为500w时,会报以上的错误,而当数据集规模减少至100w时,程序可以正常运行,因此从报错原因来看是因为数据集太大从而导致加载出错,而且此时内存未满。

因此有几个问题想请教您:

  1. 程序加载数据集时是完全加载到内存里吗?因为看论文中数据集的规模是65M * 3 = 195M,请问这个是怎么实现这么大规模数据的预训练呢?
  2. 是否存在对数据进行流式处理的训练方式?
@williamSYSU williamSYSU changed the title 预训练数据的加载 预训练数据的加载规模问题 Mar 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant