Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I would like to get help from Datatrove enthusiasts regarding issues I'm facing while running the example script. #235

Open
barneylogo opened this issue Jul 1, 2024 · 31 comments

Comments

@barneylogo
Copy link

barneylogo commented Jul 1, 2024

Hello, Datatrove enthusiasts,

Nice to meet you all.

Recently, I've been working on the Datatrove library and I'm trying to run a sample script, process_common_crawl_dump.py from the following link: Datatrove GitHub.

I've made a couple of changes to the script: I've reduced the number of tasks from 8000 to 4 and updated randomize_start_duration to randomize_start. However, after running the script, I encountered some issues.

Here is the accounting history that I received:
Accounting History

Additionally, I believe these logs are stored on my S3:
S3 Log 1
image
image
I was expecting to get an output as a result, but there is no any output directories or files.
only I got logs files
Expected Output

For reference, here is my slurm.conf file:
Slurm.conf

I've tried running the script multiple times, but I always get the same result. I'm not sure if this is the right place to ask for help, but I would appreciate any assistance from fellow Datatrove lovers.

Thank you!

@hynky1999
Copy link
Contributor

Hi,
I can't see it from screenshot but what's the value of MAIN_OUTPUT_PATH ?
The resulting files should be saved in {MAIN_OUTPUT_PATH}/base_processing/output/{DUMP_TO_PROCESS} not in the logs folder

@barneylogo
Copy link
Author

Hi, @hynky1999
Thank you for your reply.
Here is MAIN_OUTPUT_PATH
image
I mean, after I running script, I can't see output folder anywhere.
If possible, can we use communication slot such as discord or telegram?
my discord is barney49 and telegram is @raincoin5
I really hope to meet you

@hynky1999
Copy link
Contributor

Then can you check s3://data-refine/base_processing/base_processing/output/ if it cotntains any folders ?

@barneylogo
Copy link
Author

there is no any output folder
I only can see, logs folder, as I shared screenshot

@hynky1999
Copy link
Contributor

Strange so if you do aws s3 ls s3://data-refine/base_processing//base_processing/output/ you get no results ? (notice the double //

@barneylogo
Copy link
Author

it cause error !

@barneylogo
Copy link
Author

image

@barneylogo
Copy link
Author

on aws, I only can see logs folder
image

@barneylogo
Copy link
Author

barneylogo commented Jul 2, 2024

Hello @hynky1999
if you don't mind, can we discuss more details via discord or telegram?
I really hope to solve this problem asap
or, where I can find community?
Thank you

@barneylogo
Copy link
Author

hello @hynky1999
if possible, could you leave any messages?
anyway, thank you for your help
I really should to solve this problem

@hynky1999
Copy link
Contributor

hynky1999 commented Jul 2, 2024

Hey, we don't have any community forum as of right now.
Could you send the logs you got please ? (not screenshots)

@barneylogo
Copy link
Author

which logs?

@barneylogo
Copy link
Author

I will send all files

@barneylogo
Copy link
Author

hi @hynky1999
here is logs
https://drive.google.com/drive/folders/1JjbxAKdsfgAaFm3H9Y_JsD-8O6MwoOmf
also I only have logs folder in my aws account, but can't download it
image

@hynky1999
Copy link
Contributor

Ahh, okay seems like none of the files get's throught extraction. Could you try increasing the timeout to 1 sec ?
See https://github.com/huggingface/datatrove/blob/main/src/datatrove/pipeline/extractors/trafilatura.py#L26

@barneylogo
Copy link
Author

I will try. thank you

@barneylogo
Copy link
Author

hi @hynky1999
after I run this script, got new error logs.
You can see from above google drive. 1051_0 ~ 1051_3
image

@barneylogo
Copy link
Author

barneylogo commented Jul 2, 2024

hello @hynky1999
actually, I am going to run this script https://github.com/huggingface/datatrove/blob/main/examples/fineweb.py finally.
but it doesn't give me perfect result
could you give me some advice?
I think you have idea

@barneylogo
Copy link
Author

barneylogo commented Jul 2, 2024

hi @hynky1999

I think server spec is problem
Here is my cpu spec
image
or, this datatrove library need high gpu spec?
or, my slurm configuration was wrong?
if possible, could you let me know about this?

@barneylogo
Copy link
Author

hello @hynky1999
how are you today?

@hynky1999
Copy link
Contributor

I am good thank you for asking :)
It's not a slurm problem. How did you install datatrove ? From pip or from source ?
Can you run following command and send output: pip freeze | grep numpy ?

@barneylogo
Copy link
Author

barneylogo commented Jul 3, 2024

pip install datatrove[all]
I just thought it because python version. I was using python 3.12.4
so now I 've just reinstalled python into 3.10.12 and am installing datatrove again
After done, I can send you.
or, could you let me know which python version should I use?
Thank you

@hynky1999
Copy link
Contributor

Yeah, we haven't released on pypi for a while thus we don't have locked dependency for numpy.
Can you try installing the datatrove like this ? (from source)
pip install 'datatrove[all]'@git+https://github.com/huggingface/datatrove

@barneylogo
Copy link
Author

barneylogo commented Jul 3, 2024

I will try. thank you
so, you mean, any python version is ok?

@hynky1999
Copy link
Contributor

+3.10 should be fine

@barneylogo
Copy link
Author

barneylogo commented Jul 3, 2024

hello @hynky1999
I've installed datatrove like this pip install 'datatrove[all]'@git+https://github.com/huggingface/datatrove
but script is running without error logs, I think we are not getting perfect result yet
I've uploaded log files
https://drive.google.com/drive/folders/1JjbxAKdsfgAaFm3H9Y_JsD-8O6MwoOmf
If possible, could you let me know opinion about logs again?
Thank you

@hynky1999
Copy link
Contributor

Hi, could you try processing more samples ? 10k+ ? (setting the limit variable in reader)

@barneylogo
Copy link
Author

hello @hynky1999
how are you doing?
I really hope to meet you
can we use discord or telegram?
if you don't want it, could you let me know your preferable?
thank you

@hynky1999
Copy link
Contributor

hynky1999 commented Jul 11, 2024

Hi, I don't want to resolve this issue anywhere out of the gh issues. What's the state of your problem now ? Can't see any logs in the google drive folder you sent.
PS: Could you post logs directly to this issue conversation next time ?

@barneylogo
Copy link
Author

how are you @hynky1999
could you check this status?
image
also actually for first task I spent 10 hours. I have 6000 task totally.
how to increase speed?

@hynky1999
Copy link
Contributor

hynky1999 commented Jul 11, 2024

So now you can see the output ?
Re speed, there is not much you can do to speed up unless using more cpus/improving io

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants