Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize and parallelize the downloads #32

Closed
Pikrass opened this issue Nov 3, 2019 · 2 comments
Closed

Optimize and parallelize the downloads #32

Pikrass opened this issue Nov 3, 2019 · 2 comments

Comments

@Pikrass
Copy link

Pikrass commented Nov 3, 2019

Hi!
First of all, thank you for your script, it really helped me!

I wanted to download a 350000-message group. As you can imagine, it was taking a while.
I realized the script was creating a wget process, and therefore a single connection to Google, for each and every forum page, thread page, and message. Which is really inefficient.

So I improved the generated script so that

  1. a bunch of messages were retrieved in one connection, thanks to curl and its -o $output1 $url1 -o $output2 $url2 syntax
  2. multiple processes ran at the same time

I even made a nice text UI to monitor the jobs.

Here's my script. I license it under the same terms as your code.

I'm not doing a PR because:

  • there's still room for improvement (eg my script only handles the actual messages, not the thread list and message list)
  • I didn't look that much into the code, only into the generated script, so there's probably a few tweaks needed
  • there are a few decisions involved: my version requires curl, there are parameters for the number of processes and the number of URLs per batch, the cookie file is its own parameter instead of WGET_OPTIONS, etc.

Hope this proves useful. :)

@ghost ghost mentioned this issue Jan 24, 2020
@icy
Copy link
Owner

icy commented Apr 12, 2020

That's excellent @Pikrass . I will update the README to mention your script, and I think that's fine enough at the moment.

My intention was to slowly query google server so the script won't be locked down; that's why I didn't have parallel support out-of-the-box. Once you have the generated script, you have a few options to continue, e.g, your way, or you can also feed the gnu parallel tool.

@icy icy closed this as completed Apr 12, 2020
@icy
Copy link
Owner

icy commented Apr 12, 2020

I mention your script here https://github.com/icy/google-group-crawler#contributions . Thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants