Optimize and parallelize the downloads #32

Pikrass · 2019-11-03T21:54:43Z

Hi!
First of all, thank you for your script, it really helped me!

I wanted to download a 350000-message group. As you can imagine, it was taking a while.
I realized the script was creating a wget process, and therefore a single connection to Google, for each and every forum page, thread page, and message. Which is really inefficient.

So I improved the generated script so that

a bunch of messages were retrieved in one connection, thanks to curl and its -o $output1 $url1 -o $output2 $url2 syntax
multiple processes ran at the same time

I even made a nice text UI to monitor the jobs.

Here's my script. I license it under the same terms as your code.

I'm not doing a PR because:

there's still room for improvement (eg my script only handles the actual messages, not the thread list and message list)
I didn't look that much into the code, only into the generated script, so there's probably a few tweaks needed
there are a few decisions involved: my version requires curl, there are parameters for the number of processes and the number of URLs per batch, the cookie file is its own parameter instead of WGET_OPTIONS, etc.

Hope this proves useful. :)

The text was updated successfully, but these errors were encountered:

icy · 2020-04-12T08:05:11Z

That's excellent @Pikrass . I will update the README to mention your script, and I think that's fine enough at the moment.

My intention was to slowly query google server so the script won't be locked down; that's why I didn't have parallel support out-of-the-box. Once you have the generated script, you have a few options to continue, e.g, your way, or you can also feed the gnu parallel tool.

icy · 2020-04-12T08:09:56Z

I mention your script here https://github.com/icy/google-group-crawler#contributions . Thanks a lot.

ghost mentioned this issue Jan 24, 2020

Mbox format #35

Closed

icy closed this as completed Apr 12, 2020

icy mentioned this issue Apr 13, 2020

Export content between date #25

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize and parallelize the downloads #32

Optimize and parallelize the downloads #32

Pikrass commented Nov 3, 2019

icy commented Apr 12, 2020

icy commented Apr 12, 2020

Optimize and parallelize the downloads #32

Optimize and parallelize the downloads #32

Comments

Pikrass commented Nov 3, 2019

icy commented Apr 12, 2020

icy commented Apr 12, 2020