Scheduling & performance #143

TobiX · 2019-12-02T22:56:10Z

Currently, dosage downloads comic in a very straightforward way:

Get page
Parse page
Get images
Continue with next page

For better performance, the user can decide to run download multiple comics in parallel (via the -p option) - but that's more of a clutch - the threads aren't aware of each other, which could lead to the situation where multiple threads fetch comics from the same hoster.

We should evaluate a better scheduling system, satisfying at least the following requirements:

Parallel downloads from multiple hosts
Throttling per host (we don't want to overload a hoster)
Image downloads can be handled separate from page parsing

It might be worthwhile to look at things like asyncio, async/await or something like that...

The text was updated successfully, but these errors were encountered:

Null000 · 2019-12-02T23:39:44Z

We may want to evaluate Scrapy, it seems perfect for the job.

My experience with limited (only wrote one spider) but it seemed like a perfect fit for it.

TobiX · 2019-12-03T19:41:20Z

We may want to evaluate Scrapy, it seems perfect for the job.

Yes, that might be one solution, but probably a pretty hefty one... I'm certainly not a fan of reinventing the wheel, but this particular wheel seems to bring the whole caravan with it 😉

Just for fun, from an empty virtualenv:

Installing collected packages: six, protego, cssselect, w3lib, lxml, parsel, incremental, attrs, Automat, constantly, PyHamcrest, idna, hyperlink, zope.interface, Twisted, pyasn1, pyasn1-modules, pycparser, cffi, cryptography, service-identity, queuelib, PyDispatcher, pyOpenSSL, scrapy

Hey, at least we are already using some of those dependencies.

(I'm not totally opposed to using scrapy, but this looks like a daunting task indeed)

TobiX added the enhancement label Dec 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduling & performance #143

Scheduling & performance #143

TobiX commented Dec 2, 2019

Null000 commented Dec 2, 2019

TobiX commented Dec 3, 2019

Scheduling & performance #143

Scheduling & performance #143

Comments

TobiX commented Dec 2, 2019

Null000 commented Dec 2, 2019

TobiX commented Dec 3, 2019