Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduling & performance #143

Open
3 tasks
TobiX opened this issue Dec 2, 2019 · 2 comments
Open
3 tasks

Scheduling & performance #143

TobiX opened this issue Dec 2, 2019 · 2 comments

Comments

@TobiX
Copy link
Member

TobiX commented Dec 2, 2019

Currently, dosage downloads comic in a very straightforward way:

  1. Get page
  2. Parse page
  3. Get images
  4. Continue with next page

For better performance, the user can decide to run download multiple comics in parallel (via the -p option) - but that's more of a clutch - the threads aren't aware of each other, which could lead to the situation where multiple threads fetch comics from the same hoster.

We should evaluate a better scheduling system, satisfying at least the following requirements:

  • Parallel downloads from multiple hosts
  • Throttling per host (we don't want to overload a hoster)
  • Image downloads can be handled separate from page parsing

It might be worthwhile to look at things like asyncio, async/await or something like that...

@Null000
Copy link
Contributor

Null000 commented Dec 2, 2019

We may want to evaluate Scrapy, it seems perfect for the job.

My experience with limited (only wrote one spider) but it seemed like a perfect fit for it.

@TobiX
Copy link
Member Author

TobiX commented Dec 3, 2019

We may want to evaluate Scrapy, it seems perfect for the job.

Yes, that might be one solution, but probably a pretty hefty one... I'm certainly not a fan of reinventing the wheel, but this particular wheel seems to bring the whole caravan with it 😉

Just for fun, from an empty virtualenv:

Installing collected packages: six, protego, cssselect, w3lib, lxml, parsel, incremental, attrs, Automat, constantly, PyHamcrest, idna, hyperlink, zope.interface, Twisted, pyasn1, pyasn1-modules, pycparser, cffi, cryptography, service-identity, queuelib, PyDispatcher, pyOpenSSL, scrapy

Hey, at least we are already using some of those dependencies.

(I'm not totally opposed to using scrapy, but this looks like a daunting task indeed)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants