Implement crawler.teardown (exists in JS version) #651

Pijukatel · 2024-11-05T10:19:22Z

Implement some way to stop crawler in obvious and controlled way from the user function. It should properly shutdown all resources and immediately stop crawler to send any requests. It should be mirroring the JS version.

Use case:
User wants to stop crawler from within the user function.

Example of current workarounds for user:

Add flag at the beginning of the user function and shortcut user function evaluation.
if finished:
return
...
Drawback: Currently queued requests are still being send, but not processed.
Call some private internals:
await crawler._pool.abort()
Drawback: Internal. Remaining tasks will still finish.
Drop request provider
await request_provider.drop()
Drawback: Bunch of errors as existing tasks might still try to access request_provider()

Example of how this is solved in scrapy:
https://docs.scrapy.org/en/2.11/faq.html#how-can-i-instruct-a-spider-to-stop-itself

janbuchar · 2024-11-05T10:40:25Z

This has been discussed in #506

github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Nov 5, 2024

vdusek added the enhancement New feature or request. label Nov 5, 2024

vdusek assigned Pijukatel Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement crawler.teardown (exists in JS version) #651

Implement crawler.teardown (exists in JS version) #651

Pijukatel commented Nov 5, 2024 •

edited by B4nan

Loading

janbuchar commented Nov 5, 2024

Implement crawler.teardown (exists in JS version) #651

Implement crawler.teardown (exists in JS version) #651

Comments

Pijukatel commented Nov 5, 2024 • edited by B4nan Loading

janbuchar commented Nov 5, 2024

Pijukatel commented Nov 5, 2024 •

edited by B4nan

Loading