Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhancement: add IN_ERROR processed status #112

Open
brbog opened this issue Aug 16, 2022 · 2 comments
Open

enhancement: add IN_ERROR processed status #112

brbog opened this issue Aug 16, 2022 · 2 comments

Comments

@brbog
Copy link

brbog commented Aug 16, 2022

Inside WebCrawler.run() a value "processedSuccess" is received which indicates whether any form of exception handling was needed or not (exception could be merely logged without halting on error). This value could be used to update the processed status from SCHEDULED to IN_ERROR instead of to COMPLETED. Just make sure the query to fetch the next batch of URLs only fetches SCHEDULED ones and ignores both COMPLETED and IN_ERROR statuses (frontier.getNextURLs(batchReadSize, assignedURLs);).

Advantages:

  • When the logs point out that all pages IN_ERROR are due to the same error(s), then only these could be retried later on with new (custom) code.
  • Makes it possible to provide reporting on the success rate of the crawl.

Future:
If the above advantages prove useful but lacking, then future work could try differentiating between different sorts of errors.

For small sites, I didn't have a need for this feature, but it is a quick win to implement, and time will tell if more work on this feature is useful.

@brbog
Copy link
Author

brbog commented Aug 16, 2022

(since it is inside the frontier logic, I suspect it's more fun for you @rzo1 :-))

@rzo1 rzo1 added this to the v5.0.0 milestone Aug 16, 2022
@rzo1
Copy link
Collaborator

rzo1 commented Aug 16, 2022

It should be easy to have it for hsqldb as it already uses the concept of Status.

It isn't trivial for the sleepycat backend and might require some major refactoring to allow to introduce a Status here.

I don't know, if it is simply possible with urlfrontier to set an IN_ERROR status (and filter for it). I will have a look how it is done in SC and check the urlfrontier impl. Otherwise I need to ask Julien N. for some input :)

@rzo1 rzo1 modified the milestones: v5.0.0, v5.0.1 Aug 24, 2022
@rzo1 rzo1 modified the milestones: v5.0.1, v5.0.2 Dec 12, 2022
@rzo1 rzo1 modified the milestones: v5.0.2, v5.0.3 May 23, 2023
@rzo1 rzo1 modified the milestones: v5.0.3, v5.1.1 Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants