enhancement: add IN_ERROR processed status #112

brbog · 2022-08-16T11:54:15Z

Inside WebCrawler.run() a value "processedSuccess" is received which indicates whether any form of exception handling was needed or not (exception could be merely logged without halting on error). This value could be used to update the processed status from SCHEDULED to IN_ERROR instead of to COMPLETED. Just make sure the query to fetch the next batch of URLs only fetches SCHEDULED ones and ignores both COMPLETED and IN_ERROR statuses (frontier.getNextURLs(batchReadSize, assignedURLs);).

Advantages:

When the logs point out that all pages IN_ERROR are due to the same error(s), then only these could be retried later on with new (custom) code.
Makes it possible to provide reporting on the success rate of the crawl.

Future:
If the above advantages prove useful but lacking, then future work could try differentiating between different sorts of errors.

For small sites, I didn't have a need for this feature, but it is a quick win to implement, and time will tell if more work on this feature is useful.

The text was updated successfully, but these errors were encountered:

brbog · 2022-08-16T11:56:37Z

(since it is inside the frontier logic, I suspect it's more fun for you @rzo1 :-))

rzo1 · 2022-08-16T12:46:20Z

It should be easy to have it for hsqldb as it already uses the concept of Status.

It isn't trivial for the sleepycat backend and might require some major refactoring to allow to introduce a Status here.

I don't know, if it is simply possible with urlfrontier to set an IN_ERROR status (and filter for it). I will have a look how it is done in SC and check the urlfrontier impl. Otherwise I need to ask Julien N. for some input :)

rzo1 added the enhancement label Aug 16, 2022

rzo1 added this to the v5.0.0 milestone Aug 16, 2022

rzo1 modified the milestones: v5.0.0, v5.0.1 Aug 24, 2022

rzo1 modified the milestones: v5.0.1, v5.0.2 Dec 12, 2022

rzo1 modified the milestones: v5.0.2, v5.0.3 May 23, 2023

rzo1 added the help wanted label May 23, 2023

rzo1 modified the milestones: v5.0.3, v5.1.1 Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enhancement: add IN_ERROR processed status #112

enhancement: add IN_ERROR processed status #112

brbog commented Aug 16, 2022

brbog commented Aug 16, 2022

rzo1 commented Aug 16, 2022

enhancement: add IN_ERROR processed status #112

enhancement: add IN_ERROR processed status #112

Comments

brbog commented Aug 16, 2022

brbog commented Aug 16, 2022

rzo1 commented Aug 16, 2022