You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Inside WebCrawler.run() a value "processedSuccess" is received which indicates whether any form of exception handling was needed or not (exception could be merely logged without halting on error). This value could be used to update the processed status from SCHEDULED to IN_ERROR instead of to COMPLETED. Just make sure the query to fetch the next batch of URLs only fetches SCHEDULED ones and ignores both COMPLETED and IN_ERROR statuses (frontier.getNextURLs(batchReadSize, assignedURLs);).
Advantages:
When the logs point out that all pages IN_ERROR are due to the same error(s), then only these could be retried later on with new (custom) code.
Makes it possible to provide reporting on the success rate of the crawl.
Future:
If the above advantages prove useful but lacking, then future work could try differentiating between different sorts of errors.
For small sites, I didn't have a need for this feature, but it is a quick win to implement, and time will tell if more work on this feature is useful.
The text was updated successfully, but these errors were encountered:
It should be easy to have it for hsqldb as it already uses the concept of Status.
It isn't trivial for the sleepycat backend and might require some major refactoring to allow to introduce a Status here.
I don't know, if it is simply possible with urlfrontier to set an IN_ERROR status (and filter for it). I will have a look how it is done in SC and check the urlfrontier impl. Otherwise I need to ask Julien N. for some input :)
Inside
WebCrawler.run()
a value "processedSuccess" is received which indicates whether any form of exception handling was needed or not (exception could be merely logged without halting on error). This value could be used to update the processed status from SCHEDULED to IN_ERROR instead of to COMPLETED. Just make sure the query to fetch the next batch of URLs only fetches SCHEDULED ones and ignores both COMPLETED and IN_ERROR statuses (frontier.getNextURLs(batchReadSize, assignedURLs);
).Advantages:
Future:
If the above advantages prove useful but lacking, then future work could try differentiating between different sorts of errors.
For small sites, I didn't have a need for this feature, but it is a quick win to implement, and time will tell if more work on this feature is useful.
The text was updated successfully, but these errors were encountered: