Distinct

Jump to bottom Edit New page

FireAwayH edited this page Aug 21, 2018 · 4 revisions

This feature only works when storage type is set to MySQL

The data we captured can be duplicate because of some reasons and we need a way to make our data unique.

This feature is set to true by default

If you want to disable this feature to save the scraping time, please input anything except true in the Scrape page as the picture below.

Distinct

Usage

Make sure the value is set to true and then start scraping

Use cases:

There are some situations which data duplication will happen:

The crash of Chrome.

You can't figure out which URLs are not scrapped and the only way is restart the Sitemap which stopped.

Data will be captured again when you restart a Sitemap.

The mistake in selectors or pages.

Sometimes scraper will get duplicate data because there are errors in your Selector or just in the page you are scraping