Filter HTML tags in MAC WebCrawler not working #95
-
Hi everyone! As documented in the tutorial, I tried to create filter values inside Global Element Properties of the MAC configs to filter out links and text I'm not interested into. I used a random e-com site to try it out and added "div.product-tile" to just crawl links in each product tile (also put maxDepth = 1), but when it ran, it still crawled on every single other link in the page, even those from the headers. I don't really want to crawl any other link outside the product ones, cause I'm also getting the usual "Contact Us" page or "About" page in the downloaded json files (using "Crawl website" element). Am I missing something to make it work? I made sure to add the value in the configs, saved and added the config to the module configuration. Thank you! |
Beta Was this translation helpful? Give feedback.
Hi @ldostuni , thanks for reaching out! 👍
The global element configuration option for the crawler is used to fetch page content from the html elements specified in this configuration. It does not alter the crawl behaviour (ie to filter links to crawl). The crawler will crawl all pages to the specified depth, only retrieving contents (text) from the elements specified in the configuration.
If you want to create a custom search, perhaps have a look at the other operations provided by the crawler (ie if you stitch together Generate Sitemap to get all page links to the desired depth, then iterate over these pages passing each page link to Page Insights - this gives you a list of all urls on a…