Filter HTML tags in MAC WebCrawler not working #95

ldostuni · 2024-11-11T13:28:32Z

ldostuni
Nov 11, 2024

Hi everyone!

As documented in the tutorial, I tried to create filter values inside Global Element Properties of the MAC configs to filter out links and text I'm not interested into.

I used a random e-com site to try it out and added "div.product-tile" to just crawl links in each product tile (also put maxDepth = 1), but when it ran, it still crawled on every single other link in the page, even those from the headers.

I don't really want to crawl any other link outside the product ones, cause I'm also getting the usual "Contact Us" page or "About" page in the downloaded json files (using "Crawl website" element).

Am I missing something to make it work? I made sure to add the value in the configs, saved and added the config to the module configuration.

Thank you!

Answered by codedbyyogesh

Nov 14, 2024

Hi @ldostuni , thanks for reaching out! 👍

The global element configuration option for the crawler is used to fetch page content from the html elements specified in this configuration. It does not alter the crawl behaviour (ie to filter links to crawl). The crawler will crawl all pages to the specified depth, only retrieving contents (text) from the elements specified in the configuration.

If you want to create a custom search, perhaps have a look at the other operations provided by the crawler (ie if you stitch together Generate Sitemap to get all page links to the desired depth, then iterate over these pages passing each page link to Page Insights - this gives you a list of all urls on a…

View full answer

amirkhan-ak-sf · 2024-11-13T05:35:02Z

amirkhan-ak-sf
Nov 13, 2024
Maintainer

@codedbyyogesh

2 replies

codedbyyogesh Nov 14, 2024
Maintainer

Hi @ldostuni , thanks for reaching out! 👍

The global element configuration option for the crawler is used to fetch page content from the html elements specified in this configuration. It does not alter the crawl behaviour (ie to filter links to crawl). The crawler will crawl all pages to the specified depth, only retrieving contents (text) from the elements specified in the configuration.

If you want to create a custom search, perhaps have a look at the other operations provided by the crawler (ie if you stitch together Generate Sitemap to get all page links to the desired depth, then iterate over these pages passing each page link to Page Insights - this gives you a list of all urls on a page, then use some custom logic to filter urls you want to fetch contents for, then pass this url to the Page Contents operation to get the content). Hope this helps!

Answer selected by amirkhan-ak-sf

ldostuni Nov 15, 2024
Author

Thanks for the answer @codedbyyogesh !

Tried as you said and firat used Get page Insight, then filtered the links with Transform and gathered pages on each filtered link.

Now I'll try to find a way to interact with the page to click on the usual "show more" buttons on the page.

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MuleSoft AI Chain (MAC) Project

Filter HTML tags in MAC WebCrawler not working #95

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

MuleSoft AI Chain (MAC) Project

Filter HTML tags in MAC WebCrawler not working #95

ldostuni Nov 11, 2024

Replies: 1 comment · 2 replies

amirkhan-ak-sf Nov 13, 2024 Maintainer

codedbyyogesh Nov 14, 2024 Maintainer

ldostuni Nov 15, 2024 Author

ldostuni
Nov 11, 2024

Replies: 1 comment 2 replies

amirkhan-ak-sf
Nov 13, 2024
Maintainer

codedbyyogesh Nov 14, 2024
Maintainer

ldostuni Nov 15, 2024
Author