Update website_crawler.py to keep clean urls #114

nespera · 2024-08-21T16:24:35Z

It looks like the website crawler normalizes the URLs it gathers but then immediately throws them away. This change keeps the normalized list.

ofermend · 2024-08-21T18:54:44Z

crawlers/website_crawler.py

@@ -79,7 +79,6 @@ def crawl(self) -> None:
                                           pos_regex=self.pos_regex, neg_regex=self.neg_regex, 
                                           indexer=self.indexer, visited=set(), verbose=self.indexer.verbose)
                urls = clean_urls(urls_set, keep_query_params)
-                urls = list(set(urls_set))


good catch, but I think the idea of line 82 is really to remove duplicates, so should have been
"urls = list(set(urls))"
instead of removing this altogether.

But clean_urls is also removing duplicates with the same mechanism isn't it?

yes you are right!

ofermend

LGTM.

Update website_crawler.py to keep clean urls

38201f3

ofermend reviewed Aug 21, 2024

View reviewed changes

ofermend approved these changes Aug 21, 2024

View reviewed changes

ofermend merged commit cbdfac4 into vectara:main Aug 21, 2024

nespera deleted the normalize_urls branch August 29, 2024 14:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update website_crawler.py to keep clean urls #114

Update website_crawler.py to keep clean urls #114

nespera commented Aug 21, 2024

ofermend Aug 21, 2024

nespera Aug 21, 2024

ofermend Aug 21, 2024

ofermend left a comment

Update website_crawler.py to keep clean urls #114

Update website_crawler.py to keep clean urls #114

Conversation

nespera commented Aug 21, 2024

ofermend Aug 21, 2024

Choose a reason for hiding this comment

nespera Aug 21, 2024

Choose a reason for hiding this comment

ofermend Aug 21, 2024

Choose a reason for hiding this comment

ofermend left a comment

Choose a reason for hiding this comment