Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search on the website using Crawler. #372

Closed
Snegovik777 opened this issue Dec 26, 2024 · 4 comments
Closed

Search on the website using Crawler. #372

Snegovik777 opened this issue Dec 26, 2024 · 4 comments
Assignees

Comments

@Snegovik777
Copy link

Snegovik777 commented Dec 26, 2024

Hi Unclecode. Thanks for your сrawl4ai. Very cool library. A small question. Can I use your Crawler on the Amazon website (for example) in the "Search" field, type in "Samsung Galaxy Tab" (for example) and then search for the product I need on the results page?

image

@Snegovik777
Copy link
Author

Snegovik777 commented Dec 26, 2024

I understand that it needs to be done using JS... But you couldn't explain it in a little more detail!! At least for example Amazon.

image

@unclecode
Copy link
Owner

@Snegovik777 Thanks for trying Crawl4ai. You found that the right approach involves running JavaScript and then continuing to crawl the page. I think it's a good idea to add some examples, and I will do that. For this specific task you mentioned, you don't need JavaScript. You can simply play around with the URL. I will share the URL after searching for the same thing you mentioned regarding the Samsung Galaxy Tab. When you do that, Amazon has moved to this URL:

https://www.amazon.com/s?k=Samsung+Galaxy+Tab

After this, you start crawling. You can use two approaches: LM extraction strategies to extract data into JSON using a LLM, or my favorite approach, JsonCSSExtractionStrategy or JsonXpathExtractionStrategy. You create a schema for the repetitive patterns and extract them all. Here I share a code example:

"""
This example demonstrates how to use JSON CSS extraction to scrape product information 
from Amazon search results. It shows how to extract structured data like product titles,
prices, ratings, and other details using CSS selectors.
"""

from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
import json

async def extract_amazon_products():
    # Initialize browser config
    browser_config = BrowserConfig(
        browser_type="chromium",
        headless=True
    )
    
    # Initialize crawler config with JSON CSS extraction strategy
    crawler_config = CrawlerRunConfig(
        extraction_strategy=JsonCssExtractionStrategy(
            schema={
                "name": "Amazon Product Search Results",
                "baseSelector": "[data-component-type='s-search-result']",
                "fields": [
                    {
                        "name": "asin",
                        "selector": "",
                        "type": "attribute",
                        "attribute": "data-asin"
                    },
                    {
                        "name": "title",
                        "selector": "h2 a span",
                        "type": "text"
                    },
                    {
                        "name": "url",
                        "selector": "h2 a",
                        "type": "attribute",
                        "attribute": "href"
                    },
                    {
                        "name": "image",
                        "selector": ".s-image",
                        "type": "attribute",
                        "attribute": "src"
                    },
                    {
                        "name": "rating",
                        "selector": ".a-icon-star-small .a-icon-alt",
                        "type": "text"
                    },
                    {
                        "name": "reviews_count",
                        "selector": "[data-csa-c-func-deps='aui-da-a-popover'] ~ span span",
                        "type": "text"
                    },
                    {
                        "name": "price",
                        "selector": ".a-price .a-offscreen",
                        "type": "text"
                    },
                    {
                        "name": "original_price",
                        "selector": ".a-price.a-text-price .a-offscreen",
                        "type": "text"
                    },
                    {
                        "name": "sponsored",
                        "selector": ".puis-sponsored-label-text",
                        "type": "exists"
                    },
                    {
                        "name": "delivery_info",
                        "selector": "[data-cy='delivery-recipe'] .a-color-base",
                        "type": "text",
                        "multiple": True
                    }
                ]
            }
        )
    )

    # Example search URL (you should replace with your actual Amazon URL)
    url = "https://www.amazon.com/s?k=Samsung+Galaxy+Tab"
    
    # Use context manager for proper resource handling
    async with AsyncWebCrawler(config=browser_config) as crawler:
        # Extract the data
        result = await crawler.arun(url=url, config=crawler_config)
        
        # Process and print the results
        if result and result.extracted_content:
            # Parse the JSON string into a list of products
            products = json.loads(result.extracted_content)
            
            # Process each product in the list
            for product in products:
                print("\nProduct Details:")
                print(f"ASIN: {product.get('asin')}")
                print(f"Title: {product.get('title')}")
                print(f"Price: {product.get('price')}")
                print(f"Original Price: {product.get('original_price')}")
                print(f"Rating: {product.get('rating')}")
                print(f"Reviews: {product.get('reviews_count')}")
                print(f"Sponsored: {'Yes' if product.get('sponsored') else 'No'}")
                if product.get('delivery_info'):
                    print(f"Delivery: {' '.join(product['delivery_info'])}")
                print("-" * 80)

if __name__ == "__main__":
    import asyncio
    asyncio.run(extract_amazon_products())

image

@unclecode unclecode self-assigned this Dec 26, 2024
@Snegovik777
Copy link
Author

Snegovik777 commented Dec 27, 2024

WOW!!! Yes indeed, I didn't immediately notice the Amazon address. It should work!!

But I'm still trying to accomplish this task using JS. I may not be as lucky with other sites as I was with Amazon))) And I can't do it yet... According to the manual, I seem to be doing everything correctly. But the error...

Link to the script. I didn't write the whole code here..

image

Maybe I'm making a mistake somewhere... It would be great if someone would also check this task... Suddenly it's a bug...

@unclecode
Copy link
Owner

@Snegovik777 You are on the right path, but your script has some issues, so I will explain a bit more. First, remember that when you run JavaScript code to create interactions on the page and prepare it for crawling, always wait for specific criteria, such as the presence of an element, after running your code. We run these codes asynchronously, making it difficult to know when and how the execution ends. The best approach is to wait for criteria like the presence of an element, or you can pass JavaScript code to check those conditions for you.

You do not need to add setTimeout to wait; just run your code. Also, you don't need to split your code into lines and send them as an array.

And finally an alternative to JS is to use crawler hooks defined in Crawl4ai, which allow you to navigate and make those preparations directly through the page object from Playwright.

I created three fresh examples and added the document folders, and I will push them soon, either today or tomorrow, in the new version 0.4.24. For your reference and future use, I will share those examples. I will share three ways to do this; I already explained one method to crawl directly from the end URL, and the two code examples I will share here include one using JavaScript and the other using hooks.

Using Javascript

In this code, pay attention to how I created two versions of the JavaScript code: one uses asynchronous methods, and the other uses synchronous methods. You can use either of them. The second point is to look at the parameter "wait for" that I pass to the crawler run config. As you can see, I wait for specific elements, and that element is the container of the search result. If you pay attention, wait_for starts with a prefix "css:" You can pass JavaScript expressions, expecting them to return true or false. Then, you have to use "js:". This way, when Playwright executes the code, we wait for the presence of these elements or whatever the wait_for expression defines.

from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
import json
from playwright.async_api import Page, BrowserContext

async def extract_amazon_products():
    # Initialize browser config
    browser_config = BrowserConfig(
        # browser_type="chromium",
        headless=True
    )
    
    # Initialize crawler config with JSON CSS extraction strategy nav-search-submit-button
    js_code_to_search = """
        const task = async () => {
            document.querySelector('#twotabsearchtextbox').value = 'Samsung Galaxy Tab';
            document.querySelector('#nav-search-submit-button').click();
        }
        await task();
    """
    js_code_to_search_sync = """
            document.querySelector('#twotabsearchtextbox').value = 'Samsung Galaxy Tab';
            document.querySelector('#nav-search-submit-button').click();
    """
    crawler_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        js_code = js_code_to_search,
        wait_for='css:[data-component-type="s-search-result"]',
        extraction_strategy=JsonCssExtractionStrategy(
            schema={
                "name": "Amazon Product Search Results",
                "baseSelector": "[data-component-type='s-search-result']",
                "fields": [
                    {
                        "name": "asin",
                        "selector": "",
                        "type": "attribute",
                        "attribute": "data-asin"
                    },
                    {
                        "name": "title",
                        "selector": "h2 a span",
                        "type": "text"
                    },
                    {
                        "name": "url",
                        "selector": "h2 a",
                        "type": "attribute",
                        "attribute": "href"
                    },
                    {
                        "name": "image",
                        "selector": ".s-image",
                        "type": "attribute",
                        "attribute": "src"
                    },
                    {
                        "name": "rating",
                        "selector": ".a-icon-star-small .a-icon-alt",
                        "type": "text"
                    },
                    {
                        "name": "reviews_count",
                        "selector": "[data-csa-c-func-deps='aui-da-a-popover'] ~ span span",
                        "type": "text"
                    },
                    {
                        "name": "price",
                        "selector": ".a-price .a-offscreen",
                        "type": "text"
                    },
                    {
                        "name": "original_price",
                        "selector": ".a-price.a-text-price .a-offscreen",
                        "type": "text"
                    },
                    {
                        "name": "sponsored",
                        "selector": ".puis-sponsored-label-text",
                        "type": "exists"
                    },
                    {
                        "name": "delivery_info",
                        "selector": "[data-cy='delivery-recipe'] .a-color-base",
                        "type": "text",
                        "multiple": True
                    }
                ]
            }
        )
    )

    url = "https://www.amazon.com/"
 
    
    # Use context manager for proper resource handling
    async with AsyncWebCrawler(config=browser_config) as crawler:
        # Extract the data
        result = await crawler.arun(url=url, config=crawler_config)
        
        # Process and print the results
        if result and result.extracted_content:
            # Parse the JSON string into a list of products
            products = json.loads(result.extracted_content)
            
            # Process each product in the list
            for product in products:
                print("\nProduct Details:")
                print(f"ASIN: {product.get('asin')}")
                print(f"Title: {product.get('title')}")
                print(f"Price: {product.get('price')}")
                print(f"Original Price: {product.get('original_price')}")
                print(f"Rating: {product.get('rating')}")
                print(f"Reviews: {product.get('reviews_count')}")
                print(f"Sponsored: {'Yes' if product.get('sponsored') else 'No'}")
                if product.get('delivery_info'):
                    print(f"Delivery: {' '.join(product['delivery_info'])}")
                print("-" * 80)

if __name__ == "__main__":
    import asyncio
    asyncio.run(extract_amazon_products())

Using Crawler Hooks:

from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
import json
from playwright.async_api import Page, BrowserContext

async def extract_amazon_products():
    # Initialize browser config
    browser_config = BrowserConfig(
        # browser_type="chromium",
        headless=True
    )
    
    # Initialize crawler config with JSON CSS extraction strategy nav-search-submit-button
    crawler_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,

        extraction_strategy=JsonCssExtractionStrategy(
            schema={
                "name": "Amazon Product Search Results",
                "baseSelector": "[data-component-type='s-search-result']",
                "fields": [
                    {
                        "name": "asin",
                        "selector": "",
                        "type": "attribute",
                        "attribute": "data-asin"
                    },
                    {
                        "name": "title",
                        "selector": "h2 a span",
                        "type": "text"
                    },
                    {
                        "name": "url",
                        "selector": "h2 a",
                        "type": "attribute",
                        "attribute": "href"
                    },
                    {
                        "name": "image",
                        "selector": ".s-image",
                        "type": "attribute",
                        "attribute": "src"
                    },
                    {
                        "name": "rating",
                        "selector": ".a-icon-star-small .a-icon-alt",
                        "type": "text"
                    },
                    {
                        "name": "reviews_count",
                        "selector": "[data-csa-c-func-deps='aui-da-a-popover'] ~ span span",
                        "type": "text"
                    },
                    {
                        "name": "price",
                        "selector": ".a-price .a-offscreen",
                        "type": "text"
                    },
                    {
                        "name": "original_price",
                        "selector": ".a-price.a-text-price .a-offscreen",
                        "type": "text"
                    },
                    {
                        "name": "sponsored",
                        "selector": ".puis-sponsored-label-text",
                        "type": "exists"
                    },
                    {
                        "name": "delivery_info",
                        "selector": "[data-cy='delivery-recipe'] .a-color-base",
                        "type": "text",
                        "multiple": True
                    }
                ]
            }
        )
    )

    url = "https://www.amazon.com/"
    
    async def after_goto(page: Page, context: BrowserContext, url: str, response: dict, **kwargs):
        """Hook called after navigating to each URL"""
        print(f"[HOOK] after_goto - Successfully loaded: {url}")
        
        try:
            # Wait for search box to be available
            search_box = await page.wait_for_selector('#twotabsearchtextbox', timeout=1000)
            
            # Type the search query
            await search_box.fill('Samsung Galaxy Tab')
            
            # Get the search button and prepare for navigation
            search_button = await page.wait_for_selector('#nav-search-submit-button', timeout=1000)
            
            # Click with navigation waiting
            await search_button.click()
            
            # Wait for search results to load
            await page.wait_for_selector('[data-component-type="s-search-result"]', timeout=10000)
            print("[HOOK] Search completed and results loaded!")
            
        except Exception as e:
            print(f"[HOOK] Error during search operation: {str(e)}")
            
        return page    
    
    # Use context manager for proper resource handling
    async with AsyncWebCrawler(config=browser_config) as crawler:
        
        crawler.crawler_strategy.set_hook("after_goto", after_goto)
        
        # Extract the data
        result = await crawler.arun(url=url, config=crawler_config)
        
        # Process and print the results
        if result and result.extracted_content:
            # Parse the JSON string into a list of products
            products = json.loads(result.extracted_content)
            
            # Process each product in the list
            for product in products:
                print("\nProduct Details:")
                print(f"ASIN: {product.get('asin')}")
                print(f"Title: {product.get('title')}")
                print(f"Price: {product.get('price')}")
                print(f"Original Price: {product.get('original_price')}")
                print(f"Rating: {product.get('rating')}")
                print(f"Reviews: {product.get('reviews_count')}")
                print(f"Sponsored: {'Yes' if product.get('sponsored') else 'No'}")
                if product.get('delivery_info'):
                    print(f"Delivery: {' '.join(product['delivery_info'])}")
                print("-" * 80)

if __name__ == "__main__":
    import asyncio
    asyncio.run(extract_amazon_products())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants