-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Search on the website using Crawler. #372
Comments
@Snegovik777 Thanks for trying Crawl4ai. You found that the right approach involves running JavaScript and then continuing to crawl the page. I think it's a good idea to add some examples, and I will do that. For this specific task you mentioned, you don't need JavaScript. You can simply play around with the URL. I will share the URL after searching for the same thing you mentioned regarding the Samsung Galaxy Tab. When you do that, Amazon has moved to this URL: https://www.amazon.com/s?k=Samsung+Galaxy+Tab After this, you start crawling. You can use two approaches: LM extraction strategies to extract data into JSON using a LLM, or my favorite approach, JsonCSSExtractionStrategy or JsonXpathExtractionStrategy. You create a schema for the repetitive patterns and extract them all. Here I share a code example: """
This example demonstrates how to use JSON CSS extraction to scrape product information
from Amazon search results. It shows how to extract structured data like product titles,
prices, ratings, and other details using CSS selectors.
"""
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
import json
async def extract_amazon_products():
# Initialize browser config
browser_config = BrowserConfig(
browser_type="chromium",
headless=True
)
# Initialize crawler config with JSON CSS extraction strategy
crawler_config = CrawlerRunConfig(
extraction_strategy=JsonCssExtractionStrategy(
schema={
"name": "Amazon Product Search Results",
"baseSelector": "[data-component-type='s-search-result']",
"fields": [
{
"name": "asin",
"selector": "",
"type": "attribute",
"attribute": "data-asin"
},
{
"name": "title",
"selector": "h2 a span",
"type": "text"
},
{
"name": "url",
"selector": "h2 a",
"type": "attribute",
"attribute": "href"
},
{
"name": "image",
"selector": ".s-image",
"type": "attribute",
"attribute": "src"
},
{
"name": "rating",
"selector": ".a-icon-star-small .a-icon-alt",
"type": "text"
},
{
"name": "reviews_count",
"selector": "[data-csa-c-func-deps='aui-da-a-popover'] ~ span span",
"type": "text"
},
{
"name": "price",
"selector": ".a-price .a-offscreen",
"type": "text"
},
{
"name": "original_price",
"selector": ".a-price.a-text-price .a-offscreen",
"type": "text"
},
{
"name": "sponsored",
"selector": ".puis-sponsored-label-text",
"type": "exists"
},
{
"name": "delivery_info",
"selector": "[data-cy='delivery-recipe'] .a-color-base",
"type": "text",
"multiple": True
}
]
}
)
)
# Example search URL (you should replace with your actual Amazon URL)
url = "https://www.amazon.com/s?k=Samsung+Galaxy+Tab"
# Use context manager for proper resource handling
async with AsyncWebCrawler(config=browser_config) as crawler:
# Extract the data
result = await crawler.arun(url=url, config=crawler_config)
# Process and print the results
if result and result.extracted_content:
# Parse the JSON string into a list of products
products = json.loads(result.extracted_content)
# Process each product in the list
for product in products:
print("\nProduct Details:")
print(f"ASIN: {product.get('asin')}")
print(f"Title: {product.get('title')}")
print(f"Price: {product.get('price')}")
print(f"Original Price: {product.get('original_price')}")
print(f"Rating: {product.get('rating')}")
print(f"Reviews: {product.get('reviews_count')}")
print(f"Sponsored: {'Yes' if product.get('sponsored') else 'No'}")
if product.get('delivery_info'):
print(f"Delivery: {' '.join(product['delivery_info'])}")
print("-" * 80)
if __name__ == "__main__":
import asyncio
asyncio.run(extract_amazon_products()) |
WOW!!! Yes indeed, I didn't immediately notice the Amazon address. It should work!! But I'm still trying to accomplish this task using JS. I may not be as lucky with other sites as I was with Amazon))) And I can't do it yet... According to the manual, I seem to be doing everything correctly. But the error... Link to the script. I didn't write the whole code here.. Maybe I'm making a mistake somewhere... It would be great if someone would also check this task... Suddenly it's a bug... |
@Snegovik777 You are on the right path, but your script has some issues, so I will explain a bit more. First, remember that when you run JavaScript code to create interactions on the page and prepare it for crawling, always wait for specific criteria, such as the presence of an element, after running your code. We run these codes asynchronously, making it difficult to know when and how the execution ends. The best approach is to wait for criteria like the presence of an element, or you can pass JavaScript code to check those conditions for you. You do not need to add And finally an alternative to JS is to use crawler hooks defined in Crawl4ai, which allow you to navigate and make those preparations directly through the page object from Playwright. I created three fresh examples and added the document folders, and I will push them soon, either today or tomorrow, in the new version 0.4.24. For your reference and future use, I will share those examples. I will share three ways to do this; I already explained one method to crawl directly from the end URL, and the two code examples I will share here include one using JavaScript and the other using hooks. Using JavascriptIn this code, pay attention to how I created two versions of the JavaScript code: one uses asynchronous methods, and the other uses synchronous methods. You can use either of them. The second point is to look at the parameter "wait for" that I pass to the crawler run config. As you can see, I wait for specific elements, and that element is the container of the search result. If you pay attention, from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
import json
from playwright.async_api import Page, BrowserContext
async def extract_amazon_products():
# Initialize browser config
browser_config = BrowserConfig(
# browser_type="chromium",
headless=True
)
# Initialize crawler config with JSON CSS extraction strategy nav-search-submit-button
js_code_to_search = """
const task = async () => {
document.querySelector('#twotabsearchtextbox').value = 'Samsung Galaxy Tab';
document.querySelector('#nav-search-submit-button').click();
}
await task();
"""
js_code_to_search_sync = """
document.querySelector('#twotabsearchtextbox').value = 'Samsung Galaxy Tab';
document.querySelector('#nav-search-submit-button').click();
"""
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
js_code = js_code_to_search,
wait_for='css:[data-component-type="s-search-result"]',
extraction_strategy=JsonCssExtractionStrategy(
schema={
"name": "Amazon Product Search Results",
"baseSelector": "[data-component-type='s-search-result']",
"fields": [
{
"name": "asin",
"selector": "",
"type": "attribute",
"attribute": "data-asin"
},
{
"name": "title",
"selector": "h2 a span",
"type": "text"
},
{
"name": "url",
"selector": "h2 a",
"type": "attribute",
"attribute": "href"
},
{
"name": "image",
"selector": ".s-image",
"type": "attribute",
"attribute": "src"
},
{
"name": "rating",
"selector": ".a-icon-star-small .a-icon-alt",
"type": "text"
},
{
"name": "reviews_count",
"selector": "[data-csa-c-func-deps='aui-da-a-popover'] ~ span span",
"type": "text"
},
{
"name": "price",
"selector": ".a-price .a-offscreen",
"type": "text"
},
{
"name": "original_price",
"selector": ".a-price.a-text-price .a-offscreen",
"type": "text"
},
{
"name": "sponsored",
"selector": ".puis-sponsored-label-text",
"type": "exists"
},
{
"name": "delivery_info",
"selector": "[data-cy='delivery-recipe'] .a-color-base",
"type": "text",
"multiple": True
}
]
}
)
)
url = "https://www.amazon.com/"
# Use context manager for proper resource handling
async with AsyncWebCrawler(config=browser_config) as crawler:
# Extract the data
result = await crawler.arun(url=url, config=crawler_config)
# Process and print the results
if result and result.extracted_content:
# Parse the JSON string into a list of products
products = json.loads(result.extracted_content)
# Process each product in the list
for product in products:
print("\nProduct Details:")
print(f"ASIN: {product.get('asin')}")
print(f"Title: {product.get('title')}")
print(f"Price: {product.get('price')}")
print(f"Original Price: {product.get('original_price')}")
print(f"Rating: {product.get('rating')}")
print(f"Reviews: {product.get('reviews_count')}")
print(f"Sponsored: {'Yes' if product.get('sponsored') else 'No'}")
if product.get('delivery_info'):
print(f"Delivery: {' '.join(product['delivery_info'])}")
print("-" * 80)
if __name__ == "__main__":
import asyncio
asyncio.run(extract_amazon_products()) Using Crawler Hooks:from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
import json
from playwright.async_api import Page, BrowserContext
async def extract_amazon_products():
# Initialize browser config
browser_config = BrowserConfig(
# browser_type="chromium",
headless=True
)
# Initialize crawler config with JSON CSS extraction strategy nav-search-submit-button
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
extraction_strategy=JsonCssExtractionStrategy(
schema={
"name": "Amazon Product Search Results",
"baseSelector": "[data-component-type='s-search-result']",
"fields": [
{
"name": "asin",
"selector": "",
"type": "attribute",
"attribute": "data-asin"
},
{
"name": "title",
"selector": "h2 a span",
"type": "text"
},
{
"name": "url",
"selector": "h2 a",
"type": "attribute",
"attribute": "href"
},
{
"name": "image",
"selector": ".s-image",
"type": "attribute",
"attribute": "src"
},
{
"name": "rating",
"selector": ".a-icon-star-small .a-icon-alt",
"type": "text"
},
{
"name": "reviews_count",
"selector": "[data-csa-c-func-deps='aui-da-a-popover'] ~ span span",
"type": "text"
},
{
"name": "price",
"selector": ".a-price .a-offscreen",
"type": "text"
},
{
"name": "original_price",
"selector": ".a-price.a-text-price .a-offscreen",
"type": "text"
},
{
"name": "sponsored",
"selector": ".puis-sponsored-label-text",
"type": "exists"
},
{
"name": "delivery_info",
"selector": "[data-cy='delivery-recipe'] .a-color-base",
"type": "text",
"multiple": True
}
]
}
)
)
url = "https://www.amazon.com/"
async def after_goto(page: Page, context: BrowserContext, url: str, response: dict, **kwargs):
"""Hook called after navigating to each URL"""
print(f"[HOOK] after_goto - Successfully loaded: {url}")
try:
# Wait for search box to be available
search_box = await page.wait_for_selector('#twotabsearchtextbox', timeout=1000)
# Type the search query
await search_box.fill('Samsung Galaxy Tab')
# Get the search button and prepare for navigation
search_button = await page.wait_for_selector('#nav-search-submit-button', timeout=1000)
# Click with navigation waiting
await search_button.click()
# Wait for search results to load
await page.wait_for_selector('[data-component-type="s-search-result"]', timeout=10000)
print("[HOOK] Search completed and results loaded!")
except Exception as e:
print(f"[HOOK] Error during search operation: {str(e)}")
return page
# Use context manager for proper resource handling
async with AsyncWebCrawler(config=browser_config) as crawler:
crawler.crawler_strategy.set_hook("after_goto", after_goto)
# Extract the data
result = await crawler.arun(url=url, config=crawler_config)
# Process and print the results
if result and result.extracted_content:
# Parse the JSON string into a list of products
products = json.loads(result.extracted_content)
# Process each product in the list
for product in products:
print("\nProduct Details:")
print(f"ASIN: {product.get('asin')}")
print(f"Title: {product.get('title')}")
print(f"Price: {product.get('price')}")
print(f"Original Price: {product.get('original_price')}")
print(f"Rating: {product.get('rating')}")
print(f"Reviews: {product.get('reviews_count')}")
print(f"Sponsored: {'Yes' if product.get('sponsored') else 'No'}")
if product.get('delivery_info'):
print(f"Delivery: {' '.join(product['delivery_info'])}")
print("-" * 80)
if __name__ == "__main__":
import asyncio
asyncio.run(extract_amazon_products()) |
Hi Unclecode. Thanks for your сrawl4ai. Very cool library. A small question. Can I use your Crawler on the Amazon website (for example) in the "Search" field, type in "Samsung Galaxy Tab" (for example) and then search for the product I need on the results page?
The text was updated successfully, but these errors were encountered: