Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regarding scrpaping of Dynamic website like Skyscanner.net #341

Open
Shuaib11-Github opened this issue Dec 11, 2024 · 11 comments
Open

Regarding scrpaping of Dynamic website like Skyscanner.net #341

Shuaib11-Github opened this issue Dec 11, 2024 · 11 comments

Comments

@Shuaib11-Github
Copy link

I was trying to scrape the content from Skyscanner.net with fields as Origin, Destination, Price, Departure time, Arrival time but it is giving error as below

Please provide the following travel details:
Departure Airport (e.g., JFK): DEL
Date of Departure (YYYY-MM-DD): 2024-12-12
Hour of Departure (24-hour format, e.g., 14:00): 16:05
Destination Airport (e.g., LAX): BLR
Details saved to CSV file successfully.
[INIT].... → Crawl4AI 0.4.1
[ERROR]... × https://www.skyscanner.co.in/transport/flights/del... | Error:
┌───────────────────────────────────────────────────────────────────────────────┐
│ × async_crawler_strategy.py:_crawleb(): Timeout 30000ms exceeded. │
│ =========================== logs =========================== │
│ "load" event fired │
│ ============================================================ │
└───────────────────────────────────────────────────────────────────────────────┘

Failed to crawl the URL: https://www.skyscanner.co.in/transport/flights/del/blr/241212/?adults=1&adultsv2=1&cabinclass=economy&children=0&childrenv2=&inboundaltsenabled=false&infants=0&outboundaltsenabled=false&preferdirects=false&ref=home&rtn=0
Error: async_crawler_strategy.py:_crawleb(): Timeout 30000ms exceeded.
=========================== logs ===========================
"load" event fired

How can we fix this so that it can seamlessly and also there is a button "show more results" where it has the remaining data. So how can we extract all of the data present in the website using Crawl4ai.

@Navanit-git
Copy link

+1

@unclecode
Copy link
Owner

unclecode commented Dec 13, 2024

Hi @Shuaib11-Github (and anyone else facing similar issues),

The problem you’re encountering with Skyscanner and similar dynamic websites is that they employ strong anti-bot and anti-scraping measures. When you try to load the page programmatically, you might pass initial checks like a random user agent, but the website can still detect that it’s not a real browser session or a genuine user. As a result, you hit a “bot detection” wall.

I’ve attached images below to illustrate what happens:

  1. Bot Detection Screen:
    image
    Initially, you may see a challenge page or some form of verification step.

  2. Passing the Detection:
    image
    If you use a managed browser session and interact with the site as a real browser would, you can get past this stage. The browser retains your state, cookies, and other identifying factors, so once you pass the verification step once, subsequent crawls from the same user directory are recognized as a genuine session.

  3. Success & Extracted Data:
    image
    After successfully bypassing detection, Crawl4AI can extract the page content as intended.

Because scenarios like this are common, I’m adding this explanation as a reference tutorial. This way, whenever someone encounters a similar problem, they can refer back to these steps and examples.


Tutorial: Dealing with Anti-Bot Measures

Many modern sites, especially those dealing with travel, e-commerce, or finance, have robust anti-bot systems. They detect non-human browsing patterns and headless browsers. While setting a random user agent often works for simpler pages, you may need a more advanced approach for tougher sites.

Key Strategies:

  1. First Step with User Agent Randomization
    Before delving into managed browsers, first try the simplest approach:

    • Set user_agent_mode="random" in BrowserConfig.
    • Run your crawl to see if the site allows you through without additional measures.

    If this step doesn’t work and you still encounter bot detection or challenges, then proceed to the more robust solution using a managed browser and persistent user data.

  2. Use a Managed Browser:
    By enabling use_managed_browser in BrowserConfig, you’re effectively launching a full browser instance with persistent user data. This lets the site identify you as a returning user and not a fresh “bot” each time.

    For example, you might do:

    import asyncio
    from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
    from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
    
    async def main():
        # Configure the browser
        browser_config = BrowserConfig(
            headless=False,  # Set to False so you can see what's happening
            verbose=True,
            user_agent_mode="random",
            use_managed_browser=True, # Enables persistent browser sessions
            browser_type="chromium",
            user_data_dir="/path/to/your_chrome_user_data"
        )
    
        # Set crawl configuration
        crawl_config = CrawlerRunConfig(
            cache_mode=CacheMode.BYPASS,
            markdown_generator=DefaultMarkdownGenerator()
        )
    
        async with AsyncWebCrawler(config=browser_config) as crawler:
            result = await crawler.arun(
                url="https://www.skyscanner.co.in/transport/flights/del/",
                config=crawl_config
            )
    
            if result.success:
                print("Raw Markdown Length:", len(result.markdown_v2.raw_markdown))
                print("Citations Markdown Length:", len(result.markdown_v2.markdown_with_citations))
    
    if __name__ == "__main__":
        asyncio.run(main())
  3. First Run - Pass the Challenge Manually:
    The first time you run it, keep headless=False so you can see the browser. If the website shows a CAPTCHA or challenge, solve it manually in the opened browser window. Once done, that session (stored in user_data_dir) will “remember” that you’ve passed the challenge.

  4. Subsequent Crawls - Automatic Access:
    On future runs, you can enable headless=True since the site now recognizes your browser session. This gives you full automation for extraction without the bot detection popping up every time.


In Summary:

  • Basic pages: Try headless=True with a random user agent (default config).
  • Tough anti-bot pages: Use a managed browser with a user data directory and interact with the site once manually.
  • After passing the initial verification step, you can crawl the site as if you were a regular user, allowing you to gather all the data you need.

This approach makes Crawl4AI much more versatile, enabling you to tackle even heavily protected sites.

@blghtr
Copy link

blghtr commented Dec 16, 2024

So magic mode doesn't currently work in cases like this?

@Shuaib11-Github
Copy link
Author

Shuaib11-Github commented Dec 16, 2024

@unclecode Got the below when ran the code

self.crawler_strategy = crawler_strategy or AsyncPlaywrightCrawlerStrategy(
TypeError: crawl4ai.async_crawler_strategy.AsyncPlaywrightCrawlerStrategy() got multiple values for keyword argument 'browser_config'

@unclecode
Copy link
Owner

@Shuaib11-Github my bad! In AsyncWebCrawler constructor should be config=... not browser_config=..., I edit it now!

@blghtr I’ll add this to magic mode as well. When you set magic=True, it will switch to a managed browser, create a temporary user directory, set a random user agent, and then, once everything is done, either remove the directory or reuse it later.

@Shuaib11-Github
Copy link
Author

@unclecode with headless=False, I got the below

[INIT].... → Crawl4AI 0.4.22
[WARNING]. ⚠ Both crawler_config and legacy parameters provided. crawler_config will take precedence.
[FETCH]... ↓ https://www.skyscanner.co.in/transport/flights/del... | Status: True | Time: 1.96s
[SCRAPE].. ◆ Processed https://www.skyscanner.co.in/transport/flights/del... | Time: 23ms
[COMPLETE] ● https://www.skyscanner.co.in/transport/flights/del... | Status: True | Total: 2.00s
Raw Markdown Length: 371
Citations Markdown Length: 371
[INFO].... ℹ Browser process terminated normally | Code: 1

when changed to headless=True, I got the below

[INIT].... → Crawl4AI 0.4.22
[WARNING]. ⚠ Both crawler_config and legacy parameters provided. crawler_config will take precedence.
[FETCH]... ↓ https://www.skyscanner.co.in/transport/flights/del... | Status: True | Time: 1.27s
[SCRAPE].. ◆ Processed https://www.skyscanner.co.in/transport/flights/del... | Time: 9ms
[COMPLETE] ● https://www.skyscanner.co.in/transport/flights/del... | Status: True | Total: 1.28s
Raw Markdown Length: 371
Citations Markdown Length: 371

How can I extract the flight details and make sure it is further saved in any format. Atleast if can store the details in Markdown then I can further make sure to save it as csv file. But I need only the data to be extracted for respective flights as per the user input requests.

@unclecode
Copy link
Owner

@Shuaib11-Github Look at the following code:

async def main():
    # Configure the browser
    browser_config = BrowserConfig(
        headless=False,  # Set to False so you can see what's happening
        verbose=True,
        user_agent_mode="random",
        use_managed_browser=True,  # Enables persistent browser sessions
        browser_type="chromium",
        user_data_dir="/Users/unclecode/.user_data_dir",
    )

    schema = {
        "name": "Skyscanner Place Cards",
        "baseSelector": "div[class^='PlaceCard_descriptionContainer__']",
        "fields": [
            {
                "name": "city_name",
                "selector": "div[class^='PlaceCard_nameContent__'] .BpkText_bpk-text--heading-4__",
                "type": "text",
            },
            {
                "name": "country_name",
                "selector": "span[class*='PlaceCard_subName__']",
                "type": "text",
            },
            {
                "name": "description",
                "selector": "span[class*='PlaceCard_advertLabel__']",
                "type": "text",
            },
            {
                "name": "flight_price",
                "selector": "a[data-testid='flights-link'] .BpkText_bpk-text--heading-5__",
                "type": "text",
            },
            {
                "name": "flight_type",
                "selector": "a[data-testid='flights-link'] .BpkText_bpk-text--body-default__",
                "type": "text",
            },
            {
                "name": "flight_url",
                "selector": "a[data-testid='flights-link']",
                "type": "attribute",
                "attribute": "href",
            },
            {
                "name": "hotels_url",
                "selector": "a[data-testid='hotels-link']",
                "type": "attribute",
                "attribute": "href",
            },
        ],
    }

    # Set crawl configuration
    crawl_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        extraction_strategy=JsonCssExtractionStrategy(schema),
        wait_for="css:div[class^='PlaceCard_descriptionContainer__']",
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url="https://www.skyscanner.co.in/transport/flights/del/",
            config=crawl_config,
            
        )

        if result.success:
            companies = json.loads(result.extracted_content)
            print(f"Successfully extracted {len(companies)} companies")
            print(json.dumps(companies[0], indent=2))


if __name__ == "__main__":
    asyncio.run(main())
INIT].... → Crawl4AI 0.4.23
[WARNING]. ⚠ Both crawler_config and legacy parameters provided. crawler_config will take precedence.
[FETCH]... ↓ https://www.skyscanner.co.in/transport/flights/del... | Status: True | Time: 1.88s
[SCRAPE].. ◆ Processed https://www.skyscanner.co.in/transport/flights/del... | Time: 265ms
[EXTRACT]. ■ Completed for https://www.skyscanner.co.in/transport/flights/del... | Time: 0.10316416597925127s
[COMPLETE] ● https://www.skyscanner.co.in/transport/flights/del... | Status: True | Total: 2.25s
Successfully extracted 9 companies
{
  "country_name": "Saudi Arabia",
  "description": "This land is calling. Step into Saudi, the heart of Arabia.",
  "flight_url": "https://www.skyscanner.co.in/transport/flights/del/ruha/?adultsv2=1&cabinclass=economy&childrenv2=&ref=home&rtn=0&preferdirects=true&outboundaltsenabled=false&inboundaltsenabled=false&oym=2501",
  "hotels_url": "/transport/flights/del/ruha/?adultsv2=1&cabinclass=economy&childrenv2=&ref=home&rtn=0&preferdirects=true&outboundaltsenabled=false&inboundaltsenabled=false&oym=2501&hotelsselected=true"
}
[INFO].... ℹ Browser process terminated normally | Code: 0

Just pay attention to something very important when I run this code for the first time. When I pass a new directory, what happens is that I set a breakpoint, for example, and the line that I check determines if the result is successful or not. When run the code, headless, is set to false, the code wait, I can see the browser asking me to prove that I am human. I complete the proof test, and once it's approved, the page displays. Then, I stop the whole process and run the code again, and from this point because it uses the directory I created, which contains my human identity information, it works effectively.

As you can see, I use, for example, the JsonCssExtractionStrategy, and I have been able to extract data in JSON format that you want. It worth to mention I also used wait_for, its a must. You can also use LLMExtraction, or you can just store the Markdown. But the key point is that you understand how to handle the managed browser.

@Shuaib11-Github
Copy link
Author

Shuaib11-Github commented Dec 16, 2024

But I need data in the below format

[
{
"origin": "DEL",
"destination": "BLR",
"departure_time": "08:00",
"arrival_time": "10:50"
},
{
"origin": "DEL",
"destination": "BLR",
"departure_time": "05:55",
"arrival_time": "09:05"
},
{
"origin": "DEL",
"destination": "BLR",
"departure_time": "08:00",
"arrival_time": "10:50"
},
{
"origin": "DEL",
"destination": "BLR",
"departure_time": "03:30",
"arrival_time": "06:20"
},
{
"origin": "DEL",
"destination": "BLR",
"departure_time": "21:35",
"arrival_time": "00:25"
},
{
"origin": "DEL",
"destination": "BLR",
"departure_time": "08:10",
"arrival_time": "13:45"
},
{
"origin": "DEL",
"destination": "BLR",
"departure_time": "21:50",
"arrival_time": "00:40"
},
{
"origin": "DEL",
"destination": "BLR",
"departure_time": "17:40",
"arrival_time": "20:30"
},
{
"origin": "DEL",
"destination": "BLR",
"departure_time": "08:10",
"arrival_time": "13:45"
},
{
"origin": "DEL",
"destination": "BLR",
"departure_time": "11:45",
"arrival_time": "14:35"
}
]

For the entire month or so. As user gives origin of the flight and then the code will fetch the origin, destination, departure, arrival and Price of the flight for the entire month without failing for any of the provided input and robust to any input and also it should be saved locally to check if it is working or not

@Shuaib11-Github
Copy link
Author

got the below when changed with headless=True, for the second time

[INIT].... → Crawl4AI 0.4.22
[WARNING]. ⚠ Both crawler_config and legacy parameters provided. crawler_config will take precedence.
[ERROR]... × https://www.skyscanner.co.in/transport/flights/del... | Error:
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ × Unexpected error in crawl_web at line 899 in crawl_web (..\anaconda3\envs\crawl\lib\site- │
│ packages\crawl4ai\async_crawler_strategy.py): │
│ Error: Wait condition failed: Timeout after 60000ms waiting for selector │
│ 'div[class^='PlaceCard_descriptionContainer
']' │
│ │
│ Code context: │
│ 894 # Handle wait_for condition │
│ 895 if config.wait_for: │
│ 896 try: │
│ 897 await self.smart_wait(page, config.wait_for, timeout=config.page_timeout) │
│ 898 except Exception as e: │
│ 899 → raise RuntimeError(f"Wait condition failed: {str(e)}") │
│ 900 │
│ 901 # Update image dimensions if needed │
│ 902 if not self.browser_config.text_only: │
│ 903 update_image_dimensions_js = load_js_script("update_image_dimensions") │
│ 904 try:

@unclecode
Copy link
Owner

unclecode commented Dec 17, 2024

@Shuaib11-Github
1/ Did you start to use managed browser?
2/ Looking at the structure of the data you need, I see that it does not come entirely from the links you provided. Those links are insufficient because they only contain some packages. To obtain your data, you should search for that specific date and time. I will share an example of the links.

https://www.skyscanner.co.in/transport/flights/del/blr/250101/250201/?adultsv2=1&cabinclass=economy&childrenv2=&inboundaltsenabled=false&outboundaltsenabled=false&preferdirects=false&rtn=1&priceSourceId=&priceTrace=202412151014*I*DEL*BLR*20250101*goib*AI%7C202412151014*I*BLR*DEL*20250201*goib*6E&qp_prevCurrency=INR&qp_prevPrice=16287&qp_prevProvider=ins_month

Are you referring to extracting information from this page? If so, this means you build url dynamically in your application. and then pass it to Crawl4ai for extraction, is that correct?

image

@Shuaib11-Github
Copy link
Author

Shuaib11-Github commented Dec 17, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants