-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regarding scrpaping of Dynamic website like Skyscanner.net #341
Comments
+1 |
Hi @Shuaib11-Github (and anyone else facing similar issues), The problem you’re encountering with Skyscanner and similar dynamic websites is that they employ strong anti-bot and anti-scraping measures. When you try to load the page programmatically, you might pass initial checks like a random user agent, but the website can still detect that it’s not a real browser session or a genuine user. As a result, you hit a “bot detection” wall. I’ve attached images below to illustrate what happens:
Because scenarios like this are common, I’m adding this explanation as a reference tutorial. This way, whenever someone encounters a similar problem, they can refer back to these steps and examples. Tutorial: Dealing with Anti-Bot MeasuresMany modern sites, especially those dealing with travel, e-commerce, or finance, have robust anti-bot systems. They detect non-human browsing patterns and headless browsers. While setting a random user agent often works for simpler pages, you may need a more advanced approach for tougher sites. Key Strategies:
In Summary:
This approach makes Crawl4AI much more versatile, enabling you to tackle even heavily protected sites. |
So magic mode doesn't currently work in cases like this? |
@unclecode Got the below when ran the code self.crawler_strategy = crawler_strategy or AsyncPlaywrightCrawlerStrategy( |
@Shuaib11-Github my bad! In AsyncWebCrawler constructor should be config=... not browser_config=..., I edit it now! @blghtr I’ll add this to magic mode as well. When you set magic=True, it will switch to a managed browser, create a temporary user directory, set a random user agent, and then, once everything is done, either remove the directory or reuse it later. |
@unclecode with headless=False, I got the below [INIT].... → Crawl4AI 0.4.22 when changed to headless=True, I got the below [INIT].... → Crawl4AI 0.4.22 How can I extract the flight details and make sure it is further saved in any format. Atleast if can store the details in Markdown then I can further make sure to save it as csv file. But I need only the data to be extracted for respective flights as per the user input requests. |
@Shuaib11-Github Look at the following code: async def main():
# Configure the browser
browser_config = BrowserConfig(
headless=False, # Set to False so you can see what's happening
verbose=True,
user_agent_mode="random",
use_managed_browser=True, # Enables persistent browser sessions
browser_type="chromium",
user_data_dir="/Users/unclecode/.user_data_dir",
)
schema = {
"name": "Skyscanner Place Cards",
"baseSelector": "div[class^='PlaceCard_descriptionContainer__']",
"fields": [
{
"name": "city_name",
"selector": "div[class^='PlaceCard_nameContent__'] .BpkText_bpk-text--heading-4__",
"type": "text",
},
{
"name": "country_name",
"selector": "span[class*='PlaceCard_subName__']",
"type": "text",
},
{
"name": "description",
"selector": "span[class*='PlaceCard_advertLabel__']",
"type": "text",
},
{
"name": "flight_price",
"selector": "a[data-testid='flights-link'] .BpkText_bpk-text--heading-5__",
"type": "text",
},
{
"name": "flight_type",
"selector": "a[data-testid='flights-link'] .BpkText_bpk-text--body-default__",
"type": "text",
},
{
"name": "flight_url",
"selector": "a[data-testid='flights-link']",
"type": "attribute",
"attribute": "href",
},
{
"name": "hotels_url",
"selector": "a[data-testid='hotels-link']",
"type": "attribute",
"attribute": "href",
},
],
}
# Set crawl configuration
crawl_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
extraction_strategy=JsonCssExtractionStrategy(schema),
wait_for="css:div[class^='PlaceCard_descriptionContainer__']",
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://www.skyscanner.co.in/transport/flights/del/",
config=crawl_config,
)
if result.success:
companies = json.loads(result.extracted_content)
print(f"Successfully extracted {len(companies)} companies")
print(json.dumps(companies[0], indent=2))
if __name__ == "__main__":
asyncio.run(main()) INIT].... → Crawl4AI 0.4.23
[WARNING]. ⚠ Both crawler_config and legacy parameters provided. crawler_config will take precedence.
[FETCH]... ↓ https://www.skyscanner.co.in/transport/flights/del... | Status: True | Time: 1.88s
[SCRAPE].. ◆ Processed https://www.skyscanner.co.in/transport/flights/del... | Time: 265ms
[EXTRACT]. ■ Completed for https://www.skyscanner.co.in/transport/flights/del... | Time: 0.10316416597925127s
[COMPLETE] ● https://www.skyscanner.co.in/transport/flights/del... | Status: True | Total: 2.25s
Successfully extracted 9 companies
{
"country_name": "Saudi Arabia",
"description": "This land is calling. Step into Saudi, the heart of Arabia.",
"flight_url": "https://www.skyscanner.co.in/transport/flights/del/ruha/?adultsv2=1&cabinclass=economy&childrenv2=&ref=home&rtn=0&preferdirects=true&outboundaltsenabled=false&inboundaltsenabled=false&oym=2501",
"hotels_url": "/transport/flights/del/ruha/?adultsv2=1&cabinclass=economy&childrenv2=&ref=home&rtn=0&preferdirects=true&outboundaltsenabled=false&inboundaltsenabled=false&oym=2501&hotelsselected=true"
}
[INFO].... ℹ Browser process terminated normally | Code: 0 Just pay attention to something very important when I run this code for the first time. When I pass a new directory, what happens is that I set a breakpoint, for example, and the line that I check determines if the result is successful or not. When run the code, headless, is set to false, the code wait, I can see the browser asking me to prove that I am human. I complete the proof test, and once it's approved, the page displays. Then, I stop the whole process and run the code again, and from this point because it uses the directory I created, which contains my human identity information, it works effectively. As you can see, I use, for example, the JsonCssExtractionStrategy, and I have been able to extract data in JSON format that you want. It worth to mention I also used |
But I need data in the below format [ For the entire month or so. As user gives origin of the flight and then the code will fetch the origin, destination, departure, arrival and Price of the flight for the entire month without failing for any of the provided input and robust to any input and also it should be saved locally to check if it is working or not |
got the below when changed with headless=True, for the second time [INIT].... → Crawl4AI 0.4.22 |
@Shuaib11-Github Are you referring to extracting information from this page? If so, this means you build url dynamically in your application. and then pass it to Crawl4ai for extraction, is that correct? |
Basically here the user inputs the origin of the flight and then based on
that all available flights for that month for different locations need to
be extracted.
So I need data as below
Origin, Destination, Departure time, Arrival time, Date, Price
…On Tue, 17 Dec, 2024, 2:07 pm UncleCode, ***@***.***> wrote:
@Shuaib11-Github <https://github.com/Shuaib11-Github>
1/ Did you start to use managed browser?
2/ Looking at the structure of the data you need, I see that it does not
come entirely from the links you provided. Those links are insufficient
because they only contain some packages. To obtain your data, you should
search for that specific date and time. I will share an example of the
links.
https://www.skyscanner.co.in/transport/flights/del/blr/250101/250201/?adultsv2=1&cabinclass=economy&childrenv2=&inboundaltsenabled=false&outboundaltsenabled=false&preferdirects=false&rtn=1&priceSourceId=&priceTrace=202412151014*I*DEL*BLR*20250101*goib*AI%7C202412151014*I*BLR*DEL*20250201*goib*6E&qp_prevCurrency=INR&qp_prevPrice=16287&qp_prevProvider=ins_month
Are you referring to extracting information from this page? If so, this
means you build url dynamically in your application. and then pass it to
Crawl4ai for extraction, is that correct?
image.png (view on web)
<https://github.com/user-attachments/assets/983db133-8ceb-40bc-bcb2-4c5ca348b7f5>
—
Reply to this email directly, view it on GitHub
<#341 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AONNA2DSQLR4DUUSOAALPCT2F7PEDAVCNFSM6AAAAABTNO7DLSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNBXHAYTAMZQGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I was trying to scrape the content from Skyscanner.net with fields as Origin, Destination, Price, Departure time, Arrival time but it is giving error as below
Please provide the following travel details:
Departure Airport (e.g., JFK): DEL
Date of Departure (YYYY-MM-DD): 2024-12-12
Hour of Departure (24-hour format, e.g., 14:00): 16:05
Destination Airport (e.g., LAX): BLR
Details saved to CSV file successfully.
[INIT].... → Crawl4AI 0.4.1
[ERROR]... × https://www.skyscanner.co.in/transport/flights/del... | Error:
┌───────────────────────────────────────────────────────────────────────────────┐
│ × async_crawler_strategy.py:_crawleb(): Timeout 30000ms exceeded. │
│ =========================== logs =========================== │
│ "load" event fired │
│ ============================================================ │
└───────────────────────────────────────────────────────────────────────────────┘
Failed to crawl the URL: https://www.skyscanner.co.in/transport/flights/del/blr/241212/?adults=1&adultsv2=1&cabinclass=economy&children=0&childrenv2=&inboundaltsenabled=false&infants=0&outboundaltsenabled=false&preferdirects=false&ref=home&rtn=0
Error: async_crawler_strategy.py:_crawleb(): Timeout 30000ms exceeded.
=========================== logs ===========================
"load" event fired
How can we fix this so that it can seamlessly and also there is a button "show more results" where it has the remaining data. So how can we extract all of the data present in the website using Crawl4ai.
The text was updated successfully, but these errors were encountered: