Some websites don't show the results #346

luisferreira93 · 2024-12-13T11:22:17Z

Hello, I am integrating Crawl4AI with Scrapy and for some websites I get different results, for example:

Link: https://www.wafdbank.com/customer-service/faq
Result: 62c877a5211801b8

Link: https://quotes.toscrape.com
Result: #  [Quotes to Scrape](/)

[Login](/login)

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” by Albert Einstein [(about)](/author/Albert-Einstein)

Tags: [change](/tag/change/page/1/) [deep-thoughts](/tag/deep-thoughts/page/1/) [thinking](/tag/thinking/page/1/) [world](/tag/world/page/1/)

etc. (...)

So, the result is different since in the second website I actually get the result. What I find weird is that this used to work for both websites..
The version I am using is 0.3.74.

My code below (still in progress):


from crawl4ai import AsyncWebCrawler
from scrapy.exceptions import CloseSpider
from scrapy.http import Request
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class WebCrawlerSpider(CrawlSpider):
    name = "webcrawler"
    rules = (
        Rule(LinkExtractor(), callback="parse_item", follow=True),
    )

    def __init__(self, urls=["https://www.wafdbank.com/customer-service/faq"], *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.start_urls = urls

    #Passing arguments and setting here the configs depth, max_links and urls
    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super().from_crawler(crawler, *args, **kwargs)
        spider.settings.set("DEPTH_LIMIT", 1, priority="spider",)
        return spider

    # To crawl the start_url and avoid duplication of the same
    def start_requests(self):
        for url in self.start_urls:
            yield Request(url)

    # For max link, we probably need a manual implementation for now, maybe override the extract_links method of LinkExtractor
    async def parse_start_url(self, response):
        await self.process_url(response.url)
        if self.should_stop_crawling():
            self.logger.info("DEPTH_LIMIT is 0. Stopping crawl.")
            raise CloseSpider(reason="DEPTH_LIMIT reached 0, stopping spider.")

    async def parse_item(self, response):
        await self.process_url(response.url)

    def write_results_to_file(self, url, markdown):
        file_path = "results_markdown.txt"
        with open(file_path, "a", encoding="utf-8") as file:
            file.write(f"Link: {url}\nResult: {markdown}\n")

    async def process_url(self, url):
        async with AsyncWebCrawler(verbose=False) as crawler:
            result = await crawler.arun(
                url=url,
            )
        self.write_results_to_file(url, result.markdown)

    def should_stop_crawling(self):
        depth_limit = self.settings.getint("DEPTH_LIMIT", default=-1)
        return depth_limit == 0

Can someone help here please? Thank you

The text was updated successfully, but these errors were encountered:

luisferreira93 · 2024-12-13T11:29:04Z

After checking 338, I used bypass_cache = True, and the content is extracted again. Will try to bump to 0.4.2 when I get a final version of this (since using the cache is a win).

luisferreira93 · 2024-12-13T16:51:30Z

On another note, now I am getting a bug using AsyncWebCrawler and I don't understand why. Can someone help?

from crawl4ai import AsyncWebCrawler, CacheMode
from scrapy.exceptions import CloseSpider
from scrapy.http import Request
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.crawler import CrawlerProcess


class WebCrawlerSpider(CrawlSpider):
    name = "webcrawler1"
    allowed_domains = ["quotes.toscrape.com"]
    rules = (
        Rule(LinkExtractor(), callback="parse_item", follow=True),
    )

    def __init__(self, urls=["https://quotes.toscrape.com"], *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.start_urls = urls

    #Passing arguments and setting here the config crawl_depth
    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super().from_crawler(crawler, *args, **kwargs)
        spider.settings.set("DEPTH_LIMIT", 0, priority="spider",)
        return spider
    
    async def parse_item(self, response):
        await self.process_url(response.url)

    # To crawl the start_url and avoid duplication of the same
    def start_requests(self):
        for url in self.start_urls:
            yield Request(url)

    # For max link, we probably need a manual implementation for now, maybe override the extract_links method of LinkExtractor
    async def parse_start_url(self, response):
        await self.process_url(response.url)
        #if self.should_stop_crawling():
        #    self.logger.info("DEPTH_LIMIT is 0. Stopping crawl.")
        #    raise CloseSpider(reason="DEPTH_LIMIT reached 0, stopping spider.")

    def write_results_to_file(self, url, markdown):
        file_path = "results_markdown.txt"
        with open(file_path, "a", encoding="utf-8") as file:
            file.write(f"Link: {url}\nResult: {markdown}\n")

    async def process_url(self, url):
       #crawler = AsyncWebCrawler(always_bypass_cache=True)
        #result = await crawler.arun(
        #    url=url,
        #)
        async with AsyncWebCrawler(verbose=True) as crawler:
            result = await crawler.arun(
                url="https://www.kidocode.com/degrees/technology",
                cach_mode = CacheMode.DISABLED
            )
        self.write_results_to_file(url, result.markdown)

import asyncio

from scrapy.crawler import CrawlerProcess

from scrapy_webcrawler.spiders.spider import WebCrawlerSpider


class WebCrawlerConnector:
    
    async def start(self) -> int:
        process = CrawlerProcess({
            'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
        })
        process.crawl(
            WebCrawlerSpider,
            urls=["https://quotes.toscrape.com"],
            crawl_depth=0,
            max_links_per_page=2,
        )
        process.start()
        return 1


async def main() -> None:
    """Start the connector."""
    connector = WebCrawlerConnector()
    await connector.start()


if __name__ == "__main__":
    asyncio.run(main())

And the stacktrace:

ERROR:scrapy.core.scraper:Spider error processing <GET https://quotes.toscrape.com/> (referer: None)
Traceback (most recent call last):
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/utils/defer.py", line 295, in aiter_errback
    yield await it.__anext__()
          ^^^^^^^^^^^^^^^^^^^^
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/utils/python.py", line 374, in __anext__
    return await self.data.__anext__()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/utils/python.py", line 355, in _async_chain
    async for o in as_async_generator(it):
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/utils/asyncgen.py", line 14, in as_async_generator
    async for r in it:
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/utils/python.py", line 374, in __anext__
    return await self.data.__anext__()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/utils/python.py", line 355, in _async_chain
    async for o in as_async_generator(it):
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/utils/asyncgen.py", line 14, in as_async_generator
    async for r in it:
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 118, in process_async
    async for r in iterable:
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/spidermiddlewares/referer.py", line 355, in process_spider_output_async
    async for r in result or ():
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 118, in process_async
    async for r in iterable:
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/spidermiddlewares/urllength.py", line 30, in process_spider_output_async
    async for r in result or ():
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 118, in process_async
    async for r in iterable:
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/spidermiddlewares/depth.py", line 35, in process_spider_output_async
    async for r in result or ():
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 118, in process_async
    async for r in iterable:
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/spiders/crawl.py", line 122, in _parse_response
    cb_res = await cb_res
             ^^^^^^^^^^^^
  File "/Users/joao.martins/Downloads/test/scrapy_webcrawler/scrapy_webcrawler/spiders/spider.py", line 37, in parse_start_url
    await self.process_url(response.url)
  File "/Users/joao.martins/Downloads/test/scrapy_webcrawler/scrapy_webcrawler/spiders/spider.py", line 52, in process_url
    async with AsyncWebCrawler(verbose=True) as crawler:
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/crawl4ai/async_webcrawler.py", line 111, in __aenter__
    await self.crawler_strategy.__aenter__()
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/crawl4ai/async_crawler_strategy.py", line 265, in __aenter__
    await self.start()
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/crawl4ai/async_crawler_strategy.py", line 273, in start
    self.playwright = await async_playwright().start()
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/playwright/async_api/_context_manager.py", line 51, in start
    return await self.__aenter__()
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/playwright/async_api/_context_manager.py", line 40, in __aenter__
    done, _ = await asyncio.wait(
              ^^^^^^^^^^^^^^^^^^^
  File "/Users/joao.martins/.pyenv/versions/3.11.2/lib/python3.11/asyncio/tasks.py", line 418, in wait
    return await _wait(fs, timeout, return_when, loop)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/joao.martins/.pyenv/versions/3.11.2/lib/python3.11/asyncio/tasks.py", line 525, in _wait
    await waiter
RuntimeError: await wasn't used with future
2024-12-13 16:43:12 [scrapy.core.scraper] ERROR: Spider error processing <GET https://quotes.toscrape.com/> (referer: None)
Traceback (most recent call last):
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/utils/defer.py", line 295, in aiter_errback
    yield await it.__anext__()
          ^^^^^^^^^^^^^^^^^^^^
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/utils/python.py", line 374, in __anext__
    return await self.data.__anext__()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/utils/python.py", line 355, in _async_chain
    async for o in as_async_generator(it):
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/utils/asyncgen.py", line 14, in as_async_generator
    async for r in it:
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/utils/python.py", line 374, in __anext__
    return await self.data.__anext__()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/utils/python.py", line 355, in _async_chain
    async for o in as_async_generator(it):
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/utils/asyncgen.py", line 14, in as_async_generator
    async for r in it:
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 118, in process_async
    async for r in iterable:
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/spidermiddlewares/referer.py", line 355, in process_spider_output_async
    async for r in result or ():
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 118, in process_async
    async for r in iterable:
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/spidermiddlewares/urllength.py", line 30, in process_spider_output_async
    async for r in result or ():
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 118, in process_async
    async for r in iterable:
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/spidermiddlewares/depth.py", line 35, in process_spider_output_async
    async for r in result or ():
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 118, in process_async
    async for r in iterable:
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/spiders/crawl.py", line 122, in _parse_response
    cb_res = await cb_res
             ^^^^^^^^^^^^
  File "/Users/joao.martins/Downloads/test/scrapy_webcrawler/scrapy_webcrawler/spiders/spider.py", line 37, in parse_start_url
    await self.process_url(response.url)
  File "/Users/joao.martins/Downloads/test/scrapy_webcrawler/scrapy_webcrawler/spiders/spider.py", line 52, in process_url
    async with AsyncWebCrawler(verbose=True) as crawler:
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/crawl4ai/async_webcrawler.py", line 111, in __aenter__
    await self.crawler_strategy.__aenter__()
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/crawl4ai/async_crawler_strategy.py", line 265, in __aenter__
    await self.start()
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/crawl4ai/async_crawler_strategy.py", line 273, in start
    self.playwright = await async_playwright().start()
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/playwright/async_api/_context_manager.py", line 51, in start
    return await self.__aenter__()
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/playwright/async_api/_context_manager.py", line 40, in __aenter__
    done, _ = await asyncio.wait(
              ^^^^^^^^^^^^^^^^^^^
  File "/Users/joao.martins/.pyenv/versions/3.11.2/lib/python3.11/asyncio/tasks.py", line 418, in wait
    return await _wait(fs, timeout, return_when, loop)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/joao.martins/.pyenv/versions/3.11.2/lib/python3.11/asyncio/tasks.py", line 525, in _wait
    await waiter
RuntimeError: await wasn't used with future

I am trying to run it in WebCrawlerConnector but when launching the Scrapy spiders, the crawl4AI code doesn't work basically. Thank you in advance

unclecode · 2024-12-14T12:10:00Z

@luisferreira93 Please update to 0.4.21, this has been resolved in 0.4.21. Do you still need my help for your first query or that one already resolved?

luisferreira93 · 2024-12-16T08:30:19Z

I already found a solution, thank you @unclecode ⭐

unclecode · 2024-12-16T09:06:30Z

Today I release 0.4.23, let me know if you face with any issue

luisferreira93 closed this as completed Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some websites don't show the results #346

Some websites don't show the results #346

luisferreira93 commented Dec 13, 2024

luisferreira93 commented Dec 13, 2024

luisferreira93 commented Dec 13, 2024

unclecode commented Dec 14, 2024

luisferreira93 commented Dec 16, 2024

unclecode commented Dec 16, 2024

Some websites don't show the results #346

Some websites don't show the results #346

Comments

luisferreira93 commented Dec 13, 2024

luisferreira93 commented Dec 13, 2024

luisferreira93 commented Dec 13, 2024

unclecode commented Dec 14, 2024

luisferreira93 commented Dec 16, 2024

unclecode commented Dec 16, 2024