Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some websites don't show the results #346

Closed
luisferreira93 opened this issue Dec 13, 2024 · 5 comments
Closed

Some websites don't show the results #346

luisferreira93 opened this issue Dec 13, 2024 · 5 comments

Comments

@luisferreira93
Copy link

Hello, I am integrating Crawl4AI with Scrapy and for some websites I get different results, for example:

Link: https://www.wafdbank.com/customer-service/faq
Result: 62c877a5211801b8

Link: https://quotes.toscrape.com
Result: #  [Quotes to Scrape](/)

[Login](/login)

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” by Albert Einstein [(about)](/author/Albert-Einstein)

Tags: [change](/tag/change/page/1/) [deep-thoughts](/tag/deep-thoughts/page/1/) [thinking](/tag/thinking/page/1/) [world](/tag/world/page/1/)

etc. (...)

So, the result is different since in the second website I actually get the result. What I find weird is that this used to work for both websites..
The version I am using is 0.3.74.

My code below (still in progress):


from crawl4ai import AsyncWebCrawler
from scrapy.exceptions import CloseSpider
from scrapy.http import Request
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class WebCrawlerSpider(CrawlSpider):
    name = "webcrawler"
    rules = (
        Rule(LinkExtractor(), callback="parse_item", follow=True),
    )

    def __init__(self, urls=["https://www.wafdbank.com/customer-service/faq"], *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.start_urls = urls

    #Passing arguments and setting here the configs depth, max_links and urls
    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super().from_crawler(crawler, *args, **kwargs)
        spider.settings.set("DEPTH_LIMIT", 1, priority="spider",)
        return spider

    # To crawl the start_url and avoid duplication of the same
    def start_requests(self):
        for url in self.start_urls:
            yield Request(url)

    # For max link, we probably need a manual implementation for now, maybe override the extract_links method of LinkExtractor
    async def parse_start_url(self, response):
        await self.process_url(response.url)
        if self.should_stop_crawling():
            self.logger.info("DEPTH_LIMIT is 0. Stopping crawl.")
            raise CloseSpider(reason="DEPTH_LIMIT reached 0, stopping spider.")

    async def parse_item(self, response):
        await self.process_url(response.url)

    def write_results_to_file(self, url, markdown):
        file_path = "results_markdown.txt"
        with open(file_path, "a", encoding="utf-8") as file:
            file.write(f"Link: {url}\nResult: {markdown}\n")

    async def process_url(self, url):
        async with AsyncWebCrawler(verbose=False) as crawler:
            result = await crawler.arun(
                url=url,
            )
        self.write_results_to_file(url, result.markdown)

    def should_stop_crawling(self):
        depth_limit = self.settings.getint("DEPTH_LIMIT", default=-1)
        return depth_limit == 0



Can someone help here please? Thank you

@luisferreira93
Copy link
Author

After checking 338, I used bypass_cache = True, and the content is extracted again. Will try to bump to 0.4.2 when I get a final version of this (since using the cache is a win).

@luisferreira93
Copy link
Author

On another note, now I am getting a bug using AsyncWebCrawler and I don't understand why. Can someone help?

from crawl4ai import AsyncWebCrawler, CacheMode
from scrapy.exceptions import CloseSpider
from scrapy.http import Request
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.crawler import CrawlerProcess


class WebCrawlerSpider(CrawlSpider):
    name = "webcrawler1"
    allowed_domains = ["quotes.toscrape.com"]
    rules = (
        Rule(LinkExtractor(), callback="parse_item", follow=True),
    )

    def __init__(self, urls=["https://quotes.toscrape.com"], *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.start_urls = urls

    #Passing arguments and setting here the config crawl_depth
    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super().from_crawler(crawler, *args, **kwargs)
        spider.settings.set("DEPTH_LIMIT", 0, priority="spider",)
        return spider
    
    async def parse_item(self, response):
        await self.process_url(response.url)

    # To crawl the start_url and avoid duplication of the same
    def start_requests(self):
        for url in self.start_urls:
            yield Request(url)

    # For max link, we probably need a manual implementation for now, maybe override the extract_links method of LinkExtractor
    async def parse_start_url(self, response):
        await self.process_url(response.url)
        #if self.should_stop_crawling():
        #    self.logger.info("DEPTH_LIMIT is 0. Stopping crawl.")
        #    raise CloseSpider(reason="DEPTH_LIMIT reached 0, stopping spider.")

    def write_results_to_file(self, url, markdown):
        file_path = "results_markdown.txt"
        with open(file_path, "a", encoding="utf-8") as file:
            file.write(f"Link: {url}\nResult: {markdown}\n")

    async def process_url(self, url):
       #crawler = AsyncWebCrawler(always_bypass_cache=True)
        #result = await crawler.arun(
        #    url=url,
        #)
        async with AsyncWebCrawler(verbose=True) as crawler:
            result = await crawler.arun(
                url="https://www.kidocode.com/degrees/technology",
                cach_mode = CacheMode.DISABLED
            )
        self.write_results_to_file(url, result.markdown)
import asyncio

from scrapy.crawler import CrawlerProcess

from scrapy_webcrawler.spiders.spider import WebCrawlerSpider


class WebCrawlerConnector:
    
    async def start(self) -> int:
        process = CrawlerProcess({
            'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
        })
        process.crawl(
            WebCrawlerSpider,
            urls=["https://quotes.toscrape.com"],
            crawl_depth=0,
            max_links_per_page=2,
        )
        process.start()
        return 1


async def main() -> None:
    """Start the connector."""
    connector = WebCrawlerConnector()
    await connector.start()


if __name__ == "__main__":
    asyncio.run(main())

And the stacktrace:

ERROR:scrapy.core.scraper:Spider error processing <GET https://quotes.toscrape.com/> (referer: None)
Traceback (most recent call last):
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/utils/defer.py", line 295, in aiter_errback
    yield await it.__anext__()
          ^^^^^^^^^^^^^^^^^^^^
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/utils/python.py", line 374, in __anext__
    return await self.data.__anext__()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/utils/python.py", line 355, in _async_chain
    async for o in as_async_generator(it):
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/utils/asyncgen.py", line 14, in as_async_generator
    async for r in it:
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/utils/python.py", line 374, in __anext__
    return await self.data.__anext__()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/utils/python.py", line 355, in _async_chain
    async for o in as_async_generator(it):
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/utils/asyncgen.py", line 14, in as_async_generator
    async for r in it:
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 118, in process_async
    async for r in iterable:
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/spidermiddlewares/referer.py", line 355, in process_spider_output_async
    async for r in result or ():
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 118, in process_async
    async for r in iterable:
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/spidermiddlewares/urllength.py", line 30, in process_spider_output_async
    async for r in result or ():
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 118, in process_async
    async for r in iterable:
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/spidermiddlewares/depth.py", line 35, in process_spider_output_async
    async for r in result or ():
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 118, in process_async
    async for r in iterable:
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/spiders/crawl.py", line 122, in _parse_response
    cb_res = await cb_res
             ^^^^^^^^^^^^
  File "/Users/joao.martins/Downloads/test/scrapy_webcrawler/scrapy_webcrawler/spiders/spider.py", line 37, in parse_start_url
    await self.process_url(response.url)
  File "/Users/joao.martins/Downloads/test/scrapy_webcrawler/scrapy_webcrawler/spiders/spider.py", line 52, in process_url
    async with AsyncWebCrawler(verbose=True) as crawler:
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/crawl4ai/async_webcrawler.py", line 111, in __aenter__
    await self.crawler_strategy.__aenter__()
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/crawl4ai/async_crawler_strategy.py", line 265, in __aenter__
    await self.start()
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/crawl4ai/async_crawler_strategy.py", line 273, in start
    self.playwright = await async_playwright().start()
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/playwright/async_api/_context_manager.py", line 51, in start
    return await self.__aenter__()
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/playwright/async_api/_context_manager.py", line 40, in __aenter__
    done, _ = await asyncio.wait(
              ^^^^^^^^^^^^^^^^^^^
  File "/Users/joao.martins/.pyenv/versions/3.11.2/lib/python3.11/asyncio/tasks.py", line 418, in wait
    return await _wait(fs, timeout, return_when, loop)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/joao.martins/.pyenv/versions/3.11.2/lib/python3.11/asyncio/tasks.py", line 525, in _wait
    await waiter
RuntimeError: await wasn't used with future
2024-12-13 16:43:12 [scrapy.core.scraper] ERROR: Spider error processing <GET https://quotes.toscrape.com/> (referer: None)
Traceback (most recent call last):
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/utils/defer.py", line 295, in aiter_errback
    yield await it.__anext__()
          ^^^^^^^^^^^^^^^^^^^^
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/utils/python.py", line 374, in __anext__
    return await self.data.__anext__()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/utils/python.py", line 355, in _async_chain
    async for o in as_async_generator(it):
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/utils/asyncgen.py", line 14, in as_async_generator
    async for r in it:
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/utils/python.py", line 374, in __anext__
    return await self.data.__anext__()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/utils/python.py", line 355, in _async_chain
    async for o in as_async_generator(it):
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/utils/asyncgen.py", line 14, in as_async_generator
    async for r in it:
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 118, in process_async
    async for r in iterable:
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/spidermiddlewares/referer.py", line 355, in process_spider_output_async
    async for r in result or ():
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 118, in process_async
    async for r in iterable:
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/spidermiddlewares/urllength.py", line 30, in process_spider_output_async
    async for r in result or ():
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 118, in process_async
    async for r in iterable:
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/spidermiddlewares/depth.py", line 35, in process_spider_output_async
    async for r in result or ():
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/core/spidermw.py", line 118, in process_async
    async for r in iterable:
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/scrapy/spiders/crawl.py", line 122, in _parse_response
    cb_res = await cb_res
             ^^^^^^^^^^^^
  File "/Users/joao.martins/Downloads/test/scrapy_webcrawler/scrapy_webcrawler/spiders/spider.py", line 37, in parse_start_url
    await self.process_url(response.url)
  File "/Users/joao.martins/Downloads/test/scrapy_webcrawler/scrapy_webcrawler/spiders/spider.py", line 52, in process_url
    async with AsyncWebCrawler(verbose=True) as crawler:
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/crawl4ai/async_webcrawler.py", line 111, in __aenter__
    await self.crawler_strategy.__aenter__()
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/crawl4ai/async_crawler_strategy.py", line 265, in __aenter__
    await self.start()
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/crawl4ai/async_crawler_strategy.py", line 273, in start
    self.playwright = await async_playwright().start()
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/playwright/async_api/_context_manager.py", line 51, in start
    return await self.__aenter__()
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/joao.martins/Downloads/test/.venv/lib/python3.11/site-packages/playwright/async_api/_context_manager.py", line 40, in __aenter__
    done, _ = await asyncio.wait(
              ^^^^^^^^^^^^^^^^^^^
  File "/Users/joao.martins/.pyenv/versions/3.11.2/lib/python3.11/asyncio/tasks.py", line 418, in wait
    return await _wait(fs, timeout, return_when, loop)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/joao.martins/.pyenv/versions/3.11.2/lib/python3.11/asyncio/tasks.py", line 525, in _wait
    await waiter
RuntimeError: await wasn't used with future

I am trying to run it in WebCrawlerConnector but when launching the Scrapy spiders, the crawl4AI code doesn't work basically. Thank you in advance

@unclecode
Copy link
Owner

@luisferreira93 Please update to 0.4.21, this has been resolved in 0.4.21. Do you still need my help for your first query or that one already resolved?

@luisferreira93
Copy link
Author

I already found a solution, thank you @unclecode

@unclecode
Copy link
Owner

Today I release 0.4.23, let me know if you face with any issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants