Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Error: Page.wait_for_function: EvalError due to Content Security Policy restrictions #370

Closed
HamdiBarkous opened this issue Dec 25, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@HamdiBarkous
Copy link

HamdiBarkous commented Dec 25, 2024

Issue:
The scraper encounters an EvalError while attempting to crawl the page at https://www.tradingview.com/broker/FOREXcom/. The error is triggered due to the page's Content Security Policy (CSP)
code snippet to Reproduce:

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url='https://www.tradingview.com/broker/FOREXcom/',
            )
        print(result.markdown)

asyncio.run(main())

Observed Error:

Error: Page.wait_for_function: EvalError: Refused to evaluate a string as JavaScript because 'unsafe-eval' is not an allowed source of script in the Content Security Policy directive.

Complete Error Trace:

[ERROR]... × https://www.tradingview.com/broker/FOREXcom/... | Error:
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ × Unexpected error in _crawl_web at line 528 in wrap_api_call (.venv/lib/python3.10/site- │
│ packages/playwright/_impl/_connection.py): │
│ Error: Page.wait_for_function: EvalError: Refused to evaluate a string as JavaScript because 'unsafe-eval' is not │
│ an allowed source of script in the following Content Security Policy directive: "script-src │
https://static.tradingview.com/static/ blob: https://*.ampproject.org/ https://*.paypal.com/
https://platform.twitter.com/ https://platform.x.com/ https://songbird.cardinalcommerce.com/edge/v1/
https://checkout.razorpay.com/ https://cdn.checkout.com/ 'nonce-v+WIeNdKFxEFsPPe9saCNA=='". │
│ │
│ at eval () │
│ at predicate (eval at evaluate (:234:30), :11:37) │
│ at next (eval at evaluate (:234:30), :32:31) │
│ │
│ Code context: │
│ 523 parsed_st = _extract_stack_trace_information_from_stack(st, is_internal) │
│ 524 self._api_zone.set(parsed_st) │
│ 525 try: │
│ 526 return await cb() │
│ 527 except Exception as error: │
│ 528 → raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None │
│ 529 finally: │
│ 530 self._api_zone.set(None) │
│ 531 │
│ 532 def wrap_api_call_sync( │
│ 533 self, cb: Callable[[], Any], is_internal: bool = False

@unclecode unclecode self-assigned this Dec 26, 2024
@unclecode
Copy link
Owner

@HamdiBarkous Thanks for the report. Yes, that's a bug, and I have already resolved it. I will push it in the next version 0.4.24.

@unclecode unclecode added the bug Something isn't working label Dec 26, 2024
@mozou
Copy link

mozou commented Dec 27, 2024

I also encountered the same problem, but it was a little special. I directly used the code in the file https://colab.research.google.com/drive/1REChY6fXQf-EaVYLv0eHEWvzlYxGm0pd?usp=sharing#scrollTo=qUBKGpn3yZQN to execute.

The first time I used "headless=True", the execution was successful, and the second time I used "headless=False", the execution was also successful, but the third time and thereafter, regardless of whether the value of "headless=" was set, the execution failed.

import json

import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy


async def crawl_dynamic_content_pages_method_3():
    print("\n--- Advanced Multi-Page Crawling with JavaScript Execution using `wait_for` ---")

    async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
        url = "https://github.com/microsoft/TypeScript/commits/main"
        session_id = "typescript_commits_session"
        all_commits = []

        js_next_page = """
        const commits = document.querySelectorAll('li.Box-sc-g0xbh4-0 h4');
        if (commits.length > 0) {
            window.firstCommit = commits[0].textContent.trim();
        }
        const button = document.querySelector('a[data-testid="pagination-next-button"]');
        if (button) button.click();
        """

        wait_for = """() => {
            const commits = document.querySelectorAll('li.Box-sc-g0xbh4-0 h4');
            if (commits.length === 0) return false;
            const firstCommit = commits[0].textContent.trim();
            return firstCommit !== window.firstCommit;
        }"""

        schema = {
            "name": "Commit Extractor",
            "baseSelector": "li.Box-sc-g0xbh4-0",
            "fields": [
                {
                    "name": "title",
                    "selector": "h4.markdown-title",
                    "type": "text",
                    "transform": "strip",
                },
            ],
        }
        extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)

        for page in range(3):  # Crawl 3 pages
            result = await crawler.arun(
                url=url,
                session_id=session_id,
                css_selector="li.Box-sc-g0xbh4-0",
                extraction_strategy=extraction_strategy,
                js_code=js_next_page if page > 0 else None,
                wait_for=wait_for if page > 0 else None,
                js_only=page > 0,
                bypass_cache=True,
                headless=False,
            )

            assert result.success, f"Failed to crawl page {page + 1}"

            commits = json.loads(result.extracted_content)
            all_commits.extend(commits)

            print(f"Page {page + 1}: Found {len(commits)} commits")

        await crawler.crawler_strategy.kill_session(session_id)
        print(f"Successfully crawled {len(all_commits)} commits across 3 pages")


if __name__ == "__main__":
    asyncio.run(crawl_dynamic_content_pages_method_3())

Complete Error Trace:

D:\python-project\Crawl4AI -learning\page_test.py:47: DeprecationWarning: Cache control boolean flags are deprecated and will be removed in version 0.5.0. Use 'cache_mode' parameter instead.
  result = await crawler.arun(
[ERROR]... × https://github.com/microsoft/TypeScript/commits/ma... | Error: 
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ × Unexpected error in _crawl_web at line 528 in wrap_api_call (E:\python-project\lib\site-                            │
│ packages\playwright\_impl\_connection.py):                                                                            │
│   Error: Page.wait_for_function: EvalError: Refused to evaluate a string as JavaScript because 'unsafe-eval' is not   │
│ an allowed source of script in the following Content Security Policy directive: "script-src                           │
│ github.githubassets.com".                                                                                             │
│                                                                                                                       │
│   at eval (<anonymous>)                                                                                               │
│   at predicate (eval at evaluate (:234:30), <anonymous>:11:37)                                                        │
│   at next (eval at evaluate (:234:30), <anonymous>:32:31)                                                             │
│                                                                                                                       │
│   Code context:                                                                                                       │
│   523           parsed_st = _extract_stack_trace_information_from_stack(st, is_internal)                              │
│   524           self._api_zone.set(parsed_st)                                                                         │
│   525           try:                                                                                                  │
│   526               return await cb()                                                                                 │
│   527           except Exception as error:                                                                            │
│   528 →             raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None                          │
│   529           finally:                                                                                              │
│   530               self._api_zone.set(None)                                                                          │
│   531                                                                                                                 │
│   532       def wrap_api_call_sync(                                                                                   │
│   533           self, cb: Callable[[], Any], is_internal: bool = False                                                │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

@unclecode
Copy link
Owner

@mozou When you use it on Colab, you can't set the headless mode to false because there is no graphical virtualization available. I think that error occurred in the memory of Colab. I checked Colab, updated some of the code because they were using the old syntax, and I tested everything. Everything works well now, so you can give it another try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants