Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to "track" async calls? #319

Open
b-sai opened this issue Dec 4, 2024 · 1 comment
Open

How to "track" async calls? #319

b-sai opened this issue Dec 4, 2024 · 1 comment
Assignees

Comments

@b-sai
Copy link

b-sai commented Dec 4, 2024

I have a series of links I am trying to analyze the redirection for. a simple 301 redirect is not working so using playwright

I know I can do page.url to get the final url in a hook, but I need a way to track original URL and final URL for each link i have.

How can I pass/store this metadata in the hooks?

@unclecode
Copy link
Owner

Hi @b-sai , thx for trying crawl4ai. You can store custom metadata as kwargs when triggering your hooks, and then retrieve or modify them inside the hook callback. For example, you can include something like original_url=... in the execute_hook() call before navigation, and then read or update the final URL in after_goto or on_execution_started hooks. The hooks support arbitrary keyword arguments, so you can pass a dictionary or extra parameters containing the original URL.

For instance:

async def before_goto_hook(page, context=None, **kwargs):
    # kwargs might contain original_url and session_id etc.
    original_url = kwargs.get("original_url")
    # Store original_url somewhere if needed, or print
    print(f"Original URL: {original_url}")

async def after_goto_hook(page, context=None, **kwargs):
    original_url = kwargs.get("original_url")
    final_url = page.url
    print(f"Original URL: {original_url}, Final URL: {final_url}")
    # You can return these values or store them globally

crawler_strategy.set_hook('before_goto', before_goto_hook)
crawler_strategy.set_hook('after_goto', after_goto_hook)

# When calling your crawl method:
await crawler_strategy.execute_hook('before_goto', page, context=context, original_url="http://example.com")
await page.goto("http://example.com")
await crawler_strategy.execute_hook('after_goto', page, context=context, original_url="http://example.com")

By doing this, you are free to pass original_url or any other metadata you need through the execute_hook() calls. Each hook gets the kwargs so you can store and retrieve the needed information across the hooks.

Another approach is using sessions_id. Use session_id to maintain state for each URL:

# Pass metadata through session IDs 
session_id = await crawler.create_session()
result = await crawler.arun(
    url=original_url,
    session_id=session_id,
    before_goto=lambda page: store_original_url(page, original_url),
    after_goto=lambda page: store_final_url(page, page.url)
)

Hopefully this provide the help you need.

@unclecode unclecode self-assigned this Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants