Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agent run prompt=[Scrubbed due to 'Cookie'] #547

Open
IsaaacD opened this issue Dec 26, 2024 · 2 comments
Open

agent run prompt=[Scrubbed due to 'Cookie'] #547

IsaaacD opened this issue Dec 26, 2024 · 2 comments
Labels
question Further information is requested

Comments

@IsaaacD
Copy link

IsaaacD commented Dec 26, 2024

I can't find where this is coming from or what it means agent4 run prompt=[Scrubbed due to 'Cookie']? It happens when I run Pydantic AI with logging turned on. I'm using Playwright, but didn't find any "Scrubbed" keyword in both Pydantic AI nor Playwright source, or any hits on Google. Where is this coming from and could it impact the results? If I prompt Qwen in Ollama it returns the JSON I expect.

This was adapted from this notebook but I'm attempting Pydantic AI for some extra functionality: https://github.com/curiousily/AI-Bootcamp/blob/master/20.scraping-with-llm.ipynb (last activity in the notebook).

Any help would be appreciated to understanding if the prompt is being scrubbed to assist on how to remedy it, or maybe there's another way of tackling this issue? TIA.

import asyncio
from pprint import pprint
from typing import List, Optional, TypedDict, Union
import httpx
from pydantic_ai import Agent
from pydantic_ai.models.ollama import OllamaModel
import html2text
import nest_asyncio
import logfire
from logging import basicConfig
from playwright.async_api import async_playwright
from pydantic import BaseModel, Field
from tqdm import tqdm
from devtools import debug
nest_asyncio.apply()

logfire.configure(send_to_logfire='if-token-present')
logfire.ConsoleOptions.min_log_level ='trace'
logfire.ConsoleOptions.verbose = True
basicConfig(handlers=[logfire.LogfireLoggingHandler()])
# %%
USER_AGENT = "Mozilla/5.01 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"


#MODEL = "llama3.2:latest"
MODEL = "qwen2.5-coder:14b"

llm = OllamaModel(model_name=MODEL)

# %%
SYSTEM_PROMPT = """
You're an expert text extractor. You extract information from webpage content.
Always extract data without changing it and any other output.
Ignore everything but car information.
"""


def create_scrape_prompt(page_content: str) -> str:
    return f"""
Convert the following list of cars into valid JSON format, including details such as model, features, horsepower,
price, mileage, and year for each car. The list includes:
\```
{page_content}
\```

""".strip()


playwright = None
browser = None
async def fetch_page(url, user_agent=USER_AGENT) -> str:
    global playwright, browser
    if playwright == None:
        playwright = await async_playwright().start()
        browser = await playwright.chromium.launch()

    context = await browser.new_context(user_agent=USER_AGENT)

    page = await context.new_page()
    await page.goto(url, timeout=10000)
    content = await page.content()

    markdown_converter = html2text.HTML2Text()
    markdown_converter.body_width = 0
    markdown_converter.ignore_links = False
    #return content
    return markdown_converter.handle(content)

auto_content = ''
async def inner_fetch_page():
    global auto_content
    print('fetching page')
    auto_content = await fetch_page("https://www.autoscout24.com/lst?atype=C&cy=D%2CA%2CB%2CE%2CF%2CI%2CL%2CNL&desc=0&fregfrom=2018&gear=M&powerfrom=309&powerto=478&powertype=hp&search_id=1tih4oks815&sort=standard&ustate=N%2CU")
    print('fetched page', auto_content)

asyncio.run(inner_fetch_page())
class CarListing(TypedDict):
    """Information about a car listing"""

    make: str | None = Field("Make of the car e.g. Toyota", examples=["Toyota", "Lexus"])
    model: str | None = Field("Model of the car, maximum 3 words e.g. Land Cruiser", examples=["Land Cruiser", "RC F Advantage"])
    horsepower: str | None = Field("Horsepower (HP) of the engine e.g. 231", examples=["231", "467"])
    price: str | None  = Field("Price in euro e.g. 34000", examples=["34,000", "45000"])
    mileage:  str | None  = Field("Number of kilometers on the odometer e.g. 73400", examples=["73400", "12,000"])
    year:  str | None  = Field("Year of registration (if available) e.g. 2015" , examples=["2015", "2020"])
    url: str | None = Field(
        "Url to the listing e.g. https://www.autoscout24.com/offers/lexus-rc-f-advantage-coupe-gasoline-grey-19484ec1-ee56-4bfd-8769-054f03515792", 
        examples=["https://www.autoscout24.com/offers/lexus-rc-f-advantage-coupe-gasoline-grey-19484ec1-ee56-4bfd-8769-054f03515792"]
    )


class CarListings(BaseModel):
    """List of car listings"""
    cars: List[CarListing] = Field("List of cars for sale.")

ollama_model = OllamaModel(
    model_name=MODEL
)
agent4 = Agent(model=ollama_model, result_type=CarListings, retries=3, system_prompt=SYSTEM_PROMPT)
try:
    result3 = agent4.run_sync(create_scrape_prompt(auto_content))
except:
    pass
finally:
    debug(agent4.last_run_messages)

debug(result3)

#rows = [listing.__dict__ for listing in extraction.cars]

#listings_df = pd.DataFrame(car_extract)
# print(car_extract)
# #listings_df["model"] = listings_df.model.apply(filter_model)
# #listings_df

# # %%
# listings_df.to_csv("car-listings.csv", index=None)
asyncio.run(playwright.stop())
asyncio.run(browser.close())
@samuelcolvin
Copy link
Member

The scrubbing is being performed by Pydantic Logfire, which you're using for "logging" (actually tracing).

You can disable it with

logfire.configure(scrubbing=False)

Hope that helps.

@samuelcolvin samuelcolvin added the question Further information is requested label Dec 26, 2024
@IsaaacD
Copy link
Author

IsaaacD commented Dec 27, 2024

Thanks @samuelcolvin, I tried the configuration but I received the following exception:

Exception has occurred: LogfireConfigError
You are not authenticated. Please run `logfire auth` to authenticate.

If you are running in production, you can set the `LOGFIRE_TOKEN` environment variable.
To create a write token, refer to https://logfire.pydantic.dev/docs/guides/advanced/creating_write_tokens/
  File "C:\Users\<<user>>\Code\AI-Bootcamp\test2.py", line 20, in <module>
    logfire.configure(scrubbing=False)
logfire.exceptions.LogfireConfigError: You are not authenticated. Please run `logfire auth` to authenticate.

If you are running in production, you can set the `LOGFIRE_TOKEN` environment variable.
To create a write token, refer to https://logfire.pydantic.dev/docs/guides/advanced/creating_write_tokens/

I did follow the link returned in the exeption, but didn't see much to enable it besides setting up an account to use their SaaS product? Does that sound right? If so then I'm not sure if this is the logging solution for me, as the reason to go with local LLMs would be to keep with data privacy. I'll keep an eye for anything else online, but a quick search doesn't yield much, but I didn't search too in depth.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants