`agent run prompt=[Scrubbed due to 'Cookie']` #547

IsaaacD · 2024-12-26T11:49:39Z

I can't find where this is coming from or what it means agent4 run prompt=[Scrubbed due to 'Cookie']? It happens when I run Pydantic AI with logging turned on. I'm using Playwright, but didn't find any "Scrubbed" keyword in both Pydantic AI nor Playwright source, or any hits on Google. Where is this coming from and could it impact the results? If I prompt Qwen in Ollama it returns the JSON I expect.

This was adapted from this notebook but I'm attempting Pydantic AI for some extra functionality: https://github.com/curiousily/AI-Bootcamp/blob/master/20.scraping-with-llm.ipynb (last activity in the notebook).

Any help would be appreciated to understanding if the prompt is being scrubbed to assist on how to remedy it, or maybe there's another way of tackling this issue? TIA.

import asyncio
from pprint import pprint
from typing import List, Optional, TypedDict, Union
import httpx
from pydantic_ai import Agent
from pydantic_ai.models.ollama import OllamaModel
import html2text
import nest_asyncio
import logfire
from logging import basicConfig
from playwright.async_api import async_playwright
from pydantic import BaseModel, Field
from tqdm import tqdm
from devtools import debug
nest_asyncio.apply()

logfire.configure(send_to_logfire='if-token-present')
logfire.ConsoleOptions.min_log_level ='trace'
logfire.ConsoleOptions.verbose = True
basicConfig(handlers=[logfire.LogfireLoggingHandler()])
# %%
USER_AGENT = "Mozilla/5.01 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"


#MODEL = "llama3.2:latest"
MODEL = "qwen2.5-coder:14b"

llm = OllamaModel(model_name=MODEL)

# %%
SYSTEM_PROMPT = """
You're an expert text extractor. You extract information from webpage content.
Always extract data without changing it and any other output.
Ignore everything but car information.
"""


def create_scrape_prompt(page_content: str) -> str:
    return f"""
Convert the following list of cars into valid JSON format, including details such as model, features, horsepower,
price, mileage, and year for each car. The list includes:
\```
{page_content}
\```

""".strip()


playwright = None
browser = None
async def fetch_page(url, user_agent=USER_AGENT) -> str:
    global playwright, browser
    if playwright == None:
        playwright = await async_playwright().start()
        browser = await playwright.chromium.launch()

    context = await browser.new_context(user_agent=USER_AGENT)

    page = await context.new_page()
    await page.goto(url, timeout=10000)
    content = await page.content()

    markdown_converter = html2text.HTML2Text()
    markdown_converter.body_width = 0
    markdown_converter.ignore_links = False
    #return content
    return markdown_converter.handle(content)

auto_content = ''
async def inner_fetch_page():
    global auto_content
    print('fetching page')
    auto_content = await fetch_page("https://www.autoscout24.com/lst?atype=C&cy=D%2CA%2CB%2CE%2CF%2CI%2CL%2CNL&desc=0&fregfrom=2018&gear=M&powerfrom=309&powerto=478&powertype=hp&search_id=1tih4oks815&sort=standard&ustate=N%2CU")
    print('fetched page', auto_content)

asyncio.run(inner_fetch_page())
class CarListing(TypedDict):
    """Information about a car listing"""

    make: str | None = Field("Make of the car e.g. Toyota", examples=["Toyota", "Lexus"])
    model: str | None = Field("Model of the car, maximum 3 words e.g. Land Cruiser", examples=["Land Cruiser", "RC F Advantage"])
    horsepower: str | None = Field("Horsepower (HP) of the engine e.g. 231", examples=["231", "467"])
    price: str | None  = Field("Price in euro e.g. 34000", examples=["34,000", "45000"])
    mileage:  str | None  = Field("Number of kilometers on the odometer e.g. 73400", examples=["73400", "12,000"])
    year:  str | None  = Field("Year of registration (if available) e.g. 2015" , examples=["2015", "2020"])
    url: str | None = Field(
        "Url to the listing e.g. https://www.autoscout24.com/offers/lexus-rc-f-advantage-coupe-gasoline-grey-19484ec1-ee56-4bfd-8769-054f03515792", 
        examples=["https://www.autoscout24.com/offers/lexus-rc-f-advantage-coupe-gasoline-grey-19484ec1-ee56-4bfd-8769-054f03515792"]
    )


class CarListings(BaseModel):
    """List of car listings"""
    cars: List[CarListing] = Field("List of cars for sale.")

ollama_model = OllamaModel(
    model_name=MODEL
)
agent4 = Agent(model=ollama_model, result_type=CarListings, retries=3, system_prompt=SYSTEM_PROMPT)
try:
    result3 = agent4.run_sync(create_scrape_prompt(auto_content))
except:
    pass
finally:
    debug(agent4.last_run_messages)

debug(result3)

#rows = [listing.__dict__ for listing in extraction.cars]

#listings_df = pd.DataFrame(car_extract)
# print(car_extract)
# #listings_df["model"] = listings_df.model.apply(filter_model)
# #listings_df

# # %%
# listings_df.to_csv("car-listings.csv", index=None)
asyncio.run(playwright.stop())
asyncio.run(browser.close())

The text was updated successfully, but these errors were encountered:

samuelcolvin · 2024-12-26T14:17:52Z

The scrubbing is being performed by Pydantic Logfire, which you're using for "logging" (actually tracing).

You can disable it with

logfire.configure(scrubbing=False)

Hope that helps.

IsaaacD · 2024-12-27T01:21:55Z

Thanks @samuelcolvin, I tried the configuration but I received the following exception:

Exception has occurred: LogfireConfigError
You are not authenticated. Please run `logfire auth` to authenticate.

If you are running in production, you can set the `LOGFIRE_TOKEN` environment variable.
To create a write token, refer to https://logfire.pydantic.dev/docs/guides/advanced/creating_write_tokens/
  File "C:\Users\<<user>>\Code\AI-Bootcamp\test2.py", line 20, in <module>
    logfire.configure(scrubbing=False)
logfire.exceptions.LogfireConfigError: You are not authenticated. Please run `logfire auth` to authenticate.

If you are running in production, you can set the `LOGFIRE_TOKEN` environment variable.
To create a write token, refer to https://logfire.pydantic.dev/docs/guides/advanced/creating_write_tokens/

I did follow the link returned in the exeption, but didn't see much to enable it besides setting up an account to use their SaaS product? Does that sound right? If so then I'm not sure if this is the logging solution for me, as the reason to go with local LLMs would be to keep with data privacy. I'll keep an eye for anything else online, but a quick search doesn't yield much, but I didn't search too in depth.

samuelcolvin added the question Further information is requested label Dec 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`agent run prompt=[Scrubbed due to 'Cookie']` #547

`agent run prompt=[Scrubbed due to 'Cookie']` #547

IsaaacD commented Dec 26, 2024

samuelcolvin commented Dec 26, 2024

IsaaacD commented Dec 27, 2024 •

edited

Loading

agent run prompt=[Scrubbed due to 'Cookie'] #547

agent run prompt=[Scrubbed due to 'Cookie'] #547

Comments

IsaaacD commented Dec 26, 2024

samuelcolvin commented Dec 26, 2024

IsaaacD commented Dec 27, 2024 • edited Loading

`agent run prompt=[Scrubbed due to 'Cookie']` #547

`agent run prompt=[Scrubbed due to 'Cookie']` #547

IsaaacD commented Dec 27, 2024 •

edited

Loading