-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how can i extract text from the CrawlResult? #171
Comments
@unclecode I am new on crawl4ai, please help me as I want text in chunks from the crawler.run, so to further use these text in storing embeddings, how can I? |
@deepak-hl thx fot using Crawl4Ai, I take a look at your code by tomorrow and definitely update you soon 🤓 |
@unclecode thank you !! |
@unclecode can i crawl all the content from its sub urls by providing only its base url in crawl4ai, if yes then how? |
@deepak-hl Thank you for using Crawl4ai. Let me go through your questions one by one. The first is you're using the old version, the synchronous version. And I'm not going to support that because I moved everything to the asynchronous version. Here I share with you the code example that how it's properly you can combine all these together. In this example I'm building a knowledge graph from one of the Paul Graham essay. class Entity(BaseModel):
name: str
description: str
class Relationship(BaseModel):
entity1: Entity
entity2: Entity
description: str
relation_type: str
class KnowledgeGraph(BaseModel):
entities: List[Entity]
relationships: List[Relationship]
async def main():
extraction_strategy = LLMExtractionStrategy(
provider='openai/gpt-4o-mini',
api_token=os.getenv('OPENAI_API_KEY'),
schema=KnowledgeGraph.model_json_schema(),
extraction_type="schema",
instruction="""Extract entities and relationships from the given text."""
)
async with AsyncWebCrawler() as crawler:
url = "https://paulgraham.com/love.html"
result = await crawler.arun(
url=url,
bypass_cache=True,
extraction_strategy=extraction_strategy,
chunking_strategy=OverlappingWindowChunking(window_size=2000, overlap=100),
# magic=True
)
# print(result.markdown[:500])
print(result.extracted_content)
with open(os.path.join(__data__, "kb.json"), "w") as f:
f.write(result.extracted_content)
print("Done")
if __name__ == "__main__":
asyncio.run(main()) Regarding your next question to pass one URL and get all the sub-urls which is scrapping, the good news is we are already working on it, and already it is under the testing. And within few weeks, we will release the scrapper as well next to crawler function. The scrapper will handle a graph search. You give a URL and you can define how many levels you want to go or all of it. Right now there is this function I hope I answered your questions let me know if you have any question. |
I want text in chunks from the crawler.run, so to further use these text in storing embeddings, how can I?
its showing me the error : 'CrawlResult' object has no attribute 'split'
The text was updated successfully, but these errors were encountered: