how can i extract text from the CrawlResult? #171

deepak-hl · 2024-10-17T08:09:40Z

from crawl4ai import WebCrawler
from crawl4ai.chunking_strategy import SlidingWindowChunking
from crawl4ai.extraction_strategy import LLMExtractionStrategy

     crawler = WebCrawler()
     crawler.warmup()

        strategy = LLMExtractionStrategy(
            provider='openai',
            api_token=os.getenv('OPENAI_API_KEY')
        )
        loader = crawler.run(url=all_urls[0], extraction_strategy=strategy)
        chunker = SlidingWindowChunking(window_size=2000, step=50)
        texts = chunker.chunk(loader)
        print(texts)

I want text in chunks from the crawler.run, so to further use these text in storing embeddings, how can I?
its showing me the error : 'CrawlResult' object has no attribute 'split'

deepak-hl · 2024-10-17T12:35:00Z

@unclecode I am new on crawl4ai, please help me as I want text in chunks from the crawler.run, so to further use these text in storing embeddings, how can I?

unclecode · 2024-10-17T14:38:50Z

@deepak-hl thx fot using Crawl4Ai, I take a look at your code by tomorrow and definitely update you soon 🤓

deepak-hl · 2024-10-18T06:13:02Z

@unclecode thank you !!

deepak-hl · 2024-10-18T07:31:18Z

@unclecode can i crawl all the content from its sub urls by providing only its base url in crawl4ai, if yes then how?

unclecode · 2024-10-18T10:39:52Z

@deepak-hl Thank you for using Crawl4ai. Let me go through your questions one by one. The first is you're using the old version, the synchronous version. And I'm not going to support that because I moved everything to the asynchronous version. Here I share with you the code example that how it's properly you can combine all these together. In this example I'm building a knowledge graph from one of the Paul Graham essay.

class Entity(BaseModel):
    name: str
    description: str
    
class Relationship(BaseModel):
    entity1: Entity
    entity2: Entity
    description: str
    relation_type: str

class KnowledgeGraph(BaseModel):
    entities: List[Entity]
    relationships: List[Relationship]

async def main():
    extraction_strategy = LLMExtractionStrategy(
            provider='openai/gpt-4o-mini',
            api_token=os.getenv('OPENAI_API_KEY'),
            schema=KnowledgeGraph.model_json_schema(),
            extraction_type="schema",
            instruction="""Extract entities and relationships from the given text."""
    )
    async with AsyncWebCrawler() as crawler:
        url = "https://paulgraham.com/love.html"
        result = await crawler.arun(
            url=url,
            bypass_cache=True,
            extraction_strategy=extraction_strategy,
            chunking_strategy=OverlappingWindowChunking(window_size=2000, overlap=100),
            # magic=True
        )
        # print(result.markdown[:500])
        print(result.extracted_content)
        with open(os.path.join(__data__, "kb.json"), "w") as f:
            f.write(result.extracted_content)

    print("Done")

if __name__ == "__main__":
    asyncio.run(main())

Regarding your next question to pass one URL and get all the sub-urls which is scrapping, the good news is we are already working on it, and already it is under the testing. And within few weeks, we will release the scrapper as well next to crawler function. The scrapper will handle a graph search. You give a URL and you can define how many levels you want to go or all of it. Right now there is this function arun_many([urls]). After calling crawlfunction the response has a propertylinks` contains all the internal and external links of the page. You use a queue data structure, add all the internal links, then start to crawl them again, and keep adding new internal links. This is just a temporary way that you can do that but wait for our scrapper to be ready.

I hope I answered your questions let me know if you have any question.

unclecode self-assigned this Oct 18, 2024

unclecode added the question Further information is requested label Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how can i extract text from the CrawlResult? #171

how can i extract text from the CrawlResult? #171

deepak-hl commented Oct 17, 2024 •

edited

Loading

deepak-hl commented Oct 17, 2024

unclecode commented Oct 17, 2024

deepak-hl commented Oct 18, 2024

deepak-hl commented Oct 18, 2024 •

edited

Loading

unclecode commented Oct 18, 2024

how can i extract text from the CrawlResult? #171

how can i extract text from the CrawlResult? #171

Comments

deepak-hl commented Oct 17, 2024 • edited Loading

deepak-hl commented Oct 17, 2024

unclecode commented Oct 17, 2024

deepak-hl commented Oct 18, 2024

deepak-hl commented Oct 18, 2024 • edited Loading

unclecode commented Oct 18, 2024

deepak-hl commented Oct 17, 2024 •

edited

Loading

deepak-hl commented Oct 18, 2024 •

edited

Loading