Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: review and normalize haystack.components.fetchers #7232

Merged
merged 2 commits into from
Feb 28, 2024

Conversation

wochinge
Copy link
Contributor

@wochinge wochinge commented Feb 28, 2024

Related Issues

Proposed Changes:

  • make protected / private what's not needed to be public
  • normalize / clean

How did you test it?

Notes for the reviewer

Checklist

@wochinge wochinge requested a review from a team as a code owner February 28, 2024 09:12
@wochinge wochinge requested review from masci and removed request for a team February 28, 2024 09:12
@github-actions github-actions bot added topic:tests 2.x Related to Haystack v2.0 type:documentation Improvements on the docs labels Feb 28, 2024
@wochinge
Copy link
Contributor Author


title: Fetchers
excerpt: Fetches content from a list of URLs and returns a list of extracted content streams.
category: placeholder-haystack-api
slug: fetchers-api
parentDoc:
order: 80
hidden: false

Module link_content

LinkContentFetcher

@component
class LinkContentFetcher()

LinkContentFetcher is a component for fetching and extracting content from URLs.

It supports handling various content types, retries on failures, and automatic user-agent rotation for failed web
requests.

Usage example:

from haystack.components.fetchers.link_content import LinkContentFetcher

fetcher = LinkContentFetcher()
streams = fetcher.run(urls=["https://www.google.com"])["streams"]

assert len(streams) == 1
assert streams[0].meta == {'content_type': 'text/html', 'url': 'https://www.google.com'}
assert streams[0].data

LinkContentFetcher.__init__

def __init__(raise_on_failure: bool = True,
             user_agents: Optional[List[str]] = None,
             retry_attempts: int = 2,
             timeout: int = 3)

Initializes the component.

Arguments:

  • raise_on_failure: If True, raises an exception if it fails to fetch a single URL.
    For multiple URLs, it logs errors and returns the content it successfully fetched. Default is True.
  • user_agents: User agents
    for fetching content. If None, a default user agent is used.
  • retry_attempts: Specifies how many times you want it to retry to fetch the URL's content. Default is 2.
  • timeout: Timeout in seconds for the request. Default is 3.

LinkContentFetcher.run

@component.output_types(streams=List[ByteStream])
def run(urls: List[str])

Fetches content from a list of URLs and returns a list of extracted content streams.

Each content stream is a ByteStream object containing the extracted content as binary data.
Each ByteStream object in the returned list corresponds to the contents of a single URL.
The content type of each stream is stored in the metadata of the ByteStream object under
the key "content_type". The URL of the fetched content is stored under the key "url".

Arguments:

  • urls: A list of URLs to fetch content from.

Raises:

  • Exception: If the provided list of URLs contains only a single URL, and raise_on_failure is set to
    True, an exception will be raised in case of an error during content retrieval.
    In all other scenarios, any retrieval errors are logged, and a list of successfully retrieved ByteStream
    objects is returned.

Returns:

ByteStream objects representing the extracted content.

@wochinge wochinge requested review from anakin87 and removed request for masci February 28, 2024 09:14
@coveralls
Copy link
Collaborator

coveralls commented Feb 28, 2024

Pull Request Test Coverage Report for Build 8078330587

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 17 unchanged lines in 1 file lost coverage.
  • Overall coverage remained the same at 90.058%

Files with Coverage Reduction New Missed Lines %
components/fetchers/link_content.py 17 79.52%
Totals Coverage Status
Change from base Build 8077602639: 0.0%
Covered Lines: 5281
Relevant Lines: 5864

💛 - Coveralls

Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, it looks good...

Please just remove defaults from docstrings.

haystack/components/fetchers/link_content.py Outdated Show resolved Hide resolved
@wochinge wochinge requested a review from anakin87 February 28, 2024 09:45
Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@wochinge wochinge merged commit ac4f458 into main Feb 28, 2024
20 checks passed
@wochinge wochinge deleted the docs/review-haystack.components.fetchers branch February 28, 2024 10:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.x Related to Haystack v2.0 topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API Docs - haystack.components.fetchers Docstrings - haystack.components.fetchers
3 participants