docs: review and normalize `haystack.components.fetchers` #7232

wochinge · 2024-02-28T09:12:58Z

Related Issues

Proposed Changes:

make protected / private what's not needed to be public
normalize / clean

How did you test it?

Notes for the reviewer

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
I documented my code
I ran pre-commit hooks and fixed any issue

wochinge · 2024-02-28T09:13:25Z

title: Fetchers
excerpt: Fetches content from a list of URLs and returns a list of extracted content streams.
category: placeholder-haystack-api
slug: fetchers-api
parentDoc:
order: 80
hidden: false

Module link_content

LinkContentFetcher

@component
class LinkContentFetcher()

LinkContentFetcher is a component for fetching and extracting content from URLs.

It supports handling various content types, retries on failures, and automatic user-agent rotation for failed web
requests.

Usage example:

from haystack.components.fetchers.link_content import LinkContentFetcher

fetcher = LinkContentFetcher()
streams = fetcher.run(urls=["https://www.google.com"])["streams"]

assert len(streams) == 1
assert streams[0].meta == {'content_type': 'text/html', 'url': 'https://www.google.com'}
assert streams[0].data

LinkContentFetcher.init

def __init__(raise_on_failure: bool = True,
             user_agents: Optional[List[str]] = None,
             retry_attempts: int = 2,
             timeout: int = 3)

Initializes the component.

Arguments:

raise_on_failure: If True, raises an exception if it fails to fetch a single URL.
For multiple URLs, it logs errors and returns the content it successfully fetched. Default is True.
user_agents: User agents
for fetching content. If None, a default user agent is used.
retry_attempts: Specifies how many times you want it to retry to fetch the URL's content. Default is 2.
timeout: Timeout in seconds for the request. Default is 3.

LinkContentFetcher.run

@component.output_types(streams=List[ByteStream])
def run(urls: List[str])

Fetches content from a list of URLs and returns a list of extracted content streams.

Each content stream is a ByteStream object containing the extracted content as binary data.
Each ByteStream object in the returned list corresponds to the contents of a single URL.
The content type of each stream is stored in the metadata of the ByteStream object under
the key "content_type". The URL of the fetched content is stored under the key "url".

Arguments:

urls: A list of URLs to fetch content from.

Raises:

Exception: If the provided list of URLs contains only a single URL, and raise_on_failure is set to
True, an exception will be raised in case of an error during content retrieval.
In all other scenarios, any retrieval errors are logged, and a list of successfully retrieved ByteStream
objects is returned.

Returns:

ByteStream objects representing the extracted content.

coveralls · 2024-02-28T09:22:19Z

Pull Request Test Coverage Report for Build 8078330587

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
17 unchanged lines in 1 file lost coverage.
Overall coverage remained the same at 90.058%

Files with Coverage Reduction	New Missed Lines	%
components/fetchers/link_content.py	17	79.52%

Totals
Change from base Build 8077602639:	0.0%
Covered Lines:	5281
Relevant Lines:	5864

💛 - Coveralls

anakin87

In general, it looks good...

Please just remove defaults from docstrings.

haystack/components/fetchers/link_content.py

anakin87

👍

docs: review and normalize haystack.components.fetchers

54de8ff

wochinge requested a review from a team as a code owner February 28, 2024 09:12

wochinge requested review from masci and removed request for a team February 28, 2024 09:12

github-actions bot added topic:tests 2.x Related to Haystack v2.0 type:documentation Improvements on the docs labels Feb 28, 2024

wochinge requested review from anakin87 and removed request for masci February 28, 2024 09:14

anakin87 reviewed Feb 28, 2024

View reviewed changes

haystack/components/fetchers/link_content.py Outdated Show resolved Hide resolved

docs: drop defaults

8f97d60

wochinge requested a review from anakin87 February 28, 2024 09:45

anakin87 approved these changes Feb 28, 2024

View reviewed changes

wochinge merged commit ac4f458 into main Feb 28, 2024
20 checks passed

wochinge deleted the docs/review-haystack.components.fetchers branch February 28, 2024 10:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: review and normalize `haystack.components.fetchers` #7232

docs: review and normalize `haystack.components.fetchers` #7232

wochinge commented Feb 28, 2024 •

edited

Loading

wochinge commented Feb 28, 2024

coveralls commented Feb 28, 2024 •

edited

Loading

anakin87 left a comment

anakin87 left a comment

docs: review and normalize haystack.components.fetchers #7232

docs: review and normalize haystack.components.fetchers #7232

Conversation

wochinge commented Feb 28, 2024 • edited Loading

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

wochinge commented Feb 28, 2024

title: Fetchers excerpt: Fetches content from a list of URLs and returns a list of extracted content streams. category: placeholder-haystack-api slug: fetchers-api parentDoc: order: 80 hidden: false

Module link_content

LinkContentFetcher

LinkContentFetcher.__init__

LinkContentFetcher.run

coveralls commented Feb 28, 2024 • edited Loading

Pull Request Test Coverage Report for Build 8078330587

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

anakin87 left a comment

Choose a reason for hiding this comment

anakin87 left a comment

Choose a reason for hiding this comment

docs: review and normalize `haystack.components.fetchers` #7232

docs: review and normalize `haystack.components.fetchers` #7232

wochinge commented Feb 28, 2024 •

edited

Loading

title: Fetchers
excerpt: Fetches content from a list of URLs and returns a list of extracted content streams.
category: placeholder-haystack-api
slug: fetchers-api
parentDoc:
order: 80
hidden: false

LinkContentFetcher.init

coveralls commented Feb 28, 2024 •

edited

Loading