Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new extractor for Dagelijkse kost #28119

Merged
merged 1 commit into from
Feb 11, 2021
Merged

Conversation

paretje
Copy link

@paretje paretje commented Feb 8, 2021

Please follow the guide below

  • You will be asked some questions, please read them carefully and answer honestly
  • Put an x into all the boxes [ ] relevant to your pull request (like that [x])
  • Use Preview tab to see how your pull request will actually look like

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

  • Bug fix
  • Improvement
  • New extractor
  • New feature

Description of your pull request and other information

This PR adds a new extractor for "Dagelijkse kost" to canvas.py. The extractor is based on the existing extractor for een and canvas.

Comment on lines 356 to 357
mobj = re.match(self._VALID_URL, url)
display_id = mobj.group('id')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use _match_id method.

webpage = self._download_webpage(url, display_id)

title = strip_or_none(self._search_regex(
r'<h1[^>]+class="dish-metadata__title headline-1"[^>]*>(.+?)</h1>',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use get_element_by_class function.

'id': video_id,
'display_id': display_id,
'title': title,
'description': self._og_search_description(webpage),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fallback to other available values.


class DagelijkseKostIE(InfoExtractor):
IE_DESC = 'dagelijksekost.een.be'
_VALID_URL = r'https?://dagelijksekost\.een\.be/(?:[^/]+/)*(?P<id>[^/?#&]+)'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

match only supported URLs.


return {
'_type': 'url_transparent',
'url': 'https://mediazone.vrt.be/api/v1/dako/assets/%s' % (video_id),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parentheses not needed.

webpage = self._download_webpage(url, display_id)

title = strip_or_none(get_element_by_class(
"dish-metadata__title",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use single quotes consistently.

@@ -15,11 +15,12 @@
merge_dicts,
str_or_none,
url_or_none,
get_element_by_class,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alphabetic order.

description = strip_or_none(get_element_by_class(
"dish-description",
webpage) or self._og_search_description(
webpage, default=None))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

twitter:description and description meta tags are also available.

title = strip_or_none(get_element_by_class(
'dish-metadata__title',
webpage) or self._og_search_title(
webpage, default=None))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check the result of the fallback(in comparision with the primary source).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I drop the fallback, or apply a regex to make up for the differences?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

twitter:title meta tag does not contain | Dagelijkse kost in the end.
if you want to keep og:title then you can use remove_end function to clean the title.

Comment on lines 368 to 370
or self._html_search_meta(
('description', 'twitter:description'), webpage)
or self._og_search_description(webpage, default=None))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

combine into a single call to _html_search_meta.

Comment on lines 366 to 367
get_element_by_class(
'dish-description', webpage)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check that extracted description is what is expected(should not contain html tags).

title = strip_or_none(get_element_by_class(
'dish-metadata__title',
webpage) or self._og_search_title(
webpage, default=None))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

twitter:title meta tag does not contain | Dagelijkse kost in the end.
if you want to keep og:title then you can use remove_end function to clean the title.

Comment on lines 366 to 368
description = strip_or_none(

clean_html(get_element_by_class(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strip_or_none no longer needed.

or self._html_search_meta(
('description', 'twitter:description', 'og:description'),
webpage,
default=None))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it would be better not to silence the warning when the description is not found(it's not expcted to not been able to extract description, failing would likely indicate that a change in the website has happened).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I do the same for title?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.

@remitamine remitamine merged commit f28f1b4 into ytdl-org:master Feb 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants