-
-
Notifications
You must be signed in to change notification settings - Fork 7.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NHK metadata fixes/improvements #8388
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
if the programme had no entries - e.g. the sumo, the test would fail as there was no playlist title. By extracting from the programme page instead, we prevent this, and gain the programme description "for free" as well could also get the thumbnail with: thumbnail_div = get_element_by_class('p-programDetail__itemImg', html) thumbnail = extract_attributes(thumbnail_div).get('src') but it's a relative url, cba to urljoin it + too much potential for fatalities/borking
does rather ruin how nice and elegant it was before im afraid but got to get the metadata somehow
this is purely so i can have the station name embedded as the artist without changing my config can be skipped if thats not a valid enough reason
garret1317
added
site-enhancement
Feature request for some website
site-bug
Issue with a specific website
labels
Oct 19, 2023
bashonly
requested changes
Nov 9, 2023
Co-Authored-By: bashonly <[email protected]>
bashonly
approved these changes
Nov 9, 2023
bashonly
added
pending-review
PR needs a review
and removed
pending-fixes
PR has had changes requested
labels
Nov 9, 2023
aalsuwaidi
pushed a commit
to aalsuwaidi/yt-dlp
that referenced
this pull request
Apr 21, 2024
Authored by: garret1317
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
IMPORTANT: PRs without the template will be CLOSED
Description of your pull request and other information
This PR fixes/improves NHK World's metadata extraction.
I've also snuck in a little change to NHK Radiru (domestic radio) while i'm here
The titles of one-off programmes, like https://www3.nhk.or.jp/nhkworld/en/ondemand/video/3004952/, were somewhat borked. The API returns the (sub)title as
<p></p>
, so thetitle
the extractor returned wasBarakan Discovers AMAMI OSHIMA: Isson's Treasure Island - <p></p>
the site doesn't handle it that well either tbh, note the
-
at the startI've made it so that the
series
Barakan Discovers AMAMI OSHIMA: Isson's Treasure Island
becomes thetitle
, and theseries
becomesNone
- it's a one-off, so it's not really in a series.In doing this, I changed the criteria for returning
series
/episode
slightly. One-offs have no series. If there is no series, there can't be an episode of it, soepisode
isn't returned either.If a programme had no entries, e.g. the sumo (first test), the extraction (or at least, the test) would fail, as it couldn't get the
series
from the nonexistent entries.By extracting from the programme page instead, we prevent this, and gain the description "for free" as well
we could also get the thumbnail:
but it's a relative url, cba to urljoin it
+ too much potential for fatalities/borking
The streams API has some other metadata, mostly related to timestamps.
I modified the
_extract_formats_and_subtitles
to extract this additional info as well.It does rather ruin how nice and elegant it was before, i'm afraid
but got to get the metadata somehow
the radiru change
I've set the station name as
uploader
(in addition tochannel
, which it already is)this is purely so i can have the station name embedded as the artist without changing my config
that commit can be skipped if thats not a valid enough reason
Template
Before submitting a pull request make sure you have:
In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:
What is the purpose of your pull request?
Copilot Summary
🤖 Generated by Copilot at 20ed3e0
Summary
🌐🎧📺
This pull request enhances the support for downloading videos and audio from the NHK website and API. It fixes and updates the extraction of titles, descriptions, thumbnails, formats, subtitles, and other metadata for the
NhkVodIE
,NhkVodProgramIE
,NhkRadiruIE
, andNhkRadioNewsPageIE
extractors. It also adds or updates tests for these extractors in the fileyt_dlp/extractor/nhk.py
.Walkthrough
_extract_formats_and_subtitles
function to return additional stream information (link, link)clean_html
function to sanitize title, sub_title, and description fields in_extract_episode_info
function (link)_extract_episode_info
function and update info dictionary with stream information (link)NhkVodProgramIE
extractor (link)NhkRadiruIE
extractor (link)clean_html
andget_element_by_class
functions from utils module (link)NhkVodIE
extractor with additional fields and fix description typo (link, link, link, link)NhkVodProgramIE
extractor (link, link, link)NhkRadiruIE
extractor (link, link, link, link, link)