Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NHK metadata fixes/improvements #8388

Merged
merged 11 commits into from
Nov 11, 2023
Merged

Conversation

garret1317
Copy link
Collaborator

@garret1317 garret1317 commented Oct 19, 2023

IMPORTANT: PRs without the template will be CLOSED

Description of your pull request and other information

This PR fixes/improves NHK World's metadata extraction.
I've also snuck in a little change to NHK Radiru (domestic radio) while i'm here


The titles of one-off programmes, like https://www3.nhk.or.jp/nhkworld/en/ondemand/video/3004952/, were somewhat borked. The API returns the (sub)title as <p></p>, so the title the extractor returned was Barakan Discovers AMAMI OSHIMA: Isson's Treasure Island - <p></p>

the site doesn't handle it that well either tbh, note the - at the start
browser showing the page title as " - Barakan Discovers AMAMI OSHIMA: Isson's Treasure Island | NHK WORLD-JAPAN On Demand"

I've made it so that the series Barakan Discovers AMAMI OSHIMA: Isson's Treasure Island becomes the title, and the series becomes None - it's a one-off, so it's not really in a series.

In doing this, I changed the criteria for returning series/episode slightly. One-offs have no series. If there is no series, there can't be an episode of it, so episode isn't returned either.


If a programme had no entries, e.g. the sumo (first test), the extraction (or at least, the test) would fail, as it couldn't get the series from the nonexistent entries.

By extracting from the programme page instead, we prevent this, and gain the description "for free" as well

we could also get the thumbnail:
thumbnail_div = get_element_by_class('p-programDetail__itemImg', html)
thumbnail = extract_attributes(thumbnail_div).get('src')  # the div contains nothing but an <img> tag

but it's a relative url, cba to urljoin it
+ too much potential for fatalities/borking


The streams API has some other metadata, mostly related to timestamps.
I modified the _extract_formats_and_subtitles to extract this additional info as well.
It does rather ruin how nice and elegant it was before, i'm afraid
but got to get the metadata somehow


the radiru change
I've set the station name as uploader (in addition to channel, which it already is)

this is purely so i can have the station name embedded as the artist without changing my config
that commit can be skipped if thats not a valid enough reason


Template

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Copilot Summary

🤖 Generated by Copilot at 20ed3e0

Summary

🌐🎧📺

This pull request enhances the support for downloading videos and audio from the NHK website and API. It fixes and updates the extraction of titles, descriptions, thumbnails, formats, subtitles, and other metadata for the NhkVodIE, NhkVodProgramIE, NhkRadiruIE, and NhkRadioNewsPageIE extractors. It also adds or updates tests for these extractors in the file yt_dlp/extractor/nhk.py.

NhkVodIE and more
Extract fields from website, API
Autumn tests updated

Walkthrough

  • Rename and modify _extract_formats_and_subtitles function to return additional stream information (link, link)
  • Use clean_html function to sanitize title, sub_title, and description fields in _extract_episode_info function (link)
  • Handle missing or empty series and title fields in _extract_episode_info function and update info dictionary with stream information (link)
  • Extract program title and description from HTML page in NhkVodProgramIE extractor (link)
  • Extract uploader field from API response in NhkRadiruIE extractor (link)
  • Import clean_html and get_element_by_class functions from utils module (link)
  • Add test cases for NhkVodIE extractor with additional fields and fix description typo (link, link, link, link)
  • Add description field and update playlist_mincount for test cases for NhkVodProgramIE extractor (link, link, link)
  • Add uploader field for test cases for NhkRadiruIE extractor (link, link, link, link, link)

if the programme had no entries - e.g. the sumo, the test would fail as
there was no playlist title.

By extracting from the programme page instead, we prevent this, and gain
the programme description "for free" as well

could also get the thumbnail with:

thumbnail_div = get_element_by_class('p-programDetail__itemImg', html)
thumbnail = extract_attributes(thumbnail_div).get('src')

but it's a relative url, cba to urljoin it
+ too much potential for fatalities/borking
does rather ruin how nice and elegant it was before im afraid
but got to get the metadata somehow
this is purely so i can have the station name embedded as the artist
without changing my config

can be skipped if thats not a valid enough reason
@garret1317 garret1317 added site-enhancement Feature request for some website site-bug Issue with a specific website labels Oct 19, 2023
@bashonly bashonly self-requested a review October 21, 2023 13:53
yt_dlp/extractor/nhk.py Outdated Show resolved Hide resolved
yt_dlp/extractor/nhk.py Outdated Show resolved Hide resolved
yt_dlp/extractor/nhk.py Outdated Show resolved Hide resolved
yt_dlp/extractor/nhk.py Show resolved Hide resolved
yt_dlp/extractor/nhk.py Show resolved Hide resolved
@bashonly bashonly added the pending-fixes PR has had changes requested label Nov 9, 2023
@bashonly bashonly added pending-review PR needs a review and removed pending-fixes PR has had changes requested labels Nov 9, 2023
@bashonly bashonly self-assigned this Nov 11, 2023
@bashonly bashonly removed the pending-review PR needs a review label Nov 11, 2023
@bashonly bashonly merged commit 54579be into yt-dlp:master Nov 11, 2023
16 checks passed
aalsuwaidi pushed a commit to aalsuwaidi/yt-dlp that referenced this pull request Apr 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
site-bug Issue with a specific website site-enhancement Feature request for some website
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants