NHK metadata fixes/improvements #8388

garret1317 · 2023-10-19T14:15:57Z

IMPORTANT: PRs without the template will be CLOSED

Description of your pull request and other information

This PR fixes/improves NHK World's metadata extraction.
I've also snuck in a little change to NHK Radiru (domestic radio) while i'm here

The titles of one-off programmes, like https://www3.nhk.or.jp/nhkworld/en/ondemand/video/3004952/, were somewhat borked. The API returns the (sub)title as <p></p>, so the title the extractor returned was Barakan Discovers AMAMI OSHIMA: Isson's Treasure Island - <p></p>

the site doesn't handle it that well either tbh, note the - at the start

I've made it so that the series Barakan Discovers AMAMI OSHIMA: Isson's Treasure Island becomes the title, and the series becomes None - it's a one-off, so it's not really in a series.

In doing this, I changed the criteria for returning series/episode slightly. One-offs have no series. If there is no series, there can't be an episode of it, so episode isn't returned either.

If a programme had no entries, e.g. the sumo (first test), the extraction (or at least, the test) would fail, as it couldn't get the series from the nonexistent entries.

By extracting from the programme page instead, we prevent this, and gain the description "for free" as well

we could also get the thumbnail:

thumbnail_div = get_element_by_class('p-programDetail__itemImg', html)
thumbnail = extract_attributes(thumbnail_div).get('src')  # the div contains nothing but an <img> tag

but it's a relative url, cba to urljoin it
+ too much potential for fatalities/borking

The streams API has some other metadata, mostly related to timestamps.
I modified the _extract_formats_and_subtitles to extract this additional info as well.
It does rather ruin how nice and elegant it was before, i'm afraid
but got to get the metadata somehow

the radiru change
I've set the station name as uploader (in addition to channel, which it already is)

this is purely so i can have the station name embedded as the artist without changing my config
that commit can be skipped if thats not a valid enough reason

Template

Before submitting a pull request make sure you have:

At least skimmed through contributing guidelines including yt-dlp coding conventions
Searched the bugtracker for similar pull requests
Checked the code with flake8 and ran relevant tests

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

I am the original author of this code and I am willing to release it under Unlicense
I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Fix or improvement to an extractor (Make sure to add/update tests)
New extractor (Piracy websites will not be accepted)
Core bug fix/improvement
New feature (It is strongly recommended to open an issue first)

Copilot Summary

`🤖 Generated by Copilot at 20ed3e0`

Summary

🌐🎧📺

This pull request enhances the support for downloading videos and audio from the NHK website and API. It fixes and updates the extraction of titles, descriptions, thumbnails, formats, subtitles, and other metadata for the NhkVodIE, NhkVodProgramIE, NhkRadiruIE, and NhkRadioNewsPageIE extractors. It also adds or updates tests for these extractors in the file yt_dlp/extractor/nhk.py.

NhkVodIE and more
Extract fields from website, API
Autumn tests updated

Walkthrough

Rename and modify _extract_formats_and_subtitles function to return additional stream information (link, link)
Use clean_html function to sanitize title, sub_title, and description fields in _extract_episode_info function (link)
Handle missing or empty series and title fields in _extract_episode_info function and update info dictionary with stream information (link)
Extract program title and description from HTML page in NhkVodProgramIE extractor (link)
Extract uploader field from API response in NhkRadiruIE extractor (link)
Import clean_html and get_element_by_class functions from utils module (link)
Add test cases for NhkVodIE extractor with additional fields and fix description typo (link, link, link, link)
Add description field and update playlist_mincount for test cases for NhkVodProgramIE extractor (link, link, link)
Add uploader field for test cases for NhkRadiruIE extractor (link, link, link, link, link)

if the programme had no entries - e.g. the sumo, the test would fail as there was no playlist title. By extracting from the programme page instead, we prevent this, and gain the programme description "for free" as well could also get the thumbnail with: thumbnail_div = get_element_by_class('p-programDetail__itemImg', html) thumbnail = extract_attributes(thumbnail_div).get('src') but it's a relative url, cba to urljoin it + too much potential for fatalities/borking

does rather ruin how nice and elegant it was before im afraid but got to get the metadata somehow

this is purely so i can have the station name embedded as the artist without changing my config can be skipped if thats not a valid enough reason

yt_dlp/extractor/nhk.py

Co-Authored-By: bashonly <[email protected]>

Authored by: garret1317

garret1317 added 9 commits October 19, 2023 11:22

nhk world: add bad test

a0c3d2d

nhk world: fix one-off titles/lack-of-series

34a9ceb

update NhkVod tests

5cc5b90

update NHKVodProgram tests

7ef0a78

extract additional info from j-stream api

c152949

does rather ruin how nice and elegant it was before im afraid but got to get the metadata somehow

update NHKVod tests

a90417e

radiru: set station name as uploader as well

513f0d0

this is purely so i can have the station name embedded as the artist without changing my config can be skipped if thats not a valid enough reason

update radiru tests

20ed3e0

garret1317 added site-enhancement Feature request for some website site-bug Issue with a specific website labels Oct 19, 2023

bashonly self-requested a review October 21, 2023 13:53

bashonly requested changes Nov 9, 2023

View reviewed changes

yt_dlp/extractor/nhk.py Outdated Show resolved Hide resolved

yt_dlp/extractor/nhk.py Outdated Show resolved Hide resolved

yt_dlp/extractor/nhk.py Outdated Show resolved Hide resolved

yt_dlp/extractor/nhk.py Show resolved Hide resolved

yt_dlp/extractor/nhk.py Show resolved Hide resolved

bashonly added the pending-fixes PR has had changes requested label Nov 9, 2023

garret1317 and others added 2 commits November 9, 2023 20:32

apply code review suggestions

a6e8cd1

Co-Authored-By: bashonly <[email protected]>

flake the eighth / fix dumb

78c2244

bashonly approved these changes Nov 9, 2023

View reviewed changes

bashonly added pending-review PR needs a review and removed pending-fixes PR has had changes requested labels Nov 9, 2023

bashonly self-assigned this Nov 11, 2023

bashonly removed the pending-review PR needs a review label Nov 11, 2023

bashonly merged commit 54579be into yt-dlp:master Nov 11, 2023
16 checks passed

aalsuwaidi pushed a commit to aalsuwaidi/yt-dlp that referenced this pull request Apr 21, 2024

[ie/nhk] Improve metadata extraction (yt-dlp#8388)

8d892e8

Authored by: garret1317

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NHK metadata fixes/improvements #8388

NHK metadata fixes/improvements #8388

garret1317 commented Oct 19, 2023 •

edited by ghost

Loading

NHK metadata fixes/improvements #8388

NHK metadata fixes/improvements #8388

Conversation

garret1317 commented Oct 19, 2023 • edited by ghost Loading

Description of your pull request and other information

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check all of the following options that apply:

What is the purpose of your pull request?

🤖 Generated by Copilot at 20ed3e0

Summary

Walkthrough

garret1317 commented Oct 19, 2023 •

edited by ghost

Loading

`🤖 Generated by Copilot at 20ed3e0`