add lingua libre audio source #1129

StefanVukovic99 · 2024-06-24T20:10:58Z

@hugolpz thanks for the fiddle, it was super helpful 🧎‍♂️ I have some more questions, please answer if you can (or point to documentation) 🙏 :

Search for Hund returns both German and Norwegian entries - can the search filter by language?
- I've noticed the files for german have deu in the title, is this a hard rule so we can filter on that?
- Are languages consistently referred to by their ISO 639-3 code?
```
const searchString = `-${searchWord}.wav`;
```
I've tried looking up for .ogg instead and there are also results. What formats are in use, is there a convenient way to search for all formats at once, or do we need to merge results from multiple requests?
Searching for lesen returns a bunch of pdf's with titles that don't even contain lesen. Can the search be more exact?
We already have reliable audio sources for Japanese, but is there anything in place on LL's side for languages with ambiguous writing systems? E.g. for Japanese, we use a "term+reading" system so we can differentiate between 生業(なりわい) and 生業(せいぎょう).

codspeed-hq · 2024-06-24T20:12:57Z

CodSpeed Performance Report

Merging #1129 will not alter performance

_{Comparing StefanVukovic99:lingua-libre-audio-source (0d61e83) with master (4e3f23e)}

Summary

✅ 5 untouched benchmarks

hugolpz · 2024-06-25T13:43:14Z

Hello @StefanVukovic99 ,

Your links above miss an \ sign before the -hund.wav and -lesen.wav. Without it the query is opposite (EXCLUDE '-hund.wav').

Wikimedia Commons' lexical audio assets : categories and files.

Commons is a crowd sourced project, so there are much more audio, but contributed in either chaotic way or structured way.

Dimension	Spontaneous crowd sourcing	Lingualibre crowd sourcing
Type	Free	Structured
Method	Each user their process, filenames and categories names convention.	All users revord via Lingualibre.org. Conventional filenames and category names (see below)
Root category	Category:Pronunciation	Category:Lingua Libre pronunciation
Sub-categories naming rules	None, free names	If iso exist : `Category:Lingua Libre pronunctiation-{iso639-3}` If no iso : `Category:Lingua Libre pronunctiation-other {language wikidata qid}`
Audio filenames naming rules	None, free filenames	`File:LL-{Qid}_({iso})-{wikimedia username}-{voice name}-word.wav`
Audio filename examples	File:De-blaukarierten.ogg File:BingBing Chinese.ogg File:Fedora (fr).oga File:ENGLISH ELMZ PRONOUNCED BY SAM.wav ...	File:LL-Q188 (deu)-Djknusper-Hund.wav File:LL-Q9043 (nor)-Teodor605-hund.wav File:LL-Q188 (deu)-CamelCaseNick-Hund.wav File:LL-Q131339 (gsw)-Mathieu Kappler-Hund.wav
Wikidata properties Language	often: P407:language of work or name -> Q188:German	often: P407:language of work or name --> Q188:German
Wikidata properties word	often: P9533:audio transcription --> Hund	often: P9533:audio transcription --> Hund

On the other hand, Lingualibre audios may make up to 30% of the audio words, but with strict categories names and filenames conventions.

You will need an hard coded table of categories + iso pairs together with .ext flexibility for the first approach. For lingualibre categories and files you may go with a simpler common category name string + iso rules, while all uploaded as .wav. This strategy is up to you.

Filter per language

As pointed out in #1093 : yes we can. Compare the following:

We need to taylor the API request further to your need. Allow me to put those examples bellow to explore those files data and try other API request formats.

We could also look for audio pronunciation files with in P407:language of work or name --> Q188:German
and P9533:audio transcription --> hund/Hund . (Query to find)

API sandbox

If you want to explore, battle yourself and learn, inspects the Commons pages above, most importantly for meaningful patterns in their :

filenames (top of page)
"structured data" (middle of page, you have to click on it)
categories (bottom of page)

Then, play with :

hugolpz · 2024-06-25T19:00:53Z

Commons > Lingua Libre categories > target language approach

@StefanVukovic99 , I was able to refine the query centered on Commons' Lingualibre categories. See :

https://jsfiddle.net/6rbcu1yw/2/
- [classic] Query on Commons:Category:Lingual Libre pronuncitation-deu(German) > word: hund, sorted by most relevant
- [regex] Query on Commons:Category:Lingual Libre pronuncitation-deu(German) > word: hund, sorted by most relevant
Input : Commons API URL + iso639-3 + word
Response: List of relevant Commons Lingua Libre audio files

Note: this research is per Language + end of string. So a research for eng + -green.wav could also return eng + blue-green.wav.

Wikibase `P407:Language of work` approach

We could also look for audio pronunciation files with in P407:language of work or name --> Q188:German
and P9533:audio transcription --> hund/Hund . (Query to find)

Current structured data is not open to external queries. So this method don't have API endpoint at the moment.

Only this approach would allow an exact match for word queries, while also able to search in a Language of work.

hugolpz · 2024-06-25T19:39:57Z

@StefanVukovic99 : I added some sample with regex search &srsearch=intitle:/-hund.wav/i so you may refine the API request further.

As for performance, we have as expected :

Two steps search is fast (1. find the filename ; 2. from it, find full filepath), as per the jsfiddle.
Single step search returning filename and filepath exist but can lag further.
Regex are slower

StefanVukovic99 · 2024-06-26T20:21:37Z

Thank you very much 🙏
I'm happy with how this works now, it's ready for a review & merge.

I think to make use of some of the other Commons files it would be best to add another, Wikimedia Commons audio source that will use a different regex(es) to get some of the less systematically named files. That way Lingua Libre can be the 'quality' source and the rest of commons the 'quantity' source.

Casheeew

lgtm

add lingua libre audio source

2c03299

Merge branch 'master' into lingua-libre-audio-source

a52963a

Merge branch 'master' into lingua-libre-audio-source

5bfa19a

StefanVukovic99 added 2 commits June 26, 2024 22:07

mvp

55762eb

run file requests in parallel

955b460

StefanVukovic99 marked this pull request as ready for review June 26, 2024 20:17

StefanVukovic99 requested a review from a team as a code owner June 26, 2024 20:17

StefanVukovic99 and others added 3 commits June 26, 2024 22:23

remove redundant language var

5caf9fd

redundant api function

b04fe38

Merge branch 'master' into lingua-libre-audio-source

0d61e83

Casheeew approved these changes Jun 27, 2024

View reviewed changes

Kuuuube approved these changes Jun 27, 2024

View reviewed changes

StefanVukovic99 added this pull request to the merge queue Jun 27, 2024

Merged via the queue into yomidevs:master with commit 603c2c7 Jun 27, 2024
11 checks passed

StefanVukovic99 deleted the lingua-libre-audio-source branch June 27, 2024 16:12

StefanVukovic99 added kind/enhancement The issue or PR is a new feature or request area/audio The issue or PR is related to audio labels Jun 27, 2024

Kuuuube mentioned this pull request Jul 28, 2024

Fix note generator audio #1278

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add lingua libre audio source #1129

add lingua libre audio source #1129

StefanVukovic99 commented Jun 24, 2024 •

edited

Loading

codspeed-hq bot commented Jun 24, 2024 •

edited

Loading

hugolpz commented Jun 25, 2024 •

edited

Loading

hugolpz commented Jun 25, 2024 •

edited

Loading

hugolpz commented Jun 25, 2024 •

edited

Loading

StefanVukovic99 commented Jun 26, 2024

Casheeew left a comment

add lingua libre audio source #1129

add lingua libre audio source #1129

Conversation

StefanVukovic99 commented Jun 24, 2024 • edited Loading

codspeed-hq bot commented Jun 24, 2024 • edited Loading

CodSpeed Performance Report

Merging #1129 will not alter performance

Summary

hugolpz commented Jun 25, 2024 • edited Loading

Wikimedia Commons' lexical audio assets : categories and files.

Filter per language

API sandbox

hugolpz commented Jun 25, 2024 • edited Loading

Commons > Lingua Libre categories > target language approach

Wikibase P407:Language of work approach

hugolpz commented Jun 25, 2024 • edited Loading

StefanVukovic99 commented Jun 26, 2024

Casheeew left a comment

Choose a reason for hiding this comment

StefanVukovic99 commented Jun 24, 2024 •

edited

Loading

codspeed-hq bot commented Jun 24, 2024 •

edited

Loading

hugolpz commented Jun 25, 2024 •

edited

Loading

hugolpz commented Jun 25, 2024 •

edited

Loading

Wikibase `P407:Language of work` approach

hugolpz commented Jun 25, 2024 •

edited

Loading