Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add lingua libre audio source #1129

Merged

Conversation

StefanVukovic99
Copy link
Collaborator

@StefanVukovic99 StefanVukovic99 commented Jun 24, 2024

Closes #1093

@hugolpz thanks for the fiddle, it was super helpful πŸ§Žβ€β™‚οΈ I have some more questions, please answer if you can (or point to documentation) πŸ™ :

  • Search for Hund returns both German and Norwegian entries - can the search filter by language?

    • I've noticed the files for german have deu in the title, is this a hard rule so we can filter on that?
    • Are languages consistently referred to by their ISO 639-3 code?
  • const searchString = `-${searchWord}.wav`;
    

    I've tried looking up for .ogg instead and there are also results. What formats are in use, is there a convenient way to search for all formats at once, or do we need to merge results from multiple requests?

  • Searching for lesen returns a bunch of pdf's with titles that don't even contain lesen. Can the search be more exact?

  • We already have reliable audio sources for Japanese, but is there anything in place on LL's side for languages with ambiguous writing systems? E.g. for Japanese, we use a "term+reading" system so we can differentiate between η”Ÿζ₯­(γͺγ‚Šγ‚γ„) and η”Ÿζ₯­(γ›γ„γŽγ‚‡γ†).

Copy link

codspeed-hq bot commented Jun 24, 2024

CodSpeed Performance Report

Merging #1129 will not alter performance

Comparing StefanVukovic99:lingua-libre-audio-source (0d61e83) with master (4e3f23e)

Summary

βœ… 5 untouched benchmarks

@hugolpz
Copy link

hugolpz commented Jun 25, 2024

Hello @StefanVukovic99 ,

Your links above miss an \ sign before the -hund.wav and -lesen.wav. Without it the query is opposite (EXCLUDE '-hund.wav').

Wikimedia Commons' lexical audio assets : categories and files.

Commons is a crowd sourced project, so there are much more audio, but contributed in either chaotic way or structured way.

Dimension Spontaneous crowd sourcing Lingualibre crowd sourcing
Type Free Structured
Method Each user their process, filenames and categories names convention. All users revord via Lingualibre.org.
Conventional filenames and category names (see below)
Root category Category:Pronunciation Category:Lingua Libre pronunciation
Sub-categories naming rules None, free names If iso exist :
Category:Lingua Libre pronunctiation-{iso639-3}
If no iso :
Category:Lingua Libre pronunctiation-other {language wikidata qid}
Audio filenames naming rules None, free filenames File:LL-{Qid}_({iso})-{wikimedia username}-{voice name}-word.wav
Audio filename examples File:De-blaukarierten.ogg
File:BingBing Chinese.ogg
File:Fedora (fr).oga
File:ENGLISH ELMZ PRONOUNCED BY SAM.wav
...
File:LL-Q188 (deu)-Djknusper-Hund.wav
File:LL-Q9043 (nor)-Teodor605-hund.wav
File:LL-Q188 (deu)-CamelCaseNick-Hund.wav
File:LL-Q131339 (gsw)-Mathieu Kappler-Hund.wav
Wikidata properties Language often:
P407:language of work or name -> Q188:German
often:
P407:language of work or name --> Q188:German
Wikidata properties word often:
P9533:audio transcription --> Hund
often:
P9533:audio transcription --> Hund

On the other hand, Lingualibre audios may make up to 30% of the audio words, but with strict categories names and filenames conventions.

You will need an hard coded table of categories + iso pairs together with .ext flexibility for the first approach. For lingualibre categories and files you may go with a simpler common category name string + iso rules, while all uploaded as .wav. This strategy is up to you.

Filter per language

As pointed out in #1093 : yes we can. Compare the following:

We need to taylor the API request further to your need. Allow me to put those examples bellow to explore those files data and try other API request formats.

We could also look for audio pronunciation files with in P407:language of work or name --> Q188:German
and P9533:audio transcription --> hund/Hund . (Query to find)

API sandbox

If you want to explore, battle yourself and learn, inspects the Commons pages above, most importantly for meaningful patterns in their :

  • filenames (top of page)
  • "structured data" (middle of page, you have to click on it)
  • categories (bottom of page)

Then, play with :

@hugolpz
Copy link

hugolpz commented Jun 25, 2024

Commons > Lingua Libre categories > target language approach

@StefanVukovic99 , I was able to refine the query centered on Commons' Lingualibre categories. See :

Note: this research is per Language + end of string. So a research for eng + -green.wav could also return eng + blue-green.wav.

Wikibase P407:Language of work approach

We could also look for audio pronunciation files with in P407:language of work or name --> Q188:German
and P9533:audio transcription --> hund/Hund . (Query to find)

Current structured data is not open to external queries. So this method don't have API endpoint at the moment.

Only this approach would allow an exact match for word queries, while also able to search in a Language of work.

@hugolpz
Copy link

hugolpz commented Jun 25, 2024

@StefanVukovic99 : I added some sample with regex search &srsearch=intitle:/-hund.wav/i so you may refine the API request further.

As for performance, we have as expected :

  • Two steps search is fast (1. find the filename ; 2. from it, find full filepath), as per the jsfiddle.
  • Single step search returning filename and filepath exist but can lag further.
  • Regex are slower

@StefanVukovic99 StefanVukovic99 marked this pull request as ready for review June 26, 2024 20:17
@StefanVukovic99 StefanVukovic99 requested a review from a team as a code owner June 26, 2024 20:17
@StefanVukovic99
Copy link
Collaborator Author

Thank you very much πŸ™
I'm happy with how this works now, it's ready for a review & merge.

I think to make use of some of the other Commons files it would be best to add another, Wikimedia Commons audio source that will use a different regex(es) to get some of the less systematically named files. That way Lingua Libre can be the 'quality' source and the rest of commons the 'quantity' source.

Copy link
Member

@Casheeew Casheeew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@StefanVukovic99 StefanVukovic99 added this pull request to the merge queue Jun 27, 2024
Merged via the queue into yomidevs:master with commit 603c2c7 Jun 27, 2024
11 checks passed
@StefanVukovic99 StefanVukovic99 deleted the lingua-libre-audio-source branch June 27, 2024 16:12
@StefanVukovic99 StefanVukovic99 added kind/enhancement The issue or PR is a new feature or request area/audio The issue or PR is related to audio labels Jun 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/audio The issue or PR is related to audio kind/enhancement The issue or PR is a new feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Lingua Libre as default audio source
4 participants