-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add lingua libre audio source #1129
add lingua libre audio source #1129
Conversation
CodSpeed Performance ReportMerging #1129 will not alter performanceComparing Summary
|
Hello @StefanVukovic99 , Your links above miss an Wikimedia Commons' lexical audio assets : categories and files.Commons is a crowd sourced project, so there are much more audio, but contributed in either chaotic way or structured way.
On the other hand, Lingualibre audios may make up to 30% of the audio words, but with strict categories names and filenames conventions. You will need an hard coded table of Filter per languageAs pointed out in #1093 : yes we can. Compare the following:
We need to taylor the API request further to your need. Allow me to put those examples bellow to explore those files data and try other API request formats. We could also look for audio pronunciation files with in P407:language of work or name --> Q188:German API sandboxIf you want to explore, battle yourself and learn, inspects the Commons pages above, most importantly for meaningful patterns in their :
Then, play with : |
Commons > Lingua Libre categories > target language approach@StefanVukovic99 , I was able to refine the query centered on Commons' Lingualibre categories. See :
Note: this research is per Language + end of string. So a research for Wikibase
|
@StefanVukovic99 : I added some sample with regex search As for performance, we have as expected :
|
Thank you very much π I think to make use of some of the other Commons files it would be best to add another, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Closes #1093
@hugolpz thanks for the fiddle, it was super helpful π§ββοΈ I have some more questions, please answer if you can (or point to documentation) π :
Search for Hund returns both German and Norwegian entries - can the search filter by language?
deu
in the title, is this a hard rule so we can filter on that?I've tried looking up for
.ogg
instead and there are also results. What formats are in use, is there a convenient way to search for all formats at once, or do we need to merge results from multiple requests?Searching for lesen returns a bunch of pdf's with titles that don't even contain
lesen
. Can the search be more exact?We already have reliable audio sources for Japanese, but is there anything in place on LL's side for languages with ambiguous writing systems? E.g. for Japanese, we use a "term+reading" system so we can differentiate between
ηζ₯(γͺγγγ)
andηζ₯(γγγγγ)
.