Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Will update instructions to use latest wikiextractor #203

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

raivisdejus
Copy link

Adjusted the instructions in the Readme to use a more recent version of wikiextractor. It seems to be able to extract more content. In my tests for the Latvian, I am able to get about 5% more sentences if the updated wikiextractor is used.

Copy link
Member

@MichaelKohler MichaelKohler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, and thanks for submitting this PR! Happy to hear that newer versions of the WikiExtractor result in more sentences. Have you noticed any caveats we need to be aware of in terms of quality of these sentences? I think it might also be worth it to test that upgrade on the English dump.

Generally I would prefer referring to a specific commit though, as that would insure that local tests reflect what will eventually be run in the final extraction that will be added to CV. The version of the automation pipeline is specified in the scripts in https://github.com/common-voice/cv-sentence-extractor/tree/main/scripts/providers. These should match the same commit as we use in the README to have a least -surprises approach for contributors.

@raivisdejus
Copy link
Author

Hard to tell if the updated wikiextractor is meaningfully better. I noticed that it is processing gallery templates in the articles better, so you get some sentences from those. Maybe other things.

If someone follows along and does some tests for some other language, this is something to test. Will leave this PR as a note and will update it to a specific commit if I happen to do some further work on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants