Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

created or.toml #107

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Conversation

psubhashish
Copy link

No description provided.

Copy link
Member

@MichaelKohler MichaelKohler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! You can find the sample extraction here: https://github.com/Common-Voice/cv-sentence-extractor/suites/774458492/artifacts/8112008 (see https://discourse.mozilla.org/t/scraper-automatic-sample-sentences-extracted-in-pull-request/55217 for a full explanation).

I see a few issues at first glance:

  • There are English sentences in the OR wikipedia? (having a look at the allowed_symbols config option might help here instead of using the "disallowed_symbols")
  • There seem to be more abbreviations

Additionally to that, I highly suggest adding a blocklist as well: https://github.com/Common-Voice/cv-sentence-extractor#using-disallowed-words

Happy to help if you have any questions.

@psubhashish
Copy link
Author

psubhashish commented Jun 11, 2020

Hey Michael, thanks for flagging these. As a Wikipedia editor myself, I couldn't stop myself fixing some of the issues that you flagged. :-) So, there it goes -- I have started checking the English sentences and some are actual content (the rest being quotes like someone saying something about some person/place/incident -- original quotes are kept without translation in some articles) but fixing will take longer. The good news is many articles were due maintenance tags and deletion (oops) and this became a good excuse for some cleanup for good. Pat your shoulders as you indirectly contributed to Wikipedia! I'm yet to work on the blocklist.

In the meantime, is it possible to run the code and create such sample text that contains English? Maybe something I can share with the Wikipedia community so more helping hands can clean up. Also, the extractor needs to be told to not collect the citations or footnotes. It's "References" in English Wikipedia and "ଆଧାର" or "ଟୀକା" in Odia. I see some such citations included in the file that you sent.

Your comment says "requested changes". Does that mean that I need to work on the disallowed word list and this article both? I am a bit unsure what is the ask for this very file "or.toml" and would appreciate if you can help.

@MichaelKohler
Copy link
Member

MichaelKohler commented Jun 11, 2020

In the meantime, is it possible to run the code and create such sample text that contains English?

You can run as explained in the README, and use the option --no-check such as:

cargo run -- extract -l or -d ../wikiextractor/text/ --no_check >> wiki.or.all.txt

Note that this will take quite some time, and we will not be able to use that resulting file, as we have a limit of sentences per article we can take.

Might be easier to take the extraction from WikiExtractor and extract the sentences from there, then you don't have to run this script here just to identify all sentences. However, you'll need to do that to generate the block list, so probably a win-win if you do it.

Also, the extractor needs to be told to not collect the citations or footnotes. It's "References" in English Wikipedia and "ଆଧାର" or "ଟୀକା" in Odia. I see some such citations included in the file that you sent.

As we're using WikiExtractor before running our script, we do not have that info. And as far as I can see there is no such option in WikiExtractor?

Your comment says "requested changes". Does that mean that I need to work on the disallowed word list and this article both? I am a bit unsure what is the ask for this very file "or.toml" and would appreciate if you can help.

In the end we can merge this PR and run the extraction once the following is achieved:

  • Error rate is below 7% (I think, I always forget the exact number) - Get at least 3 different native speakers (ideally linguistics) to review a random sample of 100-500 sentences and estimate the average error ratio and comment (or link their comment) in the PR.
  • Rules file is correct from a technical standpoint (this is currently the case, but you most likely will require more changes down the road to decrease the error rate, so I will review again then)
  • (if using a blocklist) We know with what parameters this blocklist got generated

If it's achievable to get the error rate down to an acceptable level only with the rule file and no blocklist, that's an option too, but I heavily doubt that as we've seen a lot of improvement for other languages once a blocklist was added (as described in the README).

@MichaelKohler MichaelKohler marked this pull request as draft September 1, 2020 16:10
@MichaelKohler MichaelKohler changed the base branch from master to main October 27, 2020 17:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants