Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document how dictionaries are created #102

Closed
juanjoDiaz opened this issue May 31, 2023 · 8 comments · Fixed by #122
Closed

Document how dictionaries are created #102

juanjoDiaz opened this issue May 31, 2023 · 8 comments · Fixed by #122
Assignees
Labels
documentation Improvements or additions to documentation
Milestone

Comments

@juanjoDiaz
Copy link
Collaborator

So people can work on them, improve what we have, create custom dictionaries, etc...

@adbar adbar added the documentation Improvements or additions to documentation label May 31, 2023
@adbar adbar self-assigned this May 31, 2023
@adbar
Copy link
Owner

adbar commented May 31, 2023

Yes, it would be better, I'll work on this.

@adbar
Copy link
Owner

adbar commented Jun 2, 2023

I'll document it later in a readme in training/ but here are a few conditions:

  • The data is present in two columns: lemma TAB word form
  • Redundant and noisy cases are filtered out but it's best if the data is clean
  • To do so, you need a source comprising enough words without sacrificing quality, I believe the kaikki.org project is best for additional languages at this point (I can provide a script later on)
  • The new language (two- or three-letter code) has to be added to the dictionary data (using the dictionary_pickler script), it should then be available in SUPPORTED_LANGUAGES
  • Ideally the data should be reviewed and if possible tested on an authoritative source like UD

@adbar adbar added this to the v1.0 milestone Jun 21, 2023
@juanjoDiaz
Copy link
Collaborator Author

I could create scripts similar to what I did for evaluation on #116, but for training.
However, I still don't understand fully where the source data is coming from or how are you creating the dictionaries.

Could you elaborate a bit more on that?

@adbar
Copy link
Owner

adbar commented Sep 18, 2023

@juanjoDiaz Yes, I'll work on the repository in October.

@juanjoDiaz
Copy link
Collaborator Author

HI @adbar ,

Are you still maintaining this library?
Any plan to still document this and release v1.0?

@adbar
Copy link
Owner

adbar commented Mar 6, 2024

Hi, I'm working on something else at the moment but I still plan to work on it.

@adbar adbar linked a pull request Apr 17, 2024 that will close this issue
@juanjoDiaz
Copy link
Collaborator Author

Hi @adbar ,

I just reviewed the doc that you created and I have some questions:

You mention multiple dataset:

However, I'm very unclear about what of those datasets you actually use to create the current dictionary.
You documentation only covers Kaiki
If I want to rebuild the dictionaries from scratch, how could I do it ensuring the same result?
(This would be needed if we want to re-add to the dictionaries the words removed because they match a rule)

Most of them seem to just get the data from wiktionary, aren't they?

@adbar
Copy link
Owner

adbar commented May 17, 2024

This project started as a simple experiment and I didn't implement version control on the data so what I did is not reproducible I'm afraid. The lists mentioned in the Readme were first used roughly in that order and in that importance.

Small mistakes needed to be fixed in the lists so there are a lot of small steps involved which I tried to replicate in the dictionary pickling module, e.g. if the first column is obviously too different from the second one.
Side note: Sometimes there are forms generated automatically which don't exist in the target language (in TALP-UPC lists for example if I remember correctly), I tried to remove them but it's not a big deal since there are simply never found as they don't exist.

I believe the current state is not a problem as the aggregated lists currently present in the data can be extracted and further refined.

Over time it became clear that Kaikki was the best option to add new languages as the data is cleaner and encompasses more languages. It is not necessarily more comprehensive per language but it keeps getting better (both Wiktionary data and extraction). I expect Kaikki's part to continue becoming larger, finding and cleaning lists for additional languages is too much of a hassle.
Note: UD lists are good in principle but then you're left with nothing to evaluate the lemmatizer on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants