Document how dictionaries are created #102

juanjoDiaz · 2023-05-31T13:27:14Z

So people can work on them, improve what we have, create custom dictionaries, etc...

adbar · 2023-05-31T14:55:50Z

Yes, it would be better, I'll work on this.

adbar · 2023-06-02T15:48:53Z

I'll document it later in a readme in training/ but here are a few conditions:

The data is present in two columns: lemma TAB word form
Redundant and noisy cases are filtered out but it's best if the data is clean
To do so, you need a source comprising enough words without sacrificing quality, I believe the kaikki.org project is best for additional languages at this point (I can provide a script later on)
The new language (two- or three-letter code) has to be added to the dictionary data (using the dictionary_pickler script), it should then be available in SUPPORTED_LANGUAGES
Ideally the data should be reviewed and if possible tested on an authoritative source like UD

juanjoDiaz · 2023-09-15T11:44:46Z

I could create scripts similar to what I did for evaluation on #116, but for training.
However, I still don't understand fully where the source data is coming from or how are you creating the dictionaries.

Could you elaborate a bit more on that?

adbar · 2023-09-18T16:59:32Z

@juanjoDiaz Yes, I'll work on the repository in October.

juanjoDiaz · 2024-03-06T07:37:49Z

HI @adbar ,

Are you still maintaining this library?
Any plan to still document this and release v1.0?

adbar · 2024-03-06T18:34:33Z

Hi, I'm working on something else at the moment but I still plan to work on it.

juanjoDiaz · 2024-05-15T20:37:56Z

Hi @adbar ,

I just reviewed the doc that you created and I have some questions:

You mention multiple dataset:

However, I'm very unclear about what of those datasets you actually use to create the current dictionary.
You documentation only covers Kaiki
If I want to rebuild the dictionaries from scratch, how could I do it ensuring the same result?
(This would be needed if we want to re-add to the dictionaries the words removed because they match a rule)

Most of them seem to just get the data from wiktionary, aren't they?

adbar · 2024-05-17T12:35:48Z

This project started as a simple experiment and I didn't implement version control on the data so what I did is not reproducible I'm afraid. The lists mentioned in the Readme were first used roughly in that order and in that importance.

Small mistakes needed to be fixed in the lists so there are a lot of small steps involved which I tried to replicate in the dictionary pickling module, e.g. if the first column is obviously too different from the second one.
Side note: Sometimes there are forms generated automatically which don't exist in the target language (in TALP-UPC lists for example if I remember correctly), I tried to remove them but it's not a big deal since there are simply never found as they don't exist.

I believe the current state is not a problem as the aggregated lists currently present in the data can be extracted and further refined.

Over time it became clear that Kaikki was the best option to add new languages as the data is cleaner and encompasses more languages. It is not necessarily more comprehensive per language but it keeps getting better (both Wiktionary data and extraction). I expect Kaikki's part to continue becoming larger, finding and cleaning lists for additional languages is too much of a hassle.
Note: UD lists are good in principle but then you're left with nothing to evaluate the lemmatizer on.

adbar added the documentation Improvements or additions to documentation label May 31, 2023

adbar self-assigned this May 31, 2023

adbar added this to the v1.0 milestone Jun 21, 2023

juanjoDiaz mentioned this issue Mar 25, 2024

Words that match more than one lemma #94

Open

adbar linked a pull request Apr 17, 2024 that will close this issue

docs: add info on training data #122

Merged

adbar closed this as completed in #122 Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document how dictionaries are created #102

Document how dictionaries are created #102

juanjoDiaz commented May 31, 2023

adbar commented May 31, 2023

adbar commented Jun 2, 2023

juanjoDiaz commented Sep 15, 2023

adbar commented Sep 18, 2023

juanjoDiaz commented Mar 6, 2024

adbar commented Mar 6, 2024

juanjoDiaz commented May 15, 2024

adbar commented May 17, 2024

Document how dictionaries are created #102

Document how dictionaries are created #102

Comments

juanjoDiaz commented May 31, 2023

adbar commented May 31, 2023

adbar commented Jun 2, 2023

juanjoDiaz commented Sep 15, 2023

adbar commented Sep 18, 2023

juanjoDiaz commented Mar 6, 2024

adbar commented Mar 6, 2024

juanjoDiaz commented May 15, 2024

adbar commented May 17, 2024