-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document how dictionaries are created #102
Comments
Yes, it would be better, I'll work on this. |
I'll document it later in a readme in
|
I could create scripts similar to what I did for evaluation on #116, but for training. Could you elaborate a bit more on that? |
@juanjoDiaz Yes, I'll work on the repository in October. |
HI @adbar , Are you still maintaining this library? |
Hi, I'm working on something else at the moment but I still plan to work on it. |
Hi @adbar , I just reviewed the doc that you created and I have some questions: You mention multiple dataset:
However, I'm very unclear about what of those datasets you actually use to create the current dictionary. Most of them seem to just get the data from wiktionary, aren't they? |
This project started as a simple experiment and I didn't implement version control on the data so what I did is not reproducible I'm afraid. The lists mentioned in the Readme were first used roughly in that order and in that importance. Small mistakes needed to be fixed in the lists so there are a lot of small steps involved which I tried to replicate in the dictionary pickling module, e.g. if the first column is obviously too different from the second one. I believe the current state is not a problem as the aggregated lists currently present in the data can be extracted and further refined. Over time it became clear that Kaikki was the best option to add new languages as the data is cleaner and encompasses more languages. It is not necessarily more comprehensive per language but it keeps getting better (both Wiktionary data and extraction). I expect Kaikki's part to continue becoming larger, finding and cleaning lists for additional languages is too much of a hassle. |
So people can work on them, improve what we have, create custom dictionaries, etc...
The text was updated successfully, but these errors were encountered: