-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Swabian Dialect #2937
Add Swabian Dialect #2937
Conversation
😅 |
It might be worth considering to choose a broader approach and choose the complete alemannic dialect group (iso: als) This would include badenian and also Swiss dialects, which would increase the number of possible contributors a lot. (10 million speakers instead of just 800 000) Plus there is also an alemannic Wikipedia (https://als.m.wikipedia.org/) so there is at least some training data for the base model available while the number of mere svabian texts to train a language model is probably very small. I understand that this would create issues with different dialects that are not exactly the same, but as long as the dialect isn't too thick this should work in written conversations. BTW, you can make ChatGPT speak pretty good badenian using this prompt, so it is definitely possible to make a language model speak German dialects https://github.com/stefangrotz/prompts/blob/main/alemanic-assistant.md |
@stefangrotz according to this Wikipedia article the language code I actually think, while a mixed approach will probably get more labelers and contributors generally involved in the system - I feel like for training the model it won't make that much of a difference overall if you have a solution that tries to include all dialects into one, but will rather cause massive conflicts between the labelers (since some swabian, badenian words and expressions are just different) and then stuff would get up/downvoted between swabian/badenian native speakers all the time, reducing overall quality. Since the model is multilingual anyway it will get i.e. a german prompt and then know that it's supposed to generate in german. Same will be the case for these dialects - if there is enough training data. The key is the multilinguality of our system -> It can still learn from closely related languages: If you for instance prompt in norwegian (I think there were ~100 messages trained when I checked) it will sometimes answer in english, norwegian, danish or swedish -> These languages are already closely related and there are not many messages currently. So the system cannot distinguish properly. However if you take close languages with more trees it becomes better and better. But what this shows is that there is no point in a My norwegian friend could understand the answers it generated even tough stuff was mixed up with danish and swedish, but once more data exists the languages will become more distinct... -> I think it is better to have one tree per dialect (maybe we generally need to think of a way of handling sub-languages i.e. all british/german dialects) because then you So I really think if you can make a seperation between the dialects it really should be done, and yeah even tough 800.000 native speakers isn't too much i think it's a good experiment as well and I know a bunch of people that would get involved labeling. @stefangrotz I personally suggest that if you speak any allemanic dialects that you make a PR with these dialects added and their respective language code. I'm not a linguist but I think everything that is distinct enough to have one of these language codes is probably worth adding (If there are contributors that are willing to help). According to the Allemanic German article I think we would probably want to make PR's for all 4 distinct Allemanic Languages that used to be under the
Maybe we'll need to discuss more how we should handle dialects generally. Maybe the core team, i.e. @AbdBarho could join the discussion and potentially review this PR. |
Okay, I see your points. If you belive that you can mobilize enough people to create a swabian dataset, then a separated language version would be fine for me. Adding many small dialects could lead to a very long language list though, but this isn't necessarily a bad thing and Open Assistant is probably the only place where a dialect dialogue dataset can be built up right now. The only real downside is more work during the data export. |
Yeah! I think in the long run we might want to maybe add something like a conditional dialects panel that shows up when dialects exist, especially if there are a ton of dialects that can be attributed to a language. Taking German as an example I think there is a ton of other dialects that could be added (Plattdeutsch, Sächsisch etc.) So maybe one could then choose Also if you think for instance about the many Indian and Chinese dialects that are spoken by millions of people, having a good way of dealing with that could be genius for all the native speakers! |
I believe a simple labeling system for both language variants and topics for specialists will be necessary at some point. I also worked for the project Common Voice and there this is also an ongoing issue. Languages like Portuguese where the Brazilian variant is very different should be split up in a way, but doing it is hard. I think a labeling system inside of a standardized language ist the easiest solution. For dialects a separated corpus might still be a better solution, but it is a thin line between a variant and a dialect. For this PR just adding Svabian as a new language looks like the only possible solution for now. |
@stefangrotz that's good to know, awesome work! I think handling dialects won't ever be straightforward, especially since it is hard to determine how closely related a language or dialect really is from the original source and some are more closely related and others are further away 🤔 - I think fortunately Swabian is quite distinct and can be added as a separate language for now (most "normal" Germans barely understand it, if even a bit), but in the long run I think it's a good idea to discuss the way we want to integrate dialects, related languages and variants into the Open Assistant 👍 @yk could you potentially review this discussion and merge this PR? |
@Logophoman Chrome currently does not return the correct display name for "swg" (returns just "swg") .. a mapping for
|
@andreaskoepf Added the Mapping to the Open-Assistant/website/src/lib/languages.ts and also fixed some typos I made in my initial commit. |
@Logophoman thanks, could you quickly resolve the conflicts? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jetzt brauchen wir nur noch mindestens fünf Schwaben, die mitmachen.. ;-)
Added a PR as requested in #2877
This is the Translation of the interface for the German dialect "Schwäbisch"
According to Wictionary and the ISO 639-3 the shortcut is {swg}.
Here is the Language/Dialect Description page on Wikipedia: Swabian German
I would love to see people contributing to this dialect and will be adding prompts and assistant replies as soon as it is pulled 😊