Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding new languages #20

Open
asahi417 opened this issue Nov 10, 2023 · 5 comments
Open

Adding new languages #20

asahi417 opened this issue Nov 10, 2023 · 5 comments

Comments

@asahi417
Copy link
Owner

asahi417 commented Nov 10, 2023

Here's a thread to add more languages to lmqg as well as https://autoqg.net/ . If you would like to contribute, please comment here with a potential QA dataset we can use to train QAG model on the language. We need at least 10k QA pairs for model training.

eg)
Language: Turkish
Dataset: https://github.com/TQuad/turkish-nlp-qa-dataset
Size: 8308

@asahi417
Copy link
Owner Author

asahi417 commented Nov 10, 2023

Language: Bengali
Dataset: https://huggingface.co/datasets/csebuetnlp/squad_bn
Size: 127,771/2,502/2,504

@asahi417
Copy link
Owner Author

Language: Chinese
Dataset: https://github.com/junzeng-pluto/ChineseSquad

@asahi417
Copy link
Owner Author

asahi417 commented Nov 13, 2023

Language: Chinese Dataset: https://github.com/junzeng-pluto/ChineseSquad

Chinese QAG is available on https://autoqg.net/ and lmqg now! With lmqg, you can use it as below.

from lmqg import TransformersQG

model = TransformersQG(language="zh")
context = "与转导或结合不同,转化依赖于大量的细菌基因产物,这些基因产物专门相互作用来完成这个复杂的过程,因此转化显然是细菌对DNA转移的适应。为了使细菌结合、吸收供体DNA并将其重组为自己的染色体,它必须首先进入一种称为能力的特殊生理状态(见自然能力)。在枯草芽孢杆菌中,大约40个基因是培养能力所必需的。枯草芽孢杆菌转化过程中转移的DNA长度可以在染色体的三分之一到整个染色体之间。转化在细菌物种中似乎很常见,到目前为止,已知至少有60种物种具有自然转化能力。自然界能力的发展通常与应激性环境条件有关,似乎是一种促进受体细胞DNA损伤修复的适应。"
model.generate_qa(context)
[('在染色体中发现的DNA长度是多少?', '枯草芽孢杆菌转化过程中转移的DNA长度可以在染色体的三分之一到整个染色体之间。')]

@pawanGithub10
Copy link

Language: Hindi
Dataset:(https://github.com/google-deepmind/xquad/blob/master/xquad.hi.json) Please tell me in detail what activities to be done to contribute.

@asahi417
Copy link
Owner Author

Language: Hindi Dataset:(https://github.com/google-deepmind/xquad/blob/master/xquad.hi.json) Please tell me in detail what activities to be done to contribute.

This is too small. I checked the dataset and there're 1190 QA pairs in total. Ideally, there should be around 10k pairs, as we are going to train relatively small models (~300M).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants