Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not an issue but a question for going forwards #227

Open
thusinh1969 opened this issue Jun 12, 2023 · 1 comment
Open

Not an issue but a question for going forwards #227

thusinh1969 opened this issue Jun 12, 2023 · 1 comment

Comments

@thusinh1969
Copy link

thusinh1969 commented Jun 12, 2023

Hi,

I found that this repo is focusing ONLY on fine-tuning (with LoRA) for Chinese language. However, LLaMA was trained mostly on English-corpus, with about 30,000 vocab size which is VERY small with English-focus LLM.

How would you describe the quality / perplexity of the result (7B or 13B) with purely LoRA only, without expending Chinese vocab before fine-tuning ? Would you suggest that full fine-tuning / or LoRA fine-tuning but with large corpus (non-instruct) is a better way to go ?

I am about to train Vietnamese for LLaMA, hence would like to know more about your experiences. I also referring to https://github.com/ymcui/Chinese-LLaMA-Alpaca which said that pre-training LoRA with large corpus + expansion of vocab should be done first, so I am a bit confused.

Thanks for any input.
Steve

@Facico
Copy link
Owner

Facico commented Jun 29, 2023

Here is a similar issue: #12

Thank you for your interest in our project. LLaMA is a multilingual model and does have some proficiency in Chinese. Considering the lack of a strong Chinese base, we chose to use LLaMA as the foundation.

Given sufficient hardware resources, full-scale fine-tuning would certainly yield better results compared to using Lora, such as with FastChat's Vicuna.

The method of expanding the vocabulary for Chinese-LLaMA-Alpaca also requires extensive pretraining, which can be done if the hardware conditions are adequate. LLaMA itself utilizes encoding mechanisms that can encode many Chinese characters, but achieving one-to-one encoding is relatively limited, hence the need for vocabulary expansion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants