You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I found that this repo is focusing ONLY on fine-tuning (with LoRA) for Chinese language. However, LLaMA was trained mostly on English-corpus, with about 30,000 vocab size which is VERY small with English-focus LLM.
How would you describe the quality / perplexity of the result (7B or 13B) with purely LoRA only, without expending Chinese vocab before fine-tuning ? Would you suggest that full fine-tuning / or LoRA fine-tuning but with large corpus (non-instruct) is a better way to go ?
I am about to train Vietnamese for LLaMA, hence would like to know more about your experiences. I also referring to https://github.com/ymcui/Chinese-LLaMA-Alpaca which said that pre-training LoRA with large corpus + expansion of vocab should be done first, so I am a bit confused.
Thanks for any input.
Steve
The text was updated successfully, but these errors were encountered:
Thank you for your interest in our project. LLaMA is a multilingual model and does have some proficiency in Chinese. Considering the lack of a strong Chinese base, we chose to use LLaMA as the foundation.
Given sufficient hardware resources, full-scale fine-tuning would certainly yield better results compared to using Lora, such as with FastChat's Vicuna.
The method of expanding the vocabulary for Chinese-LLaMA-Alpaca also requires extensive pretraining, which can be done if the hardware conditions are adequate. LLaMA itself utilizes encoding mechanisms that can encode many Chinese characters, but achieving one-to-one encoding is relatively limited, hence the need for vocabulary expansion.
Hi,
I found that this repo is focusing ONLY on fine-tuning (with LoRA) for Chinese language. However, LLaMA was trained mostly on English-corpus, with about 30,000 vocab size which is VERY small with English-focus LLM.
How would you describe the quality / perplexity of the result (7B or 13B) with purely LoRA only, without expending Chinese vocab before fine-tuning ? Would you suggest that full fine-tuning / or LoRA fine-tuning but with large corpus (non-instruct) is a better way to go ?
I am about to train Vietnamese for LLaMA, hence would like to know more about your experiences. I also referring to https://github.com/ymcui/Chinese-LLaMA-Alpaca which said that pre-training LoRA with large corpus + expansion of vocab should be done first, so I am a bit confused.
Thanks for any input.
Steve
The text was updated successfully, but these errors were encountered: