-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Additional fixes #4
Conversation
@tokkoro Can i merge this? |
inference-mistral.py
Outdated
@@ -3,6 +3,7 @@ | |||
|
|||
import torch | |||
import transformers | |||
import datasets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used somehow?
You should install the pre-commit hooks to run ruff locally before committing. It would catch these kind of issues and fix them for you when you are about to commit new code.
To install:
pip install pre-commit
pre-commit install
then you can run ruff .
to see all issues or ruff . --fix
to fix them. Or run them via pre-commit without commit by pre-commit run --all-files
. The configs for pre-commit are in https://github.com/valohai/mistral-example/blob/main/.pre-commit-config.yaml
inference-mistral.py
Outdated
parser.add_argument('--prompt', type=str, required=True, help='Input prompt for text generation') | ||
parser.add_argument('--prompt', type=str, help='Input prompt for text generation') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The prompt is required, though, isn't it? You can't infer without one, and passing None
would fail.
inference-mistral.py
Outdated
parser.add_argument('--base_mistral_model', type=str, default='mistralai/Mistral-7B-v0.1', | ||
help='Base mistral from hugging face') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary change here.
data-preprocess.py
Outdated
self.data_path = args.data_path or valohai.inputs('dataset').path() | ||
self.data_path = args.data_path | ||
self.model_max_length = args.model_max_length | ||
self.tokenizer = args.tokenizer | ||
self.train_dataset = load_dataset(self.data_path, split='train') | ||
self.eval_dataset = load_dataset(self.data_path, split='validation') | ||
self.test_dataset = load_dataset(self.data_path, split='test') | ||
self.train_dataset = load_dataset('csv', data_files=valohai.inputs('dataset').path('train.csv')) | ||
self.eval_dataset = load_dataset('csv', data_files=valohai.inputs('dataset').path('validation.csv')) | ||
self.test_dataset = load_dataset('csv', data_files=valohai.inputs('dataset').path('test.csv')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't these changes mean that the data_path
argument isn't used anymore? If so, it should be removed altogether.
However, I think it would be better to allow it, and compose these paths based on it (and default it to the dataset
input directory)?
Maybe more broadly – what's the reason behind this change? Doesn't this work like the original code did?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the original code the dataset input was pointing to viggo.py
file, which loaded the data from HF for us. We decided that it's not the intuitive way to load the data, and it is better to load from csv. When loading from csv you should pass the path to the exact file. I don’t see how we can have one data_path
.
Maybe it will look better if we have three inputs like: train_data, test_data, val_data?
Or I delete data_path
arg for good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When loading from csv you should pass the path to the exact file. I don’t see how we can have one
data_path
.
It'd be the path to the directory with the files:
self.data_path = args.data_path or valohai.inputs('dataset').path() | |
self.data_path = args.data_path | |
self.model_max_length = args.model_max_length | |
self.tokenizer = args.tokenizer | |
self.train_dataset = load_dataset(self.data_path, split='train') | |
self.eval_dataset = load_dataset(self.data_path, split='validation') | |
self.test_dataset = load_dataset(self.data_path, split='test') | |
self.train_dataset = load_dataset('csv', data_files=valohai.inputs('dataset').path('train.csv')) | |
self.eval_dataset = load_dataset('csv', data_files=valohai.inputs('dataset').path('validation.csv')) | |
self.test_dataset = load_dataset('csv', data_files=valohai.inputs('dataset').path('test.csv')) | |
self.data_path = args.data_path or valohai.inputs('dataset').path() | |
self.model_max_length = args.model_max_length | |
self.tokenizer = args.tokenizer | |
self.train_dataset = load_dataset('csv', data_files=os.path.join(self.data_path, 'train.csv')) | |
self.eval_dataset = load_dataset('csv', data_files=os.path.join(self.data_path, 'validation.csv')) | |
self.test_dataset = load_dataset('csv', data_files=os.path.join(self.data_path, 'test.csv')) |
valohai/llm-toolkit
viggo.py
file from inputs to data-prep step, now load without HF scriptprepare_prompt
method to inference to have the proper input to the model when inferencing.