Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional fixes #4

Merged
merged 3 commits into from
Nov 7, 2023
Merged

Additional fixes #4

merged 3 commits into from
Nov 7, 2023

Conversation

SofiaChar
Copy link
Collaborator

  1. New docker image valohai/llm-toolkit
  2. Fix error in data-prep
  3. Delete viggo.py file from inputs to data-prep step, now load without HF script
  4. Minor change in finetuning parameters
  5. Add prepare_prompt method to inference to have the proper input to the model when inferencing.

@SofiaChar
Copy link
Collaborator Author

@tokkoro Can i merge this?

@SofiaChar SofiaChar requested a review from akx November 7, 2023 11:53
@@ -3,6 +3,7 @@

import torch
import transformers
import datasets
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used somehow?

You should install the pre-commit hooks to run ruff locally before committing. It would catch these kind of issues and fix them for you when you are about to commit new code.

To install:

pip install pre-commit
pre-commit install

then you can run ruff . to see all issues or ruff . --fix to fix them. Or run them via pre-commit without commit by pre-commit run --all-files. The configs for pre-commit are in https://github.com/valohai/mistral-example/blob/main/.pre-commit-config.yaml

Comment on lines 59 to 72
parser.add_argument('--prompt', type=str, required=True, help='Input prompt for text generation')
parser.add_argument('--prompt', type=str, help='Input prompt for text generation')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prompt is required, though, isn't it? You can't infer without one, and passing None would fail.

Comment on lines 67 to 68
parser.add_argument('--base_mistral_model', type=str, default='mistralai/Mistral-7B-v0.1',
help='Base mistral from hugging face')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary change here.

inference-mistral.py Show resolved Hide resolved
Comment on lines 15 to 20
self.data_path = args.data_path or valohai.inputs('dataset').path()
self.data_path = args.data_path
self.model_max_length = args.model_max_length
self.tokenizer = args.tokenizer
self.train_dataset = load_dataset(self.data_path, split='train')
self.eval_dataset = load_dataset(self.data_path, split='validation')
self.test_dataset = load_dataset(self.data_path, split='test')
self.train_dataset = load_dataset('csv', data_files=valohai.inputs('dataset').path('train.csv'))
self.eval_dataset = load_dataset('csv', data_files=valohai.inputs('dataset').path('validation.csv'))
self.test_dataset = load_dataset('csv', data_files=valohai.inputs('dataset').path('test.csv'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't these changes mean that the data_path argument isn't used anymore? If so, it should be removed altogether.

However, I think it would be better to allow it, and compose these paths based on it (and default it to the dataset input directory)?

Maybe more broadly – what's the reason behind this change? Doesn't this work like the original code did?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original code the dataset input was pointing to viggo.py file, which loaded the data from HF for us. We decided that it's not the intuitive way to load the data, and it is better to load from csv. When loading from csv you should pass the path to the exact file. I don’t see how we can have one data_path.
Maybe it will look better if we have three inputs like: train_data, test_data, val_data?
Or I delete data_path arg for good?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When loading from csv you should pass the path to the exact file. I don’t see how we can have one data_path.

It'd be the path to the directory with the files:

Suggested change
self.data_path = args.data_path or valohai.inputs('dataset').path()
self.data_path = args.data_path
self.model_max_length = args.model_max_length
self.tokenizer = args.tokenizer
self.train_dataset = load_dataset(self.data_path, split='train')
self.eval_dataset = load_dataset(self.data_path, split='validation')
self.test_dataset = load_dataset(self.data_path, split='test')
self.train_dataset = load_dataset('csv', data_files=valohai.inputs('dataset').path('train.csv'))
self.eval_dataset = load_dataset('csv', data_files=valohai.inputs('dataset').path('validation.csv'))
self.test_dataset = load_dataset('csv', data_files=valohai.inputs('dataset').path('test.csv'))
self.data_path = args.data_path or valohai.inputs('dataset').path()
self.model_max_length = args.model_max_length
self.tokenizer = args.tokenizer
self.train_dataset = load_dataset('csv', data_files=os.path.join(self.data_path, 'train.csv'))
self.eval_dataset = load_dataset('csv', data_files=os.path.join(self.data_path, 'validation.csv'))
self.test_dataset = load_dataset('csv', data_files=os.path.join(self.data_path, 'test.csv'))

@SofiaChar SofiaChar requested a review from akx November 7, 2023 15:09
@akx akx merged commit 1f56c0f into main Nov 7, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants