Additional fixes #4

SofiaChar · 2023-11-06T17:59:25Z

New docker image valohai/llm-toolkit
Fix error in data-prep
Delete viggo.py file from inputs to data-prep step, now load without HF script
Minor change in finetuning parameters
Add prepare_prompt method to inference to have the proper input to the model when inferencing.

SofiaChar · 2023-11-07T11:53:22Z

@tokkoro Can i merge this?

tokkoro · 2023-11-07T12:02:13Z

inference-mistral.py

@@ -3,6 +3,7 @@

 import torch
 import transformers
+import datasets


Is this used somehow?

You should install the pre-commit hooks to run ruff locally before committing. It would catch these kind of issues and fix them for you when you are about to commit new code.

To install:

pip install pre-commit pre-commit install

then you can run ruff . to see all issues or ruff . --fix to fix them. Or run them via pre-commit without commit by pre-commit run --all-files. The configs for pre-commit are in https://github.com/valohai/mistral-example/blob/main/.pre-commit-config.yaml

akx · 2023-11-07T12:28:57Z

inference-mistral.py

-    parser.add_argument('--prompt', type=str, required=True, help='Input prompt for text generation')
+    parser.add_argument('--prompt', type=str, help='Input prompt for text generation')


The prompt is required, though, isn't it? You can't infer without one, and passing None would fail.

akx · 2023-11-07T12:29:17Z

inference-mistral.py

+    parser.add_argument('--base_mistral_model', type=str, default='mistralai/Mistral-7B-v0.1',
+                        help='Base mistral from hugging face')


Unnecessary change here.

inference-mistral.py

akx · 2023-11-07T12:32:51Z

data-preprocess.py

-        self.data_path = args.data_path or valohai.inputs('dataset').path()
+        self.data_path = args.data_path
        self.model_max_length = args.model_max_length
        self.tokenizer = args.tokenizer
-        self.train_dataset = load_dataset(self.data_path, split='train')
-        self.eval_dataset = load_dataset(self.data_path, split='validation')
-        self.test_dataset = load_dataset(self.data_path, split='test')
+        self.train_dataset = load_dataset('csv', data_files=valohai.inputs('dataset').path('train.csv'))
+        self.eval_dataset = load_dataset('csv', data_files=valohai.inputs('dataset').path('validation.csv'))
+        self.test_dataset = load_dataset('csv', data_files=valohai.inputs('dataset').path('test.csv'))


Don't these changes mean that the data_path argument isn't used anymore? If so, it should be removed altogether.

However, I think it would be better to allow it, and compose these paths based on it (and default it to the dataset input directory)?

Maybe more broadly – what's the reason behind this change? Doesn't this work like the original code did?

In the original code the dataset input was pointing to viggo.py file, which loaded the data from HF for us. We decided that it's not the intuitive way to load the data, and it is better to load from csv. When loading from csv you should pass the path to the exact file. I don’t see how we can have one data_path.
Maybe it will look better if we have three inputs like: train_data, test_data, val_data?
Or I delete data_path arg for good?

When loading from csv you should pass the path to the exact file. I don’t see how we can have one data_path.

It'd be the path to the directory with the files:

Suggested change

self.data_path = args.data_path or valohai.inputs('dataset').path()

self.data_path = args.data_path

self.model_max_length = args.model_max_length

self.tokenizer = args.tokenizer

self.train_dataset = load_dataset(self.data_path, split='train')

self.eval_dataset = load_dataset(self.data_path, split='validation')

self.test_dataset = load_dataset(self.data_path, split='test')

self.train_dataset = load_dataset('csv', data_files=valohai.inputs('dataset').path('train.csv'))

self.eval_dataset = load_dataset('csv', data_files=valohai.inputs('dataset').path('validation.csv'))

self.test_dataset = load_dataset('csv', data_files=valohai.inputs('dataset').path('test.csv'))

self.data_path = args.data_path or valohai.inputs('dataset').path()

self.model_max_length = args.model_max_length

self.tokenizer = args.tokenizer

self.train_dataset = load_dataset('csv', data_files=os.path.join(self.data_path, 'train.csv'))

self.eval_dataset = load_dataset('csv', data_files=os.path.join(self.data_path, 'validation.csv'))

self.test_dataset = load_dataset('csv', data_files=os.path.join(self.data_path, 'test.csv'))

fix dataprep error / prompt prep in inference

ae89388

SofiaChar requested a review from akx November 7, 2023 11:53

tokkoro reviewed Nov 7, 2023

View reviewed changes

akx requested changes Nov 7, 2023

View reviewed changes

sofiacharnota added 2 commits November 7, 2023 14:12

undo changes in inference parser

34dd9b0

Update data-preprocess self.data_path argument

54ea45d

SofiaChar requested a review from akx November 7, 2023 15:09

akx approved these changes Nov 7, 2023

View reviewed changes

akx merged commit 1f56c0f into main Nov 7, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional fixes #4

Additional fixes #4

SofiaChar commented Nov 6, 2023

SofiaChar commented Nov 7, 2023

tokkoro Nov 7, 2023

akx Nov 7, 2023

akx Nov 7, 2023

akx Nov 7, 2023

SofiaChar Nov 7, 2023

akx Nov 7, 2023

		parser.add_argument('--prompt', type=str, required=True, help='Input prompt for text generation')
		parser.add_argument('--prompt', type=str, help='Input prompt for text generation')

		parser.add_argument('--base_mistral_model', type=str, default='mistralai/Mistral-7B-v0.1',
		help='Base mistral from hugging face')

Additional fixes #4

Additional fixes #4

Conversation

SofiaChar commented Nov 6, 2023

SofiaChar commented Nov 7, 2023

tokkoro Nov 7, 2023

Choose a reason for hiding this comment

akx Nov 7, 2023

Choose a reason for hiding this comment

akx Nov 7, 2023

Choose a reason for hiding this comment

akx Nov 7, 2023

Choose a reason for hiding this comment

SofiaChar Nov 7, 2023

Choose a reason for hiding this comment

akx Nov 7, 2023

Choose a reason for hiding this comment