Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding two notebooks on fine-tuning gemini 1.5 using new experimental google gen ai sdk #1516

Merged
merged 2 commits into from
Dec 12, 2024

Conversation

erwinh85
Copy link
Member

Description

Thank you for opening a Pull Request!
Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Follow the CONTRIBUTING Guide.
  • You are listed as the author in your notebook or README file.
    • Your account is listed in CODEOWNERS for the file(s).
  • Make your Pull Request title in the https://www.conventionalcommits.org/ specification.
  • Ensure the tests and linter pass (Run nox -s format from the repository root to format).
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

@erwinh85 erwinh85 requested a review from a team as a code owner December 12, 2024 05:29
Copy link

🤖 I detect that the PR title and the commit message differ and there's only one commit. To use the PR title for the commit history, you can use Github's automerge feature with squashing, or use automerge label. Good luck human!

-- conventional-commit-lint bot
https://conventionalcommits.org/

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link
Contributor

@code-review-assist code-review-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @erwinh85 and reviewers,

I'm currently reviewing this pull request and will post my detailed review in a few minutes. In the meantime, here's a quick summary to help everyone get up to speed:

This PR adds two Jupyter notebooks demonstrating how to fine-tune the Gemini 1.5 Flash model for question answering using the new experimental Google Gen AI SDK. The notebooks cover various aspects, including:

  • Setting up the environment and installing necessary packages.
  • Authenticating with Google Cloud and initializing the Gen AI SDK.
  • Preparing the SQuAD dataset for fine-tuning (including data normalization and formatting into JSONL).
  • Establishing a baseline by evaluating the performance of the base Gemini model.
  • Running the fine-tuning job on Vertex AI using the specified configurations.
  • Monitoring the training and evaluation metrics using TensorBoard.
  • Evaluating the performance of the fine-tuned model.

The main changes involve the addition of two new .ipynb files (gemini/tuning/gen_ai_sdk_supervised_finetuning_using_gemini_qa.ipynb) containing comprehensive code and explanations for the fine-tuning process. The notebooks guide users through each step, from data preparation to model evaluation, providing a practical example of how to leverage the new SDK for supervised fine-tuning.

Now for a little fun, a haiku to brighten your day:

Models learn and grow,
Fine-tuning's gentle touch guides,
Answers clear and bright.

Looking forward to the discussion!

Copy link
Contributor

@code-review-assist code-review-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request introduces a new notebook that demonstrates how to fine-tune Gemini 1.5 Flash for question answering using the Google Gen AI SDK. The notebook is well-structured and provides a comprehensive walkthrough of the fine-tuning process, including data preparation, model training, evaluation, and usage. The use of the SQuAD dataset is appropriate for the task, and the explanations provided throughout the notebook are generally clear and helpful. However, there are a few areas where the notebook could be improved, particularly in terms of code efficiency, error handling, and data handling.

Comment on lines +615 to +630
"source": [
"def get_predictions(question: str, model_version: str) -> str:\n",
"\n",
" prompt = question\n",
" base_model = model_version\n",
"\n",
" response = client.models.generate_content(\n",
" model = base_model,\n",
" contents = prompt,\n",
" config={\n",
" 'system_instruction': systemInstruct,\n",
" 'temperature': 0.3,\n",
" },\n",
" )\n",
"\n",
" return response.text"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The get_predictions function unnecessarily re-initializes the GenerativeModel with the system instruction in every call. This is inefficient, as the model and system instruction remain the same throughout the notebook. It's better to initialize the model once outside the function and reuse it for all predictions. This will significantly reduce overhead and improve performance, especially when making multiple predictions.

Suggested change
"source": [
"def get_predictions(question: str, model_version: str) -> str:\n",
"\n",
" prompt = question\n",
" base_model = model_version\n",
"\n",
" response = client.models.generate_content(\n",
" model = base_model,\n",
" contents = prompt,\n",
" config={\n",
" 'system_instruction': systemInstruct,\n",
" 'temperature': 0.3,\n",
" },\n",
" )\n",
"\n",
" return response.text"
def get_predictions(question: str, model: genai.GenerativeModel) -> str:
prompt = question
response = model.generate_content(
contents = prompt,
config={
'temperature': 0.3,
},
)
return response.text

Comment on lines +672 to +674
"# Apply the get_prediction() function to the 'question_column'\n",
"test_df[\"predicted_answer\"] = test_df[\"input_question\"].apply(get_predictions)\n",
"test_df.head(2)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation uses .apply to generate predictions for each question, which can be inefficient for large datasets due to the overhead of individual API calls. Batch prediction is a more efficient approach, as it allows sending multiple questions to the API at once, reducing overhead and improving performance. Consider using the batch_generate_content method for batch prediction.

Suggested change
"# Apply the get_prediction() function to the 'question_column'\n",
"test_df[\"predicted_answer\"] = test_df[\"input_question\"].apply(get_predictions)\n",
"test_df.head(2)"
def batch_get_predictions(questions: list[str], model: genai.GenerativeModel) -> list[str]:
responses = model.batch_generate_content(
contents=questions,
config={
'temperature': 0.3,
},
)
return [response.text for response in responses]
predictions = batch_get_predictions(test_df['input_question'].tolist(),tuned_model)
test_df["predicted_answer"] = predictions
test_df.head(2)

Comment on lines +942 to +949
"train_dataset = f\"\"\"{BUCKET_URI}/squad_train.jsonl\"\"\"\n",
"validation_dataset = f\"\"\"{BUCKET_URI}/squad_train.jsonl\"\"\"\n",
"\n",
"training_dataset= {\n",
" 'gcs_uri': train_dataset,\n",
"}\n",
"\n",
"validation_dataset = types.TuningValidationDataset(gcs_uri=validation_dataset)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The validation dataset path is the same as the training dataset path. This is likely a mistake. Using the same data for training and validation will not give you a true measure of the model's performance on unseen data. It's crucial to use a separate validation dataset to evaluate how well the model generalizes. Please double-check the paths and ensure they point to the correct training and validation datasets.

Suggested change
"train_dataset = f\"\"\"{BUCKET_URI}/squad_train.jsonl\"\"\"\n",
"validation_dataset = f\"\"\"{BUCKET_URI}/squad_train.jsonl\"\"\"\n",
"\n",
"training_dataset= {\n",
" 'gcs_uri': train_dataset,\n",
"}\n",
"\n",
"validation_dataset = types.TuningValidationDataset(gcs_uri=validation_dataset)"
train_dataset = f"""{BUCKET_URI}/squad_train.jsonl"""
validation_dataset = f"""{BUCKET_URI}/squad_validation.jsonl"""
training_dataset= {
'gcs_uri': train_dataset,
}
validation_dataset = types.TuningValidationDataset(gcs_uri=validation_dataset)

Comment on lines +1268 to +1270
"# Apply the get_prediction() function to the 'question_column'\n",
"test_df[\"predicted_answer\"] = test_df[\"input_question\"].apply(get_predictions)\n",
"test_df.head(2)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to a previous comment, using .apply for predictions can be inefficient. Consider batching the requests.

Suggested change
"# Apply the get_prediction() function to the 'question_column'\n",
"test_df[\"predicted_answer\"] = test_df[\"input_question\"].apply(get_predictions)\n",
"test_df.head(2)"
# Apply the get_prediction() function to the 'question_column'
predictions = batch_get_predictions(test_df['input_question'].tolist(),tuned_model)
test_df["predicted_answer"] = predictions
test_df.head(2)

Copy link
Member

@lavinigam-gcp lavinigam-gcp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@erwinh85 erwinh85 merged commit 1ad19c5 into GoogleCloudPlatform:main Dec 12, 2024
3 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants