You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using the DDPO logic to fine-tuned my own model.
However, I found that the example reward function (LLaVA BERTScore) use a fixed batch size.
After seeing the source code in this repo and the TRL DDPOTrainer class, I found that this batch size may related to sample_batch_size.
I recommend to modify the batch size with the one in the config or leave some comments on it. By doing so, people who wants to design their reward function can have a more sensible guide.
Below is the example reward in this repo I mentioned above.
defllava_bertscore():
"""Submits images to LLaVA and computes a reward by comparing the responses to the prompts using BERTScore. See https://github.com/kvablack/LLaVA-server for server-side code. """importrequestsfromrequests.adaptersimportHTTPAdapter, RetryfromioimportBytesIOimportpicklebatch_size=16url="http://127.0.0.1:8085"sess=requests.Session()
retries=Retry(
total=1000, backoff_factor=1, status_forcelist=[500], allowed_methods=False
)
sess.mount("http://", HTTPAdapter(max_retries=retries))
def_fn(images, prompts, metadata):
delmetadataifisinstance(images, torch.Tensor):
images= (images*255).round().clamp(0, 255).to(torch.uint8).cpu().numpy()
images=images.transpose(0, 2, 3, 1) # NCHW -> NHWCimages_batched=np.array_split(images, np.ceil(len(images) /batch_size))
prompts_batched=np.array_split(prompts, np.ceil(len(prompts) /batch_size))
...
And this is the code which use compute_reward() in the DDPOTraner class in TRL Repo
defstep(self, epoch: int, global_step: int):
""" Perform a single step of training. Args: epoch (int): The current epoch. global_step (int): The current global step. Side Effects: - Model weights are updated - Logs the statistics to the accelerator trackers. - If `self.image_samples_callback` is not None, it will be called with the prompt_image_pairs, global_step, and the accelerator tracker. Returns: global_step (int): The updated global step. """samples, prompt_image_data=self._generate_samples(
iterations=self.config.sample_num_batches_per_epoch,
batch_size=self.config.sample_batch_size,
)
# collate samples into dict where each entry has shape (num_batches_per_epoch * sample.batch_size, ...)samples= {k: torch.cat([s[k] forsinsamples]) forkinsamples[0].keys()}
rewards, rewards_metadata=self.compute_rewards(
prompt_image_data, is_async=self.config.async_reward_computation
)
...
The text was updated successfully, but these errors were encountered:
I am using the DDPO logic to fine-tuned my own model.
However, I found that the example reward function (LLaVA BERTScore) use a fixed batch size.
After seeing the source code in this repo and the TRL DDPOTrainer class, I found that this batch size may related to sample_batch_size.
I recommend to modify the batch size with the one in the config or leave some comments on it. By doing so, people who wants to design their reward function can have a more sensible guide.
Below is the example reward in this repo I mentioned above.
And this is the code which use compute_reward() in the DDPOTraner class in TRL Repo
The text was updated successfully, but these errors were encountered: