Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The implementation of Truth Ratio and Probability is different from the definition in the paper #26

Open
wzunknown opened this issue Apr 3, 2024 · 13 comments

Comments

@wzunknown
Copy link

Truth Ratio

In the paper, the truth ratio is defined as,
image
The normalization is defined as,
image
The code implementation is,

# getting Truth Ratio
avg_paraphrase_np_values = np.array(list(eval_result_dict[k]['avg_paraphrased_loss'].values()))
avg_perturbed_np_values = np.array(list(eval_result_dict[k]['average_perturb_loss'].values()))
avg_perturbed_np_values = avg_perturbed_np_values.mean(axis=-1)
curr_stat_1 = np.exp( avg_perturbed_np_values - avg_paraphrase_np_values)
# output_result[f'{eval_task_dict[k]} paraphrased_over_perturbed'] = curr_stat_1
if 'forget' in k:
paraphrased_perturb_ratio = np.mean(np.minimum(curr_stat_1, 1/curr_stat_1))
else:
paraphrased_perturb_ratio = np.mean(np.maximum(0, 1 - 1/curr_stat_1))
output_result[f'Truth Ratio {eval_task_dict[k]}'] = paraphrased_perturb_ratio

In the code, there are two questions:

  1. L69, the normalization for the "forget" branch take the minimum of a normalized probability and its reciprocal which doesn't make sense and is different from the paper.
  2. L64, the mean operation is over the log probs, but the average in the paper is over the probs.

Probability

The probability score for Real Authors and World Facts is defined as the ratio of original probabilities, but in the code (L50-L53) is computed as the ratio of normalized probabilities.
image

# getting Probability
if 'eval_log' in k:
gt_probs = np.exp(-1 * np.array(list(eval_result_dict[k]['avg_gt_loss'].values())))
avg_gt_prob = np.mean(gt_probs)
else:
avg_true_prob = np.exp(-1 * np.array(list(eval_result_dict[k]['avg_gt_loss'].values())))
avg_false_prob = np.exp(-1 * np.array(list(eval_result_dict[k]['average_perturb_loss'].values())))
avg_all_prob = np.concatenate([np.expand_dims(avg_true_prob, axis=-1), avg_false_prob], axis=1).sum(-1)
avg_gt_prob = np.mean(avg_true_prob/avg_all_prob)
output_result[f'Prob. {eval_task_dict[k]}'] = avg_gt_prob

Any help is appreciated!

@zhilif
Copy link
Collaborator

zhilif commented Apr 3, 2024

Thanks for your post!

  1. Reciprocals: in the paper, that reciprocal was used for uploading to huggingface leaderboard as wel as plotting the line plots. We want a number between [0,1] so that it is on the same scale as everything else. In the paper, forget quality is always computed using the KS test on R_truth without reciprocal. The reciprocal does make sense: a model that hasn't seen this data should be indifferent to the paraphrased or the perturbed answer, that is, the truth ratio should be around 1. A truth ratio that is either too large or too small always tells some leakage. So for a private model, you expect truth ratio to be 1. This should be stressed out in the paper, we will make that adjustment.
  2. probs vs log probs: We realized this and have updated the paper, the arXiv preprint will be updated. Let's stick with log probs.
  3. Good catch. We will also update the paper. The ratio should be over normalized probability (by the length).

Let us know other questions you have as well.

@molereddy
Copy link

Why does a very large truth ratio have to always indicate leakage?

I observed that the base Phi model (which hasn't seen TOFU) has a high truth ratio of 0.759 compared to the finetuned and unlearned models, which hover around 0.5. The base Phi model should be the most indifferent to paraphrased or the perturbed answer.

I don't think the truth ratio from the paper is simply a measure of indifference to paraphrased or perturbed answer. The numerator has a number of different perturbations we average over. The denominator has a single paraphrased answer that we try to get our model to score for.

@zhilif
Copy link
Collaborator

zhilif commented Apr 3, 2024

Why does a very large truth ratio have to always indicate leakage?

I observed that the base Phi model (which hasn't seen TOFU) has a high truth ratio of 0.759 compared to the finetuned and unlearned models, which hover around 0.5. The base Phi model should be the most indifferent to paraphrased or the perturbed answer.

I don't think the truth ratio from the paper is simply a measure of indifference to paraphrased or perturbed answer. The numerator has a number of different perturbations we average over. The denominator has a single paraphrased answer that we try to get our model to score for.

Large as the ratio maybe bigger than 1: it is a ratio over all probabilities so it is not bounded above. Although we don't observe this in practice but it is possible.

@molereddy
Copy link

What was the observed forget truth ratio for the only-retain trained model? Was it nearly 1?

@wzunknown
Copy link
Author

wzunknown commented Apr 3, 2024

The mean computed either on logprobs (geometric mean) or probs (arithmetic mean) may not have significant advantages or disadvantages. Personally I would prefer the arithmetic mean, which is simpler and behaves linearly. Also you have performed a geometric mean over tokens in one sequence which is the same as the perplexity and has the meaning of the cross entropy.

For the truth ratio I have some more comments. For the Forget Quality evaluation, we want to make the model behave indistinguishable among paraphrased answers and perturbed answers. So for current definition of the truth ratio, either R_truth > 1 or R_truth < 1 should be punished, and the min(x, 1/x) operation seems to be suitable. But from this perspective, it still puts the paraphrased answer at an asymmetric position, I would prefer some metric that treats all answers symmetrically?

@wzunknown
Copy link
Author

wzunknown commented Apr 3, 2024

What was the observed forget truth ratio for the only-retain trained model? Was it nearly 1?

@molereddy data from https://github.com/locuslab/tofu/blob/8889542f281f7fca9ad23dbc11a4cb253ee2aa65/data/ft_epoch5_lr1e-05_llama2-7b_retain90_wd0/eval_results/ds_size300/eval_log_aggregated.json

Code here: https://gist.github.com/wzunknown/060349e2bbdc72fd664e4e7f0e8a9dc2

Use the mean of logprobs:
image
Use the mean of probs:
image

@molereddy
Copy link

@wzunknown thank you for that! That makes sense. I observed a ~ 0.75 average with the base Phi model. I guess that's the best value this definition of truth ratio gets to is around 0.75.

@wzunknown
Copy link
Author

wzunknown commented Apr 5, 2024

Also why do you choose the probability metrics? Why not make it a multiple-choice question and ask for the answer (A/B/C/D/etc.)?

And when measuring the Truth Ratio for retain set, real authors or world facts, do you think compute (max(P_perturbed) / P_paraphrased) would be better?

@wzunknown
Copy link
Author

The plot below is the distribution of (max(P_perturbed) / P_paraphrased) for the retain model on the forget set. It shows that even the retain model has some preference bias for the paraphrased answers.
image

@wtma1999
Copy link

@wzunknown Thank you for your share! But in my experiments, I observe the Truth Ratio on Real Authors and World Facts is around 0.6 with the llama2-7b-Tofu, and there is no obvious change with unlearning, which is different from Figure 8 in TOFU. Have you observed similar phenomena?
image

@zhilif
Copy link
Collaborator

zhilif commented May 12, 2024

@wtma1999 We fixed a bug in the original implementation, the revision will be posted on arXiv later, but here is the updated figure, is this similar to what you see?
image

@wtma1999
Copy link

@zhilif Thank you for your prompt response and for sharing your observations on the Truth Ratio results.
I've noticed that my Truth Ratio scores on World Facts, Real Authors, and the Forget Set are similar to yours, but different on Retain Set around 0.4, which is lower than your score. What's your current implementation? Is it consistent with the code repository's calculation (max(0,1-Rtruth))?

Additionally, I have a question regarding the baseline chosen for your step ablation experiment. If you used the Gradient Difference as the baseline, would the indicator on the holdout set show an upward trend or keep high score (Contrary to the trend in the figure)? My understanding is that during the forgetting process, the knowledge of the retained set should be continuously strengthened. Did you use Gradient Ascent for this purpose?

I would greatly appreciate any insights or clarifications you can provide on these matters.

@zhilif
Copy link
Collaborator

zhilif commented May 19, 2024

@wtma1999 I have just uploaded the notebooks that we use to draw the figure (probably need some cleanup, I will get back to this after NeurIPS), see https://github.com/locuslab/tofu/blob/refactor_eval/notebook/line_plot.ipynb.

For Gradient Difference, I interpret "holdout set" as the "retain set" that you don't want to forget? The reason why it's not going upward is because we only subsample n data from the retain set, where n is the number of data that we want to forget (so if we want to forget 20 entries, we only subsample 20 entries to retain). So the signal for retaining is not as strong. The rationale behind this choice is we want to restrict the computation resource, otherwise you can just try to maintain the performance on all retain set, which can be computationally expensive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants