-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The implementation of Truth Ratio and Probability is different from the definition in the paper #26
Comments
Thanks for your post!
Let us know other questions you have as well. |
Why does a very large truth ratio have to always indicate leakage? I observed that the base Phi model (which hasn't seen TOFU) has a high truth ratio of 0.759 compared to the finetuned and unlearned models, which hover around 0.5. The base Phi model should be the most indifferent to paraphrased or the perturbed answer. I don't think the truth ratio from the paper is simply a measure of indifference to paraphrased or perturbed answer. The numerator has a number of different perturbations we average over. The denominator has a single paraphrased answer that we try to get our model to score for. |
Large as the ratio maybe bigger than 1: it is a ratio over all probabilities so it is not bounded above. Although we don't observe this in practice but it is possible. |
What was the observed forget truth ratio for the only-retain trained model? Was it nearly 1? |
The mean computed either on logprobs (geometric mean) or probs (arithmetic mean) may not have significant advantages or disadvantages. Personally I would prefer the arithmetic mean, which is simpler and behaves linearly. Also you have performed a geometric mean over tokens in one sequence which is the same as the perplexity and has the meaning of the cross entropy. For the truth ratio I have some more comments. For the Forget Quality evaluation, we want to make the model behave indistinguishable among paraphrased answers and perturbed answers. So for current definition of the truth ratio, either R_truth > 1 or R_truth < 1 should be punished, and the min(x, 1/x) operation seems to be suitable. But from this perspective, it still puts the paraphrased answer at an asymmetric position, I would prefer some metric that treats all answers symmetrically? |
@molereddy data from https://github.com/locuslab/tofu/blob/8889542f281f7fca9ad23dbc11a4cb253ee2aa65/data/ft_epoch5_lr1e-05_llama2-7b_retain90_wd0/eval_results/ds_size300/eval_log_aggregated.json Code here: https://gist.github.com/wzunknown/060349e2bbdc72fd664e4e7f0e8a9dc2 |
@wzunknown thank you for that! That makes sense. I observed a ~ 0.75 average with the base Phi model. I guess that's the best value this definition of truth ratio gets to is around 0.75. |
Also why do you choose the probability metrics? Why not make it a multiple-choice question and ask for the answer (A/B/C/D/etc.)? And when measuring the Truth Ratio for retain set, real authors or world facts, do you think compute (max(P_perturbed) / P_paraphrased) would be better? |
@wzunknown Thank you for your share! But in my experiments, I observe the Truth Ratio on Real Authors and World Facts is around 0.6 with the llama2-7b-Tofu, and there is no obvious change with unlearning, which is different from Figure 8 in TOFU. Have you observed similar phenomena? |
@wtma1999 We fixed a bug in the original implementation, the revision will be posted on arXiv later, but here is the updated figure, is this similar to what you see? |
@zhilif Thank you for your prompt response and for sharing your observations on the Truth Ratio results. Additionally, I have a question regarding the baseline chosen for your step ablation experiment. If you used the Gradient Difference as the baseline, would the indicator on the holdout set show an upward trend or keep high score (Contrary to the trend in the figure)? My understanding is that during the forgetting process, the knowledge of the retained set should be continuously strengthened. Did you use Gradient Ascent for this purpose? I would greatly appreciate any insights or clarifications you can provide on these matters. |
@wtma1999 I have just uploaded the notebooks that we use to draw the figure (probably need some cleanup, I will get back to this after NeurIPS), see https://github.com/locuslab/tofu/blob/refactor_eval/notebook/line_plot.ipynb. For Gradient Difference, I interpret "holdout set" as the "retain set" that you don't want to forget? The reason why it's not going upward is because we only subsample n data from the retain set, where n is the number of data that we want to forget (so if we want to forget 20 entries, we only subsample 20 entries to retain). So the signal for retaining is not as strong. The rationale behind this choice is we want to restrict the computation resource, otherwise you can just try to maintain the performance on all retain set, which can be computationally expensive. |
Truth Ratio
In the paper, the truth ratio is defined as,
The normalization is defined as,
The code implementation is,
tofu/aggregate_eval_stat.py
Lines 60 to 72 in 8889542
In the code, there are two questions:
Probability
The probability score for Real Authors and World Facts is defined as the ratio of original probabilities, but in the code (L50-L53) is computed as the ratio of normalized probabilities.
tofu/aggregate_eval_stat.py
Lines 45 to 54 in 8889542
Any help is appreciated!
The text was updated successfully, but these errors were encountered: