The implementation of Truth Ratio and Probability is different from the definition in the paper #26

wzunknown · 2024-04-03T05:16:30Z

Truth Ratio

In the paper, the truth ratio is defined as,

The normalization is defined as,

The code implementation is,

Lines 60 to 72 in 8889542

    
           # getting Truth Ratio 
        
           avg_paraphrase_np_values = np.array(list(eval_result_dict[k]['avg_paraphrased_loss'].values())) 
        
           avg_perturbed_np_values = np.array(list(eval_result_dict[k]['average_perturb_loss'].values())) 
        
           avg_perturbed_np_values = avg_perturbed_np_values.mean(axis=-1) 
        
           curr_stat_1 =  np.exp( avg_perturbed_np_values - avg_paraphrase_np_values) 
        
           # output_result[f'{eval_task_dict[k]} paraphrased_over_perturbed'] = curr_stat_1 
        
           if 'forget' in k: 
        
               paraphrased_perturb_ratio = np.mean(np.minimum(curr_stat_1, 1/curr_stat_1)) 
        
           else: 
        
               paraphrased_perturb_ratio = np.mean(np.maximum(0, 1 - 1/curr_stat_1)) 
        
           output_result[f'Truth Ratio {eval_task_dict[k]}'] = paraphrased_perturb_ratio

In the code, there are two questions:

L69, the normalization for the "forget" branch take the minimum of a normalized probability and its reciprocal which doesn't make sense and is different from the paper.
L64, the mean operation is over the log probs, but the average in the paper is over the probs.

Probability

The probability score for Real Authors and World Facts is defined as the ratio of original probabilities, but in the code (L50-L53) is computed as the ratio of normalized probabilities.

tofu/aggregate_eval_stat.py

Lines 45 to 54 in 8889542

    
           # getting Probability 
        
           if 'eval_log' in k: 
        
               gt_probs = np.exp(-1 * np.array(list(eval_result_dict[k]['avg_gt_loss'].values()))) 
        
               avg_gt_prob = np.mean(gt_probs) 
        
           else: 
        
               avg_true_prob = np.exp(-1 * np.array(list(eval_result_dict[k]['avg_gt_loss'].values()))) 
        
               avg_false_prob = np.exp(-1 * np.array(list(eval_result_dict[k]['average_perturb_loss'].values()))) 
        
               avg_all_prob = np.concatenate([np.expand_dims(avg_true_prob, axis=-1), avg_false_prob], axis=1).sum(-1) 
        
               avg_gt_prob = np.mean(avg_true_prob/avg_all_prob) 
        
           output_result[f'Prob. {eval_task_dict[k]}'] = avg_gt_prob

Any help is appreciated!

zhilif · 2024-04-03T14:17:32Z

Thanks for your post!

Reciprocals: in the paper, that reciprocal was used for uploading to huggingface leaderboard as wel as plotting the line plots. We want a number between [0,1] so that it is on the same scale as everything else. In the paper, forget quality is always computed using the KS test on R_truth without reciprocal. The reciprocal does make sense: a model that hasn't seen this data should be indifferent to the paraphrased or the perturbed answer, that is, the truth ratio should be around 1. A truth ratio that is either too large or too small always tells some leakage. So for a private model, you expect truth ratio to be 1. This should be stressed out in the paper, we will make that adjustment.
probs vs log probs: We realized this and have updated the paper, the arXiv preprint will be updated. Let's stick with log probs.
Good catch. We will also update the paper. The ratio should be over normalized probability (by the length).

Let us know other questions you have as well.

molereddy · 2024-04-03T15:06:59Z

Why does a very large truth ratio have to always indicate leakage?

I observed that the base Phi model (which hasn't seen TOFU) has a high truth ratio of 0.759 compared to the finetuned and unlearned models, which hover around 0.5. The base Phi model should be the most indifferent to paraphrased or the perturbed answer.

I don't think the truth ratio from the paper is simply a measure of indifference to paraphrased or perturbed answer. The numerator has a number of different perturbations we average over. The denominator has a single paraphrased answer that we try to get our model to score for.

zhilif · 2024-04-03T15:12:57Z

Why does a very large truth ratio have to always indicate leakage?

I observed that the base Phi model (which hasn't seen TOFU) has a high truth ratio of 0.759 compared to the finetuned and unlearned models, which hover around 0.5. The base Phi model should be the most indifferent to paraphrased or the perturbed answer.

I don't think the truth ratio from the paper is simply a measure of indifference to paraphrased or perturbed answer. The numerator has a number of different perturbations we average over. The denominator has a single paraphrased answer that we try to get our model to score for.

Large as the ratio maybe bigger than 1: it is a ratio over all probabilities so it is not bounded above. Although we don't observe this in practice but it is possible.

molereddy · 2024-04-03T15:47:07Z

What was the observed forget truth ratio for the only-retain trained model? Was it nearly 1?

wzunknown · 2024-04-03T22:43:47Z

The mean computed either on logprobs (geometric mean) or probs (arithmetic mean) may not have significant advantages or disadvantages. Personally I would prefer the arithmetic mean, which is simpler and behaves linearly. Also you have performed a geometric mean over tokens in one sequence which is the same as the perplexity and has the meaning of the cross entropy.

For the truth ratio I have some more comments. For the Forget Quality evaluation, we want to make the model behave indistinguishable among paraphrased answers and perturbed answers. So for current definition of the truth ratio, either R_truth > 1 or R_truth < 1 should be punished, and the min(x, 1/x) operation seems to be suitable. But from this perspective, it still puts the paraphrased answer at an asymmetric position, I would prefer some metric that treats all answers symmetrically?

wzunknown · 2024-04-03T22:48:30Z

What was the observed forget truth ratio for the only-retain trained model? Was it nearly 1?

@molereddy data from https://github.com/locuslab/tofu/blob/8889542f281f7fca9ad23dbc11a4cb253ee2aa65/data/ft_epoch5_lr1e-05_llama2-7b_retain90_wd0/eval_results/ds_size300/eval_log_aggregated.json

Code here: https://gist.github.com/wzunknown/060349e2bbdc72fd664e4e7f0e8a9dc2

Use the mean of logprobs:

Use the mean of probs:

molereddy · 2024-04-03T23:28:37Z

@wzunknown thank you for that! That makes sense. I observed a ~ 0.75 average with the base Phi model. I guess that's the best value this definition of truth ratio gets to is around 0.75.

wzunknown · 2024-04-05T21:45:24Z

Also why do you choose the probability metrics? Why not make it a multiple-choice question and ask for the answer (A/B/C/D/etc.)?

And when measuring the Truth Ratio for retain set, real authors or world facts, do you think compute (max(P_perturbed) / P_paraphrased) would be better?

wzunknown · 2024-04-05T23:04:12Z

The plot below is the distribution of (max(P_perturbed) / P_paraphrased) for the retain model on the forget set. It shows that even the retain model has some preference bias for the paraphrased answers.

wtma1999 · 2024-05-12T03:23:54Z

@wzunknown Thank you for your share! But in my experiments, I observe the Truth Ratio on Real Authors and World Facts is around 0.6 with the llama2-7b-Tofu, and there is no obvious change with unlearning, which is different from Figure 8 in TOFU. Have you observed similar phenomena?

zhilif · 2024-05-12T16:08:46Z

@wtma1999 We fixed a bug in the original implementation, the revision will be posted on arXiv later, but here is the updated figure, is this similar to what you see?

wtma1999 · 2024-05-13T02:37:03Z

@zhilif Thank you for your prompt response and for sharing your observations on the Truth Ratio results.
I've noticed that my Truth Ratio scores on World Facts, Real Authors, and the Forget Set are similar to yours, but different on Retain Set around 0.4, which is lower than your score. What's your current implementation? Is it consistent with the code repository's calculation (max(0,1-Rtruth))?

Additionally, I have a question regarding the baseline chosen for your step ablation experiment. If you used the Gradient Difference as the baseline, would the indicator on the holdout set show an upward trend or keep high score (Contrary to the trend in the figure)? My understanding is that during the forgetting process, the knowledge of the retained set should be continuously strengthened. Did you use Gradient Ascent for this purpose?

I would greatly appreciate any insights or clarifications you can provide on these matters.

zhilif · 2024-05-19T03:03:10Z

@wtma1999 I have just uploaded the notebooks that we use to draw the figure (probably need some cleanup, I will get back to this after NeurIPS), see https://github.com/locuslab/tofu/blob/refactor_eval/notebook/line_plot.ipynb.

For Gradient Difference, I interpret "holdout set" as the "retain set" that you don't want to forget? The reason why it's not going upward is because we only subsample n data from the retain set, where n is the number of data that we want to forget (so if we want to forget 20 entries, we only subsample 20 entries to retain). So the signal for retaining is not as strong. The rationale behind this choice is we want to restrict the computation resource, otherwise you can just try to maintain the performance on all retain set, which can be computationally expensive.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The implementation of Truth Ratio and Probability is different from the definition in the paper #26

The implementation of Truth Ratio and Probability is different from the definition in the paper #26

wzunknown commented Apr 3, 2024

zhilif commented Apr 3, 2024

molereddy commented Apr 3, 2024

zhilif commented Apr 3, 2024 •

edited

Loading

molereddy commented Apr 3, 2024

wzunknown commented Apr 3, 2024 •

edited

Loading

wzunknown commented Apr 3, 2024 •

edited

Loading

molereddy commented Apr 3, 2024

wzunknown commented Apr 5, 2024 •

edited

Loading

wzunknown commented Apr 5, 2024

wtma1999 commented May 12, 2024

zhilif commented May 12, 2024

wtma1999 commented May 13, 2024

zhilif commented May 19, 2024

The implementation of Truth Ratio and Probability is different from the definition in the paper #26

The implementation of Truth Ratio and Probability is different from the definition in the paper #26

Comments

wzunknown commented Apr 3, 2024

Truth Ratio

Probability

zhilif commented Apr 3, 2024

molereddy commented Apr 3, 2024

zhilif commented Apr 3, 2024 • edited Loading

molereddy commented Apr 3, 2024

wzunknown commented Apr 3, 2024 • edited Loading

wzunknown commented Apr 3, 2024 • edited Loading

molereddy commented Apr 3, 2024

wzunknown commented Apr 5, 2024 • edited Loading

wzunknown commented Apr 5, 2024

wtma1999 commented May 12, 2024

zhilif commented May 12, 2024

wtma1999 commented May 13, 2024

zhilif commented May 19, 2024

zhilif commented Apr 3, 2024 •

edited

Loading

wzunknown commented Apr 3, 2024 •

edited

Loading

wzunknown commented Apr 3, 2024 •

edited

Loading

wzunknown commented Apr 5, 2024 •

edited

Loading