tf.metrics.accuracy maintains a running accuracy? #9498

gongzhitaao · 2017-04-27T19:32:05Z

I use the tf.metrics.accuracy, however it is a bit counter-intuitive in that it maintains a running accuracy (the doc agrees with this). The following simple script illustrates the situation

import os
# supress tensorflow logging other than errors
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

import numpy as np
import tensorflow as tf


print(tf.__version__)
# 1.1.0

x = tf.placeholder(tf.int32, [5])
y = tf.placeholder(tf.int32, [5])
acc, acc_op = tf.metrics.accuracy(labels=x, predictions=y)

sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())

v = sess.run([acc, acc_op], feed_dict={x: [1, 0, 0, 0, 0],
                                       y: [1, 0, 0, 0, 1]})
print(v)
# [0.0, 0.8]

v = sess.run(acc)
print(v)
# 0.8

v = sess.run([acc, acc_op], feed_dict={x: [1, 0, 0, 0, 0],
                                       y: [0, 1, 1, 1, 1]})
print(v)
# [0.8, 0.4]

v = sess.run(acc)
print(v)
# 0.4

My concerns are

the use of accuracy is bit surprising, are we supposed to manually construct the normal accuracy?
IMHO, it is better to
1. implement the normal accuracy behavior or
2. provide a clean way to reset the local variables created by tf.metrics.accuracy, i.e., the count and total.

The text was updated successfully, but these errors were encountered:

girving · 2017-04-27T21:25:21Z

@sguada Do you have opinions here? The current API is designed around an evaluation process that restarts from scratch each time; it does seem hard to use in other contexts.

shoyer · 2017-04-27T21:59:39Z

For what it's worth, I agree with @gongzhitaao here, though we'll be stuck with these for a while given their use in TF 1.0. I think the metric names prefaced with streaming (back when this was tf.contrib.metrics) are much more intuitive. I suppose you could also rename the entire module, i.e., tf.streaming_metrics.

sguada · 2017-04-29T22:23:21Z

Yeah when we created them they were streaming_metrics so it was streaming_accuracy.

I think streaming metrics should have a simple way to reset them.

FYI: Implementing non streaming accuracy is simple, ex:
tf.reduce_mean(tf.to_float32(predictions == labels))

@martinwicke

gongzhitaao · 2017-04-30T11:45:27Z

@sguada Thanks. That's what I'm using in my code.

foxik · 2017-05-22T17:07:30Z

+1 for reset_op for all streaming metrics (i.e., all from tf.metrics) -- that allows both

to create a non-streaming metrics (by running a reset_op followed by update_op)
repeatedly evaluate a multi-batch test set (often needed in our work)

Maybe a new named argument could be added, which causes the metrics to return an additional (third) result, which would be the reset_op? That would be backward compatible.

I could also imagine another named argument which would cause the metrics to return just one value -- one-time metrics call (i.e., reset_op combined by an update_op); but I can easily implement this if the reset_op is available.

foxik · 2017-06-02T11:09:10Z

Note that this is tightly coupled to #4814.

lancerts · 2017-06-09T22:28:58Z

Nice suggestion! This has puzzled me for weeks.
I just asked a question earlier in the stackover flow.

Suppose we set every_n_steps = 100 so the monitor will be called every 100 steps.

Also, suppose the input_fn function for the validation monitor will yield a streaming of data. Let's assume 10 batches.

Case 1: the auc state is reset every time the validation monitor is called, there for, the streaming is done for the 10 batches in each validation step.

Case 2: the auc state is NOT reset, so the streaming auc is computed from the first call of validation monitor. Namely, the first output (at 100 steps) is computed from 10 batches, the second validation output (at 200 steps) is computed based on the streaming auc after the first call and also the 10 batches fed in. The third output (at 300 steps) is computed based on the streaming auc after the second call and also the 10 batches fed in.

Question1, which one of the scenarios is implemented?

Question2, if we use tf.metrics.auc, what is the difference? In this doc they say:

For estimation of the metric over a stream of data, the function creates an update_op operation that updates these variables and returns the auc.
so this also computes the streamed auc?!

So, the streaming is from the very beginning or within each call for the validation monitor?
If it is from the very beginning, i guess we have to initialize mannerly and do things like
for i in range(n_epochs):
tf.initilize.local_variables()
m.fit(..)
which is not efficient....

lancerts · 2017-06-10T05:36:47Z

Also, from tensorflow output,

INFO:tensorflow:Saving dict for global step 39807: accuracy = 0.85421, accuracy/baseline_label_mean = 0.14821, accuracy/threshold_0.500000_mean = 0.85421, auc = 0.686321, global_step = 39807, labels/actual_label_mean = 0.14821, labels/prediction_mean = 0.146081, loss = 0.39175, precision/positive_threshold_0.500000_mean = 0.580026, recall/positive_threshold_0.500000_mean = 0.0591728, validate_confusion_matrix = [[84052 699]
[14430 819]], validate_streaing_precision = 0.580026, validate_streaing_recall = 0.0591728, validate_streaming_auc = 0.525859
WARNING:tensorflow:Skipping summary for validate_confusion_matrix, must be a float or np.float32.
WARNING:tensorflow:Skipping summary for global_step, must be a float or np.float32.
INFO:tensorflow:Validation (step 40000): loss = 0.39175, accuracy = 0.85421, labels/prediction_mean = 0.146081, labels/actual_label_mean = 0.14821, accuracy/baseline_label_mean = 0.14821, auc = 0.686321, accuracy/threshold_0.500000_mean = 0.85421, precision/positive_threshold_0.500000_mean = 0.580026, recall/positive_threshold_0.500000_mean = 0.0591728, validate_confusion_matrix = [[84052 699]
[14430 819]], validate_streaing_precision = 0.580026, validate_streaing_recall = 0.0591728, validate_streaming_auc = 0.525859, global_step = 39807

Why are this two AUC differs so much? When i use model.evaluate on train, validate, test data set, all the output AUCs are very close to the first auc shows in bold above.
So what is the validation_streaming_auc is calculating? I have reset the local variables in each epoch.

martinwicke · 2017-06-12T18:53:00Z

All the metrics are streaming, which is why we removed the name. The non-streaming use case is typically trivial to implement (see @sguada's example). It is also almost entirely useless if you use these in evaluation, since you're almost never interested in a single batch result.

@lancerts, the ValidationMonitor will start a completely new evaluation each time, which should reset the state.

I will close this issue -- this is working as intended. If you have a specific feature request, please file a new issue. Thanks!

kashefy · 2017-07-27T18:55:10Z

When running @gongzhitaao code snippet at the top multiple times, I sometimes get different outputs for acc.

v = sess.run([acc, acc_op], feed_dict={x: [1, 0, 0, 0, 0],
                                       y: [1, 0, 0, 0, 1]})
print(v)
# [0.0, 0.8] # shouldn't the first be 0.8 as well?
# [0.8, 0.8] # running this several times will sometimes produce this output, but less frequently

I have this code running in standalone script, each run is completely independent from the previous run.

Any ideas anyone?

BloodD · 2017-12-19T08:45:19Z

The fuzzy function behavior is sucks!

martinwicke · 2017-12-19T16:59:52Z

@kashefy If you execute both acc and acc_op (the latter being the update op), in the same session.run call, the execution order of the two is undefined.

If acc_op is executed first, you will see [0.8, 0.8], if acc is executed first, you will see [0.0, 0.8].

This is turning into more of a StackOverflow discussion, I suggest we take additional questions there.

Knight-H · 2018-06-19T03:34:15Z

For people who are still looking to reset and manage the tensorflow variables: total and count in the tf.metrics.accuracy , please refer to Ronny Restrepo's blog post of how to do that:

Avoiding headaches with tf.metrics Sept. 11, 2017 http://ronny.rest/blog/post_2017_09_11_tf_metrics/

I find he provides a nice intuition as well on the accuracy and update_op.

ntvy95 · 2018-07-26T07:49:42Z

It looks like the article linked by @Knight-H is now moved to https://steemit.com/machine-learning/@ronny.rest/avoiding-headaches-with-tf-metrics

To add to the article, Tensorflow also automatically adds the local variables of tf.metrics to the collection tf.GraphKeys.METRIC_VARIABLES. Sadly this fact is not mentioned in the current documentation.

arozans · 2018-08-18T19:36:21Z

For simple, one-batch accuracy:

def non_streaming_accuracy(predictions, labels):
    return tf.reduce_mean(tf.cast(tf.equal(predictions, labels), tf.float32))

@sguada's answer uses deprecated functions.

hassanzadeh · 2018-11-26T16:54:27Z

@kashefy If you execute both acc and acc_op (the latter being the update op), in the same session.run call, the execution order of the two is undefined.

If acc_op is executed first, you will see [0.8, 0.8], if acc is executed first, you will see [0.0, 0.8].

This is turning into more of a StackOverflow discussion, I suggest we take additional questions there.

Sorry but this does not make sense to me the acc and update_op they are both part of the computational graph and no matter where we put them in the run call, ie. session.run([acc, update_op]) vs session.run([update_op, acc]), the order of execution will be the same, i'll get surprised if it is the other way. In my view this is a bug.

girving added stat:awaiting tensorflower Status - Awaiting response from tensorflower type:feature Feature requests labels Apr 27, 2017

foxik mentioned this issue Jun 2, 2017

[Feature Request] Make streaming metrics resettable #4814

Closed

martinwicke closed this as completed Jun 12, 2017

ppwwyyxx mentioned this issue Jun 23, 2017

tf.metrics and tensorpack evaluation callbacks tensorpack/tensorpack#312

Closed

reedwm mentioned this issue Dec 5, 2017

Wrong result when computing accuracy using tf.metrics.accuracy #15115

Closed

tom-andersson mentioned this issue Jun 21, 2021

Ability to prevent a metric from accumulating during training #50384

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tf.metrics.accuracy maintains a running accuracy? #9498

tf.metrics.accuracy maintains a running accuracy? #9498

gongzhitaao commented Apr 27, 2017

girving commented Apr 27, 2017

shoyer commented Apr 27, 2017

sguada commented Apr 29, 2017

gongzhitaao commented Apr 30, 2017

foxik commented May 22, 2017

foxik commented Jun 2, 2017

lancerts commented Jun 9, 2017

lancerts commented Jun 10, 2017

martinwicke commented Jun 12, 2017

kashefy commented Jul 27, 2017

BloodD commented Dec 19, 2017

martinwicke commented Dec 19, 2017

Knight-H commented Jun 19, 2018 •

edited

Loading

ntvy95 commented Jul 26, 2018 •

edited

Loading

arozans commented Aug 18, 2018

hassanzadeh commented Nov 26, 2018

tf.metrics.accuracy maintains a running accuracy? #9498

tf.metrics.accuracy maintains a running accuracy? #9498

Comments

gongzhitaao commented Apr 27, 2017

girving commented Apr 27, 2017

shoyer commented Apr 27, 2017

sguada commented Apr 29, 2017

gongzhitaao commented Apr 30, 2017

foxik commented May 22, 2017

foxik commented Jun 2, 2017

lancerts commented Jun 9, 2017

lancerts commented Jun 10, 2017

martinwicke commented Jun 12, 2017

kashefy commented Jul 27, 2017

BloodD commented Dec 19, 2017

martinwicke commented Dec 19, 2017

Knight-H commented Jun 19, 2018 • edited Loading

ntvy95 commented Jul 26, 2018 • edited Loading

arozans commented Aug 18, 2018

hassanzadeh commented Nov 26, 2018

Knight-H commented Jun 19, 2018 •

edited

Loading

ntvy95 commented Jul 26, 2018 •

edited

Loading