Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tf.metrics.accuracy maintains a running accuracy? #9498

Closed
gongzhitaao opened this issue Apr 27, 2017 · 16 comments
Closed

tf.metrics.accuracy maintains a running accuracy? #9498

gongzhitaao opened this issue Apr 27, 2017 · 16 comments
Labels
stat:awaiting tensorflower Status - Awaiting response from tensorflower type:feature Feature requests

Comments

@gongzhitaao
Copy link

I use the tf.metrics.accuracy, however it is a bit counter-intuitive in that it maintains a running accuracy (the doc agrees with this). The following simple script illustrates the situation

import os
# supress tensorflow logging other than errors
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

import numpy as np
import tensorflow as tf


print(tf.__version__)
# 1.1.0

x = tf.placeholder(tf.int32, [5])
y = tf.placeholder(tf.int32, [5])
acc, acc_op = tf.metrics.accuracy(labels=x, predictions=y)

sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())

v = sess.run([acc, acc_op], feed_dict={x: [1, 0, 0, 0, 0],
                                       y: [1, 0, 0, 0, 1]})
print(v)
# [0.0, 0.8]

v = sess.run(acc)
print(v)
# 0.8

v = sess.run([acc, acc_op], feed_dict={x: [1, 0, 0, 0, 0],
                                       y: [0, 1, 1, 1, 1]})
print(v)
# [0.8, 0.4]

v = sess.run(acc)
print(v)
# 0.4

My concerns are

  1. the use of accuracy is bit surprising, are we supposed to manually construct the normal accuracy?
  2. IMHO, it is better to
    1. implement the normal accuracy behavior or
    2. provide a clean way to reset the local variables created by tf.metrics.accuracy, i.e., the count and total.
@girving
Copy link
Contributor

girving commented Apr 27, 2017

@sguada Do you have opinions here? The current API is designed around an evaluation process that restarts from scratch each time; it does seem hard to use in other contexts.

@girving girving added stat:awaiting tensorflower Status - Awaiting response from tensorflower type:feature Feature requests labels Apr 27, 2017
@shoyer
Copy link
Contributor

shoyer commented Apr 27, 2017

For what it's worth, I agree with @gongzhitaao here, though we'll be stuck with these for a while given their use in TF 1.0. I think the metric names prefaced with streaming (back when this was tf.contrib.metrics) are much more intuitive. I suppose you could also rename the entire module, i.e., tf.streaming_metrics.

@sguada
Copy link
Member

sguada commented Apr 29, 2017

Yeah when we created them they were streaming_metrics so it was streaming_accuracy.

I think streaming metrics should have a simple way to reset them.

FYI: Implementing non streaming accuracy is simple, ex:
tf.reduce_mean(tf.to_float32(predictions == labels))

@martinwicke

@gongzhitaao
Copy link
Author

@sguada Thanks. That's what I'm using in my code.

@foxik
Copy link
Contributor

foxik commented May 22, 2017

+1 for reset_op for all streaming metrics (i.e., all from tf.metrics) -- that allows both

  • to create a non-streaming metrics (by running a reset_op followed by update_op)
  • repeatedly evaluate a multi-batch test set (often needed in our work)

Maybe a new named argument could be added, which causes the metrics to return an additional (third) result, which would be the reset_op? That would be backward compatible.

I could also imagine another named argument which would cause the metrics to return just one value -- one-time metrics call (i.e., reset_op combined by an update_op); but I can easily implement this if the reset_op is available.

@foxik
Copy link
Contributor

foxik commented Jun 2, 2017

Note that this is tightly coupled to #4814.

@lancerts
Copy link

lancerts commented Jun 9, 2017

Nice suggestion! This has puzzled me for weeks.
I just asked a question earlier in the stackover flow.

Suppose we set every_n_steps = 100 so the monitor will be called every 100 steps.

Also, suppose the input_fn function for the validation monitor will yield a streaming of data. Let's assume 10 batches.

Case 1: the auc state is reset every time the validation monitor is called, there for, the streaming is done for the 10 batches in each validation step.

Case 2: the auc state is NOT reset, so the streaming auc is computed from the first call of validation monitor. Namely, the first output (at 100 steps) is computed from 10 batches, the second validation output (at 200 steps) is computed based on the streaming auc after the first call and also the 10 batches fed in. The third output (at 300 steps) is computed based on the streaming auc after the second call and also the 10 batches fed in.

Question1, which one of the scenarios is implemented?

Question2, if we use tf.metrics.auc, what is the difference? In this doc they say:

For estimation of the metric over a stream of data, the function creates an update_op operation that updates these variables and returns the auc.
so this also computes the streamed auc?!

So, the streaming is from the very beginning or within each call for the validation monitor?
If it is from the very beginning, i guess we have to initialize mannerly and do things like
for i in range(n_epochs):
tf.initilize.local_variables()
m.fit(..)
which is not efficient....

@lancerts
Copy link

Also, from tensorflow output,

INFO:tensorflow:Saving dict for global step 39807: accuracy = 0.85421, accuracy/baseline_label_mean = 0.14821, accuracy/threshold_0.500000_mean = 0.85421, auc = 0.686321, global_step = 39807, labels/actual_label_mean = 0.14821, labels/prediction_mean = 0.146081, loss = 0.39175, precision/positive_threshold_0.500000_mean = 0.580026, recall/positive_threshold_0.500000_mean = 0.0591728, validate_confusion_matrix = [[84052 699]
[14430 819]], validate_streaing_precision = 0.580026, validate_streaing_recall = 0.0591728, validate_streaming_auc = 0.525859
WARNING:tensorflow:Skipping summary for validate_confusion_matrix, must be a float or np.float32.
WARNING:tensorflow:Skipping summary for global_step, must be a float or np.float32.
INFO:tensorflow:Validation (step 40000): loss = 0.39175, accuracy = 0.85421, labels/prediction_mean = 0.146081, labels/actual_label_mean = 0.14821, accuracy/baseline_label_mean = 0.14821, auc = 0.686321, accuracy/threshold_0.500000_mean = 0.85421, precision/positive_threshold_0.500000_mean = 0.580026, recall/positive_threshold_0.500000_mean = 0.0591728, validate_confusion_matrix = [[84052 699]
[14430 819]], validate_streaing_precision = 0.580026, validate_streaing_recall = 0.0591728, validate_streaming_auc = 0.525859, global_step = 39807

Why are this two AUC differs so much? When i use model.evaluate on train, validate, test data set, all the output AUCs are very close to the first auc shows in bold above.
So what is the validation_streaming_auc is calculating? I have reset the local variables in each epoch.

@martinwicke
Copy link
Member

All the metrics are streaming, which is why we removed the name. The non-streaming use case is typically trivial to implement (see @sguada's example). It is also almost entirely useless if you use these in evaluation, since you're almost never interested in a single batch result.

@lancerts, the ValidationMonitor will start a completely new evaluation each time, which should reset the state.

I will close this issue -- this is working as intended. If you have a specific feature request, please file a new issue. Thanks!

@kashefy
Copy link

kashefy commented Jul 27, 2017

When running @gongzhitaao code snippet at the top multiple times, I sometimes get different outputs for acc.

v = sess.run([acc, acc_op], feed_dict={x: [1, 0, 0, 0, 0],
                                       y: [1, 0, 0, 0, 1]})
print(v)
# [0.0, 0.8] # shouldn't the first be 0.8 as well?
# [0.8, 0.8] # running this several times will sometimes produce this output, but less frequently

I have this code running in standalone script, each run is completely independent from the previous run.

Any ideas anyone?

@BloodD
Copy link

BloodD commented Dec 19, 2017

The fuzzy function behavior is sucks!

@martinwicke
Copy link
Member

@kashefy If you execute both acc and acc_op (the latter being the update op), in the same session.run call, the execution order of the two is undefined.

If acc_op is executed first, you will see [0.8, 0.8], if acc is executed first, you will see [0.0, 0.8].

This is turning into more of a StackOverflow discussion, I suggest we take additional questions there.

@Knight-H
Copy link

Knight-H commented Jun 19, 2018

For people who are still looking to reset and manage the tensorflow variables: total and count in the tf.metrics.accuracy , please refer to Ronny Restrepo's blog post of how to do that:

Avoiding headaches with tf.metrics Sept. 11, 2017 http://ronny.rest/blog/post_2017_09_11_tf_metrics/

I find he provides a nice intuition as well on the accuracy and update_op.

@ntvy95
Copy link

ntvy95 commented Jul 26, 2018

It looks like the article linked by @Knight-H is now moved to https://steemit.com/machine-learning/@ronny.rest/avoiding-headaches-with-tf-metrics

To add to the article, Tensorflow also automatically adds the local variables of tf.metrics to the collection tf.GraphKeys.METRIC_VARIABLES. Sadly this fact is not mentioned in the current documentation.

@arozans
Copy link

arozans commented Aug 18, 2018

For simple, one-batch accuracy:

def non_streaming_accuracy(predictions, labels):
    return tf.reduce_mean(tf.cast(tf.equal(predictions, labels), tf.float32))

@sguada's answer uses deprecated functions.

@hassanzadeh
Copy link

@kashefy If you execute both acc and acc_op (the latter being the update op), in the same session.run call, the execution order of the two is undefined.

If acc_op is executed first, you will see [0.8, 0.8], if acc is executed first, you will see [0.0, 0.8].

This is turning into more of a StackOverflow discussion, I suggest we take additional questions there.

Sorry but this does not make sense to me the acc and update_op they are both part of the computational graph and no matter where we put them in the run call, ie. session.run([acc, update_op]) vs session.run([update_op, acc]), the order of execution will be the same, i'll get surprised if it is the other way. In my view this is a bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stat:awaiting tensorflower Status - Awaiting response from tensorflower type:feature Feature requests
Projects
None yet
Development

No branches or pull requests