-
Notifications
You must be signed in to change notification settings - Fork 759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tensorboard for variational inference #598
tensorboard for variational inference #598
Conversation
Add loss as a scalar and gradients and weights as histograms to the summary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work so far! You should track the norm of all the gradients too.
Also, see the travis failure to adhere to PEP 8.
|
||
var_list.update(get_variables(qz, collection=trainables)) | ||
|
||
for x, qx in six.iteritems(self.data): | ||
if isinstance(x, RandomVariable) and \ | ||
not isinstance(qx, RandomVariable): | ||
var_list.update(get_variables(x, collection=trainables)) | ||
data_var_list.update(get_variables(x, collection=trainables)) | ||
|
||
var_list = list(var_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to make this more efficient, i think you can rewrite var_list
to be the concatenation of latent_var_list
and data_var_list
.
trainables = tf.trainable_variables() | ||
for z, qz in six.iteritems(self.latent_vars): | ||
if isinstance(z, RandomVariable): | ||
var_list.update(get_variables(z, collection=trainables)) | ||
latent_var_list.update(get_variables(z, collection=trainables)) | ||
|
||
var_list.update(get_variables(qz, collection=trainables)) | ||
|
||
for x, qx in six.iteritems(self.data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a good practice is to instantiate variables closest to where they're used. e.g., instantiate data_var_list
above this line.
if self.logging: | ||
#TODO: when var_list is not None | ||
tf.summary.scalar("loss", self.loss) | ||
for var in latent_var_list: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two double for loops could be made more efficient and readable. For example:
for grad, var in grads_and_vars:
if var in latent_var_list:
name = "variational"
elif var in data_var_list:
name = "model"
else:
name = ""
with tf.name_scope(name):
tf.summary.histogram("parameter_" + var.name, var)
tf.summary.histogram("gradient_" + var.name, grad)
This also solves your TODO above.
Some other tips I added to the above: 1. Use variational over "inference", as they're variational parameters in general; "inference" is specific to inference networks in VAEs; 2. use name scopes if you notice you're using a prefix everywhere.
Also, is there a reason you used CamelCase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No particular reason for CamelCase. I tried both and decided to stick with it just for the aesthetics. Using name scopes is a better idea though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On running initialize()
, tensorflow prints these warnings for all variables/gradients:
INFO:tensorflow:Summary name parameter_dense_1/bias:0 is illegal; using parameter_dense_1/bias_0 instead.
INFO:tensorflow:Summary name gradient_gradients/inference_4608532368/0/dense_1/BiasAdd_grad/BiasAddGrad:0 is illegal; using gradient_gradients/inference_4608532368/0/dense_1/BiasAdd_grad/BiasAddGrad_0 instead.
Should we do something about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, try iterating them with _0
, _1
, etc. for each unique name?
Also, can you check that the variables are given their appropriate summary names? parameter_dense_1/bias:0
sounds like a model parameter, but it's not in the model name scope. Same for the second line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I simply replace ':' with '_'?
The names include the scope when they are displayed on tensorboard, they don't print the full names here.
But as seen from the screenshot, there is one other problem. Not all variables are grouped under scope 'model'. It uses 'model_1' when tf.name_scope(scope_name)
is called again.
Should I add all the variables and gradients from one scope in one go? Something like:
with tf.name_scope('model'):
for grad, var in grads_and_vars:
if var in data_var_list:
tf.summary.histogram("parameter_" + var.name, var)
tf.summary.histogram("gradient_" + grad.name, grad)
and similiarly for inference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To your two questions: yes and yes.
tf.summary.histogram("ModelWeight_" + var.name, var) | ||
tf.summary.histogram("ModelGradient_" + var.name, grad_var_tuple[0]) | ||
|
||
self.summarize = tf.summary.merge_all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You define self.summarize
here, but it's also defined when using the parent method's initialize (Inference
).
You're right that we need to merge the summaries after they're defined and not before. It sounds like we need to decide where to properly put the merge summary op; not sure if it's as simple as moving the call to the parent method after VariationalInference
finishes its initialize instead of before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it doesn't work when self.summarize
is defined before defining the summary ops.
I tried to move the call to parent's initialize()
at the end of the function. But we are using variables like self.logging
which are initialized in the parent call. Here it will be easy to modify, but that may not be the case for all classes that are inherited from Inference
.
Or we could call summary.merge_all()
in each of the child classes and remove it from the parent. For now, we can let it be defined in both until we modify the other classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would even make sense to have the summary writer object outside the inference class? Or else how does it work with composed inferences (http://edwardlib.org/api/inference-compositionality)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mariru. Good point. It sounds like self.summary
shouldn't merge all summaries in the TensorFlow graph, but only summaries added inside its class. That way self.summary
is localized and thus compatible with compositions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think summary.merge_all()
will merge all the summaries in the graph. We could instead use tf.merge()
which takes as input a list of summary buffers. Something like:
self.summarize = tf.merge(summary_objects)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tf.merge
requires storing all the summary objects. However, we build summaries in various methods of the Inference class, so this is not very TensorFlow-y.
We can use a unique graph collection for each inference; then we can merge all summaries within that collection. For example:
# summaries are added like this
summary_key= 'summaries_' + str(id(inference))
tf.summary.scalar(..., collections=[summary_key])
tf.summary.scalar(..., collections=[summary_key])
# when forming merge summary
summary_key = 'summaries_' + str(id(inference))
self.summarize = tf.summary.merge_all(summary_key)
Bump. It would be great to get this merged in before you start working on the Monte Carlo / more specific variational method stuf. |
Also removes redundancy in updating var_list twice
edward/inferences/gan_inference.py
Outdated
@@ -101,6 +101,12 @@ def initialize(self, optimizer=None, optimizer_d=None, | |||
self.train_d = optimizer_d.apply_gradients(grads_and_vars_d, | |||
global_step=global_step_d) | |||
|
|||
if self.logging: | |||
summary_key = 'summaries_' + str(id(self)) | |||
tf.summary.scalar('loss_fake', self.loss_d, collections=[summary_key]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you rename these to 'loss_discriminative' and 'loss_generative' respectively?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Are you including the summary ops for the VI losses here too? |
Yes. |
@akshaykhatri639 I'm not sure if I follow: do you mean they will be in this PR? |
tf.summary.scalar("gradient_norm_" + | ||
grad.name.replace(':', '_'), | ||
tf.norm(grad), collections=[summary_key]) | ||
# replace : with _ because tf does not allow : in var names in summaries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@akshaykhatri639 Really looking forward to this feature!
I tried it using MAP inference, but the naming turns out weird.
I would replace grad.name.replace(':', '_')
with var.name.replace(':', '_')
everywhere, because the gradients are named after the operations. But then by prefacing it with gradient_
it will still be clear from the name that its a gradient.
I have been using tf.name_scope('model')
for the actual model.
Maybe rename to tf.name_scope('training')
for logging? But that's minor. I can just rename the scope for my model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
naming of gradients messy: see comment above
-
tracking gradients is super useful, but it might be even more useful to look at
gradients*learning_rate
or(parameters - gradients*learning_rate)/parameters
.
Especially with adaptive learning rates when the user doesn't know the current learning rate, it is impossible to know if the gradient is in the right ballpark from looking at the histograms. -
when training blows up and some parameters become
NaN
orinf
(e.g. because the learning rate is too large) I get an error message that is hard to decipher.
InvalidArgumentError (see above for traceback): Nan in summary histogram for: model_1/parameter_model/embeddings/alpha_0
[[Node: model_1/parameter_model/embeddings/alpha_0 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](model_1/parameter_model/embeddings/alpha_0/tag, model/embeddings/alpha/read)]]
Maybe there's a nicer way to catch this? I guess that's more an idiosyncrasy of tensorflow than of edward. But still, catching it in a more readable way would be nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- You are right about names of the summary ops. I will make the required changes.
- Won't this become a bit complicated? Because each optimizer has a slightly different way of modifying weights and may be storing one or more cache/momentum matrices for each gradient. Should we implement is differently for each optimizer? Otherwise, the values in the summary still won't be close to the values that the weights were updated with.
- Should I check if any values become Nan before writing the summary ops?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- There's a
self.debug
member in Edward inferences for checking if certain ops blow up. I agree with Maja that it would be nice if we can somehow raise a more informative error. (but without running a check every iteration just to raise that error)
Merging this now. I'll personally make changes that revise this slightly in another PR. |
Add loss as a scalar and gradients and weights as histograms to the
summary