Nature '20 | A distributional code for value in dopamine-based reinforcement learning. #14

NorbertZheng · 2022-01-08T14:15:26Z

Dabney W, Kurth-Nelson Z, Uchida N, et al. A distributional code for value in dopamine-based reinforcement learning.

NorbertZheng · 2022-01-09T03:04:25Z

Related References

Hao Liang. Distributional RL.
Lowet A S, Zheng Q, Matias S, et al. Distributional reinforcement learning in the brain.
Bellemare M G, Dabney W, Munos R. A distributional perspective on reinforcement learning.

NorbertZheng · 2022-01-09T03:10:49Z

Markov Decision Process

A classical formalization of sequential decision making.
A mathematically idealized form of the reinforcement learning problem.

我们可以看到agent与env交互的过程中，在状态St采取动作At，会产生reward（Rt），状态转移到St+1。由于MDP是sequential decision-making任务的典型抽象，我们在建模sequential decision-making任务的时候一般都要使用MDP。
MDP中(s,a,r,s')具备随机性（其实就是捕捉到env的快速变化，可以是循环，也可以是某些env动力学主变量的快速变化，详见#13 中有关stochasticity和uncertainty的探讨），其动力学方程如下：

依据这里给出的动力学方程p(s',r|s,a)，我们可以对其marginalize，直接得到给定policy（π）的reward-sum-dist（p(r0)+γp(r1)+...），然后我们便可以在其上定义一些所期望优化的统计量，比如dist-mean、dist-var以及CVaR（在#13 中详细阐述）等等。我们可以将该统计量在reward-sum-dist上的定义拆解为single-step形式，也就是Bellman equation：

这里拆解的主要目的是方便进行动态规划（也就是将下一步value表示为当前value的组合型式，同时要知道动力学方程p(s',r|s,a)），当然其它TD近似方法也是需要的，但是在建模中一般用DP就可以了，就像#13 中那样。

NorbertZheng · 2022-01-09T03:10:57Z

Distributional Reinforcement Learning

我们考虑这样一个大富翁游戏：

在这里，state可以表示为在地图上的位置，这个是deterministic的，但是action自身是具备stochasticity（这是由骰子带来的）。并且由于action的平移不变性，导致state可以归为一个state，成为一个多臂老虎机问题（state简化到只有一种情况），action带来的stochasticity被并到reward的stochasticity里面了。这刚好对应了Summary部分中distribution on reward，而不是distribution on policy的分类。
这时，我们得到reward的dist及其dist-mean为：

NorbertZheng · 2022-01-09T03:52:22Z

我们为什么要学习整个distribution呢？因为reward通常十分复杂且是multimodal的，如果我们只关心expected reward，会忽略其内部的stochasticity，这会让内部对于该任务的表征过于简化，剔除很多本来必要的信息，我们需要model整个distribution。
那么distributional RL做了一件什么事呢？它直接把原来Bellman equation中的期望值砍去了，直接进行distribution层级的update：

也就是它在这里定义了一个新的变量return-dist：

而其可以捕捉MDP中的随机性，这里面包含1）reward的随机性；2）state-trans的随机性。其实这两点都体现在env的动力学方程中，额外的是其可以捕捉agent-policy中的随机性（这是与MDP中随机性不同的，因为MDP建模的是env，而env和agent是整个系统中的两部分）。

NorbertZheng · 2022-01-09T03:58:35Z

于是，我们便可以依据修改后的Bellman equation进行distribution层级的update：

需要注意的是，这里的Bellman equation，论文使用Wasserstein散度证明了其收敛性，但没有表明其control过程一定会收敛到global-optimal policy。
具体操作为（这里以distributional RL中的C51算法为例）：

将Pπ作用于原有的return-distribution表Z上，得到概率加权后的下一步return-dist。
然后使用γ对其进行压缩，依据原点（0）进行压缩。
加入当前trial得到的sample-reward，对其进行偏移。
使用Πn对其进行区间对其，然后将其赋给更新后的return-distribution表Z'。

NorbertZheng · 2022-01-09T04:03:52Z

当然表征distribution不仅仅有C51这一种方式，而且C51由于区间选取的局限性还会导致reward-underestimate的问题（因为掐掉了极端值）。在distributional RL中一些无参数distribution的表示有：

Categorical
Quantile Representation
Expectile Representation
...

NorbertZheng · 2022-01-09T04:09:57Z

distributional RL会相比expected-return RL表现得更好么（Yes...）？虽然我们的representation丰富了，但是最终比较两个policy好坏的时候，我们还是只用了distribution的mean。

可以看到distributional RL的提升还是很大的，但这里值得注意的一点是，distributional RL中的QR-DQN和IQN其实都和quantile函数有关，只不过IQN是其反函数，然而IQN却比QR-DQN的性能要好（好tm一个玄学东西）。

NorbertZheng · 2022-01-09T04:12:27Z

Why does it work?

Auxiliary task effect:
- Same signal to learn from but more predictions.
- More Predictions → Richer Signal → Better Representations.
- Reduce state aliasing (disambiguate different states based on return).
Density estimation instead of L2-regressions.
- RL uses same tools as deep learning.
- Lower variance gradient.
Other reasons?

NorbertZheng · 2022-01-09T04:18:46Z

Distributional RL in brain

RW rule

RW rule是TD的简化版，但也更加短视。其相比TD的优势在于，其自身定义不存在bootstrapping，因而计算上更简洁，起计算过程可以被视为使用SGD的update过程，但其实这一点也适用于对TD的理解，这些update rule都可以看作某个optimization object的优化过程。expected RW rule的定义如下：

而这一方程其实是minimize MSE的SGD过程：

我们可以求其微分为0的不动点，得到其最终收敛于distribution-mean。

NorbertZheng · 2022-01-09T04:23:10Z

Expectile Representation

怎样获得Expectile Representation呢？我们上面已经得到了distribution-mean的RW rule，其实这就是50%-expectile的RW rule，我们只需要对其进行很小的修改，便可以得到α-expectile的RW rule（对positive-error和negative-error加以不同的权重）：

而这其实是minimization object的SGD更新方程：

因此，不同的损失函数导致对return-distribution的不同统计估计。

NorbertZheng · 2022-01-09T04:25:35Z

Predictions of Expectile Representation

Ample diversity in asymmetric scaling factors(τ ) across dopamine neurons.
Result in optimistic and pessimistic value predictors.
The reversal points should be positively correlated with their τ .

NorbertZheng · 2022-01-09T04:28:41Z

在上面的第二条中，optimistic的RPE永远和对应的value predictor耦合，其实这是一个很不合理的假设，但或许其它value predictor也有投射，只不过影响很小？或者说加权之后的value-predictor和对应的value predictor达到了同样的效果？
所以这里认为大脑中关于某项任务的reward只保留了一个distribution，有关该distribution的各个expectile被散布在不同的脑区中。

NorbertZheng · 2022-01-09T04:30:43Z

Evidence of Expectile Representation in brain

在这里主要使用了两个以往实验的数据进行测试：

Variable-Magnitude Task: a single cue, followed by a reward of unpredictable magnitude.
Variable-Probability Task: three cues, which each signal a different probability of reward and the reward magnitude is fixed.

NorbertZheng · 2022-01-09T04:33:27Z

对应Expectile Representation Prediction的证据是：

Different dopamine neurons consistently reverse from positive to negative responses at different reward magnitudes.
Optimistic and pessimistic probability coding occur concurrently in dopamine and VTA GABAergic neurons.
Relative scaling of positive and negative dopamine responses predicts reversal point.

NorbertZheng · 2022-01-09T07:06:20Z

当然这里首先是默认了一个假设：VTA中dopamine的发放率对应Reward Prediction Error（RPE）。但有关这个问题还是需要进一步检验的，Gershman et al. 2020提出了一种检验dopamine发放是否表示RPE的统一实验，在这项实验中，VTA中部分脑区dopamine还是表征RPE的。

NorbertZheng · 2022-01-09T07:09:29Z

至于大脑可以在这样一种distributional representation中得到多少好处，这也是一个值得探讨的问题。在论文中的讨论如下：

Support efficient representation learning in multilayer neural networks.
- Drive to distinguish these states in lower layers of the network.
- Reducing state aliasing(disambiguate different states based on return).
- Improve performance even with risk-neutral policies.
Whether or not such distributional codes also promote state learning in the brain remains to be tested experimentally.
- Not only distribution, but also uncertainty?

NorbertZheng · 2022-01-09T07:19:32Z

Other Population Code Schemes

Nonparametric Codes
- Quantile-like Code
- Problistic Population Code
- Distributed Distributional Code
Parametric Codes

关于Parametric Codes，个人不算特别赞同，毕竟可选的distribution都是人为创立的，然后认为一整个脑区在编码一个参数，然后这个参数被用于下游脑区进行其他计算，并且下游脑区要知道这个参数对应的distribution中含义。但是在#13 中确实是用Approximate Kalman Filter对distribution进行的建模，并没有关心其生物合理性。
另外就是Parametric Codes，大脑中对于distribution，应该是一个主观的过程，或者称为belief state distribution，其实每一个神经表征都可以被视为一次对distribution的sample。当然，Nonparametric Codes不仅仅有之前提到的Quantile-like Code，也有其他表示形式，比如Problistic Population Code，下面就是visual cortex中对direction uncertainty的主观distribution：

NorbertZheng · 2022-01-09T07:20:15Z

Summary

NorbertZheng mentioned this issue Jan 8, 2022

NeurIPS '21 | Two steps to risk sensitivity. #13

Closed

NorbertZheng added dopamine reinforcement-learning theory labels Jan 8, 2022

NorbertZheng closed this as completed Jan 9, 2022

NorbertZheng mentioned this issue Mar 9, 2022

arXiv '22 | How to build a cognitive map: insights from models of the hippocampal formation. #20

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nature '20 | A distributional code for value in dopamine-based reinforcement learning. #14

Nature '20 | A distributional code for value in dopamine-based reinforcement learning. #14

NorbertZheng commented Jan 8, 2022

NorbertZheng commented Jan 9, 2022 •

edited

Loading

NorbertZheng commented Jan 9, 2022

NorbertZheng commented Jan 9, 2022 •

edited

Loading

NorbertZheng commented Jan 9, 2022 •

edited

Loading

NorbertZheng commented Jan 9, 2022 •

edited

Loading

NorbertZheng commented Jan 9, 2022 •

edited

Loading

NorbertZheng commented Jan 9, 2022

NorbertZheng commented Jan 9, 2022

NorbertZheng commented Jan 9, 2022

NorbertZheng commented Jan 9, 2022

NorbertZheng commented Jan 9, 2022

NorbertZheng commented Jan 9, 2022

NorbertZheng commented Jan 9, 2022

NorbertZheng commented Jan 9, 2022

NorbertZheng commented Jan 9, 2022

NorbertZheng commented Jan 9, 2022

NorbertZheng commented Jan 9, 2022

NorbertZheng commented Jan 9, 2022

Nature '20 | A distributional code for value in dopamine-based reinforcement learning. #14

Nature '20 | A distributional code for value in dopamine-based reinforcement learning. #14

Comments

NorbertZheng commented Jan 8, 2022

NorbertZheng commented Jan 9, 2022 • edited Loading

Related References

NorbertZheng commented Jan 9, 2022

Markov Decision Process

NorbertZheng commented Jan 9, 2022 • edited Loading

Distributional Reinforcement Learning

NorbertZheng commented Jan 9, 2022 • edited Loading

NorbertZheng commented Jan 9, 2022 • edited Loading

NorbertZheng commented Jan 9, 2022 • edited Loading

NorbertZheng commented Jan 9, 2022

NorbertZheng commented Jan 9, 2022

NorbertZheng commented Jan 9, 2022

Distributional RL in brain

RW rule

NorbertZheng commented Jan 9, 2022

Expectile Representation

NorbertZheng commented Jan 9, 2022

Predictions of Expectile Representation

NorbertZheng commented Jan 9, 2022

NorbertZheng commented Jan 9, 2022

Evidence of Expectile Representation in brain

NorbertZheng commented Jan 9, 2022

NorbertZheng commented Jan 9, 2022

NorbertZheng commented Jan 9, 2022

NorbertZheng commented Jan 9, 2022

Other Population Code Schemes

NorbertZheng commented Jan 9, 2022

Summary

NorbertZheng commented Jan 9, 2022 •

edited

Loading

NorbertZheng commented Jan 9, 2022 •

edited

Loading

NorbertZheng commented Jan 9, 2022 •

edited

Loading

NorbertZheng commented Jan 9, 2022 •

edited

Loading

NorbertZheng commented Jan 9, 2022 •

edited

Loading