# A Primer on Deep Q-Learning - Part 1/2

Published:

Before starting to write a blog post I always ask myself - “What is the added value?”. There is a lot of awesome ML material out there. And a lot of duplicates as well. Especially when it comes to all the flavors of Deep Reinforcement Learning. So you might wonder what is the added value of this two part blog post on Deep Q-Learning? It is threefold.

1. We will review all the ingredients of modern value-based Reinforcement Learning (RL). Starting with Tabular Q-Learning (Watkins & Dayan, 1992) we will travel through the world of value function approximator and finally end up with all the ingredients to the state-of-the-art Rainbow (Hessel et al., 2018) algorithm in a single blog post ().

2. In the next post we will further gain insights into the effect of the hyperparameter choices in Deep Q-Learning. In order to do so, we will deal with the simplified gridworld shown below. You can find the corresponding code in this repository ().

3. After having become experts in ablation studies and fiddling around with hyperparameters, we will have a closer look at 3 recent papers shining some more empirical as well as theoretical light on the divergence behavior in Deep Q-Learning dynamics ().

Let’s get rolling… or updating ! But before one note of caution: We cover quite a lot of ground, so feel free to jump around and skip sections which you feel like you have already mastered.

# Setting the Stage: The Reinforcement Learning Problem

Our () everyday lives are full of perceptual experiences of our surroundings (). We constantly make decisions () that change how & what we experience (). Crucially, we get rewarded () for good choices and sometime punished for bad ones. Feedback is in many ways crucial to our existence. This feedback can be extrinsic (e.g. monetary or the satisfaction physical needs) or intrinsic (e.g. motivation, curiosity or social aspects). We adapt our behavior to such cues. Why? To fulfill our hedonistic desires () or simply to survive (). And Reinforcement Learning describes a computational framework for exactly these reward-based adaptation dynamics.

## The World as a Markov Decision Process

Most traditional RL approaches are situated within the more general framework of Markov Decision Processes (MDP). An MDP consist of the tuple $(\mathcal{S}, \mathcal{A}, \mathcal{T}, \mathcal{R}, \gamma, p_0)$ where:

• $\mathcal{S}$: Denotes the state space in which an agent is situated at a current point in time $t$, i.e. $s_t \in \mathcal{S}$ and informally , $\in \mathcal{S}$.
• $\mathcal{A}$: Denotes the action space from which an agent is allowed to choose its next move from, i.e. $a_t \in \mathcal{A}$ and informally $\in \mathcal{A}$.
• $\mathcal{T}$: Denotes the transition kernel which provides you with the probably of transitioning to a state $s’$ given that the agent is currently in state $s$ and takes action $a$, i.e. $p(s’ \mid s, a)$ describes informally the probability of $\to$ given .
• $\mathcal{R}$: Denotes the reward function which given a state transition $s, a, s’$ issues a reward to the agent (). This reward can be stochastic or deterministic.
• $\gamma$: Denotes a factor $\gamma \in [0,1)$ with which rewards accumulated in future timesteps is discounted. From an economics point of view returns we can obtain today are associated with less uncertainty and can directly be consumed. More fundamentally the discount factor can viewed as a survival probability which incorporates some fundamental time uncertainty of the agent.
• $p_0$: Denotes the initial state distribution for $t=0$, i.e. where the agent is initialized.

Given a state the agent chooses the next action according to a policy $\pi(a \mid s)$. It can be deterministic, which means that given a state the agent will always choose a specific action. Or it can be stochastic, where the next action is sampled from the distribution.

So what exactly is Markovian about this MDP and does this restrict us in modeling reward-based learning? The next state transition is assumed to only depend on the state and action of the current time step. So what if the agent had to pickup a key twenty steps ago to open a door right now? As long as the key is part of the current state that is not a problem. If not we are in danger. Velocity in ATARI games is another of such examples. A single frame cannot tell you in which direction and how fast a ball is moving. Mhmm… that is a bummer. But wait! We can infer velocity from two frames! So as long as a state is represented cleverly all dynamics can be described as Markovian. Finding the right composition of states can be challenging though.

## Problem Formulation & Dynamic Programming

So how can we articulate all of our hedonistic desires? We need some maths here! Two key quantities of interest in RL are the state value function $V^\pi(s)$ and the action value function $Q^\pi(s,a)$:

where $V^\pi(s) = \sum_{a \in \mathcal{A}} \pi(s,a) Q^\pi(s,a)$ . Intuitively, $Q^\pi(s,a)$ denotes the reward that one is expected to obtain when starting from state $s$ you first take action $a$ and only afterwards continues to follow the current deterministic policy $\pi$. $V^\pi(s)$, on the other hand, directly selects actions according to the policy. Ultimately, in order to solve the Reinforcement Learning problem we are interested in the optimal behavior policy $\pi^\star$ that is derived by maximizing this expression:

# The One Equation Summary aka TL;DR

In this blog post we took a deep dive into Deep Q-Learning. Below you find a figure summarizing the key elements of each algorithm:

See you next time for the second half of this series, in which we cover some hyperparameter experiments and discuss some theoretical foundations of the divergence behavior observed in Deep Q-Learning!

# References

1. Dabney, W., M. Rowland, M. G. Bellemare, and R. Munos. (2018): “Distributional Reinforcement Learning with Quantile Regression,” Thirty-Second AAAI Conference on Artificial Intelligence, .
2. Dabney, W., G. Ostrovski, D. Silver, and R. Munos. (2018): “Implicit quantile networks for distributional reinforcement learning,” arXiv preprint arXiv:1806.06923, .
3. Thrun, S., and A. Schwartz. “Issues in Using Function Approximation for Reinforcement Learning,”
4. Watkins, C. J. C. H., and P. Dayan. (1992): “Q-learning,” Machine learning, 8, 279–92.
5. Bellemare, M. G., W. Dabney, and R. Munos. (2017): “A Distributional Perspective on Reinforcement Learning,” Proceedings of the 34th International Conference on Machine Learning-Volume 70, JMLR. org, 449–58.
6. Fortunato, M., M. G. Azar, B. Piot, J. Menick, I. Osband, A. Graves, V. Mnih, et al. (2017): “Noisy networks for exploration,” arXiv preprint arXiv:1706.10295, .
7. Schultz, W., P. Dayan, and P. R. Montague. (1997): “A neural substrate of prediction and reward,” Science, 275, 1593–99.
8. Van Hasselt, H., Y. Doron, F. Strub, M. Hessel, N. Sonnerat, and J. Modayil. (2018): “Deep reinforcement learning and the deadly triad,” arXiv preprint arXiv:1812.02648, .
9. Sutton, R. S., and A. G. Barto. (2018): Reinforcement Learning: An Introduction, .
10. Gordon, G. J. (1996): “Stable Fitted Reinforcement Learning,” Advances in neural information processing systems, .
11. Van Hasselt, H. (2010): “Double Q-Learning,” Advances in Neural Information Processing Systems, .
12. Hessel, M., J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver. (2018): “Rainbow: Combining Improvements in Deep Reinforcement Learning,” Thirty-Second AAAI Conference on Artificial Intelligence, .
13. Lin, L.-J. (1992): “Self-improving reactive agents based on reinforcement learning, planning and teaching,” Machine learning, 8, 293–321.
14. Mnih, V., K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. (2013): “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, .
15. Mnih, V., K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, et al. (2015): “Human-level control through deep reinforcement learning,” Nature, 518, 529.
16. Schaul, T., J. Quan, I. Antonoglou, and D. Silver. (2015): “Prioritized experience replay,” arXiv preprint arXiv:1511.05952, .
17. Van Hasselt, H., A. Guez, and D. Silver. (2016): “Deep Reinforcement Learning with Double q-Learning,” Thirtieth AAAI conference on artificial intelligence, .
18. Wang, Z., T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, and N. De Freitas. (2015): “Dueling network architectures for deep reinforcement learning,” arXiv preprint arXiv:1511.06581, .

Tags: