An Introduction to Deep Reinforcement Learning - Q-learning and Double Q-learning

You might have found it hard to tackle against some of the CPU opponents when you were playing arcade-style games back in the days, or you might have already heard of Google’s DeepMind’s recent overwhelming defeat of professional players in the real-time strategy-based game Starcraft 2. From creating a rational agent that can “learn” to solve a series of complex tasks on its own to engendering human-like behavior and superseding human-level performance, reinforcement learning sits at the core of artificial intelligence, enabling machines and systems to generate and recognize patterns and strategies that are hard to program and devise in advance.

Google DeepMind’s AI V.S. Professional Starcraft II Players

To get started, let’s consider the classic game of chess-playing.

A rational agent needs to generate the correct (optimal) move for each state/position it encounters in the process of the game. However, as we can see in real life, even with just an 8x8 chessboard, and let’s suppose there are a total of 5 different chess pieces possible. The size of the overall state space (the total possible states of the game) will be extremely huge:

$\Large \text{Size of State Space } = 6^{8 \times 8}$

As we can see, as the number of different chess pieces grows, the size of the state space skyrockets exponentially, exceeding the capability of a normal computer to solve the problem in a reasonable amount of time. So how do we actually get around this issue of state space which often rises up in the face of supervised learning?

Reinforcement learning, an award-penalty-based system, much like operant conditioning in psychology of training an agent to recognize a behavior as desirable or not, encourages the agent to explore the environment around itself. As the agent experiences episodes with different consequences (mainly rewards and penalties), it will learn over time whether that particular state (where it gets rewarded or penalized) should be avoided or frequented.

There are 2 main types of paradigms when it comes to RL algorithms:

Model-based Learning
Model-free Learning

In this post, we will be focusing on Model-free Learning as it has demonstrated significant potentials in recent years with its independence of prior knowledge of any states, the guarantee of an eventual convergence to an optimal policy, and its adaptability with the deep neural networks (which greatly improves complexity and comprehensiveness of the agent’s final decision-making).

Without spending too much time on the introductory material, let’s directly go into the following 2 topics that will provide the foundations necessary to have a high-level understanding for the later content on deep reinforcement learning.

Markov Decision Process (MDP)

Markov decision process (MDP) is a discrete time stochastic control process. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. To put it in simplest terms, an MDP represents a collection of different states the system can be in. Usually represented as a directed graph, with each node as a different state, and a directed edge from state A to state B as a possible action from the current state.

Here are some of the important terms that defines an MDP:

$\Large \text{Size of State Space } = 6^{8 \times 8}$
$\Large \text{Size of State Space } = 6^{8 \times 8}$
$V(S;\theta_t) = \text{The optimal total rewards that can be achieved over time by starting from state S with parameters }\theta_t$
$Q(S, A;\theta_t) = \text{The optimal total rewards that can be achieved by starting in state S, taking action A, and acting optimally henceforth with parameters }\theta_t$
$\gamma = \text{Discount factor, increased by an additional factor of }\gamma\text{ for every additional timestep}$

Q-learning, based of the Markov Decision Process model, estimates the state-action value function( $Q^{\pi}(s,a)$ ) for a target policy $\pi$ that deterministically selects the action of highest value.

The standard Q-learning update for the parameters $\theta_t$ after taking action $A_i$ in the state $S_i$ and observing the immediate reward $R(S_i, A_i, S_{i+1})$ and the resulting state $S_{i+1}$ is as follows:

$\theta_{t+1}=\theta_t + \alpha(Y_t^Q-Q(S_t,A_t;\theta_t))\nabla_{\theta_t}Q(S_t, A_t;\theta_t)$