Value-Based Reinforcement Learning

Today, I’ll talk about the method of estimating the Action-Value Functions in Reinforcement Learning which usually named $Q(s,a)$.

Actually, this kind of $Q(s,a)$ has two types, they were defined as below.

Action-value function for policy $\pi$. $$ Q_{\pi}(s_t,a_t) = \mathbb{E}[U_t|S_t = s_t, A_t = a_t] $$

It means the each action’s value of each agent’s each state, it can be showed as this.

$a_1$$a_2$$a_3$$a_4$
$s_1$1234
$s_2$-2-105
$s_3$-3689

And the other type os the optimal action-value function.

$$ Q^*(s_t,a_t) = \mathop{max}\limits_{\pi}Q_{\pi}(s_t,a_t) $$

As we can see, this function means the highest value in $Q_{\pi}(s_t,a_t)$, it’s easy to understand how an agent do the decision between these actions depend on the the state it has and the ecvironment.

The Approximate

To understand this thing, we define a situation.

Goal: Win th game(maximize the total reward).

Question: If we know $Q^*(s,a)$, what is the best action?

  • Obviously, the best action is $a^* = \mathop{argmax}\limits_{a}Q^*(s,a)$.

$Q^*$ is and indication for how good it is for an agent to pick action a while being in state s. Then this agent will choose the best action and try its best to achieve the goal.

In my understand, it is a simulation of a person to do decision, but the computer could calculate the expected value of the whole process and do the best choice, and every state do the best action in the every environment.

But the challenge is: How we know the $Q^*(s,a)$?

The general solution is Deep Q Network(DQN).

Like the Neural Network $Q(s,a; w)$, we could use it to approximate $Q^*(s,a)$.

Solution

For example, we let the computer to estimate the distance between Hangzhou and Beijing, and the computer will give the estimate: $Q(w)$ = 1000 minutes, but after the the agent achieve the real goal for the first time, it just took the 900 minutes, it is clear that the 900 minutes is more reliable estimate than 1000.

So the Loss in this situation is. $$ Loss:L = \frac{1}{2}(Q(w)-y)^2 $$

The Gradient algorithm is. $$ \frac{\partial L}{\partial w}=(1000-900) \times \frac{\partial Q(w)}{\partial w} $$

As we can see the $(1000-900)$ is the TD error.

Then we could make it more general.

In the same situation, but this time when we arrived at Hefei, our car don’t work, so we have to stop our travel, but this time we get the part real minutes between Hangzhou and Hefei and computer would give the estimate of the left way. The computer could also estimate the minutes between Hangzhou and Hefei at start, such as 400, but the real minutes is 300. So the error is 100.

Then we could give this general equation.

$$ T_{Hangzhou \rightarrow Beijing} \approx T_{Hangzhou \rightarrow Hefei} + T_{Hefei \rightarrow Beijing} $$

so the calculate equation is . $$ Q(s_t,a_t;w) \approx r_t + \gamma \times Q(s_{t+1},a_{t+1};w) $$

The $r_t$ means the real data in this travel. Then we could get the return function easily. $$ U_t = R_t + \gamma \times R_{t+1} + \gamma^2 \times R_{t+2} + \gamma^3 \times R_{t+3} + \dots $$

We could put the $\gamma$ out, the equation change like this. $$ U_t = R_t + \gamma \times (R_{t+1} + \gamma \times R_{t+2} + \gamma^2 \times R_{t+3} + \dots) $$

Then we could get the identity: $ U_t = R_t + \gamma \times U_{t+1}$

To get the $Q(s_t,a_t;w)$, it means the estimation of $\mathbb{E}[U_t]$, the $Q(s_{t+1},a_{t+1};w)$, means the estimation of $\mathbb{E}[U_{t+1}]$, So the method equation is.

$$ Q(s_t,a_t;w) \approx \mathbb{E}[R_t+\gamma \times Q(S_{t+1},A_{t+1};w)] $$

The left of this equation is the prediction and the right of this equation is the TD target. SO we could get this equations.

Prediction:$Q(s_t,a_t;w_t)$

TD target:

$$ y_t = r_t + \gamma \times Q(s_{t+1},a_{t+1};w_t) \\ \quad \quad = r_t + \gamma \times \mathop{max}\limits_{a}Q(s_{t+1},a;w_t) $$

Loss:$L_t=\frac{1}{2}[Q(s_t,a_t;w)-y_t]^2$

The ultimate goal: $$ Q^*(s_t,a_t) = \mathop{max}\limits_{\pi}\mathbb{E}[U_t|S_t = s_t, A_t = a_t] $$

The $Q(s,a;w)$ is a neural network parameterized by $w$, we could input the observed state $s$ then it’ll output the scores for every action $a \in \mathop{A}$.

The TD target is similar to the famous equation named Bellman Euqation, we could remember this equation because it’s the heart of the reinforcement learning.

Bell equation: $$ Q_{*}(s_t,a_t) = \mathbb{E}_{S_{t+1}~p(·|s_t,a_t)}[R_t+\gamma \times \mathop{max}\limits_{A \in \mathop{A}}Q(S_{t+1},A)|S_t = s_t,A_t=a_t] $$

At last, the method based on value could solve the problem like this and I considered if this estimate could be used in the evaluate question, I’m always thinking about this, if someone has the same thought, please tell me in the comment module.