As a beginner, the definations are so importnant for us. Only we are clear about these compelax things, then we could be much easier to understand the knowledge later.

Note: The lower case letters means the observation variable, the uppercase letters mean the random variable.


  • Agent
  • Environment
  • State $S$
  • Action $a$
  • Reward $r$
  • Policy $\pi(a|s)$
  • State transsition $p(s’|s,a)$.

Return and Value

  • Return: $$ U_t = R_t+\gamma R_{t+1}+\gamma^2 R_{t+2}+ \dots $$

  • Action-value function: $$ Q_\pi(s_t,a_t) = \mathbb{E}[U_t|s_t,a_t] $$

  • Optimal action-value function: $$ Q^*(s_t,a_t) = \mathop{max}\limits_{\pi}Q_\pi(s_t,a_t) $$

  • State-value funtion:

$$ V_\pi(s_t) = \mathbb{E}_{A}[Q_{\pi}(s_t,A)] $$


During the iteration process, the agent can be controlled by either $\pi(a|s)$ or $Q^*(s,a)$.

So these two things are the targets we should estimate and we will learn some methods to finish this process in the later lessions.