The Big Picture

The are two main components in reinforcement learning:

image|w70

  • A set of states 𝑆
  • A set of actions 𝐴
  • A state transition function 𝛿(𝑠), which gives the next state after taking action π‘Ž in state 𝑠
  • A reward function 𝑅(𝑠,π‘Ž) that gives immediate reward after taking action π‘Ž in state 𝑠

Rewards

The most naive way to define the total reward is to simply sum up all the immediate rewards:

𝑅𝑑=βˆ‘βˆžπ‘–=𝑑=π‘Ÿπ‘‘+1+π‘Ÿπ‘‘+2+π‘Ÿπ‘‘+3+…+π‘Ÿπ‘‘+𝑛+…

However, the future rewards are usually less valuable than the immediate rewards. And thus we introduce a discount factor π›Ύβˆˆ[0,1] to reduce the value of future rewards:

𝑅𝑑=βˆ‘βˆžπ‘–=π‘‘π›Ύπ‘–π‘Ÿπ‘–=π›Ύπ‘‘π‘Ÿπ‘‘+𝛾𝑑+1π‘Ÿπ‘‘+1+𝛾𝑑+2π‘Ÿπ‘‘+2+…+𝛾𝑑+π‘›π‘Ÿπ‘‘+𝑛+…

Q-Function

The semantic of Q-Function 𝑄(𝑠,π‘Ž) is capture the expected total future reward for an agent taking action π‘Ž in state 𝑠

𝑄(𝑠𝑑,π‘Žπ‘‘)=𝔼[𝑅𝑑|𝑠𝑑,π‘Žπ‘‘]

Policy

Agent need a way to choose action to take in each state. Which is done by a policy

πœ‹βˆ—(𝑠)=argmaxπ‘Žπ‘„(𝑠,π‘Ž)

which means that the policy should choose an action maximize future rewards.

References

MIT 6.S191: Reinforcement Learning