We can gain cool insights by knowing types of RL algorithm:
Policy gradient: Learn by directly adjusting policy to maximize the reward. E.g. Shoot the basketball. If you did not make it (low reward), then you’ll adjust how you shoot the ball.
Value-based: Learn by core values. E.g. we know it is good to be optimistic. So we can still trust ourselves in difficult situations because we have this value function to help us know what to do.
Actor-Critic: Learn by having teachers to teach us how to do well. E.g. when we grow up, our parents teach us to do some things (e.g. to treat others well) and not to do some things (e.g. to bully others). We are actors. Parents are our critics. When being criticized, we might adjust our policy/value function/model.
Model-based: Learn the model of our world. E.g. We know if we release our cup in the air, it will fall to the ground due to gravity on Earth.
It is not difficult to know we actually use all these methods to learn.
The pictures above come from Deep RL course in UCB.