AI/ML

Why RL can learn from discrete reward?

One of the reasons is the expectation is reward (r) x probability (p) . Because p is continuous, it can smooth r.

For example, r = +1/-1 in the driving scheme below, but p might be 0.899 (if you drive 1000 times, you will fall 101 times). As a result, you get a smoothed expectation.

擷取.JPG

The picture above comes from Deep RL course taught by Sergey Levine.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s