Pendulum with PPO¶
In this notebook we solve the Pendulum-v0 environment using a TD actor-critic algorithm with PPO policy updates.
We use a simple multi-layer percentron as our function
approximators for the state value function \(v(s)\) and
policy \(\pi(a|s)\) implemented by GaussianPolicy
.
This algorithm is slow to converge (if it does at all). You should start to see improvement in the average return after about 150k timesteps. Below you’ll see a particularly succesful episode:
To view the notebook in a new tab, click here. To interact with the notebook in Google Colab, hit the “Open in Colab” button below.