Glossary¶
In this package we make heavy use of function approximators using
keras.Model
objects. In Section 1 we list the available types of
function approximators. A function approximator uses multiple keras models to
support its full functionality. The different types keras models are listed in
Section 2. Finally, in Section 3 we list the different kinds of inputs and
outputs that our keras models expect.
1. Function approximator types¶
- function approximator
- A function approximator is any object that can be updated.
- body
- The body is what we call the part of the computation graph that may
be shared between e.g. policy (actor) and value function (critic). It
is typlically the part of a neural net that does most of the heavy
lifting. One may think of the
body()
as an elaborate automatic feature extractor. - head
The head is the part of the computation graph that actually generates the desired output format/shape. As its input, it takes the output of body. The different heads that
FunctionApproximator
class provides are:head_v
This is the state value head. It returns a batch of scalar values V.head_q1
This is the type-I Q-value head. It returns a batch of scalar values Q_sa.head_q2
This is the type-II Q-value head. It returns a batch of vectors Q_s.head_pi
This is the policy head. It returns a batch of distribution parameters Z.- forward_pass
- This is just the consecutive application of head after body.
In this package we have four distinct types of function approximators:
- state value function
- State value functions \(v(s)\) are implemented by
V
. - type-I state-action value function
This is the standard state-action value function \(q(s,a)\). It models the Q-function as
\[(s, a) \mapsto q(s,a)\ \in\ \mathbb{R}\]This function approximator is implemented by
QTypeI
.- type-II state-action value function
This type of state-action value function is different from type-I in that it models the Q-function as
\[s \mapsto q(s,.)\ \in\ \mathbb{R}^n\]where \(n\) is the number of actions. The type-II Q-function is implemented by
QTypeII
.- updateable policy
- This function approximator represents a policy directly. It is
implemented by e.g.
SoftmaxPolicy
. - actor-critic
- This is a special function approximator that allows for the sharing of parts of the computation graph between a value function (critic) and a policy (actor).
Note
At the moment, type-II Q-functions and updateable policies are only
implemented for environments with a Discrete
action space.
2. Keras model types¶
Now each function approximator takes multiple keras.Model
objects. The
different models are named according to role they play in the functions
approximator object:
- train_model
- This
keras.Model
is used for training. - predict_model
- This
keras.Model
is used for predicting. - target_model
- This
keras.Model
is a kind of shadow copy of predict_model that is used in off-policy methods. For instance, in DQN we use it for reducing the variance of the bootstrapped target by synchronizing with predict_model only periodically.
Note
The specific input depends on the type of function approximator you’re using. These are provided in each individual class doc.
3. Keras model inputs/outputs¶
Each keras.Model
object expects specific inputs and outputs. These are
provided in each individual function approximator’s docs.
Below we list the different available arrays that we might use as inputs/outputs to our keras models.
- S
- A batch of (preprocessed) state observations. The shape is
[batch_size, ...]
where the ellipses might be any number of dimensions. - A
- A batch of actions taken, with shape
[batch_size]
. - P
- A batch of distribution parameters that allow us to construct action
propensities according to the behavior/target policy \(b(a|s)\).
For instance, the parameters of a
keras_gym.SoftmaxPolicy
(for discrete actions spaces) are those of a categorical distribution. On the other hand, for continuous action spaces we use akeras_gym.GaussianPolicy
, whose parameters are the parameters of the underlying normal distribution. - Z
- Similar to P, this is a batch of distribution parameters. In contrast to P, however, Z represents the primary updateable policy \(\pi_\theta(a|s)\) instead of the behavior/target policy \(b(a|s)\).
- G
- A batch of (\(\gamma\)-discounted) returns, shape:
[batch_size]
. - Rn
A batch of partial (\(\gamma\)-discounted) returns. For instance, in n-step bootstrapping these are given by:
\[R^{(n)}_t\ =\ R_t + \gamma\,R_{t+1} + \dots + \gamma^{n-1}\,R_{t+n-1}\]In other words, it’s the part of the n-step return without the bootstrapping term. The shape is
[batch_size]
.- In
A batch of bootstrap factors. For instance, in n-step bootstrapping these are given by \(I^{(n)}_t=\gamma^n\) when bootstrapping and \(I^{(n)}_t=0\) otherwise. It is used in bootstrapped updates. For instance, the n-step bootstrapped target makes use of it as follows:
\[G^{(n)}_t\ =\ R^{(n)}_t + I^{(n)}_t\,Q(S_{t+1}, A_{t+1})\]The shape is
[batch_size]
.- S_next
- A batch of (preprocessed) next-state observations. This is typically
used in bootstrapping (see In). The shape is
[batch_size, ...]
where the ellipses might be any number of dimensions. - A_next
- A batch of next-actions to be taken. These can be actions that were
actually taken (on-policy), but they can also be any other would-be
next-actions (off-policy). The shape is
[batch_size]
. - P_next
- A batch of action propensities according to the policy \(\pi(a|s)\).
- V
- A batch of V-values \(v(s)\) of shape
[batch_size]
. - Q_sa
- A batch of Q-values \(q(s,a)\) of shape
[batch_size]
. - Q_s
- A batch of Q-values \(q(s,.)\) of shape
[batch_size, num_actions]
. - Adv
- A batch of advantages \(\mathcal{A}(s,a) = q(s,a) - v(s)\), which
has shape:
[batch_size]
.