Policy Gradient Algorithm’s Mathematics Explained with PyTorch Implementation | by Ebrahim Pichka

[ad_1]

As explained above, Policy Gradient (PG) methods are algorithms that aim to learn the optimal policy function directly in a Markov Decision Processes setting (S, A, P, R, γ). In PG, the policy π is represented by a parametric function (e.g., a neural network), so we can control its outputs by changing its parameters. The policy π maps state to actions (or probability distributions over actions).

The goal of PG is to achieve a policy that maximizes the expected cumulative rewards over a trajectory of states and actions (A.K.A. the return). Let us go through how it achieves to do so.

Derivation

So, considering we have a policy function π, parameterized by a parameter set θ. Given a state as input, The policy outputs a probability distribution over actions at each state. Knowing this, first, we need to come up with an objective/goal so that we would want to optimize our policy function’s parameters with regard to this goal.

Note that this policy function could be approximated with any function approximation technique. However, in the case of Deep RL, we consider this function to be approximated by a Neural Network (with parameter set θ) that takes states (observations) as input and outputs the distribution over actions. It could be a discrete distribution for discrete actions or a continuous distribution for continuous actions.

In general, the agent’s goal would be to obtain the maximum cumulative reward over a trajectory of interactions (state-action sequences). So, let τ denote such trajectory i.e., a state-action sequence s₀, a₀, s₁, a₁, …, sₕ, aₕ, And R(τ) denote the cumulative reward over this trajectory a.k.a the return.

It is obvious that this trajectory τ, is a random variable because of its stochasticity. This makes R(τ) to be a stochastic function as well. As we can not maximize this stochastic function directly, we would want to maximize its expectation, meaning to maximize it on average case, while taking actions with policy π: E[ R(τ); π ].

And as a refresher on probabilities, this is how to calculate a discrete expectation of a function of a random variable:

So the final objective function that we would want to maximize is rewritten as follows. Also, we will now call this objective function, the Utility Function (U).

where P(τ; θ) is the probability of trajectory τ taking place over policy parameters θ.

Note that this utility function is written as a function of θ, a set of policy parameters, because it is only the policy that controls the path of the trajectories since the environment dynamic is fixed (and usually unknown), and we would want to optimize this utility by changing and optimizing θ. Also, note that the reward series R over the trajectory is not dependent on the policy. Hence it is not dependent on the parameters θ because it is based on the environment and is only an experienced scalar value.

Optimization

Now that we have the Utility function U(θ), we would want to optimize it using a stochastic gradient-based optimization algorithm, namely Stochastic Gradient Ascent:

Gradient Ascent Update Rule — Image by author

To be able to perform the parameter update, we have to compute the gradient of the Utility over θ (∇U). By attempting to do so, we get:

Note that the gradient over the summation is equal to the sum of gradients. Now we continue by applying a trick by multiplying this expression by P(τ; θ)/P(τ; θ), which equals 1. Remember from calculus that
∇f(x)/f(x) = ∇log(f(x)), so we get:

We do this trick to create a general form of expectation ‘ΣP(x)f(x)’ in the gradient, which we can later replace and estimate the expectation with the average over samples of real interaction data.

As we cannot sum over ALL possible trajectories, in a stochastic setting, we estimate this expression (∇U(θ)), by generating “m” number of random trajectories using policy parameters θ and averaging the above expectation over these m trajectories to estimate the gradient ∇U(θ):

So now we have a way of estimating the gradient with trajectory samples, but there is an issue here. this gradient causes parameters θ to change in a direction that the probability of trajectories with higher return R(τ) increases the most. However, this has a disadvantage, and that is the direction of change in the trajectory probability is highly dependent on how reward function R is designed. For example, if all transitions result in a positive reward, all trajectory probabilities would be increased and no trajectory would be penalized, and vice versa for cases with negative R.

To address this issue, it is proposed to subtract a baseline value (b) [1] from the return (i.e., change ‘R(τ)’ to ‘R(τ)-b’ ) where this baseline b should estimate how would be the return on average. This helps returns to be centralized around zero which makes trajectory probabilities that performed better than average are increased and those that didn’t are decreased. There are multiple proposed ways of calculating baseline b, among which using a neural network is one that we will use later.

Trajectories with different Returns illustrated — photo from **[3]**

Further, by decomposing the trajectory probability P(τ) into Temporal timesteps and writing it as the product of the probabilities of all ‘H’ timesteps, and also by knowing that the environment dynamics does NOT depend on the parameters θ so its gradient over θ would be 1, we can rewrite the Trajectory’s gradient w.r.t. Temporal timesteps and solely based on the policy function π:

Also, we can decompose the return R(τ) as a sum of individual timestep rewards Σrₜ. And at each timestep t, we can disregard the rewards from previous timesteps 0, … , t-1 since they are not affected by the current action aₜ (Removing terms that don’t depend on current action can lower variance). Together with incorporating the baseline, we would have the following Temporal format:

So the gradient estimate ‘g’ will be:

Finally, as a proper choice for estimation of the baseline b(sₜ), we can use expected return from state sₜ onward, which is also known as the Value function of this state V(sₜ). We will use another neural network with parameter set ϕ to approximate this value function with a bellman target using either Monte Carlo or Temporal difference learning from interaction data. the final form would be as follows:

Moreover, it is good to know that the difference between R(sₜ) and the baseline b(sₜ) is called the Advantage function A(sₜ). In some implementations, as the R(sₜ) equivalences the state-action value, also known as the Q function Q(sₜ), the advantage is written as A(sₜ) = Q(sₜ)-V(sₜ) where both Q and V can be approximated with neural networks, and maybe even with shared weights.

It is worth mentioning that when we incorporate value estimation in policy gradient methods, it is also called the Actor-Critic method, while the actor is the policy network and the critic is the value estimator network. These types of methods are highly studied nowadays due to their good performance in various benchmarks and tasks.

The Algorithm

Based on the derived gradient estimator g, the Value function approximation network, and the gradient ascent update rule, the Overall algorithm to train an agent with the Vanilla (simple) Policy Gradient algorithm is as follows:

Note that in this algorithm, first, a number of state-action sequences (trajectories) are done then the updates are carried away. The policy network here is updated in an on-policy or online manner, whereas the value functions can be updated off-policy (offline) from batch-sampled data using gradient descent.

[ad_2]

Source link