# [0705] Task06 DDPG algorithm, PPO algorithm, SAC algorithm [theory only]

2024-07-12

easy-rl PDF version notes organization P5, P10 - P12
joyrl Compare and supplement P11 - P13
OpenAI document compilation ⭐ https://spinningup.openai.com/en/latest/index.html

insert image description here

Download the latest version PDF
Address: https://github.com/datawhalechina/easy-rl/releases
Domestic address (recommended for domestic readers)：
Link: https://pan.baidu.com/s/1isqQnpVRWbb3yh83Vs0kbw Extraction code: us6a

easy-rl online version link (for copying code)
Reference link 2: https://datawhalechina.github.io/joyrl-book/

other:
【Correction record link】
——————
5. Basics of Deep Reinforcement Learning⭐️
Open source content: https://linklearner.com/learn/summary/11
——————————

insert image description here
Image Source

Proximal policy optimization (PPO)

Same-policy: The agent to learn and the agent that interacts with the environment are the same.
Off-policy: The agent to learn and the agent interacting with the environment are different

Policy Gradient: It takes a lot of time to sample data

Same strategy $⟹ Importance Sampling$ Different strategies

PPO: Avoid two distributions that differ too much. Same strategy algorithm
1. Original optimization items $J(theta,theta^prime)$
2. Constraints: $θ$ and $theta^prime$ The KL divergence of the output action ( $θ$ and $theta^prime$ The more similar the better)

PPO has a predecessor: trust region policy optimization (TRPO)
TRPO is difficult to handle because it treats the KL divergence constraint as an additional constraint and does not put it in the objective function, so it is difficult to calculate. Therefore, we generally use PPO instead of TRPO. The performance of PPO and TRPO is similar, but PPO is much easier to implement than TRPO.

KL divergence: the distance of the actions.The probability distribution of performing an action distance.

There are two main variants of the PPO algorithm: Proximal Policy Optimization Penalty (PPO-penalty) and Proximal Policy Optimization Clipping (PPO-clip).

insert image description here

——————————
P10 Sparse Reward Problem
1. Design rewards. Domain knowledge required
How about distributing the final reward to each relevant action?

2. Curiosity
Intrinsic curiosity module (ICM)
enter: $a_t,s_t$
Output: $s_{t+1}$
The network's predicted value $s_{t+1}$ With the true value $s_{t+1}$ The more dissimilar, $r_t^i$ The bigger

$r_t^i$ ： The more difficult it is to predict the future state, the greater the reward for the action. Encourage risk-taking and exploration.

The indicator is too single, and you may only learn useless things.

Feature extractor

Network 2:
Input: Vector $phi}(s_{t})$ and $phi}(s_{t+1})$

Predicting Actions $\overset{a}{^}$ The closer to the real action the better.

insert image description here

3. Course study

Easy -> Difficult

Reverse curriculum learning:
Starting from the final most ideal state (which we call the gold state), go toFind the state closest to the golden stateAs the "ideal" state that we want the intelligent agent to reach at certain stages. Of course, we will intentionally remove some extreme states in the process, that is, states that are too simple or too difficult.

4. Hierarchical reinforcement learning (HRL)
The agent's strategies are divided into high-level strategies and low-level strategies. The high-level strategy determines how to execute the low-level strategy based on the current state.

————————
P11 Imitation Learning
Unclear about the reward scenario

Imitation learning (IL)
Learning from demonstration
Apprenticeship learning
Learning by watching

There are clear rewards: board games, video games
No clear reward: Chatbots

Collect expert demonstrations: recordings of human driving, human conversations

Conversely, what kind of reward function makes the experts take these actions?
Inverse reinforcement learning isFirst find the reward function, after finding the reward function, we use reinforcement learning to find the optimal actor.

Third person imitation learning technology

————————
P12 Deep deterministic policy gradient (DDPG)

insert image description here

Used experience replay strategy

Ablation experiment [Control variable method] analysisEach constraintThe impact on the outcome of the battle.

joyrl：

DDPG_Continuous

In needDeterminismStrategy andContinuous ActionUnder the premise of space, this type of algorithm will be a relatively stable baseline algorithm

DQN in Continuous Action Space

Deep deterministic policy gradient (DDPG)

The experience replay mechanism can reduce the correlation between samples, improve the effective utilization of samples, and increase the stability of training.

shortcoming:
1. Cannot be used in discrete action spaces
2、Highly dependent on hyperparameters
3. Highly sensitive initial conditions. Affects the convergence and performance of the algorithm
4. It is easy to fall into local optimum.

Due to the use of deterministic strategies, the algorithm may fall into local optimality and it is difficult to find the global optimal strategy. In order to increase the exploration, some measures need to be taken, such as adding noise strategies or using other exploration methods.

The advantage of soft update is that it is smoother and slower, which can avoid shocks caused by too rapid weight updates and reduce the risk of training divergence.

Twin delayed DDPG (TD3)

Double-delayed deterministic policy gradient algorithm

Three improvements: dual Q network, delayed update, noise regularization
Double Q Network: Two Q networks, choose the one with smaller Q value. This can solve the problem of overestimation of Q value and improve the stability and convergence of the algorithm.

Delayed updates: make the actor update frequency lower than the critic update frequency

Think twice

Noise is more like aRegularizationway, so thatValue function updatemoresmooth

OpenAI Gym Library_Pendulum_TD3

OpenAI's documentation interface link for TD3

TD3 Paper PDF Link

PPO_Continuous/Discrete Action Space [OpenAI 201708]

The most commonly used PPO algorithm in reinforcement learning
Discrete + Continuous
Fast and stable, easy to adjust parameters
Baseline Algorithm

PPO

In practice, the clip constraint is generally used because it is simpler, less computationally expensive, and produces better results.

The off-policy algorithm canTo use historical experience, generally use experience replay to store and reuse previous experience,High data utilization efficiency。

PPO is an on-policy algorithm

Although the importance sampling part uses the old actor sampling samples, weWe do not use these samples directly to update the strategy., but use importance sampling to correct the error caused by different data distributions, that is, to minimize the difference between the two sample distributions. In other words, it can be understood that although the samples after importance sampling are obtained by sampling with the old strategy,is approximately obtained from the updated policy, that is, the actor we want to optimize and the actor sampled are the same.

——————————————————

— OpenAI Documentation_PPO

OpenAI Documentation
Paper arXiv interface link: Proximal Policy Optimization Algorithms

PPO: on-policy algorithm, applicable to discrete or continuous action space. Possible local optimum

The motivation for PPO is the same as TRPO: how to use existing dataTake the biggest possible step in strategy improvement, without changing it too much and accidentally causing performance problems?
TRPO tries to solve this problem with a complicated second-order method, while PPO is a first-order method that uses some other tricks to keep the new policy close to the old one.
The PPO method is much simpler to implement and has been empirically shown to perform at least as well as TRPO.

There are two main variations of PPO: PPO-Penalty and PPO-Clip.

PPO-Penalty approximately solves the KL-constrained update like TRPO, but penalizes KL-divergence in the objective function instead of making it a hard constraint, and automatically adjusts the penalty coefficient during training so that it scales appropriately.
PPO-Clip has no KL-divergence or constraints in the objective function. Instead, it relies on specific clipping of the objective function to remove the incentive for the new policy to move away from the old policy.
PPO-Clip (the main variant used by OpenAl).

insert image description here

PPO-Clip algorithm pseudocode

insert image description here

Algorithm: PPO-Clip
1: Input: Initial strategy parameters $theta_0$ , initial value function parameters $phi_0$
2： $for k = 0, 1, 2, \dots do$ ：
3：By running the policy in the environment $pi_k=pi(theta_k)$ Collecting Trajectory Sets $D}_k={tau_i}$
4：Calculating rewards (rewards-to-go) $R_t~~~~~$ ▢ $R_t$ Calculation rules
5：Compute advantage estimate, based on current value function $V_{phi_k}$ of $A_t$ (Use any dominance estimation method) ▢ What are the current methods for edge estimation?
6：Update the policy by maximizing the PPO-Clip objective function:

$~~~~~~~~~~~theta_{k+1}=argmaxlimits_thetafrac{1}{|{cal D}_k|T}sumlimits_{tauin{cal D}_k}sumlimits_{t=0}^TminBig(frac{pi_{theta} (a_t|s_t)}{pi_{theta_k}(a_t|s_t)}A^{pi_{theta_k}}(s_t,a_t),g(epsilon,A^{pi_{theta_k}}(s_t,a_t))Big)$ ▢ How to determine the policy update formula?

$pi_{theta_k}$ : Policy parameter vector before update. Importance sampling. Sampling from the old policy.

General Stochastic Gradient Ascent + Adam
7：Mean square errorRegression fit value function:

$~~~~~~~~~~~phi_{k+1}=arg minlimits_phifrac{1}{|{cal D}_k|T}sumlimits_{tauin{cal D}_k}sumlimits_{t=0}^TBig(V_phi(s_t)-hat R_tBig)^2$

General Gradient Descent
8： $end for$

$dots$ $\dots$

$begin{aligned}&(1+epsilon)A ~~~~&Ageq0\ &(1-epsilon)A&A<0end{aligned}$

insert image description here

In the paperAdvantage Estimate:

$A_t=-V(s_t)+underbrace{r_t+gamma r_{t+1}+cdots+gamma^{T-t+1}r_{T-1}+gamma^{T-t}V(s_T)}_{textcolor{blue}{hat R_t???}}$

insert image description here

make $Delta_t =r_t+gamma V(s_{t+1})-V(s_t)$
but $r_t=Delta_t - gamma V(s_{t+1})+V(s_t)$

Substitution $A_t$ expression

$begin{aligned}hat A_t&=-V(s_t)+r_t+gamma r_{t+1}+gamma^2 r_{t+2}+cdots+gamma^{T-t}r_{T-2}+gamma^{T-t+1}r_{T-1}+gamma^{T-t}V(s_T)\ &=-V(s_t)+r_t+gamma r_{t+1}+cdots+gamma^{T-t+1}r_{T-1}+gamma^{T-t}V(s_T)\ &=-V(s_t)+\ & ~~~~~~~Delta_t - gamma V(s_{t+1})+V(s_t)+\ & ~~~~~~~gamma (Delta_{t+1} - gamma V(s_{t+2})+V(s_{t+1}))+\ & ~~~~~~~gamma^2(Delta_{t+2} - gamma V(s_{t+3})+V(s_{t+1}))+\ & ~~~~~~~cdots+\ & ~~~~~~~gamma^{T-t}(Delta_{T-t} - gamma V(s_{T-t+1})+V(s_{T-t}))+\ & ~~~~~~~gamma^{T-t+1}(Delta_{T-1} - gamma V(s_T)+V(s_{T-1}))+\ & ~~~~~~~gamma^{T-t}V(s_T)\ &=Delta_t+gammaDelta_{t+1}+gamma^2Delta_{t+2}+cdots+gamma^{T-t}Delta_{T-t}+gamma^{T-t+1}Delta_{T-1}end{aligned}$

insert image description here

Clipping acts as a regularizer by removing the incentive to make drastic policy changes.Hyperparameters $ϵ$ Corresponding to the distance between the new strategy and the old strategy。

This kind of clipping may still end up with a new strategy that is far from the old one. In the implementation here, we use a particularly simple method:Early StoppingIf the average KL-divergence between the new policy and the old policy exceeds a threshold, we stop executing the gradient step.

Simple derivation of PPO objective function link
The objective function of PPO-Clip is:

$L^{rm CLIP}_{theta_k}(theta)=underset{s, asimtheta_k}{rm E}Bigg[minBigg(frac{pi_theta(a|s)}{pi_{theta_k}(a|s)}A^{theta_k}(s, a), {rm clip}Big(frac{pi_theta(a|s)}{pi_{theta_k}(a|s)},1-epsilon, 1+epsilonBig)A^{theta_k}(s, a)Bigg)Bigg]$

$underset{s, asimtheta_k}{rm E}$ $asimtheta_k}{rm E}$

No. $k$ The strategy parameters for the iteration $theta_k$ ， $ϵ$ is a small hyperparameter.
set up $ϵ \in (0, 1)$ , Definition
$F (r, A, ϵ) ≐ min (r A, clip (r, 1 - ϵ, 1 + ϵ) A)$
when $A \geq 0$
$begin{aligned}F(r,A,epsilon)&=minBigg(rA,{rm clip}(r,1-epsilon,1+epsilon)ABigg)\ &=AminBigg(r,{rm clip}(r,1-epsilon,1+epsilon)Bigg)\ &=AminBigg(r,left{begin{aligned}&1+epsilon~~&rgeq1+epsilon\ &r &rin(1-epsilon,1+epsilon)\ &1-epsilon &rleq1-epsilon\ end{aligned}$

when $A < 0$
$begin{aligned}F(r,A,epsilon)&=minBigg(rA,{rm clip}(r,1-epsilon,1+epsilon)ABigg)\ &=Atextcolor{blue}{max}Bigg(r,{rm clip}(r,1-epsilon,1+epsilon)Bigg)\ &=AmaxBigg(r,left{begin{aligned}&1+epsilon~~&rgeq1+epsilon\ &r &rin(1-epsilon,1+epsilon)\ &1-epsilon &rleq1-epsilon\ end{aligned}$

In summary: it can be defined $g (ϵ, A)$
$begin{aligned}&(1+epsilon)A ~~~~&Ageq0\ &(1-epsilon)A&A<0end{aligned}$

Why does this definition ensure that the new strategy does not stray too far from the old one?
Importance sampling methods effectively require new strategies $pi_theta(a|s)$ and the old strategy $pi_{theta_k}(a|s)$ The two distributions cannot be too far apart

1. When advantage is positive

$L(s,a,theta_k, theta)=minBigg(frac{pi_theta(a|s)}{pi_{theta_k}(a|s)}, 1+epsilonBigg)A^{pi_{theta_k}}(s, a)$
Advantage function: If a state-action pair is found to have a high reward, increase the weight of the state-action pair.

When the state-action pair $(s, a)$ The advantage of is positive, then if the action $a$ is more likely to be executed if $pi_theta(a|s)$ Increase, and the goal will increase.
The min in this item limits the objective function to only increase to a certain value
once $pi_theta(a|s)>(1+epsilon)pi_{theta_k}(a|s)$ , min trigger, limit the value of this item $(1+epsilon)pi_{theta_k}(a|s)$ 。
the new policy does not benefit by going far away from the old policy.
New strategies do not benefit from moving away from old strategies.

2. When the advantage is negative

$L(s,a,theta_k, theta)=maxBigg(frac{pi_theta(a|s)}{pi_{theta_k}(a|s)}, 1-epsilonBigg)A^{pi_{theta_k}}(s, a)$

When a state-action pair $(s, a)$ The advantage of is negative, then if the action is executed $a$ is less likely, that is, if $π_theta(a|s)$ If decreases, the objective function will increase. However, the max in this term limits how much the objective function can increase.
once $pi_theta(a|s)<(1-epsilon)pi_{theta_k}(a|s)$ , max trigger, limit the value of this item to $(1-epsilon)pi_{theta_k}(a|s)$ 。

To reiterate: the new policy does not benefit by going far away from the old policy.
New strategies do not benefit from moving away from old strategies.

TD3_Continuous only： Twin Delayed Deep Deterministic Policy Gradient [ICML 2018 (Canada) McGill University]

insert image description here
Image Source

OpenAI Documentation_TD3
Paper link

While DDPG can sometimes achieve excellent performance, it is often unstable when it comes to hyperparameters and other types of tuning.
A common DDPG failure mode is that the learned Q-function starts to significantly overestimate the Q-values, which then causes the policy to break because it exploits the error in the Q-function.
Twin Delayed DDPG (TD3) is an algorithm that solves this problem by introducing three key tricks:
1、Truncated Double Q Learning。

TD3 learns two Q-functions instead of one (hence "twin"), and uses the smaller of the two Q-values to form the target in the Bellman error loss function.

2、Policy update delay。

TD3 updates the policy (and target network) less frequently than the Q function. The paper suggests updating the policy once for every two updates of the Q function.

3. Target strategy smoothing.

TD3 adds noise to the target action, making it harder for the policy to exploit the error in the Q function by smoothing out the changes in the action.

TD3 is an off-policy algorithm; it can only be used withcontinuousThe environment of the action space.

TD3 algorithm pseudocode

insert image description here

Algorithm: TD3
With random parameters $theta_1, theta_2, phi$ Initialize the critic network $Q_{theta_1},Q_{theta_2}$ , and actor networks $pi_phi$
Initialize the target network $theta_1^primeleftarrowtheta_1, theta_2^primeleftarrowtheta_2, phi^primeleftarrow phi$
Initialize the playback buffer set $B$
$for t = 1 to T$ ：
Selecting actions with exploration noise $asimpi_phi(s)+epsilon,~~epsilonsim {cal N}(0,sigma)$ , Observation Reward $r$ and the new state $s^prime$
The transition tuple $s^prime)$ Save to $B$ middle
from $B$ Sampling small batches $N$ transitions $s^prime)$
$pi_{phi^prime}(s^prime)+epsilon,~~epsilonsim{rm clip}({cal N}(0,widetilde sigma),-c,c)$
$minlimits_{i=1,2}Q_{theta_i^prime}(s^prime,widetilde a)$
Update critics $theta_ileftarrowargminlimits_{theta_i}N^{-1}sum(y-Q_{theta_i}(s, a))^2$
$if t % d$ ：
Updated via deterministic policy gradient $ϕ$
$_phi J(phi)=N^{-1}sumnabla_aQ_{theta_1}(s, a)|_{a=pi_phi(s)}nabla_phipi_phi(s)$
Update the target network:
$~~~~~~~~~~~~~~~~~theta_i^primeleftarrowtautheta_i+(1-tau)theta_i^prime~~~~~$ $τ$ : Target update rate
$~~~~~~~~~~~~~~~~~phi^primeleftarrowtauphi+(1-tau)phi^prime$
$end if$
$end for$

Soft Actor-Critic: SAC_Continuous/Discrete Action Space [Google Brain latest version 201906]

insert image description here

Image Source

Maximize the entropy of the policy, thus making the policy more robust.

Deterministic Strategy It means that given the same state, the same action is always chosen.
Randomness strategy There are multiple possible actions that can be chosen in a given state.

	Deterministic Strategy	Randomness strategy
definition	Same state, same action	The same state,Different actions may be performed
advantage	Stable and repeatable	Avoid falling into local optimal solutions and improve global search capabilities
shortcoming	Lack of exploration, easy to be caught by opponents	This may result in slower convergence of the strategy, affecting efficiency and performance.

In practical applications, if conditions permit, we willUse as much as possibleRandomness strategy, such as A2C, PPO, etc., because it is more flexible, more robust and more stable.

Maximum entropy reinforcement learning believes that even though we currently have mature random strategies, such as AC algorithms, we still have not achieved optimal randomness. Therefore, it introduces aInformation EntropyThe concept ofMaximize the cumulative reward while maximizing the entropy of the strategy, making the strategy more robust, thus achieving the optimal stochastic strategy.

——————————————————

— OpenAI Documentation_SAC

OpenAI Documentation_SAC interface link

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor, Haarnoja et al, 201808 ICML 2018
Soft Actor-Critic Algorithms and Applications, Haarnoja et al, 201901
Learning to Walk via Deep Reinforcement Learning, Haarnoja et al, 201906 RSS2019

Soft Actor Critic (SAC) optimizes random policies in an off-policy manner.

DDPG + Stochastic Strategy Optimization

Not a direct successor to TD3 (published almost simultaneously).

It incorporates the clipped double-Q trick, and due to the inherent randomness of SAC's strategy, it also ultimately benefits fromTarget strategy smoothing。

A core feature of SAC is entropy regularization。
The policy is trained to maximize the trade-off between expected reward and entropy,Entropy is a measure of the randomness of a policy。
This is closely related to the trade-off between exploration and exploitation: an increase in entropy leads toMore to explore,this is OKAccelerate subsequent learningIt can alsoPrevent the strategy from converging to a bad local optimum prematurely。

It can be used in both continuous and discrete action spaces.

exist Entropy-Regularized Reinforcement LearningIn the example, the agent obtains the sameThe entropy of the policy at this time stepProportional rewards.
At this time, the RL problem is described as:

$pi^*=argmaxlimits_pi underset{tausimpi}{rm E}Big[sumlimits_{t=0}^inftygamma^tBig(R(s_t,a_t,s_{t+1})textcolor{blue}{+alpha H(pi(·|s_t))}Big)Big]$

in $α > 0$ is the trade-off coefficient.
The state-value function includes an entropy bonus at each time step $V^pi$ for:

$V^pi(s)=underset{tausimpi}{rm E}Big[sumlimits_{t=0}^inftygamma^tBig(R(s_t,a_t,s_{t+1})+alpha H(pi(·|s_t))Big)Big|s_0=sBig]$

Action-value function including entropy bonus for every time step except the first $Q^pi$ :

$Q^pi(s,a)=underset{tausimpi}{rm E}Big[sumlimits_{t=0}^inftygamma^tBig(R(s_t,a_t,s_{t+1})+alpha sumlimits_{t=1}^infty H(pi(·|s_t))Big)Big|s_0=s,a_0=aBig]$

Some papers $Q^pi$ Contains the entropy reward for the first time step

$V^pi$ and $Q^pi$ The relationship between them is:

$V^pi(s)=underset{asimpi}{rm E}[Q^pi(s, a)]+alpha H(pi(·|s))$

about $Q^pi$ The Bellman formula is:

$begin{aligned}Q^pi(s, a)&=underset{s^prime sim P atop a^primesim pi}{rm E}[R(s,a,s^prime)+gammabig(Q^pi(s^prime,a^prime)+alpha H(pi(·|s^prime))big)]\ &=underset{s^prime sim P}{rm E}[R(s,a,s^prime)+gamma V^pi(s^prime)]end{aligned}$

SAC learns a policy at the same time $π_theta$ and two $Q$ function $Q_{phi_1}, Q_{phi_2}$ 。
There are currently two standard SAC variants: one that uses a fixedEntropy regularization coefficient $α$ , and another by changing $α$ to enforce the entropy constraint.
OpenAI's documentation uses a version with a fixed entropy regularization coefficient, but in practice one usually prefersEntropy ConstraintVariant of .

As shown in the figure below, $α$ In the fixed version, except for the last figure which has a clear advantage, the others are only slightly better, basically the same as $α$ The learning version is flat; $α$ The learning version has advantages in the two middle pictures, with more obvious advantages.

insert image description here
Image Source

SAC VS TD3:

Same point:
1. Both Q functions are learned by regressing to a single shared objective by minimizing the MSBE (Mean Squared Bellman Error).
2. Use the target Q-network to calculate the shared target, and perform polyak averaging on the Q-network parameters during training to obtain the target Q-network.
3. The shared target uses a truncated double Q technique.

difference:
1. SAC contains entropy regularization term
2. The next state action used in SAC’s goal comes fromCurrent Strategy, rather than a target strategy.
3. There is no clear target policy smoothing. TD3 trains a deterministic policy that takes actions to the next state.Add random noiseSAC trains a random policy, and the noise from randomness is enough to achieve similar effects.

SAC algorithm pseudocode

insert image description here

Algorithm: Soft Actor-Critic SAC
enter: $theta_1,theta_2,phi~~~~~$ Initialization parameters
Parameter initialization:
Initialize the target network weights: $theta_1leftarrowtheta_1, bar theta_2leftarrowtheta_2$
The playback pool is initialized to be empty: $D \leftarrow \emptyset$
$for$ Each iteration $do$ ：
$for$ Each environment step $do$ ：
Sample actions from the policy: $a_tsimpi_phi(a_t|s_t)~~~~~$ ▢ Here $pi_phi(a_t|s_t)$ How to define?
Sample transitions from the environment: $s_{t+1}sim p(s_{t+1}|s_t,a_t)$
Save the transition to the replay pool: $D}~cup~{(s_t,a_t,r(s_t,a_t),s_{t+1})}$
$end for$
$for$ Each gradient step $do$ ：
renew $Q$ Function parameters: For $i \in {1, 2}$ ， $theta_ileftarrowtheta_i-lambda_Qhat nabla_{theta_i}J_Q(theta_i)~~~~~$ ▢ Here $J_Q(theta_i)$ How to define?
Update policy weights: $phileftarrowphi-lambda_pihat nabla_phi J_pi (phi)~~~~~$ ▢ Here $J_pi (phi)$ How to define?
Adjust the temperature: $alphaleftarrowalpha-lambdahatnabla_alpha J(alpha)~~~~~$ ▢ Here $J (α)$ How to define? How to understand the temperature here
Update the target network weights: For $i \in {1, 2}$ ， $theta_ileftarrow tau theta_i-(1-tau)bar theta_i~~~~~$ ▢ How to understand this $τ$ ? ——> Target smoothing coefficient
$end for$
$end for$
Output: $theta_1,theta_1,phi~~~~~$ Optimized parameters

$\hat{\nabla}$ ： Stochastic gradient

$emptyset$ $\emptyset$

insert image description here

Learning to Walk via Deep Reinforcement Learning Versions in:

$α$ is the temperature parameter, which determines the relative importance of the entropy term and the reward, thereby controlling the stochasticity of the optimal policy.
$α$ Large: Explore
$α$ Small: Utilize

$J(alpha)=underset{a_tsimpi_t}{mathbb E}[-alphalog pi_t(a_t|s_t)-alphabar{cal H}]$

Technology Sharing