Technology Sharing

# [0705] Task06 DDPG algorithm, PPO algorithm, SAC algorithm [theory only]

2024-07-12

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

  • easy-rl PDF version notes organization P5, P10 - P12
  • joyrl Compare and supplement P11 - P13
  • OpenAI document compilation ⭐ https://spinningup.openai.com/en/latest/index.html

insert image description here

insert image description here

Download the latest version PDF
Address: https://github.com/datawhalechina/easy-rl/releases
Domestic address (recommended for domestic readers)
Link: https://pan.baidu.com/s/1isqQnpVRWbb3yh83Vs0kbw Extraction code: us6a

easy-rl online version link (for copying code)
Reference link 2: https://datawhalechina.github.io/joyrl-book/

other:
【Correction record link】
——————
5. Basics of Deep Reinforcement Learning⭐️
Open source content: https://linklearner.com/learn/summary/11
——————————

insert image description here
Image Source

Proximal policy optimization (PPO)

Same-policy: The agent to learn and the agent that interacts with the environment are the same.
Off-policy: The agent to learn and the agent interacting with the environment are different

Policy Gradient: It takes a lot of time to sample data

Same strategy ⟹ Importance sampling ~~~overset{Importance sampling}{Longrightarrow}~~~   Importance Sampling    Different strategies

PPO: Avoid two distributions that differ too much. Same strategy algorithm
1. Original optimization items J ( θ , θ ′ ) J(theta,theta^prime) J(θ,θ)
2. Constraints: θ theta θ and θ ′ theta^prime θ The KL divergence of the output action ( θ theta θ and θ ′ theta^prime θ The more similar the better)

PPO has a predecessor: trust region policy optimization (TRPO)
TRPO is difficult to handle because it treats the KL divergence constraint as an additional constraint and does not put it in the objective function, so it is difficult to calculate. Therefore, we generally use PPO instead of TRPO. The performance of PPO and TRPO is similar, but PPO is much easier to implement than TRPO.

KL divergence: the distance of the actions.The probability distribution of performing an action distance.

There are two main variants of the PPO algorithm: Proximal Policy Optimization Penalty (PPO-penalty) and Proximal Policy Optimization Clipping (PPO-clip).

insert image description here

Similar performance, easier to implement
Trust Region Strategy Optimization ( trust region policy optimization,TRPO)
Proximal Strategy Optimization ( proximal policy optimization, PPO
Proximal strategy optimization penalty ( PPO-penalty)
Proximal strategy optimization pruning ( PPO-clip)

——————————
P10 Sparse Reward Problem
1. Design rewards. Domain knowledge required
How about distributing the final reward to each relevant action?

2. Curiosity
Intrinsic curiosity module (ICM)
enter: a t , s t a_t,s_t at,st
Output: s ^ t + 1 hat s_{t+1} s^t+1
The network's predicted value s ^ t + 1 hat s_{t+1} s^t+1 With the true value s t + 1 s_{t+1} st+1 The more dissimilar, r t i r_t^i rti The bigger

r t i r_t^i rti : The more difficult it is to predict the future state, the greater the reward for the action. Encourage risk-taking and exploration.

  • The indicator is too single, and you may only learn useless things.

Feature extractor

Network 2:
Input: Vector ϕ ( s t ) {bm phi}(s_{t}) ϕ(st) and ϕ ( s t + 1 ) {bm phi}(s_{t+1}) ϕ(st+1)

Predicting Actions a ^ hat a a^ The closer to the real action the better.

insert image description here

3. Course study

Easy -> Difficult

Reverse curriculum learning:
Starting from the final most ideal state (which we call the gold state), go toFind the state closest to the golden stateAs the "ideal" state that we want the intelligent agent to reach at certain stages. Of course, we will intentionally remove some extreme states in the process, that is, states that are too simple or too difficult.

4. Hierarchical reinforcement learning (HRL)
The agent's strategies are divided into high-level strategies and low-level strategies. The high-level strategy determines how to execute the low-level strategy based on the current state.

————————
P11 Imitation Learning
Unclear about the reward scenario

Imitation learning (IL)
Learning from demonstration
Apprenticeship learning
Learning by watching

There are clear rewards: board games, video games
No clear reward: Chatbots

Collect expert demonstrations: recordings of human driving, human conversations

Conversely, what kind of reward function makes the experts take these actions?
Inverse reinforcement learning isFirst find the reward function, after finding the reward function, we use reinforcement learning to find the optimal actor.

Third person imitation learning technology

————————
P12 Deep deterministic policy gradient (DDPG)

insert image description here

Used experience replay strategy

Ablation experiment [Control variable method] analysisEach constraintThe impact on the outcome of the battle.


joyrl:

DDPG_Continuous

In needDeterminismStrategy andContinuous ActionUnder the premise of space, this type of algorithm will be a relatively stable baseline algorithm

DQN in Continuous Action Space

Deep deterministic policy gradient (DDPG)

The experience replay mechanism can reduce the correlation between samples, improve the effective utilization of samples, and increase the stability of training.

shortcoming:
1. Cannot be used in discrete action spaces
2、Highly dependent on hyperparameters
3. Highly sensitive initial conditions. Affects the convergence and performance of the algorithm
4. It is easy to fall into local optimum.

  • Due to the use of deterministic strategies, the algorithm may fall into local optimality and it is difficult to find the global optimal strategy. In order to increase the exploration, some measures need to be taken, such as adding noise strategies or using other exploration methods.

The advantage of soft update is that it is smoother and slower, which can avoid shocks caused by too rapid weight updates and reduce the risk of training divergence.

Twin delayed DDPG (TD3)

Double-delayed deterministic policy gradient algorithm

Three improvements: dual Q network, delayed update, noise regularization
Double Q Network: Two Q networks, choose the one with smaller Q value. This can solve the problem of overestimation of Q value and improve the stability and convergence of the algorithm.

Delayed updates: make the actor update frequency lower than the critic update frequency

  • Think twice

Noise is more like aRegularizationway, so thatValue function updatemoresmooth

OpenAI Gym Library_Pendulum_TD3

OpenAI's documentation interface link for TD3

TD3 Paper PDF Link

PPO_Continuous/Discrete Action Space [OpenAI 201708]

The most commonly used PPO algorithm in reinforcement learning
Discrete + Continuous
Fast and stable, easy to adjust parameters
Baseline Algorithm

PPO

In practice, the clip constraint is generally used because it is simpler, less computationally expensive, and produces better results.

The off-policy algorithm canTo use historical experience, generally use experience replay to store and reuse previous experience,High data utilization efficiency

PPO is an on-policy algorithm

  • Although the importance sampling part uses the old actor sampling samples, weWe do not use these samples directly to update the strategy., but use importance sampling to correct the error caused by different data distributions, that is, to minimize the difference between the two sample distributions. In other words, it can be understood that although the samples after importance sampling are obtained by sampling with the old strategy,is approximately obtained from the updated policy, that is, the actor we want to optimize and the actor sampled are the same.

——————————————————

— OpenAI Documentation_PPO

OpenAI Documentation
Paper arXiv interface link: Proximal Policy Optimization Algorithms

PPO: on-policy algorithm, applicable to discrete or continuous action space. Possible local optimum

The motivation for PPO is the same as TRPO: how to use existing dataTake the biggest possible step in strategy improvement, without changing it too much and accidentally causing performance problems?
TRPO tries to solve this problem with a complicated second-order method, while PPO is a first-order method that uses some other tricks to keep the new policy close to the old one.
The PPO method is much simpler to implement and has been empirically shown to perform at least as well as TRPO.

There are two main variations of PPO: PPO-Penalty and PPO-Clip.

  • PPO-Penalty approximately solves the KL-constrained update like TRPO, but penalizes KL-divergence in the objective function instead of making it a hard constraint, and automatically adjusts the penalty coefficient during training so that it scales appropriately.
  • PPO-Clip has no KL-divergence or constraints in the objective function. Instead, it relies on specific clipping of the objective function to remove the incentive for the new policy to move away from the old policy.
    PPO-Clip (the main variant used by OpenAl).

insert image description here

PPO-Clip algorithm pseudocode

insert image description here

Algorithm: PPO-Clip
1: Input: Initial strategy parameters θ 0 theta_0 θ0, initial value function parameters ϕ 0 phi_0 ϕ0
2: f o r   k = 0 , 1 , 2 , …   d o {bf for} ~ k=0,1,2,dots~ {bf do} for k=0,1,2, do
3:        ~~~~~~       By running the policy in the environment π k = π ( θ k ) pi_k=pi(theta_k) πk=π(θk) Collecting Trajectory Sets D k = { τ i } {cal D}_k={tau_i} Dk={τi}
4:        ~~~~~~       Calculating rewards (rewards-to-go) R ^ t       hat R_t~~~~~ R^t      R ^ t hat R_t R^t Calculation rules
5:        ~~~~~~       Compute advantage estimate, based on current value function V ϕ k V_{phi_k} Vϕk of A ^ t hat A_t A^t (Use any dominance estimation method)       ~~~~~       ▢ What are the current methods for edge estimation?
6:        ~~~~~~       Update the policy by maximizing the PPO-Clip objective function:
            ~~~~~~~~~~~            
            θ k + 1 = arg ⁡ max ⁡ θ 1 ∣ D k ∣ T ∑ τ ∈ D k ∑ t = 0 T min ⁡ ( π θ ( a t ∣ s t ) π θ k ( a t ∣ s t ) A π θ k ( s t , a t ) , g ( ϵ , A π θ k ( s t , a t ) ) ) ~~~~~~~~~~~theta_{k+1}=argmaxlimits_thetafrac{1}{|{cal D}_k|T}sumlimits_{tauin{cal D}_k}sumlimits_{t=0}^TminBig(frac{pi_{theta} (a_t|s_t)}{pi_{theta_k}(a_t|s_t)}A^{pi_{theta_k}}(s_t,a_t),g(epsilon,A^{pi_{theta_k}}(s_t,a_t))Big)            θk+1=argθmaxDkT1τDkt=0Tmin(πθk(atst)πθ(atst)Aπθk(st,at),g(ϵ,Aπθk(st,at)))       ~~~~~       ▢ How to determine the policy update formula?
            ~~~~~~~~~~~            
            ~~~~~~~~~~~             π θ k pi_{theta_k} πθk: Policy parameter vector before update. Importance sampling. Sampling from the old policy.
            ~~~~~~~~~~~            
            ~~~~~~~~~~~            General Stochastic Gradient Ascent + Adam
7:        ~~~~~~       Mean square errorRegression fit value function:
            ~~~~~~~~~~~            
            ϕ k + 1 = arg ⁡ min ⁡ ϕ 1 ∣ D k ∣ T ∑ τ ∈ D k ∑ t = 0 T ( V ϕ ( s t ) − R ^ t ) 2 ~~~~~~~~~~~phi_{k+1}=arg minlimits_phifrac{1}{|{cal D}_k|T}sumlimits_{tauin{cal D}_k}sumlimits_{t=0}^TBig(V_phi(s_t)-hat R_tBig)^2            ϕk+1=argϕminDkT1τDkt=0T(Vϕ(st)R^t)2
            ~~~~~~~~~~~            
            ~~~~~~~~~~~            General Gradient Descent
8: e n d   f o r bf end ~for end for
             ~~~~~~~~~~~~             

$dots$     … ~~~dots    

g ( ϵ , A ) = { ( 1 + ϵ ) A      A ≥ 0 ( 1 − ϵ ) A A < 0 g(epsilon,A)=left{(1+ϵ)A    A0(1ϵ)AA<0right. g(ϵ,A)={(1+ϵ)A    (1ϵ)AA0A<0

insert image description here

In the paperAdvantage Estimate:

A ^ t = − V ( s t ) + r t + γ r t + 1 + ⋯ + γ T − t + 1 r T − 1 + γ T − t V ( s T ) ⏟ R ^ t ? ? ? hat A_t=-V(s_t)+underbrace{r_t+gamma r_{t+1}+cdots+gamma^{T-t+1}r_{T-1}+gamma^{T-t}V(s_T)}_{textcolor{blue}{hat R_t???}} A^t=V(st)+R^t??? rt+γrt+1++γTt+1rT1+γTtV(sT)

insert image description here

make Δ t = r t + γ V ( s t + 1 ) − V ( s t ) Delta_t =r_t+gamma V(s_{t+1})-V(s_t) Δt=rt+γV(st+1)V(st)
but r t = Δ t − γ V ( s t + 1 ) + V ( s t ) r_t=Delta_t - gamma V(s_{t+1})+V(s_t) rt=ΔtγV(st+1)+V(st)

Substitution A ^ t hat A_t A^t expression

A ^ t = − V ( s t ) + r t + γ r t + 1 + γ 2 r t + 2 + ⋯ + γ T − t r T − 2 + γ T − t + 1 r T − 1 + γ T − t V ( s T ) = − V ( s t ) + r t + γ r t + 1 + ⋯ + γ T − t + 1 r T − 1 + γ T − t V ( s T ) = − V ( s t ) +         Δ t − γ V ( s t + 1 ) + V ( s t ) +         γ ( Δ t + 1 − γ V ( s t + 2 ) + V ( s t + 1 ) ) +         γ 2 ( Δ t + 2 − γ V ( s t + 3 ) + V ( s t + 1 ) ) +         ⋯ +         γ T − t ( Δ T − t − γ V ( s T − t + 1 ) + V ( s T − t ) ) +         γ T − t + 1 ( Δ T − 1 − γ V ( s T ) + V ( s T − 1 ) ) +         γ T − t V ( s T ) = Δ t + γ Δ t + 1 + γ 2 Δ t + 2 + ⋯ + γ T − t Δ T − t + γ T − t + 1 Δ T − 1 ˆAt=V(st)+rt+γrt+1+γ2rt+2++γTtrT2+γTt+1rT1+γTtV(sT)=V(st)+rt+γrt+1++γTt+1rT1+γTtV(sT)=V(st)+       ΔtγV(st+1)+V(st)+       γ(Δt+1γV(st+2)+V(st+1))+       γ2(Δt+2γV(st+3)+V(st+1))+       +       γTt(ΔTtγV(sTt+1)+V(sTt))+       γTt+1(ΔT1γV(sT)+V(sT1))+       γTtV(sT)=Δt+γΔt+1+γ2Δt+2++γTtΔTt+γTt+1ΔT1 A^t=V(st)+rt+γrt+1+γ2rt+2++γTtrT2+γTt+1rT1+γTtV(sT)=V(st)+rt+γrt+1++γTt+1rT1+γTtV(sT)=V(st)+       ΔtγV(st+1)+V(st)+       γ(Δt+1γV(st+2)+V(st+1))+       γ2(Δt+2γV(st+3)+V(st+1))+       +       γTt(ΔTtγV(sTt+1)+V(sTt))+       γTt+1(ΔT1γV(sT)+V(sT1))+       γTtV(sT)=Δt+γΔt+1+γ2Δt+2++γTtΔTt+γTt+1ΔT1

insert image description here

Clipping acts as a regularizer by removing the incentive to make drastic policy changes.Hyperparameters ϵ epsilon ϵ Corresponding to the distance between the new strategy and the old strategy

This kind of clipping may still end up with a new strategy that is far from the old one. In the implementation here, we use a particularly simple method:Early StoppingIf the average KL-divergence between the new policy and the old policy exceeds a threshold, we stop executing the gradient step.

Simple derivation of PPO objective function link
The objective function of PPO-Clip is:
  ~  
L θ k C L I P ( θ ) = E s , a ∼ θ k [ min ⁡ ( π θ ( a ∣ s ) π θ k ( a ∣ s ) A θ k ( s , a ) , c l i p ( π θ ( a ∣ s ) π θ k ( a ∣ s ) , 1 − ϵ , 1 + ϵ ) A θ k ( s , a ) ) ] L^{rm CLIP}_{theta_k}(theta)=underset{s, asimtheta_k}{rm E}Bigg[minBigg(frac{pi_theta(a|s)}{pi_{theta_k}(a|s)}A^{theta_k}(s, a), {rm clip}Big(frac{pi_theta(a|s)}{pi_{theta_k}(a|s)},1-epsilon, 1+epsilonBig)A^{theta_k}(s, a)Bigg)Bigg] LθkCLIP(θ)=s,aθkE[min(πθk(as)πθ(as)Aθk(s,a),clip(πθk(as)πθ(as),1ϵ,1+ϵ)Aθk(s,a))]
  ~  
$underset{s, asimtheta_k}{rm E}$     E s , a ∼ θ k ~~~underset{s, asimtheta_k}{rm E}    s,aθkE
  ~  
No. k k k The strategy parameters for the iteration θ k theta_k θk ϵ epsilon ϵ is a small hyperparameter.
set up ϵ ∈ ( 0 , 1 ) epsilonin(0,1) ϵ(0,1), Definition
F ( r , A , ϵ ) ≐ min ⁡ ( r A , c l i p ( r , 1 − ϵ , 1 + ϵ ) A ) F(r,A,epsilon)doteqminBigg(rA,{rm clip}(r,1-epsilon,1+epsilon)ABigg) F(r,A,ϵ)min(rA,clip(r,1ϵ,1+ϵ)A)
when A ≥ 0 Ageq0 A0
F ( r , A , ϵ ) = min ⁡ ( r A , clip ( r , 1 − ϵ , 1 + ϵ ) A ) = A min ⁡ ( r , clip ( r , 1 − ϵ , 1 + ϵ ) ) = A min ⁡ ( r , { 1 + ϵ r ≥ 1 + ϵ rr ∈ ( 1 − ϵ , 1 + ϵ ) 1 − ϵ r ≤ 1 − ϵ } ) = A { min ⁡ ( r , 1 + ϵ ) r ≥ 1 + ϵ min ⁡ ( r , r ) r ∈ ( 1 − ϵ , 1 + ϵ ) min ⁡ ( r , 1 − ϵ ) r ≤ 1 − ϵ } = A { 1 + ϵ r ≥ 1 + ϵ rr ∈ ( 1 − ϵ , 1 + ϵ ) rr ≤ 1 − ϵ } From the range on the right = A min ⁡ ( r , 1 + ϵ ) = min ⁡ ( r A , ( 1 + ϵ ) A )begin{aligned}F(r,A,epsilon)&=minBigg(rA,{rm clip}(r,1-epsilon,1+epsilon)ABigg)\ &=AminBigg(r,{rm clip}(r,1-epsilon,1+epsilon)Bigg)\ &=AminBigg(r,left{begin{aligned}&1+epsilon~~&rgeq1+epsilon\ &r &rin(1-epsilon,1+epsilon)\ &1-epsilon &rleq1-epsilon\ end{aligned}right}Bigg)\ &=Aleft{minright}\ &=Aleft{right}~~~~~textcolor{blue}{According to the range on the right}\ &=Amin(r, 1+epsilon)\ &=minBigg(rA, (1+epsilon)ABigg) end{aligned} F(r,A,ϵ)=min(rA,clip(r,1ϵ,1+ϵ)A)=Amin(r,clip(r,1ϵ,1+ϵ))=Amin(r, 1+ϵ  r1ϵr1+ϵr(1ϵ,1+ϵ)r1ϵ )=A min(r,1+ϵ)  min(r,r)min(r,1ϵ)r1+ϵr(1ϵ,1+ϵ)r1ϵ =A 1+ϵ  rrr1+ϵr(1ϵ,1+ϵ)r1ϵ      According to the range on the right=Amin(r,1+ϵ)=min(rA,(1+ϵ)A)
  ~  
when A < 0 A<0 A<0
F ( r , A , ϵ ) = min ⁡ ( r A , clip ( r , 1 − ϵ , 1 + ϵ ) A ) = A max ⁡ ( r , clip ( r , 1 − ϵ , 1 + ϵ ) ) = A max ⁡ ( r , { 1 + ϵ r ≥ 1 + ϵ rr ∈ ( 1 − ϵ , 1 + ϵ ) 1 − ϵ r ≤ 1 − ϵ } ) = A { max ⁡ ( r , 1 + ϵ ) r ≥ 1 + ϵ max ⁡ ( r , r ) r ∈ ( 1 − ϵ , 1 + ϵ ) max ⁡ ( r , 1 − ϵ ) r ≤ 1 − ϵ } = A { r r ≥ 1 + ϵ rr ∈ ( 1 − ϵ , 1 + ϵ ) 1 − ϵ r ≤ 1 − ϵ } From the range on the right = A max ⁡ ( r , 1 − ϵ ) = min ⁡ ( r A , ( 1 − ϵ ) A )right}Bigg)\ &=Aleft{right}\ &=Aleft{right}~~~~~textcolor{blue}{According to the range on the right}\ &=Amax(r, 1-epsilon)\ &=textcolor{blue}{min}Bigg(rA,(1-epsilon)ABigg) end{aligned} F(r,A,ϵ)=min(rA,clip(r,1ϵ,1+ϵ)A)=Amax(r,clip(r,1ϵ,1+ϵ))=Amax(r, 1+ϵ  r1ϵr1+ϵr(1ϵ,1+ϵ)r1ϵ )=A max(r,1+ϵ)  max(r,r)max(r,1ϵ)r1+ϵr(1ϵ,1+ϵ)r1ϵ =A r  r1ϵr1+ϵr(1ϵ,1+ϵ)r1ϵ      According to the range on the right=Amax(r,1ϵ)=min(rA,(1ϵ)A)
  ~  
In summary: it can be defined g ( ϵ , A ) g(epsilon,A) g(ϵ,A)
g ( ϵ , A ) = { ( 1 + ϵ ) A      A ≥ 0 ( 1 − ϵ ) A A < 0 g(epsilon,A)=left{right. g(ϵ,A)={(1+ϵ)A    (1ϵ)AA0A<0
insert image description here

Why does this definition ensure that the new strategy does not stray too far from the old one?
Importance sampling methods effectively require new strategies π θ ( a ∣ s ) pi_theta(a|s) πθ(as) and the old strategy π θ k ( a ∣ s ) pi_{theta_k}(a|s) πθk(as) The two distributions cannot be too far apart

1. When advantage is positive

L ( s , a , θ k , θ ) = min ⁡ ( π θ ( a ∣ s ) π θ k ( a ∣ s ) , 1 + ϵ ) A π θ k ( s , a ) L(s,a,theta_k, theta)=minBigg(frac{pi_theta(a|s)}{pi_{theta_k}(a|s)}, 1+epsilonBigg)A^{pi_{theta_k}}(s, a) L(s,a,θk,θ)=min(πθk(as)πθ(as),1+ϵ)Aπθk(s,a)
Advantage function: If a state-action pair is found to have a high reward, increase the weight of the state-action pair.

When the state-action pair ( s , a ) (s, a) (s,a) The advantage of is positive, then if the action a a a is more likely to be executed if π θ ( a ∣ s ) pi_theta(a|s) πθ(as) Increase, and the goal will increase.
The min in this item limits the objective function to only increase to a certain value
once π θ ( a ∣ s ) > ( 1 + ϵ ) π θ k ( a ∣ s ) pi_theta(a|s)>(1+epsilon)pi_{theta_k}(a|s) πθ(as)>(1+ϵ)πθk(as), min trigger, limit the value of this item ( 1 + ϵ ) π θ k ( a ∣ s ) (1+epsilon)pi_{theta_k}(a|s) (1+ϵ)πθk(as)
the new policy does not benefit by going far away from the old policy.
New strategies do not benefit from moving away from old strategies.

2. When the advantage is negative

L ( s , a , θ k , θ ) = max ⁡ ( π θ ( a ∣ s ) π θ k ( a ∣ s ) , 1 − ϵ ) A π θ k ( s , a ) L(s,a,theta_k, theta)=maxBigg(frac{pi_theta(a|s)}{pi_{theta_k}(a|s)}, 1-epsilonBigg)A^{pi_{theta_k}}(s, a) L(s,a,θk,θ)=max(πθk(as)πθ(as),1ϵ)Aπθk(s,a)

When a state-action pair ( s , a ) (s, a) (s,a) The advantage of is negative, then if the action is executed a a a is less likely, that is, if π θ ( a ∣ s ) π_theta(a|s) πθ(as) If decreases, the objective function will increase. However, the max in this term limits how much the objective function can increase.
once π θ ( a ∣ s ) < ( 1 − ϵ ) π θ k ( a ∣ s ) pi_theta(a|s)<(1-epsilon)pi_{theta_k}(a|s) πθ(as)<(1ϵ)πθk(as), max trigger, limit the value of this item to ( 1 − ϵ ) π θ k ( a ∣ s ) (1-epsilon)pi_{theta_k}(a|s) (1ϵ)πθk(as)

To reiterate: the new policy does not benefit by going far away from the old policy.
New strategies do not benefit from moving away from old strategies.

TD3_Continuous only: Twin Delayed Deep Deterministic Policy Gradient [ICML 2018 (Canada) McGill University]

insert image description here
Image Source

OpenAI Documentation_TD3
Paper link

While DDPG can sometimes achieve excellent performance, it is often unstable when it comes to hyperparameters and other types of tuning.
A common DDPG failure mode is that the learned Q-function starts to significantly overestimate the Q-values, which then causes the policy to break because it exploits the error in the Q-function.
Twin Delayed DDPG (TD3) is an algorithm that solves this problem by introducing three key tricks:
1、Truncated Double Q Learning

  • TD3 learns two Q-functions instead of one (hence "twin"), and uses the smaller of the two Q-values ​​to form the target in the Bellman error loss function.

2、Policy update delay

  • TD3 updates the policy (and target network) less frequently than the Q function. The paper suggests updating the policy once for every two updates of the Q function.

3. Target strategy smoothing.

  • TD3 adds noise to the target action, making it harder for the policy to exploit the error in the Q function by smoothing out the changes in the action.

TD3 is an off-policy algorithm; it can only be used withcontinuousThe environment of the action space.

TD3 algorithm pseudocode

insert image description here

Algorithm: TD3
With random parameters θ 1 , θ 2 , ϕ theta_1, theta_2, phi θ1,θ2,ϕ Initialize the critic network Q θ 1 , Q θ 2 Q_{theta_1},Q_{theta_2} Qθ1,Qθ2, and actor networks π ϕ pi_phi πϕ
Initialize the target network θ 1 ′ ← θ 1 , θ 2 ′ ← θ 2 , ϕ ′ ← ϕ theta_1^primeleftarrowtheta_1, theta_2^primeleftarrowtheta_2, phi^primeleftarrow phi θ1θ1,θ2θ2,ϕϕ
Initialize the playback buffer set B cal B B
f o r   t = 1   t o   T {bf for}~t=1 ~{bf to} ~T for t=1 to T
       ~~~~~~       Selecting actions with exploration noise a ∼ π ϕ ( s ) + ϵ ,    ϵ ∼ N ( 0 , σ ) asimpi_phi(s)+epsilon,~~epsilonsim {cal N}(0,sigma) aπϕ(s)+ϵ,  ϵN(0,σ), Observation Reward r r r and the new state s ′ s^prime s
       ~~~~~~       The transition tuple ( s , a , r , s ′ ) (s, a,r, s^prime) (s,a,r,s) Save to B cal B B middle
       ~~~~~~       from B cal B B Sampling small batches N N N transitions ( s , a , r , s ′ ) (s, a, r, s^prime) (s,a,r,s)
       a ~ ← π ϕ ′ ( s ′ ) + ϵ ,    ϵ ∼ c l i p ( N ( 0 , σ ~ ) , − c , c ) ~~~~~~widetilde aleftarrow pi_{phi^prime}(s^prime)+epsilon,~~epsilonsim{rm clip}({cal N}(0,widetilde sigma),-c,c)       a πϕ(s)+ϵ,  ϵclip(N(0,σ ),c,c)
       y ← r + γ min ⁡ i = 1 , 2 Q θ i ′ ( s ′ , a ~ ) ~~~~~~yleftarrow r+gamma minlimits_{i=1,2}Q_{theta_i^prime}(s^prime,widetilde a)       yr+γi=1,2minQθi(s,a )
       ~~~~~~       Update critics θ i ← arg ⁡ min ⁡ θ i N − 1 ∑ ( y − Q θ i ( s , a ) ) 2 theta_ileftarrowargminlimits_{theta_i}N^{-1}sum(y-Q_{theta_i}(s, a))^2 θiargθiminN1(yQθi(s,a))2
       ~~~~~~        i f   t   %   d {bf if}~t~ % ~d if t % d
            ~~~~~~~~~~~            Updated via deterministic policy gradient ϕ phi ϕ
                  ∇ ϕ J ( ϕ ) = N − 1 ∑ ∇ a Q θ 1 ( s , a ) ∣ a = π ϕ ( s ) ∇ ϕ π ϕ ( s ) ~~~~~~~~~~~~~~~~~nabla _phi J(phi)=N^{-1}sumnabla_aQ_{theta_1}(s, a)|_{a=pi_phi(s)}nabla_phipi_phi(s)                  ϕJ(ϕ)=N1aQθ1(s,a)a=πϕ(s)ϕπϕ(s)
            ~~~~~~~~~~~            Update the target network:
                  θ i ′ ← τ θ i + ( 1 − τ ) θ i ′       ~~~~~~~~~~~~~~~~~theta_i^primeleftarrowtautheta_i+(1-tau)theta_i^prime~~~~~                  θiτθi+(1τ)θi      τ tau τ: Target update rate
                  ϕ ′ ← τ ϕ + ( 1 − τ ) ϕ ′ ~~~~~~~~~~~~~~~~~phi^primeleftarrowtauphi+(1-tau)phi^prime                  ϕτϕ+(1τ)ϕ
       e n d   i f ~~~~~~{bf end ~if}       end if
e n d   f o r {bf end ~for} end for

Soft Actor-Critic: SAC_Continuous/Discrete Action Space [Google Brain latest version 201906]

insert image description here

Image Source

Maximize the entropy of the policy, thus making the policy more robust.

Deterministic Strategy It means that given the same state, the same action is always chosen.
Randomness strategy There are multiple possible actions that can be chosen in a given state.

Deterministic StrategyRandomness strategy
definitionSame state, same actionThe same state,Different actions may be performed
advantageStable and repeatableAvoid falling into local optimal solutions and improve global search capabilities
shortcomingLack of exploration, easy to be caught by opponentsThis may result in slower convergence of the strategy, affecting efficiency and performance.

In practical applications, if conditions permit, we willUse as much as possibleRandomness strategy, such as A2C, PPO, etc., because it is more flexible, more robust and more stable.

Maximum entropy reinforcement learning believes that even though we currently have mature random strategies, such as AC algorithms, we still have not achieved optimal randomness. Therefore, it introduces aInformation EntropyThe concept ofMaximize the cumulative reward while maximizing the entropy of the strategy, making the strategy more robust, thus achieving the optimal stochastic strategy.

——————————————————

— OpenAI Documentation_SAC

OpenAI Documentation_SAC interface link
  ~  
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor, Haarnoja et al, 201808 ICML 2018
Soft Actor-Critic Algorithms and Applications, Haarnoja et al, 201901
Learning to Walk via Deep Reinforcement Learning, Haarnoja et al, 201906 RSS2019

Soft Actor Critic (SAC) optimizes random policies in an off-policy manner.

DDPG + Stochastic Strategy Optimization

Not a direct successor to TD3 (published almost simultaneously).

It incorporates the clipped double-Q trick, and due to the inherent randomness of SAC's strategy, it also ultimately benefits fromTarget strategy smoothing

A core feature of SAC is entropy regularization
The policy is trained to maximize the trade-off between expected reward and entropy,Entropy is a measure of the randomness of a policy
This is closely related to the trade-off between exploration and exploitation: an increase in entropy leads toMore to explore,this is OKAccelerate subsequent learningIt can alsoPrevent the strategy from converging to a bad local optimum prematurely

It can be used in both continuous and discrete action spaces.

exist Entropy-Regularized Reinforcement LearningIn the example, the agent obtains the sameThe entropy of the policy at this time stepProportional rewards.
At this time, the RL problem is described as:

π ∗ = arg ⁡ max ⁡ π E τ ∼ π [ ∑ t = 0 ∞ γ t ( R ( s t , a t , s t + 1 ) + α H ( π ( ⋅ ∣ s t ) ) ) ] pi^*=argmaxlimits_pi underset{tausimpi}{rm E}Big[sumlimits_{t=0}^inftygamma^tBig(R(s_t,a_t,s_{t+1})textcolor{blue}{+alpha H(pi(·|s_t))}Big)Big] π=argπmaxτπE[t=0γt(R(st,at,st+1)+αH(π(st)))]

in α > 0 alpha > 0 α>0 is the trade-off coefficient.
The state-value function includes an entropy bonus at each time step V π V^pi Vπ for:

V π ( s ) = E τ ∼ π [ ∑ t = 0 ∞ γ t ( R ( s t , a t , s t + 1 ) + α H ( π ( ⋅ ∣ s t ) ) ) ∣ s 0 = s ] V^pi(s)=underset{tausimpi}{rm E}Big[sumlimits_{t=0}^inftygamma^tBig(R(s_t,a_t,s_{t+1})+alpha H(pi(·|s_t))Big)Big|s_0=sBig] Vπ(s)=τπE[t=0γt(R(st,at,st+1)+αH(π(st))) s0=s]

Action-value function including entropy bonus for every time step except the first Q π Q^pi Qπ:

Q π ( s , a ) = E τ ∼ π [ ∑ t = 0 ∞ γ t ( R ( s t , a t , s t + 1 ) + α ∑ t = 1 ∞ H ( π ( ⋅ ∣ s t ) ) ) ∣ s 0 = s , a 0 = a ] Q^pi(s,a)=underset{tausimpi}{rm E}Big[sumlimits_{t=0}^inftygamma^tBig(R(s_t,a_t,s_{t+1})+alpha sumlimits_{t=1}^infty H(pi(·|s_t))Big)Big|s_0=s,a_0=aBig] Qπ(s,a)=τπE[t=0γt(R(st,at,st+1)+αt=1H(π(st))) s0=s,a0=a]

  • Some papers Q π Q^pi Qπ Contains the entropy reward for the first time step

V π V^pi Vπ and Q π Q^pi Qπ The relationship between them is:

V π ( s ) = E a ∼ π [ Q π ( s , a ) ] + α H ( π ( ⋅ ∣ s ) ) V^pi(s)=underset{asimpi}{rm E}[Q^pi(s, a)]+alpha H(pi(·|s)) Vπ(s)=aπE[Qπ(s,a)]+αH(π(s))

about Q π Q^pi Qπ The Bellman formula is:

Q π ( s , a ) = E s ′ ∼ P a ′ ∼ π [ R ( s , a , s ′ ) + γ ( Q π ( s ′ , a ′ ) + α H ( π ( ⋅ ∣ s ′ ) ) ) ] = E s ′ ∼ P [ R ( s , a , s ′ ) + γ V π ( s ′ ) ] Qπ(s,a)=aπsPE[R(s,a,s)+γ(Qπ(s,a)+αH(π(s)))]=sPE[R(s,a,s)+γVπ(s)]

SAC learns a policy at the same time π θ π_theta πθ and two Q Q Q function Q ϕ 1 , Q ϕ 2 Q_{phi_1}, Q_{phi_2} Qϕ1,Qϕ2
There are currently two standard SAC variants: one that uses a fixedEntropy regularization coefficient α alpha α, and another by changing α alpha α to enforce the entropy constraint.
OpenAI's documentation uses a version with a fixed entropy regularization coefficient, but in practice one usually prefersEntropy ConstraintVariant of .

As shown in the figure below, α alpha α In the fixed version, except for the last figure which has a clear advantage, the others are only slightly better, basically the same as α alpha α The learning version is flat; α alpha α The learning version has advantages in the two middle pictures, with more obvious advantages.

insert image description here
Image Source

SAC VS TD3:
  ~  
Same point:
1. Both Q functions are learned by regressing to a single shared objective by minimizing the MSBE (Mean Squared Bellman Error).
2. Use the target Q-network to calculate the shared target, and perform polyak averaging on the Q-network parameters during training to obtain the target Q-network.
3. The shared target uses a truncated double Q technique.
  ~  
difference:
1. SAC contains entropy regularization term
2. The next state action used in SAC’s goal comes fromCurrent Strategy, rather than a target strategy.
3. There is no clear target policy smoothing. TD3 trains a deterministic policy that takes actions to the next state.Add random noiseSAC trains a random policy, and the noise from randomness is enough to achieve similar effects.

SAC algorithm pseudocode

insert image description here

Algorithm: Soft Actor-Critic SAC
enter: θ 1 , θ 2 , ϕ       theta_1,theta_2,phi~~~~~ θ1,θ2,ϕ      Initialization parameters
Parameter initialization:
       ~~~~~~       Initialize the target network weights: θ ˉ 1 ← θ 1 , θ ˉ 2 ← θ 2 bar theta_1leftarrowtheta_1, bar theta_2leftarrowtheta_2 θˉ1θ1,θˉ2θ2
       ~~~~~~       The playback pool is initialized to be empty: D ← ∅ {cal D}leftarrowemptyset D
f o r {bf for} for Each iteration d o {bf do} do
       ~~~~~~        f o r {bf for} for Each environment step d o {bf do} do
            ~~~~~~~~~~~            Sample actions from the policy: a t ∼ π ϕ ( a t ∣ s t )       a_tsimpi_phi(a_t|s_t)~~~~~ atπϕ(atst)      ▢ Here π ϕ ( a t ∣ s t ) pi_phi(a_t|s_t) πϕ(atst) How to define?
            ~~~~~~~~~~~            Sample transitions from the environment: s t + 1 ∼ p ( s t + 1 ∣ s t , a t ) s_{t+1}sim p(s_{t+1}|s_t,a_t) st+1p(st+1st,at)
            ~~~~~~~~~~~            Save the transition to the replay pool: D ← D   ∪   { ( s t , a t , r ( s t , a t ) , s t + 1 ) } {cal D}leftarrow{cal D}~cup~{(s_t,a_t,r(s_t,a_t),s_{t+1})} DD  {(st,at,r(st,at),st+1)}
       ~~~~~~        e n d   f o r {bf end ~for} end for
       ~~~~~~        f o r {bf for} for Each gradient step d o {bf do} do
            ~~~~~~~~~~~            renew Q Q Q Function parameters: For i ∈ { 1 , 2 } iin{1,2} i{1,2} θ i ← θ i − λ Q ∇ ^ θ i J Q ( θ i )       theta_ileftarrowtheta_i-lambda_Qhat nabla_{theta_i}J_Q(theta_i)~~~~~ θiθiλQ^θiJQ(θi)      ▢ Here J Q ( θ i ) J_Q(theta_i) JQ(θi) How to define?
            ~~~~~~~~~~~            Update policy weights: ϕ ← ϕ − λ π ∇ ^ ϕ J π ( ϕ )       phileftarrowphi-lambda_pihat nabla_phi J_pi (phi)~~~~~ ϕϕλπ^ϕJπ(ϕ)      ▢ Here J π ( ϕ ) J_pi (phi) Jπ(ϕ) How to define?
            ~~~~~~~~~~~            Adjust the temperature: α ← α − λ ∇ ^ α J ( α )       alphaleftarrowalpha-lambdahatnabla_alpha J(alpha)~~~~~ ααλ^αJ(α)      ▢ Here J ( α ) J(alpha) J(α) How to define? How to understand the temperature here
            ~~~~~~~~~~~             Update the target network weights: For i ∈ { 1 , 2 } iin{1,2} i{1,2} θ ˉ i ← τ θ i − ( 1 − τ ) θ ˉ i       bar theta_ileftarrow tau theta_i-(1-tau)bar theta_i~~~~~ θˉiτθi(1τ)θˉi      ▢ How to understand this τ tau τ? ——&gt; Target smoothing coefficient
       ~~~~~~        e n d   f o r {bf end ~for} end for
e n d   f o r {bf end ~for} end for
Output: θ 1 , θ 1 , ϕ       theta_1,theta_1,phi~~~~~ θ1,θ1,ϕ     Optimized parameters

∇ ^ hat nabla ^: Stochastic gradient

$emptyset$      ∅ ~~~~emptyset     

insert image description here

insert image description here

Learning to Walk via Deep Reinforcement Learning Versions in:
  ~  
insert image description here
insert image description here
insert image description here

α α α is the temperature parameter, which determines the relative importance of the entropy term and the reward, thereby controlling the stochasticity of the optimal policy.
α alpha α Large: Explore
α alpha α Small: Utilize

J ( α ) = E a t ∼ π t [ − α log ⁡ π t ( a t ∣ s t ) − α H ˉ ] J(alpha)=underset{a_tsimpi_t}{mathbb E}[-alphalog pi_t(a_t|s_t)-alphabar{cal H}] J(α)=atπtE[αlogπt(atst)αHˉ]