2026-06-03-SwingUpCartpole

From CartPole-v1 to CartPole Swing-Up: Modifying a Classic RL Environment

The standard CartPole-v1 environment is one of the most common introductory reinforcement learning tasks. In this environment, the pole starts almost upright, and the agent only needs to learn how to keep it balanced by moving the cart left or right.

However, there is a harder and more physically interesting version of this task: CartPole Swing-Up. Instead of starting upright, the pole starts naturally hanging downward. The agent must first learn to swing the pole up by moving the cart back and forth, and then stabilize it near the upright position.

This post summarizes the main changes I have done to convert a standard CartPole-v1 PPO training script into a custom CartPole swing-up environment.


1. Original CartPole-v1

In the original version, the CartPole-v1 environment created directly from Gymnasium has a relatively simple setup:

1
2
3
4
5
Observation: [x, x_dot, theta, theta_dot]
Action space: Discrete(2)
Initial state: pole starts near upright
Termination: cart out of bounds or pole angle too large
Reward: usually +1 for every surviving step

Here:

1
2
3
4
x          = cart position
x_dot = cart velocity
theta = pole angle
theta_dot = pole angular velocity

The two actions are:

1
2
0 -> push cart left
1 -> push cart right

In this balancing task, since the pole already starts near upright, the agent mainly needs to prevent it from falling. What if we want to train for a more complicated setting: the cart needs to swing up the pole by itself and then balance it.


2. Why CartPole-v1 Cannot Directly Do Swing-Up

In the swing-up setting, the pole should start near the downward position:

1
theta = np.pi

However, in standard CartPole-v1, the episode terminates when the pole angle becomes too large. The usual angle threshold is around ±12 degrees from the upright position.

This means that if we reset the pole to the downward position, the episode would terminate immediately.

Therefore, to train a swing-up policy, we need to define a custom environment.


3. Custom CartPoleSwingUpEnv

The new environment is defined as a custom Gymnasium environment:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
class CartPoleSwingUpEnv(gym.Env):
metadata = {"render_modes": ["human", "rgb_array"], "render_fps": 50}

def __init__(self, render_mode=None):
super().__init__()

self.render_mode = render_mode

# Continuous force action: force ∈ [-F_max, F_max]
self.force_mag = 10.0
self.action_space = spaces.Box(
low=np.array([-self.force_mag], dtype=np.float32),
high=np.array([self.force_mag], dtype=np.float32),
dtype=np.float32,
)

# Observation: [x, x_dot, cos(theta), sin(theta), theta_dot]
high = np.array(
[
np.inf, # x
np.inf, # x_dot
1.0, # cos(theta)
1.0, # sin(theta)
np.inf, # theta_dot
],
dtype=np.float32,
)
low = np.array([-np.inf, -np.inf, -1.0, -1.0, -np.inf,],dtype=np.float32,)
self.observation_space = spaces.Box(low=low, high=high, dtype=np.float32)

# CartPole physical parameters, same style as Gym CartPole
self.gravity = 9.8
self.masscart = 1.0
self.masspole = 0.1
self.total_mass = self.masscart + self.masspole
self.length = 0.5 # actually half the pole length
self.polemass_length = self.masspole * self.length
self.tau = 0.02

# Termination threshold
self.x_threshold = 4.8

# Max episode length
self.max_steps = 1000
self.step_count = 0

self.state = None
self.np_random = None

self.screen_width = 600
self.screen_height = 400
self.screen = None
self.clock = None
self.isopen = True

def _get_obs(self):
x, x_dot, theta, theta_dot = self.state
return np.array(
[
x,
x_dot,
np.cos(theta),
np.sin(theta),
theta_dot,
],
dtype=np.float32,
)

def reset(self, *, seed=None, options=None):
super().reset(seed=seed)

self.step_count = 0

# Start near downward position.
# Here theta = 0 means upright, theta = pi means hanging downward.
x = self.np_random.uniform(low=-0.05, high=0.05)
x_dot = self.np_random.uniform(low=-0.05, high=0.05)
theta = np.pi + self.np_random.uniform(low=-0.05, high=0.05)
theta_dot = self.np_random.uniform(low=-0.05, high=0.05)

self.state = np.array([x, x_dot, theta, theta_dot], dtype=np.float32)

if self.render_mode == "human":
self.render()

return self._get_obs(), {}

def step(self, action):
self.step_count += 1

# Convert Box action to scalar force
force = float(np.clip(action[0], -self.force_mag, self.force_mag))

x, x_dot, theta, theta_dot = self.state

costheta = np.cos(theta)
sintheta = np.sin(theta)

# Same dynamics form as classic CartPole
temp = (
force + self.polemass_length * theta_dot**2 * sintheta
) / self.total_mass

theta_acc = (
self.gravity * sintheta - costheta * temp
) / (
self.length
* (4.0 / 3.0 - self.masspole * costheta**2 / self.total_mass)
)

x_acc = temp - self.polemass_length * theta_acc * costheta / self.total_mass

# Euler integration
x = x + self.tau * x_dot
x_dot = x_dot + self.tau * x_acc
theta = theta + self.tau * theta_dot
theta_dot = theta_dot + self.tau * theta_acc

# Keep theta in [-pi, pi] for numerical stability
theta = ((theta + np.pi) % (2 * np.pi)) - np.pi

self.state = np.array([x, x_dot, theta, theta_dot], dtype=np.float32)

# Terminate only if cart leaves track
terminated = bool(x < -self.x_threshold or x > self.x_threshold)
# Truncate if max episode length reached
truncated = bool(self.step_count >= self.max_steps)

# Reward shaping
# theta = 0 is upright, theta = pi or -pi is downward
upright_reward = (np.cos(theta) + 1.0) / 2.0

cart_penalty = 0.01 * x**2
velocity_penalty = 0.001 * x_dot**2 + 0.001 * theta_dot**2
action_penalty = 0.0001 * force**2

reward = upright_reward - cart_penalty - velocity_penalty - action_penalty

if self.render_mode == "human":
self.render()

return self._get_obs(), float(reward), terminated, truncated, {}

def close(self):
if self.screen is not None:
import pygame

pygame.display.quit()
pygame.quit()
self.isopen = False
self.screen = None
self.clock = None

4. Action Space: From Discrete Left/Right to Continuous Force

In standard CartPole, the action space is discrete:

1
action_space = Discrete(2)

The agent can only choose between two fixed actions:

1
push left or push right

For the swing-up task, I changed the action space to a continuous force:

1
2
3
4
5
6
7
self.force_mag = 10.0

self.action_space = spaces.Box(
low=np.array([-self.force_mag], dtype=np.float32),
high=np.array([self.force_mag], dtype=np.float32),
dtype=np.float32,
)

Now the policy outputs a continuous force:

1
force ∈ [-F_max, F_max]

This gives the agent more precise control. Instead of only choosing “left” or “right,” it can decide how strongly to push the cart.

The action is then clipped inside step():

1
force = float(np.clip(action[0], -self.force_mag, self.force_mag))

This makes the control problem closer to a continuous-control task.


5. Observation Space: From 4D to 5D

The original CartPole observation is 4-dimensional:

1
[x, x_dot, theta, theta_dot]

For swing-up, I changed it to:

1
[x, x_dot, cos(theta), sin(theta), theta_dot]

This makes the observation 5-dimensional.

The reason is that angle is a periodic variable. For example:

1
2
theta =  pi
theta = -pi

represent almost the same physical pole direction, but numerically they look very different if we feed theta directly into a neural network.

Using cos(theta) and sin(theta) avoids this discontinuity:

1
2
3
4
5
6
7
8
9
10
11
12
def _get_obs(self):
x, x_dot, theta, theta_dot = self.state
return np.array(
[
x,
x_dot,
np.cos(theta),
np.sin(theta),
theta_dot,
],
dtype=np.float32,
)

This representation is more stable for learning because the pole angle can rotate through the full circle during swing-up.


6. Initial State: Start from the Downward Position

In standard CartPole, the pole starts near upright. For swing-up, I reset the pole near the downward position:

1
theta = np.pi + self.np_random.uniform(low=-0.05, high=0.05)

The small random noise prevents every episode from starting in exactly the same symmetric state.

The reset function initializes the state as:

1
2
3
4
5
6
x = self.np_random.uniform(low=-0.05, high=0.05)
x_dot = self.np_random.uniform(low=-0.05, high=0.05)
theta = np.pi + self.np_random.uniform(low=-0.05, high=0.05)
theta_dot = self.np_random.uniform(low=-0.05, high=0.05)

self.state = np.array([x, x_dot, theta, theta_dot], dtype=np.float32)

Here, I define:

1
2
theta = 0      -> pole is upright
theta = pi -> pole is hanging downward

So the agent must learn to first inject energy into the system, swing the pole upward, and then balance it.


7. Termination: Remove Angle-Based Termination

In standard CartPole, the episode terminates when the pole angle exceeds a small threshold. That makes sense for a pure balancing task, but it is not suitable for swing-up.

For swing-up, the pole must be allowed to fall, rotate, and swing through large angles. Therefore, I removed the angle-based termination condition.

The new termination rule is only based on the cart leaving the track:

1
terminated = bool(x < -self.x_threshold or x > self.x_threshold)

The episode is truncated when it reaches the maximum length:

1
truncated = bool(self.step_count >= self.max_steps)

In my implementation:

1
2
self.x_threshold = 4.8
self.max_steps = 1000

So the task gives the agent enough time to swing the pole up and stabilize it.


8. Dynamics: Simulating Cart-Pole Physics

The environment updates the cart-pole state manually inside step().

The agent outputs a horizontal force applied to the cart. The environment then computes the resulting cart acceleration and pole angular acceleration:

1
2
3
4
5
6
7
8
9
10
11
12
temp = (
force + self.polemass_length * theta_dot**2 * sintheta
) / self.total_mass

theta_acc = (
self.gravity * sintheta - costheta * temp
) / (
self.length
* (4.0 / 3.0 - self.masspole * costheta**2 / self.total_mass)
)

x_acc = temp - self.polemass_length * theta_acc * costheta / self.total_mass

Then the state is updated with Euler integration:

1
2
3
4
x = x + self.tau * x_dot
x_dot = x_dot + self.tau * x_acc
theta = theta + self.tau * theta_dot
theta_dot = theta_dot + self.tau * theta_acc

Finally, the angle is wrapped into [-pi, pi] for numerical stability:

1
theta = ((theta + np.pi) % (2 * np.pi)) - np.pi

This means the policy does not directly control the pole. It only controls the cart force. The pole moves as a result of the cart-pole dynamics.


9. Reward Design

The standard CartPole reward is usually simple:

1
+1 for every step the pole remains balanced

This is not enough for swing-up because the pole starts downward. The agent needs a reward signal that encourages it to move toward the upright position.

I used the following shaped reward:

1
2
3
4
5
6
7
upright_reward = (np.cos(theta) + 1.0) / 2.0

cart_penalty = 0.01 * x**2
velocity_penalty = 0.001 * x_dot**2 + 0.001 * theta_dot**2
action_penalty = 0.0001 * force**2

reward = upright_reward - cart_penalty - velocity_penalty - action_penalty

The main term is:

1
upright_reward = (np.cos(theta) + 1.0) / 2.0

This gives:

1
2
3
theta = 0      -> upright_reward = 1
theta = pi -> upright_reward = 0
theta = pi/2 -> upright_reward = 0.5

So the reward becomes larger as the pole gets closer to upright.

The penalty terms encourage stable behavior:

1
2
3
cart_penalty      -> keep the cart near the center
velocity_penalty -> avoid excessive cart and pole velocities
action_penalty -> avoid unnecessarily large forces

Together, the reward encourages the agent to swing the pole up, keep the cart within bounds, and stabilize the system smoothly.


10. Training with PPO

The PPO training code stays mostly the same. The main change is replacing the original environment with the custom swing-up environment:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
env = CartPoleSwingUpEnv()

model = PPO(
"MlpPolicy",
env,
verbose=1,
device="cpu",
)

model.learn(
total_timesteps=600_000,
callback=[swanlab_cb, LogApproxKL()],
progress_bar=True,
)

The number of training steps is larger than in the standard CartPole experiment because swing-up is a harder task. The policy has to learn both energy injection and balancing.

After training, the model is saved as:

1
model.save("output/ppo_cartpole_swingup")

The saved output contains the trained PPO policy, value network, and related model metadata. The custom environment code itself is not saved inside the model file, so inference still needs the same CartPoleSwingUpEnv definition.


11. Summary of Main Changes

Component CartPole-v1 CartPole Swing-Up
Initial pole position Near upright Near downward
Action space Discrete(2) Continuous force Box([-F_max, F_max])
Observation [x, x_dot, theta, theta_dot] [x, x_dot, cos(theta), sin(theta), theta_dot]
Pole angle termination Yes No
Cart position termination Yes Yes
Reward Usually +1 per step Upright reward minus stability penalties
Max episode length 500 steps 1000 steps
Task difficulty Balance only Swing-up + balance

13. Takeaway

The standard CartPole task is mainly a balancing problem. The pole already starts near the correct position, so the agent only needs to keep it there.

CartPole swing-up is a more challenging control problem. The agent must first learn how to create enough momentum to swing the pole upward, and then transition into a balancing policy.

This requires changes to the environment design:

1
2
3
4
5
6
1. Start the pole downward
2. Remove angle-based termination
3. Use a continuous force action
4. Represent the pole angle with sin/cos
5. Design a reward that encourages upright stabilization
6. Use a longer training horizon

After these changes, the same PPO algorithm can be used to train a policy that learns the full swing-up behavior.

Author

Jiangshan Gong

Posted on

2026-06-03

Updated on

2026-06-03

Licensed under

Comments