Posted 2026-06-03Updated 2026-07-02Study13 minutes read (About 1982 words)

2026-06-03-SwingUpCartpole

From CartPole-v1 to CartPole Swing-Up: Modifying a Classic RL Environment

The standard CartPole-v1 environment is one of the most common introductory reinforcement learning tasks. In this environment, the pole starts almost upright, and the agent only needs to learn how to keep it balanced by moving the cart left or right.

However, there is a harder and more physically interesting version of this task: CartPole Swing-Up. Instead of starting upright, the pole starts naturally hanging downward. The agent must first learn to swing the pole up by moving the cart back and forth, and then stabilize it near the upright position.

This post summarizes the main changes I have done to convert a standard CartPole-v1 PPO training script into a custom CartPole swing-up environment.

1. Original CartPole-v1

In the original version, the CartPole-v1 environment created directly from Gymnasium has a relatively simple setup:

Observation: [x, x_dot, theta, theta_dot]
Action space: Discrete(2)
Initial state: pole starts near upright
Termination: cart out of bounds or pole angle too large
Reward: usually +1 for every surviving step

Here:

x          = cart position
x_dot      = cart velocity
theta      = pole angle
theta_dot  = pole angular velocity

The two actions are:

1 2	0 -> push cart left 1 -> push cart right

In this balancing task, since the pole already starts near upright, the agent mainly needs to prevent it from falling. What if we want to train for a more complicated setting: the cart needs to swing up the pole by itself and then balance it.

2. Why CartPole-v1 Cannot Directly Do Swing-Up

In the swing-up setting, the pole should start near the downward position:

1	theta = np.pi

However, in standard CartPole-v1, the episode terminates when the pole angle becomes too large. The usual angle threshold is around ±12 degrees from the upright position.

This means that if we reset the pole to the downward position, the episode would terminate immediately.

Therefore, to train a swing-up policy, we need to define a custom environment.

3. Custom CartPoleSwingUpEnv

The new environment is defined as a custom Gymnasium environment:

class CartPoleSwingUpEnv(gym.Env):
    metadata = {"render_modes": ["human", "rgb_array"], "render_fps": 50}

    def __init__(self, render_mode=None):
        super().__init__()

        self.render_mode = render_mode

        # Continuous force action: force ∈ [-F_max, F_max]
        self.force_mag = 10.0
        self.action_space = spaces.Box(
            low=np.array([-self.force_mag], dtype=np.float32),
            high=np.array([self.force_mag], dtype=np.float32),
            dtype=np.float32,
        )

        # Observation: [x, x_dot, cos(theta), sin(theta), theta_dot]
        high = np.array(
            [
                np.inf,  # x
                np.inf,  # x_dot
                1.0,     # cos(theta)
                1.0,     # sin(theta)
                np.inf,  # theta_dot
            ],
            dtype=np.float32,
        )
        low = np.array([-np.inf, -np.inf, -1.0, -1.0, -np.inf,],dtype=np.float32,)
        self.observation_space = spaces.Box(low=low, high=high, dtype=np.float32)

        # CartPole physical parameters, same style as Gym CartPole
        self.gravity = 9.8
        self.masscart = 1.0
        self.masspole = 0.1
        self.total_mass = self.masscart + self.masspole
        self.length = 0.5  # actually half the pole length
        self.polemass_length = self.masspole * self.length
        self.tau = 0.02

        # Termination threshold
        self.x_threshold = 4.8

        # Max episode length
        self.max_steps = 1000
        self.step_count = 0

        self.state = None
        self.np_random = None

        self.screen_width = 600
        self.screen_height = 400
        self.screen = None
        self.clock = None
        self.isopen = True

    def _get_obs(self):
        x, x_dot, theta, theta_dot = self.state
        return np.array(
            [
                x,
                x_dot,
                np.cos(theta),
                np.sin(theta),
                theta_dot,
            ],
            dtype=np.float32,
        )

    def reset(self, *, seed=None, options=None):
        super().reset(seed=seed)

        self.step_count = 0

        # Start near downward position.
        # Here theta = 0 means upright, theta = pi means hanging downward.
        x = self.np_random.uniform(low=-0.05, high=0.05)
        x_dot = self.np_random.uniform(low=-0.05, high=0.05)
        theta = np.pi + self.np_random.uniform(low=-0.05, high=0.05)
        theta_dot = self.np_random.uniform(low=-0.05, high=0.05)

        self.state = np.array([x, x_dot, theta, theta_dot], dtype=np.float32)

        if self.render_mode == "human":
            self.render()

        return self._get_obs(), {}

    def step(self, action):
        self.step_count += 1

        # Convert Box action to scalar force
        force = float(np.clip(action[0], -self.force_mag, self.force_mag))

        x, x_dot, theta, theta_dot = self.state

        costheta = np.cos(theta)
        sintheta = np.sin(theta)

        # Same dynamics form as classic CartPole
        temp = (
            force + self.polemass_length * theta_dot**2 * sintheta
        ) / self.total_mass

        theta_acc = (
            self.gravity * sintheta - costheta * temp
        ) / (
            self.length
            * (4.0 / 3.0 - self.masspole * costheta**2 / self.total_mass)
        )

        x_acc = temp - self.polemass_length * theta_acc * costheta / self.total_mass

        # Euler integration
        x = x + self.tau * x_dot
        x_dot = x_dot + self.tau * x_acc
        theta = theta + self.tau * theta_dot
        theta_dot = theta_dot + self.tau * theta_acc

        # Keep theta in [-pi, pi] for numerical stability
        theta = ((theta + np.pi) % (2 * np.pi)) - np.pi

        self.state = np.array([x, x_dot, theta, theta_dot], dtype=np.float32)

        # Terminate only if cart leaves track
        terminated = bool(x < -self.x_threshold or x > self.x_threshold)
        # Truncate if max episode length reached
        truncated = bool(self.step_count >= self.max_steps)

        # Reward shaping
        # theta = 0 is upright, theta = pi or -pi is downward
        upright_reward = (np.cos(theta) + 1.0) / 2.0

        cart_penalty = 0.01 * x**2
        velocity_penalty = 0.001 * x_dot**2 + 0.001 * theta_dot**2
        action_penalty = 0.0001 * force**2

        reward = upright_reward - cart_penalty - velocity_penalty - action_penalty

        if self.render_mode == "human":
            self.render()

        return self._get_obs(), float(reward), terminated, truncated, {}

    def close(self):
        if self.screen is not None:
            import pygame

            pygame.display.quit()
            pygame.quit()
            self.isopen = False
            self.screen = None
            self.clock = None

4. Action Space: From Discrete Left/Right to Continuous Force

In standard CartPole, the action space is discrete:

1	action_space = Discrete(2)

The agent can only choose between two fixed actions:

1	push left or push right

For the swing-up task, I changed the action space to a continuous force:

self.force_mag = 10.0

self.action_space = spaces.Box(
    low=np.array([-self.force_mag], dtype=np.float32),
    high=np.array([self.force_mag], dtype=np.float32),
    dtype=np.float32,
)

Now the policy outputs a continuous force:

1	force ∈ [-F_max, F_max]

This gives the agent more precise control. Instead of only choosing “left” or “right,” it can decide how strongly to push the cart.

The action is then clipped inside step():

1	force = float(np.clip(action[0], -self.force_mag, self.force_mag))

This makes the control problem closer to a continuous-control task.

5. Observation Space: From 4D to 5D

The original CartPole observation is 4-dimensional:

1	[x, x_dot, theta, theta_dot]

For swing-up, I changed it to:

1	[x, x_dot, cos(theta), sin(theta), theta_dot]

This makes the observation 5-dimensional.

The reason is that angle is a periodic variable. For example:

1 2	theta = pi theta = -pi

represent almost the same physical pole direction, but numerically they look very different if we feed theta directly into a neural network.

Using cos(theta) and sin(theta) avoids this discontinuity:

def _get_obs(self):
    x, x_dot, theta, theta_dot = self.state
    return np.array(
        [
            x,
            x_dot,
            np.cos(theta),
            np.sin(theta),
            theta_dot,
        ],
        dtype=np.float32,
    )

This representation is more stable for learning because the pole angle can rotate through the full circle during swing-up.

6. Initial State: Start from the Downward Position

In standard CartPole, the pole starts near upright. For swing-up, I reset the pole near the downward position:

1	theta = np.pi + self.np_random.uniform(low=-0.05, high=0.05)

The small random noise prevents every episode from starting in exactly the same symmetric state.

The reset function initializes the state as:

x = self.np_random.uniform(low=-0.05, high=0.05)
x_dot = self.np_random.uniform(low=-0.05, high=0.05)
theta = np.pi + self.np_random.uniform(low=-0.05, high=0.05)
theta_dot = self.np_random.uniform(low=-0.05, high=0.05)

self.state = np.array([x, x_dot, theta, theta_dot], dtype=np.float32)

Here, I define:

1 2	theta = 0 -> pole is upright theta = pi -> pole is hanging downward

So the agent must learn to first inject energy into the system, swing the pole upward, and then balance it.

7. Termination: Remove Angle-Based Termination

In standard CartPole, the episode terminates when the pole angle exceeds a small threshold. That makes sense for a pure balancing task, but it is not suitable for swing-up.

For swing-up, the pole must be allowed to fall, rotate, and swing through large angles. Therefore, I removed the angle-based termination condition.

The new termination rule is only based on the cart leaving the track:

1	terminated = bool(x < -self.x_threshold or x > self.x_threshold)

The episode is truncated when it reaches the maximum length:

1	truncated = bool(self.step_count >= self.max_steps)

In my implementation:

1 2	self.x_threshold = 4.8 self.max_steps = 1000

So the task gives the agent enough time to swing the pole up and stabilize it.

8. Dynamics: Simulating Cart-Pole Physics

The environment updates the cart-pole state manually inside step().

The agent outputs a horizontal force applied to the cart. The environment then computes the resulting cart acceleration and pole angular acceleration:

temp = (
    force + self.polemass_length * theta_dot**2 * sintheta
) / self.total_mass

theta_acc = (
    self.gravity * sintheta - costheta * temp
) / (
    self.length
    * (4.0 / 3.0 - self.masspole * costheta**2 / self.total_mass)
)

x_acc = temp - self.polemass_length * theta_acc * costheta / self.total_mass

Then the state is updated with Euler integration:

x = x + self.tau * x_dot
x_dot = x_dot + self.tau * x_acc
theta = theta + self.tau * theta_dot
theta_dot = theta_dot + self.tau * theta_acc

Finally, the angle is wrapped into [-pi, pi] for numerical stability:

1	theta = ((theta + np.pi) % (2 * np.pi)) - np.pi

This means the policy does not directly control the pole. It only controls the cart force. The pole moves as a result of the cart-pole dynamics.

9. Reward Design

The standard CartPole reward is usually simple:

1	+1 for every step the pole remains balanced

This is not enough for swing-up because the pole starts downward. The agent needs a reward signal that encourages it to move toward the upright position.

I used the following shaped reward:

upright_reward = (np.cos(theta) + 1.0) / 2.0

cart_penalty = 0.01 * x**2
velocity_penalty = 0.001 * x_dot**2 + 0.001 * theta_dot**2
action_penalty = 0.0001 * force**2

reward = upright_reward - cart_penalty - velocity_penalty - action_penalty

The main term is:

1	upright_reward = (np.cos(theta) + 1.0) / 2.0

This gives:

1
2
3

theta = 0      -> upright_reward = 1
theta = pi     -> upright_reward = 0
theta = pi/2   -> upright_reward = 0.5

So the reward becomes larger as the pole gets closer to upright.

The penalty terms encourage stable behavior:

1
2
3

cart_penalty      -> keep the cart near the center
velocity_penalty  -> avoid excessive cart and pole velocities
action_penalty    -> avoid unnecessarily large forces

Together, the reward encourages the agent to swing the pole up, keep the cart within bounds, and stabilize the system smoothly.

10. Training with PPO

The PPO training code stays mostly the same. The main change is replacing the original environment with the custom swing-up environment:

env = CartPoleSwingUpEnv()

model = PPO(
    "MlpPolicy",
    env,
    verbose=1,
    device="cpu",
)

model.learn(
    total_timesteps=600_000,
    callback=[swanlab_cb, LogApproxKL()],
    progress_bar=True,
)

The number of training steps is larger than in the standard CartPole experiment because swing-up is a harder task. The policy has to learn both energy injection and balancing.

After training, the model is saved as:

1	model.save("output/ppo_cartpole_swingup")

The saved output contains the trained PPO policy, value network, and related model metadata. The custom environment code itself is not saved inside the model file, so inference still needs the same CartPoleSwingUpEnv definition.

11. Summary of Main Changes

Component	CartPole-v1	CartPole Swing-Up
Initial pole position	Near upright	Near downward
Action space	`Discrete(2)`	Continuous force`Box([-F_max, F_max])`
Observation	`[x, x_dot, theta, theta_dot]`	`[x, x_dot, cos(theta), sin(theta), theta_dot]`
Pole angle termination	Yes	No
Cart position termination	Yes	Yes
Reward	Usually`+1` per step	Upright reward minus stability penalties
Max episode length	500 steps	1000 steps
Task difficulty	Balance only	Swing-up + balance

2026-06-03-SwingUpCartpole

http://gong208.github.io/2026/06/03/2026-06-03-SwingUpCartpole/

Author

Jiangshan Gong

Posted on

2026-06-03

Updated on

2026-07-02

Licensed under

#RL Experiment

2026-06-03-SwingUpCartpole

From CartPole-v1 to CartPole Swing-Up: Modifying a Classic RL Environment

1. Original CartPole-v1

2. Why CartPole-v1 Cannot Directly Do Swing-Up

3. Custom CartPoleSwingUpEnv

4. Action Space: From Discrete Left/Right to Continuous Force

5. Observation Space: From 4D to 5D

6. Initial State: Start from the Downward Position

7. Termination: Remove Angle-Based Termination

8. Dynamics: Simulating Cart-Pole Physics

9. Reward Design

10. Training with PPO

11. Summary of Main Changes

Author

Posted on

Updated on

Licensed under

Comments

Recents

Categories

Catalogue