2026-06-03-SwingUpCartpole
From CartPole-v1 to CartPole Swing-Up: Modifying a Classic RL Environment
The standard CartPole-v1 environment is one of the most common introductory reinforcement learning tasks. In this environment, the pole starts almost upright, and the agent only needs to learn how to keep it balanced by moving the cart left or right.
However, there is a harder and more physically interesting version of this task: CartPole Swing-Up. Instead of starting upright, the pole starts naturally hanging downward. The agent must first learn to swing the pole up by moving the cart back and forth, and then stabilize it near the upright position.
This post summarizes the main changes I have done to convert a standard CartPole-v1 PPO training script into a custom CartPole swing-up environment.
1. Original CartPole-v1
In the original version, the CartPole-v1 environment created directly from Gymnasium has a relatively simple setup:
1 | Observation: [x, x_dot, theta, theta_dot] |
Here:
1 | x = cart position |
The two actions are:
1 | 0 -> push cart left |
In this balancing task, since the pole already starts near upright, the agent mainly needs to prevent it from falling. What if we want to train for a more complicated setting: the cart needs to swing up the pole by itself and then balance it.
2. Why CartPole-v1 Cannot Directly Do Swing-Up
In the swing-up setting, the pole should start near the downward position:
1 | theta = np.pi |
However, in standard CartPole-v1, the episode terminates when the pole angle becomes too large. The usual angle threshold is around ±12 degrees from the upright position.
This means that if we reset the pole to the downward position, the episode would terminate immediately.
Therefore, to train a swing-up policy, we need to define a custom environment.
3. Custom CartPoleSwingUpEnv
The new environment is defined as a custom Gymnasium environment:
1 | class CartPoleSwingUpEnv(gym.Env): |
4. Action Space: From Discrete Left/Right to Continuous Force
In standard CartPole, the action space is discrete:
1 | action_space = Discrete(2) |
The agent can only choose between two fixed actions:
1 | push left or push right |
For the swing-up task, I changed the action space to a continuous force:
1 | self.force_mag = 10.0 |
Now the policy outputs a continuous force:
1 | force ∈ [-F_max, F_max] |
This gives the agent more precise control. Instead of only choosing “left” or “right,” it can decide how strongly to push the cart.
The action is then clipped inside step():
1 | force = float(np.clip(action[0], -self.force_mag, self.force_mag)) |
This makes the control problem closer to a continuous-control task.
5. Observation Space: From 4D to 5D
The original CartPole observation is 4-dimensional:
1 | [x, x_dot, theta, theta_dot] |
For swing-up, I changed it to:
1 | [x, x_dot, cos(theta), sin(theta), theta_dot] |
This makes the observation 5-dimensional.
The reason is that angle is a periodic variable. For example:
1 | theta = pi |
represent almost the same physical pole direction, but numerically they look very different if we feed theta directly into a neural network.
Using cos(theta) and sin(theta) avoids this discontinuity:
1 | def _get_obs(self): |
This representation is more stable for learning because the pole angle can rotate through the full circle during swing-up.
6. Initial State: Start from the Downward Position
In standard CartPole, the pole starts near upright. For swing-up, I reset the pole near the downward position:
1 | theta = np.pi + self.np_random.uniform(low=-0.05, high=0.05) |
The small random noise prevents every episode from starting in exactly the same symmetric state.
The reset function initializes the state as:
1 | x = self.np_random.uniform(low=-0.05, high=0.05) |
Here, I define:
1 | theta = 0 -> pole is upright |
So the agent must learn to first inject energy into the system, swing the pole upward, and then balance it.
7. Termination: Remove Angle-Based Termination
In standard CartPole, the episode terminates when the pole angle exceeds a small threshold. That makes sense for a pure balancing task, but it is not suitable for swing-up.
For swing-up, the pole must be allowed to fall, rotate, and swing through large angles. Therefore, I removed the angle-based termination condition.
The new termination rule is only based on the cart leaving the track:
1 | terminated = bool(x < -self.x_threshold or x > self.x_threshold) |
The episode is truncated when it reaches the maximum length:
1 | truncated = bool(self.step_count >= self.max_steps) |
In my implementation:
1 | self.x_threshold = 4.8 |
So the task gives the agent enough time to swing the pole up and stabilize it.
8. Dynamics: Simulating Cart-Pole Physics
The environment updates the cart-pole state manually inside step().
The agent outputs a horizontal force applied to the cart. The environment then computes the resulting cart acceleration and pole angular acceleration:
1 | temp = ( |
Then the state is updated with Euler integration:
1 | x = x + self.tau * x_dot |
Finally, the angle is wrapped into [-pi, pi] for numerical stability:
1 | theta = ((theta + np.pi) % (2 * np.pi)) - np.pi |
This means the policy does not directly control the pole. It only controls the cart force. The pole moves as a result of the cart-pole dynamics.
9. Reward Design
The standard CartPole reward is usually simple:
1 | +1 for every step the pole remains balanced |
This is not enough for swing-up because the pole starts downward. The agent needs a reward signal that encourages it to move toward the upright position.
I used the following shaped reward:
1 | upright_reward = (np.cos(theta) + 1.0) / 2.0 |
The main term is:
1 | upright_reward = (np.cos(theta) + 1.0) / 2.0 |
This gives:
1 | theta = 0 -> upright_reward = 1 |
So the reward becomes larger as the pole gets closer to upright.
The penalty terms encourage stable behavior:
1 | cart_penalty -> keep the cart near the center |
Together, the reward encourages the agent to swing the pole up, keep the cart within bounds, and stabilize the system smoothly.
10. Training with PPO
The PPO training code stays mostly the same. The main change is replacing the original environment with the custom swing-up environment:
1 | env = CartPoleSwingUpEnv() |
The number of training steps is larger than in the standard CartPole experiment because swing-up is a harder task. The policy has to learn both energy injection and balancing.
After training, the model is saved as:
1 | model.save("output/ppo_cartpole_swingup") |
The saved output contains the trained PPO policy, value network, and related model metadata. The custom environment code itself is not saved inside the model file, so inference still needs the same CartPoleSwingUpEnv definition.
11. Summary of Main Changes
| Component | CartPole-v1 | CartPole Swing-Up |
|---|---|---|
| Initial pole position | Near upright | Near downward |
| Action space | Discrete(2) |
Continuous force Box([-F_max, F_max]) |
| Observation | [x, x_dot, theta, theta_dot] |
[x, x_dot, cos(theta), sin(theta), theta_dot] |
| Pole angle termination | Yes | No |
| Cart position termination | Yes | Yes |
| Reward | Usually +1 per step |
Upright reward minus stability penalties |
| Max episode length | 500 steps | 1000 steps |
| Task difficulty | Balance only | Swing-up + balance |
13. Takeaway
The standard CartPole task is mainly a balancing problem. The pole already starts near the correct position, so the agent only needs to keep it there.
CartPole swing-up is a more challenging control problem. The agent must first learn how to create enough momentum to swing the pole upward, and then transition into a balancing policy.
This requires changes to the environment design:
1 | 1. Start the pole downward |
After these changes, the same PPO algorithm can be used to train a policy that learns the full swing-up behavior.
2026-06-03-SwingUpCartpole
http://gong208.github.io/2026/06/03/2026-06-03-SwingUpCartpole/