2026-06-01-Slahmr_examples

Preliminary SLAHMR Reproduction and Failure Case Analysis

Motivation

Following the suggestion on Prof. Kanazawa’s webpage, I reproduced SLAHMR on several in-the-wild videos and analyzed its behavior under different types of human and camera motion.

A preliminary observation is that SLAHMR appears more stable on videos with mostly horizontal near-ground motion, such as field sports or ice skating, but struggles more on videos involving strong vertical or airborne motion, such as diving.

Brief Method Reminder

SLAHMR estimates relative camera motion from static-scene pixel motion using DROID-SLAM. It then jointly optimizes human trajectories, camera scale, and a ground plane using 2D keypoint reprojection and learned human motion priors.

Therefore, the final world-frame reconstruction depends on both:

  1. the quality of the estimated relative camera motion, and
  2. whether the learned human motion prior is appropriate for the input sequence.

Examples

Horizontal / Near-Ground Motion

Observation
For horizontal near-ground motion, the recovered human trajectory is relatively smooth and appears consistent with the scene motion. This may be because the person moves near a stable ground plane and the motion pattern is closer to the distribution modeled by the human motion prior. The quality of human mesh in the ice-skating example may have suffered from the rapid and complicated rotation of the two athletes. As for the upstaring examples, we can see even though it is a nearly vertical movement, the model can perfectly estimate the slope of the stairs and the human and camera motion of climbing up the slope. I believe this is because the motions are grounded to the plane not airborne.

Vertical / Airborne Motion

Observation
For diving-like vertical motion, SLAHMR can fail after world-frame optimization even when the PHALP+ initialization estimates a reasonable body mesh. The sequence itsefl is like this: the diver starts on the platform, jumps into the air, becoming airborne for several seconds no valid support surface, and enters the water, which is a different physical plane. However, SLAHMR uses a static world coordinate system, a single time-invariant ground plane, and contact-height constraints across the whole sequence. The ground plane is effectively anchored by the early platform frames, while later airborne and water-entry frames are still optimized under the same support assumption. What’s more, because of the datasets being trained, I believe the HuMoR/contact priors are better suited to near-ground locomotion than to airborne actions such as diving. Consequently, the optimization can distort the recovered human and camera trajectory despite a plausible local pose initialization.

Pattern

SLAHMR appears more reliable when:

  • the human moves near a stable ground plane,
  • the video contains sufficient static-scene parallax,
  • the motion resembles common near-ground human locomotion,
  • and the initial pose/tracking estimates are reliable.

It becomes more fragile when:

  • the movement is mostly airborne,
  • the motion is strongly vertical,
  • the background has weak parallax,
  • or the subject undergoes unusual poses and fast rotations.

Possible hypothesis about why slahmr is not good at videos with vertical movements are:

1. Scale Recovery Relies on Human Motion Plausibility

The relative camera translation from monocular SLAM is only available up to an unknown scale. SLAHMR estimates this scale by optimizing the camera and human trajectories together under human motion priors.

For common horizontal motion, such as running or skating, the human displacement is likely closer to the prior distribution. For diving, the vertical motion, fast rotation, and airborne phase may be out of distribution, making scale recovery less reliable.

2. Ground and Contact Priors Are Less Suitable for Airborne Motion

SLAHMR uses a ground plane and contact-related constraints to encourage physically plausible human motion. These assumptions are useful for near-ground locomotion, but they are less appropriate for diving, where the person may have no stable ground contact for most of the sequence.

As a result, the optimizer may have difficulty deciding whether image motion should be explained by camera motion, vertical human displacement, zoom, or pose-estimation error.

Summary

The preliminary results suggest that SLAHMR works best when the video provides reliable static-scene parallax and the human motion can be explained as plausible near-ground locomotion.

Diving-like videos violate both assumptions: the camera motion may be geometrically ambiguous, and the human motion is often airborne, vertical, and out of distribution for ground-contact-based human motion priors.

These cases may therefore provide an interesting direction for improving human-camera motion reconstruction in more general in-the-wild videos.

Author

Jiangshan Gong

Posted on

2026-06-01

Updated on

2026-06-01

Licensed under

Comments