Kinemation

A computer vision pipeline for converting human movement in video into temporally coherent, smooth animated stick figures.

Kinemation detects body joints from monocular video, enforces temporal consistency across frames, and maps the resulting motion data to temporally coherent and smooth stick figure animations.

---

Project Overview

Standard pose estimation models operate frame-by-frame, producing skeletal estimates that are accurate in isolation but temporally incoherent in sequence - joints flicker, limbs snap between positions, and the resulting animation is visually noisy. Kinemation addresses this by treating video as a temporal signal rather than a collection of independent images.

The longer-term goal extends beyond geometric accuracy. By integrating body-language-based emotion recognition, Kinemation aims to produce stick figures that not only move like the subject but also express like them - mapping inferred emotional state to visual properties of the animation such as posture, joint expressiveness, motion dynamics, and rendering style.

---

Pipeline Architecture

INPUT VIDEO
     |
     v
2D Pose Estimator (MediaPipe BlazePose)
     |  N x 33 landmarks per frame
     v
Keypoint Adapter (MediaPipe -> H36M format)
     |  N x 17 x 2 keypoint sequence
     v
Temporal Smoothing Layer (VideoPose3D TCN)
     |  smooth 3D poses: N x 17 x 3
     v
Stick Figure Renderer
     |  visual parameters
     v
OUTPUT VIDEO

---

Approaches Explored

Pose Estimation

The following methods were surveyed and evaluated for suitability in the Kinemation pipeline. Evaluation criteria included real-time performance, keypoint accuracy, ease of integration with downstream modules, and community support.

Tools and Libraries Evaluated

Tool	Type	Keypoints	Status
MediaPipe BlazePose	2D/3D, real-time	33	In use
OpenPose	2D, bottom-up	25	Surveyed
OpenCV DNN	2D, lightweight	18	Surveyed
AlphaPose	2D, top-down	17/26	Surveyed
RTMPose	2D, real-time	17	Planned
HRNet	2D, top-down	17	Surveyed
ViTPose	2D, transformer	17	Surveyed

Key Papers Reviewed

Cao et al. (2017) - Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields (OpenPose) - arXiv:1611.08050
Sun et al. (2019) - Deep High-Resolution Representation Learning (HRNet) - arXiv:1908.07919
Newell et al. (2016) - Stacked Hourglass Networks for Human Pose Estimation - arXiv:1603.06937
Yang et al. (2021) - TransPose: Keypoint Localization via Transformer - arXiv:2012.14214
Xu et al. (2022) - ViTPose: Simple Vision Transformer Baselines - arXiv:2204.12484
Jiang et al. (2023) - RTMPose: Real-Time Multi-Person Pose Estimation - arXiv:2303.07399
Bazarevsky et al. (2020) - BlazePose: On-device Real-time Body Pose Tracking - arXiv:2006.10204

---

Temporal Smoothing and Coherence

Temporal coherence is the primary active research focus. Three papers were studied in depth, representing distinct approaches to the problem.

Paper 1 - Temporal Bundle Adjustment

Arnab, Doersch & Zisserman (CVPR 2019) - Exploiting Temporal Context for 3D Human Pose Estimation in the Wild - arXiv:1905.04668

Treats the entire video as a single global optimization problem. Per-frame 2D keypoints are fed into an HMR model to produce SMPL mesh estimates (beta - body shape, theta - joint angles). Bundle Adjustment then jointly minimizes a compound loss E = E_R + E_T + E_P across all frames simultaneously using L-BFGS. Body shape beta is held constant across frames to enforce anatomical consistency. Robust to noisy detections via Huber loss and camera shake via hinge loss.

Strengths: globally consistent, principled, handles real-world noise well
Limitations: requires full video upfront, computationally heavy, not suitable for real-time use

Paper 2 - Temporal Convolutional Network (Primary Implementation Target)

Pavllo et al. (CVPR 2019) - 3D Human Pose Estimation in Video with Temporal Convolutions and Semi-Supervised Training - arXiv:1811.11742

Takes a sequence of 2D keypoints (T x J x 2) and uses a dilated 1D TCN with exponentially increasing dilation rates (1, 2, 4, 8...) to infer smooth 3D poses for the center frame of each window. Temporal smoothness is learned from data rather than explicitly optimized. Includes a semi-supervised extension using a back-projection loss, enabling training on unlabeled video without 3D ground truth. Inference is a single forward pass, making it significantly more practical than bundle adjustment for a real pipeline.

Strengths: near-real-time, detector-agnostic, plug-and-play with MediaPipe outputs, semi-supervised capability
Limitations: less globally consistent than bundle adjustment; receptive field of 243 frames requires padding on short clips

Paper 3 - Bidirectional 2D Temporal Refinement

Liu et al. (CVPR 2021) - Deep Dual Consecutive Network for Human Pose Estimation (DCPose) - arXiv:2103.07254

Operates purely in 2D. Takes a triplet of frames (t-k, t, t+k) and uses a Pose Temporal Merger (PTM) built on deformable convolutions to warp and align neighboring heatmaps to the target frame before merging. A Pose Refine Machine (PRM) then fuses the merged temporal features with the original single-frame heatmap to produce a corrected output. Occlusion recovery is an emergent property - joints hidden at frame t but visible at t+/-k are automatically recovered through the PTM-PRM pipeline without any explicit occlusion modeling.

Strengths: stays in 2D, natural occlusion handling, architecturally elegant
Limitations: operates on intermediate feature maps rather than final keypoint coordinates, making it non-trivial to integrate with arbitrary 2D detectors

Planned Upgrade - MotionBERT

Zhu et al. (ICCV 2023) - MotionBERT: A Unified Perspective on Learning Human Motion Representations - arXiv:2210.06551

Transformer-based temporal model accepting the same N x 17 x 2 input format as VideoPose3D, making it a straightforward upgrade once the base temporal pipeline is established. Demonstrates consistent accuracy improvements over TCN-based approaches, particularly on fast motion and occluded joints.

---

Work Completed

Base 2D pose estimation pipeline using MediaPipe BlazePose, with per-frame keypoint extraction on arbitrary video input
Stick figure renderer connecting skeletal joint detections to a 2D drawing layer with full bone connectivity
Literature survey covering 35+ papers across pose estimation paradigms including top-down, bottom-up, heatmap-based, regression-based, transformer-based, 3D, and video-based approaches
In-depth study of three temporal smoothing papers (Arnab et al., Pavllo et al., Liu et al.) covering global optimization, learned TCN-based smoothing, and bidirectional 2D refinement
Keypoint adapter mapping MediaPipe's 33-landmark format to the Human3.6M 17-joint format required by VideoPose3D (in progress)
VideoPose3D repository set up with dependencies resolved and pretrained checkpoint downloaded (in progress)

---

Work In Progress

Full video-to-3D-pose inference pipeline: MediaPipe extraction - H36M adapter - dilated TCN - smooth 3D output
Jitter metric for quantitative evaluation of temporal smoothness (mean acceleration magnitude across joints) for before/after comparison
3D-to-2D back-projection to feed smooth poses back into the existing stick figure renderer
Edge case handling: interpolation for missing detections, sequence padding for short clips, causal mode for faster inference

---

Roadmap

Phase 1 - Temporal Smoothing (current) Integrate VideoPose3D as the temporal smoothing backbone. Benchmark against raw MediaPipe output using the jitter metric. Produce side-by-side demo videos. Evaluate MotionBERT as an upgrade path once the baseline is stable.

Phase 2 - 3D Stick Figure Rendering Extend the stick figure renderer to work in 3D space, enabling viewpoint manipulation and more expressive animation using Open3D or matplotlib 3D axes.

Phase 3 - Applying Temporal Smoothing Techniques to Achieve Temporal Consistency of Video Outputs

Phase 4 - Evaluation and Demo End-to-end evaluation on a curated test set. Quantitative benchmarks on temporal smoothness, 3D accuracy, and emotion classification. Final demo production.

---

Setup and Installation

# Clone the repository
git clone https://github.com/your-org/kinemation
cd kinemation

# Install dependencies
pip install -r requirements.txt

# Clone and set up VideoPose3D
git clone https://github.com/facebookresearch/VideoPose3D
cd VideoPose3D
pip install -r requirements.txt

# Download pretrained checkpoint
wget https://dl.fbaipublicfiles.com/video-pose-3d/pretrained\_h36m\_detectron\_coco.bin

Running the Base Pipeline

python run\_pipeline.py --input path/to/video.mp4 --output path/to/output.mp4

Running with Temporal Smoothing

python run\_pipeline.py --input path/to/video.mp4 --output path/to/output.mp4 --smooth --checkpoint pretrained\_h36m\_detectron\_coco.bin

---

Team

Mentor: Maaya Mohan

Mentees:

---

References

Paper	Venue	Link
Cao et al. - OpenPose	CVPR 2017	arXiv
Newell et al. - Stacked Hourglass Networks	ECCV 2016	arXiv
Sun et al. - HRNet	CVPR 2019	arXiv
Xu et al. - ViTPose	NeurIPS 2022	arXiv
Jiang et al. - RTMPose	2023	arXiv
Bazarevsky et al. - BlazePose	2020	arXiv
Arnab, Doersch & Zisserman - Temporal Bundle Adjustment	CVPR 2019	arXiv
Pavllo et al. - VideoPose3D	CVPR 2019	arXiv
Liu et al. - DCPose	CVPR 2021	arXiv
Zhu et al. - MotionBERT	ICCV 2023	arXiv

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.github		.github
AIEP_Resources		AIEP_Resources
Mediapipe		Mediapipe
authors		authors
backend		backend
frontend-resources		frontend-resources
.gitignore		.gitignore
README.md		README.md
app.py		app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kinemation

Table of Contents

Project Overview

Pipeline Architecture

Approaches Explored

Pose Estimation

Tools and Libraries Evaluated

Key Papers Reviewed

Temporal Smoothing and Coherence

Paper 1 - Temporal Bundle Adjustment

Paper 2 - Temporal Convolutional Network (Primary Implementation Target)

Paper 3 - Bidirectional 2D Temporal Refinement

Planned Upgrade - MotionBERT

Work Completed

Work In Progress

Roadmap

Setup and Installation

Running the Base Pipeline

Running with Temporal Smoothing

Team

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Kinemation

Table of Contents

Project Overview

Pipeline Architecture

Approaches Explored

Pose Estimation

Tools and Libraries Evaluated

Key Papers Reviewed

Temporal Smoothing and Coherence

Paper 1 - Temporal Bundle Adjustment

Paper 2 - Temporal Convolutional Network (Primary Implementation Target)

Paper 3 - Bidirectional 2D Temporal Refinement

Planned Upgrade - MotionBERT

Work Completed

Work In Progress

Roadmap

Setup and Installation

Running the Base Pipeline

Running with Temporal Smoothing

Team

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages