🕺🕺🕺 Follow Your Pose 💃💃💃
Pose-Guided Text-to-Video Generation using Pose-Free Videos

Yue Ma^1* Yingqing He^2* Xiaodong Cun³ Xintao Wang³ Ying Shan³ Xiu Li¹ Qifeng Chen²

¹ Tsinghua University, Tsinghua Shenzhen International Graduate School ² HKUST ³ Tencent AI Lab

A man in the park, Van Gogh style

A man in the forest , Minecraft

A astronaut, brown simple background

A robot is dancing in Sahara desert

A astronaut on the beach

A man in the park, Van Gogh style

The Hulk on the sea.

Iron man on beach, Makoto Shinkai.

Iron man on forest, Makoto Shinkai

The Iron man on the beach.

A astronaut on the moon

Stormtrooper on sea, Cartoon style

Stormtroopers on the sea

Astronauts on the moon

Policeman on the beach

a man in the sea, at sunset

Iron man on the beach.

Stormtrooper on the sea

A Astronaut on the beach

Superman on the forest

Policeman on the beach

A Iron man on the beach.

A astronaut on the moon.

James Bond on the beach.

A stormtrooper on the sea.

A astronaut on beach, Cartoon style

Batman on the sea.

A stormtrooper on the sea.

Abstract

Generating text-editable and pose-controllable character videos have an imperious demand in creating various digital human. Nevertheless, this task has been restricted by the absence of a comprehensive dataset featuring paired video-pose captions and the generative prior models for videos. In this work, we design a novel two-stage training scheme that can utilize easily obtained datasets (i.e.,image pose pair and pose-free video) and the pre-trained text-to-image (T2I) model to obtain the pose-controllable character videos. Specifically, in the first stage, only the keypoint-image pairs are used only for a controllable text-to-image generation. We learn a zero-initialized convolu- tional encoder to encode the pose information. In the second stage, we finetune the motion of the above network via a pose-free video dataset by adding the learnable temporal self-attention and reformed cross-frame self-attention blocks. Powered by our new designs, our method successfully generates continuously pose-controllable character videos while keeps the editing and concept composition ability of the pre-trained T2I model. The code and models will be made publicly available.

Pipeline

Video Generation Using Stable Diffusion

Stormtrooper in the gym

The girl, Simple background

Trump on mountain

Astronaut on mountain

Iron man in street

A man in the park

A astronaut, brown background.

A robot is dancing in Sahara desert

A Iron man on the beach

Astronaut on moon, Makoto Shinkai

Iron man on beach, Makoto Shinkai

Policeman on the beach

A man in the park, Van Gogh style

Fireman in the beach

Batman, brown background

Astronauts on the moon

Stormtroopers on the sea

Robots in Antarctica

A Hulk on the sea

A astronaut on the moon

A superman on the forest.

A man in the forest, Minecraft

A man in the sea, at sunset

Batman in forest, Van Gogh style.

A astronaut on the beach.

A Obama in the desert.

A Iron man on the snow.

A panda on the sea.

A astronaut on the beach.

A robot in Antarctica.

Iron man on forest, Makoto Shinkai.

Batman on the sea.

Stormtrooper on sea, Cartoon style.

James Bond on the beach.

Batman on the sea.

Iron man on the beach.

A robot in Antarctica.

A astronaut on the moon.

A stormtrooper on the sea.

Astronaut on the beach.

Superman on the forest.

Iron man on the beach.

A Stormtrooper on the sea.

A Iron man on the beach.

A stormtrooper on the sea.

BibTeX

@article{ma2023follow,
  title={Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos},
  author={Ma, Yue and He, Yingqing and Cun, Xiaodong and Wang, Xintao and Shan, Ying and Li, Xiu and Chen, Qifeng},
  journal={arXiv preprint arXiv:2304.01186},
  year={2023}
}