EasyCache

Abstract

Video generation models have demonstrated remarkable performance, yet their broader adoption remains constrained by slow inference speeds and substantial computational costs, primarily due to the iterative nature of the denoising process. Addressing this bottleneck is essential for democratizing advanced video synthesis technologies and enabling their integration into real-world applications. This work proposes EasyCache, a training-free acceleration framework for video diffusion models. EasyCache introduces a lightweight, runtime-adaptive caching mechanism that dynamically reuses previously computed transformation vectors, avoiding redundant computations during inference. Unlike prior approaches, EasyCache requires no offline profiling, pre-computation, or extensive parameter tuning. We conduct comprehensive studies on various large-scale video generation models, including OpenSora, Wan2.1, and HunyuanVideo. Our method achieves leading acceleration performance, reducing inference time by up to 2.1-3.3× compared to the original baselines while maintaining high visual fidelity with a significant up to 36% PSNR improvement compared to the previous SOTA method. This improvement makes our EasyCache a efficient and highly accessible solution for high-quality video generation in both research and practical applications.

Method Comparison

Video synchronization issues may occur due to network load. Please consider refreshing the page.

Prompt: "Grassland at dusk, wild horses galloping, golden light flickering across manes."

HunyuanVideo (baseline) / 1142s

Ours / 501s / 2.3x

TeaCache (CVPR 25)/ 680s / 1.7x

PAB (ICLR 25)/ 959s / 1.2x

Prompt: "A piece of paper with 'AI' handwritten as distinct characters on it, gently floating down a calm river, surrounded by autumn leaves."

Wan2.1-14B (baseline) / 928s

Ours / 324s / 2.9x

TeaCache (CVPR 25)/ 628s / 1.5x

PAB (ICLR 25)/ 697s / 1.3x

Prompt: "A joyful dog leaps excitedly on a green park lawn, trying to catch a red frisbee."

HunyuanVideo (baseline) / 1140s

Ours / 448s / 2.5x

TeaCache (CVPR 25)/ 676s / 1.7x

PAB (ICLR 25)/ 963s / 1.2x

Guess the Method

Can you identify which method generated each video?

Prompt: "A top-down view of a barista creating latte art, skillfully pouring milk to form the letters 'TPAMI' on coffee."

HunyuanVideo Original (919s)

A

B

C

Prompt: "On an empty city street, a zebra runs in the sunshine."

HunyuanVideo Original (1141s)

A

B

C

Video Demonstrations

Hover over videos to see prompts (*Different devices and resolution have different inference times)

Wan2.1-14B / 6874s on H20 / 1.9×

A close up view of a glass sphere that has a zen garden within it. There is a small dwarf in the sphere who is raking the zen garden and creating patterns in the sand.

Wan2.1-14B / 6827s on H20 / 2.1×

A flock of paper airplanes flutters through a dense jungle, weaving around trees as if they were migrating birds.

HunyuanVideo / 433s on A800 / 2.7×

A strawberry ice cream rapidly melts on a colorful tabletop under direct sunlight, forming a small puddle.

HunyuanVideo / 489s on A800 / 2.4×

Sunrise on a palm beach, turquoise waves lapping golden sands, palm fronds rustling in breeze.

HunyuanVideo / 507s on A800 / 2.3×

Starry mountain meadow, grass gently rippling, fireflies blinking rhythmically under the Milky Way.

HunyuanVideo / 527s on A800 / 2.2×

Autumn maple forest by a stream, red and orange leaves drifting onto mossy stones, water flowing softly.

HunyuanVideo / 465s on A800 / 2.6×

A campfire burns in a sunlit forest clearing, with bright sparks occasionally leaping out.

Wan2.1-14B / 576s on A800 / 2.7×

An extreme close-up of an gray-haired man with a beard in his 60s, he is deep in thought pondering the history of the universe as he sits at a cafe in Paris, his eyes focus on people offscreen as they walk as he sits mostly motionless, he is dressed in a wool coat suit coat with a button-down shirt , he wears a brown beret and glasses and has a very professorial appearance, and the end he offers a subtle closed-mouth smile as if he found the answer to the mystery of life, the lighting is very cinematic with the golden light and the Parisian streets and city in the background, depth of field, cinematic 35mm film.

Wan2.1-1.3B / 77s on A800 / 2.3×

Moonlight revealing invisible ink.

Wan2.1-14B / 1531s on A800 / 2.4×

Beautiful, snowy Tokyo city is bustling. The camera moves through the bustling city street, following several people enjoying the beautiful snowy weather and shopping at nearby stalls. Gorgeous sakura petals are flying through the wind along with snowflakes.

Compatibility with Other Acceleration Techniques

Demo video is tested on NVIDIA H20.

Our method maintains compatibility with other acceleration techniques while still achieving high visual retention.

Cross-Architecture Compatibility

Our method demonstrates consistent performance across different GPU architectures.
(*Prompt: A dog running in front of a bench.)

NVIDIA A800

HunyuanVideo (baseline) 3046s

NVIDIA H20

HunyuanVideo (baseline) 6597s

NVIDIA A800

SVG (ICML 25)

1939s 1.5x

NVIDIA A800

Ours

1442s 2.1x

NVIDIA H20

SVG (ICML 25)

3473s 1.9x

NVIDIA H20

Ours

2998s 2.2x

BibTeX

@article{zhou2025easycache,
  title={Less is Enough: Training-Free Video Diffusion Acceleration via Runtime-Adaptive Caching},
  author={Zhou, Xin and Liang, Dingkang and Chen, Kaijin and and Feng, Tianrui and Chen, Xiwu and Lin, Hongkai and Ding, Yikang and Tan, Feiyang and Zhao, Hengshuang and Bai, Xiang},
  journal={arXiv preprint arXiv:2507.02860},
  year={2025}
}