Chipmunk Introduces Training-Free Acceleration for Diffusion Transformers

Ted Hisokawa
Apr 22, 2025 02:14

Chipmunk leverages dynamic sparsity to accelerate diffusion transformers, achieving significant speed-ups in video and image generation without additional training.

Chipmunk, a novel approach to accelerating diffusion transformers, has been introduced by Together.ai, promising substantial speed improvements in video and image generation. This method utilizes dynamic column-sparse deltas without requiring additional training, according to Together.ai.

Dynamic Sparsity for Faster Processing

Chipmunk employs a technique where it caches attention weights and MLP activations from previous steps, dynamically computing sparse deltas against these cached weights. This method allows Chipmunk to achieve up to 3.7 times faster video generation on platforms like HunyuanVideo compared to traditional methods. The approach shows a 2.16x speed improvement in specific configurations and up to 1.6 times faster image generation on FLUX.1-dev.

Addressing Diffusion Transformer Challenges

Diffusion Transformers (DiTs) are widely used for video generation, but their high time and cost requirements have limited their accessibility. Chipmunk addresses these challenges by focusing on two key insights: the slow-changing nature of model activations and their inherent sparsity. By reformulating these activations to compute cross-step deltas, the method enhances their sparsity and efficiency.

Hardware-Aware Optimization

Chipmunk’s design includes a hardware-aware sparsity pattern that optimizes for dense shared memory tiles using non-contiguous columns in global memory. This approach, combined with fast kernels, enables significant computational efficiency and speed improvements. The method takes advantage of GPUs’ preference for computing large blocks, aligning with native tile sizes for optimal performance.

Kernel Optimizations

To further enhance performance, Chipmunk incorporates several kernel optimizations. These include fast sparsity identification through custom CUDA kernels, efficient cache writeback using the CUDA driver API, and warp-specialized persistent kernels. These innovations contribute to a more efficient execution, reducing computation time and resource usage.

Open Source and Community Engagement

Together.ai has embraced the open-source community by releasing Chipmunk’s resources on GitHub, inviting developers to explore and leverage these advancements. This initiative is part of a broader effort to accelerate model performance across various architectures, such as FLUX-1.dev and DeepSeek R1.

For more detailed insights and technical documentation, interested readers can access the full blog post on Together.ai.

Image source: Shutterstock

Source link