token clustering

Temporal Cluster Assignment for Efficient Real-Time Video Segmentation

Vision Transformers, especially Swin, are strong backbones for video segmentation but remain computationally heavy, limiting real-time use. Conventional token pruning struggles with Swin’s fixed window scheme, while existing clustering methods ignore temporal redundancy. To address this, the proposed Temporal Cluster Assignment (TCA) leverages temporal coherence to refine token clusters without fine-tuning, improving efficiency and accuracy across multiple video benchmarks, including surgical data.