Presentation

· Contributors · Organizations · Search Program · Flagged · Happening Now

Partially-Structured Transformer Pruning with Patch-Limited XOR-Gate Compression for Stall-Free Sparse-Model Access

SessionIt's Not 8b Retro-Gaming, It's State-Of-The-Art Architectures Using Quantization, Sparsity, and Compression!

DescriptionThe pruning-based model compression is regarded as an essential technique to deploy the recent large-size transformer models in practical services; however, accessing sparse transformer models cannot reach the ideal speed at all due to the frequent memory stalls for the irregular memory-accessing patterns. Based on the recent XOR-gate compression relaxing the amount of irregular accesses, this work presents a novel partially-structured transformer pruning method dedicated to the interface-friendly compression format. The stall-free memory access is firstly derived by limiting the number of patches per weight, introducing a new trade-off between model quality and effective memory bandwidth.
Then, the partially-structured pruning patterns are deployed to provide better accuracy-bandwidth trade-off by significantly reducing the number of correction patches. Adjusting the patch distribution per weight in an aggressive way, the number of limited patches can be even smaller than that of weight bits, further increasing the effective bandwidth for achieving the similar model accuracy. We demonstrate the proposed stall-free XOR-gate compression schemes at pruned DeiT/BERT models on ImageNet/SQuAD datasets, presenting the highest effective bandwidth for accessing sparse transformers compared to the existing stall-based solutions.

Authors

Younghoon Byun

POSTECH

Youngjoo Lee

Pohang University of Science and Technology (POSTECH)

Event Type

Research Manuscript

TimeThursday, June 2711:30am - 11:45am PDT

Location3003, 3rd Floor

Topics

Keywords

Next PresentationNext Presentation

SWAT: Scalable and Efficient Window Attention-based Transformers Acceleration on FPGAs

DAC 2024