Close

Presentation

FastQuery: Communication-efficient Embedding Table Query for Private LLMs inference
DescriptionWith the fast evolution of large language models (LLMs), privacy concerns with user queries arise as they may contain sensitive information. Private inference based on homomorphic encryption (HE) has been proposed to protect user query privacy. However, private embedding table query has to be formulated as a HE-based matrix-vector multiplication problem and hence, suffers from enormous computation and communication overhead. We observe the overhead mainly comes from the neglect of 1) the one-hot nature of user queries and 2) the robustness of the embedding table to low-precision quantization noise. Hence, in this paper, we propose a private embedding table query optimization framework, dubbed FastQuery. FastQuery features a communication-aware embedding table quantization algorithm and a one-hot-aware dense packing algorithm to simultaneously reduce both the computation and communication costs. Compared to prior-art HE-based frameworks, e.g., CrypTFlow2, Iron, Cheetah, and CHAM, FastQuery achieves 2.7 ∼ 4.5× computation and 75.1 ∼ 84.4× communication reduction on both LLAMA-7B and LLAMA-30B.
Event Type
Research Manuscript
TimeWednesday, June 265:15pm - 5:30pm PDT
Location3002, 3rd Floor
Topics
AI
Security
Keywords
AI/ML Security/Privacy