TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Amir Zandieh
Google Research
[email protected]

Majid Daliri
New York University
[email protected]

Majid Hadian
Google DeepMind
[email protected]

Vahab Mirrokni
Google Research
[email protected]

Abstract

Vector quantization, a problem rooted in Shannon's source coding theory, aims to quantize high-dimensional Euclidean vectors while minimizing distortion in their geometric structure. We propose TurboQuant to address both mean-squared error (MSE) and inner product distortion, overcoming limitations of existing methods that fail to achieve optimal distortion rates. Our data-oblivious algorithms, suitable for online applications, achieve near-optimal distortion rates (within a small constant factor) across all bit-widths and dimensions. TurboQuant achieves this by randomly rotating input vectors, inducing a concentrated Beta distribution on coordinates, and leveraging the near-independence property of distinct coordinates in high dimensions to simply apply optimal scalar quantizers per each coordinate. Recognizing that MSE-optimal quantizers introduce bias in inner product estimation, we propose a two-stage approach: applying an MSE quantizer followed by a 1-bit Quantized JL (QJL) transform on the residual, resulting in an unbiased inner product quantizer. We also provide a formal proof of the information-theoretic lower bounds on best achievable distortion rate by any vector quantizer, demonstrating that TurboQuant closely matches these bounds, differing only by a small constant ($\approx 2.7$) factor. Experimental results validate our theoretical findings, showing that for KV cache quantization, we achieve absolute quality neutrality with 3.5 bits per channel and marginal quality degradation with 2.5 bits per channel. Furthermore, in nearest neighbor search tasks, our method outperforms existing product quantization techniques in recall while reducing indexing time to virtually zero.

Executive Summary: Vector quantization compresses high-dimensional data, such as embeddings in artificial intelligence models, by converting continuous values into low-bit integers. This reduces memory use and speeds up computations, which is critical today as large language models (LLMs) grow massive and demand ever-longer contexts for tasks like chatbots or document analysis. However, existing methods either require time-consuming data tuning, incompatible with real-time needs, or fail to preserve key geometric properties—like distances and similarities—leading to errors in model performance or search accuracy. With AI deployment scaling rapidly, efficient compression without quality loss addresses bottlenecks in inference speed and costs, especially for key-value (KV) caches in transformers and vector databases for retrieval.

This document introduces TurboQuant, a set of algorithms designed to compress vectors with near-optimal distortion under two measures: mean-squared error (MSE), which tracks overall shape preservation, and inner product error, which ensures accurate similarity estimates without bias. The goal is to achieve these for any bit-width in high dimensions, using data-oblivious methods that work instantly without preprocessing, suitable for online AI applications.

The approach relies on simple, randomized transformations and proven scalar techniques. Vectors are first randomly rotated to standardize their statistics, making coordinates nearly independent in high dimensions—a property from concentration of measure. For MSE, each coordinate is then quantized using precomputed optimal thresholds, like solving a one-dimensional clustering problem, minimizing reconstruction error. For inner products, a two-stage process applies the MSE quantizer at one fewer bit, then adds a single-bit transform on the small residual error to eliminate bias. Theoretical analysis proves bounds close to Shannon's information limits, using worst-case inputs over datasets of any size. Experiments test on real embeddings from sources like DBpedia (up to 3072 dimensions, 100,000 samples), focusing on KV cache tasks with Llama models and nearest-neighbor search.

Key findings highlight TurboQuant's efficiency. First, it matches theoretical distortions: MSE error is about 0.36 at 1 bit per coordinate, dropping to 0.009 at 4 bits, within a 2.7 factor of the fundamental lower bound across all bit-widths; inner product variance scales as 1.57/d at 1 bit to 0.047/d at 4 bits, also near-optimal and unbiased. Second, in KV cache compression for long-context LLMs, 3.5 bits per channel achieves quality identical to full precision (16 bits), enabling 4x memory reduction with perfect recall in needle-in-a-haystack tests up to 104k tokens; at 2.5 bits, degradation is minimal (under 5% score drop on LongBench tasks). Third, for nearest-neighbor search, TurboQuant outperforms product quantization baselines by 10-20% in recall at top-10 results across dimensions, while indexing time is nearly zero (under 0.002 seconds) versus minutes for competitors. Experiments confirm these on diverse datasets, aligning closely with proofs.

These results mean TurboQuant unlocks practical gains: it cuts KV cache memory by over 4.5x in LLMs, slashing inference costs and latency for serving large models, while preserving accuracy for tasks like generation or retrieval—augmented AI. Unlike prior methods, it avoids bias in similarities, vital for reliable search in vector databases powering recommendation systems. It improves on expectations by being fully online and accelerator-friendly, exponentially better in bit efficiency than grid-based or k-means approaches, especially at low bits where others degrade sharply. This matters for scaling AI sustainably, reducing hardware needs without retraining.

Leaders should prioritize integrating TurboQuant into LLM serving frameworks and vector databases for immediate deployment in production, starting with KV cache pilots at 3-4 bits to test cost savings. For nearest-neighbor applications, adopt it to replace slower indexing methods, trading off minor storage for speed. If bit budgets vary, the MSE variant suits general compression, while the inner-product one fits similarity-focused tasks. Further work includes broader testing on proprietary models and hybrid setups with outlier handling; a small-scale rollout could quantify ROI before full adoption.

Confidence in the results is high, backed by rigorous proofs and consistent experiments across benchmarks, though limitations include reliance on high dimensions (d > 1000) for independence assumptions and unit-norm vectors (mitigated by storing norms separately). Worst-case analysis may overstate errors in typical data, but users should validate on specific workloads for edge cases like very low dimensions.

1. Introduction

Section Summary: Vector quantization is a technique that compresses high-dimensional data by converting detailed numerical values into simpler, low-bit codes, helping to reduce errors in calculations like those used in AI models and search systems while saving memory and speed. It plays a vital role in large language models by shrinking the size of model weights and memory caches for longer texts, and in databases for quick similarity searches. The section introduces TurboQuant, a new two-step method that rotates vectors, applies optimal error-minimizing quantization per coordinate, and adds a one-bit refinement to ensure accurate inner product results with better efficiency than existing approaches.

Vector quantization (VQ) in Euclidean space is crucial for efficiently handling high-dimensional vectors across a spectrum of computational domains, from training and deploying large-scale AI and deep learning models to powering vector databases for search/retrieval systems. The core objective is to compress high dimensional vectors by quantizing them–converting floating-point coordinate values to low-bitwidth integers–while minimizing distortion, quantified by metrics such as mean-squared error (MSE) or inner product errors. By preserving these properties, inner product queries can be answered rapidly, with minimal latency, and using reduced computational and communication resources.

This problem's roots trace back to Shannon’s seminal work on Source Coding theory [1, 2], which established that the least distortion achievable by block source codes, now known as vector quantizers, is defined by the Shannon distortion-rate function, determined by the statistical properties of the source and the chosen distortion measure, such as MSE. Today, VQ plays a critical role in fundamental computational domains, including AI, deep learning, and search systems.

A key application of VQ is in the deployment of AI models, including large language models (LLMs) [3, 4, 5, 6]. As LLM capabilities depend heavily on their model size and context length [7], serving them requires substantial memory demands and increased inference latency. This latency is primarily attributed to communication bottlenecks between HBM and SRAM on accelerators, or across distributed clusters. By compressing or quantizing model weights and activations, we can effectively mitigate these bottlenecks, resulting in significant reductions in inference costs. Inner product operations between activations and weights is at the core of deep learning models. Thus, model quantization schemes strive to compress weights and/or activation vectors while accurately preserving these inner products.

Decoder based transformer models [8] present another compelling use case. These models must store key/value (KV) embeddings from previously generated tokens in the KV cache, the size of which scales with both model size (number of layers and attention heads) and context length. This scaling is a significant bottleneck in terms of memory usage and computational speed, especially for long context models. Therefore, reducing the KV cache size without compromising accuracy is essential. In this context, the preservation of the Euclidean structure of these embedding vectors–their inner products and distances–is crucial for maintaining model performance. VQ emerges as the most suitable framework for addressing this challenge, offering a robust approach to compressing high-dimensional embeddings while preserving their essential geometric properties.

Additionally, nearest neighbor (NN) search in high-dimensional spaces with inner product or cosine similarity [9, 10] is a cornerstone of vector databases [11, 12, 13]. These databases are fundamental for retrieval-augmented generation [14, 15] and information retrieval [16, 17]. VQ, a.k.a. product quantization (PQ), plays a critical role in these applications. It enables efficient compression of database vectors, optimizes memory usage, and facilitates low-latency, accurate estimations of inner products with query vectors, thereby enabling fast and precise nearest neighbor searches.

Existing VQ algorithms present a trade-off: either they lack accelerator (vectorization) compatibility and exhibit slow computation, making them unsuitable for real-time AI applications like KV cache quantization, or they suffer from suboptimal distortion bounds relative to bit-width. Our objective is to introduce an algorithm that addresses these limitations. Specifically, we design TurboQuant: a lightweight, capable of online application (crucial for scenarios like KV cache quantization), and highly accelerator-friendly—a critical attribute for modern AI workloads.

The core of TurboQuant is a two-stage process. First, we develop a vector quantizer with optimal distortion rate in terms of mean-squared error (MSE). Subsequently, we apply a 1-bit quantizer to the residual, resulting in an unbiased and low-distortion inner product quantizer. We demonstrate that quantizers optimized for MSE do not produce unbiased estimators for inner products, and our two-stage solution effectively bridges this gap. Our MSE-optimal quantizer starts by randomly rotating $d$-dimensional input vectors. Observing the key fact that each coordinate in the rotated vectors follows a Beta distribution, we design optimal Lloyd-Max quantizer [18, 19] for each coordinate by solving a continuous k-means problem. This method gives optimal MSE distortion bound and minimizes the L2 norm of the residual. To obtain an unbiased and low-distortion quantizer for inner products, we compose our quantizer with the recently developed Quantized Johnson-Lindenstrauss (QJL) transform [20], which quantizes each coordinate of the residual vector to a single bit. Our algorithm offers provably optimal distortion bounds for both MSE and inner products, achieving an exponential improvement over existing methods in terms of bit-width dependence.

1.1 Problem Definition

Formally, our goal is to design a quantization map, denoted as $Q: \mathbb{R}^d \to { 0, 1 }^B$, that transforms $d$-dimensional vectors to a binary string of $B$ bits. If we set $B = b \cdot d$ for some $b \ge 0$, this quantizer will have a bit-width of $b$, representing the average number of bits used to encode each real-valued coordinate of $\mathbb{R}^d$. Crucially, we require an inverse map, $Q^{-1}: { 0, 1 }^B \to \mathbb{R}^d$ that performs dequantization, approximately reconstructing original vectors from their quantized representations. Of course, this transformation is inherently lossy, as $Q$ is not a bijection. So, our primary objective is to minimize distortion, with a specific focus on mean-squared error (MSE) and inner product distortion.

We make no assumptions about the input vector dataset, considering the worst-case scenario. We let the quantizer $Q(\cdot)$ to be randomized, leading to stochastic outputs. Considering randomized quantizers, it is more appropriate to define the expected distortion over the randomness of the quantizer's output. Thus, we aim to design quantizers that for any desired bit-width $b$ minimize the following expected distortion measures for any (worst-case) vectors ${\bm x}, {\bm y} \in \mathbb{R}^d$:

$ \begin{align} \textbf{(MSE)}\quad ; & D_{\tt mse} := \mathop{{\mathbb{E}}}{Q}\left[\left| {\bm x} - Q^{-1}\left(Q({\bm x}) \right) \right|2^2 \right] \tag{1}\ \textbf{(inner-prod error)}\quad ; & D{\tt prod} := \mathop{{\mathbb{E}}}{Q}\left[\left| \langle {\bm y}, {\bm x} \rangle - \langle {\bm y}, Q^{-1}\left(Q({\bm x}) \right) \rangle \right|^2 \right]. \tag{2} \end{align} $

The expectations above are takes with respect to the randomness of the quantizer $Q(\cdot)$. Furthermore, for inner-product quantizers, we require unbiasedness of the inner product estimator, a desirable property for numerous applications. More precisely, we require:

$ \begin{align*} \textbf{(unbiased inner-prod)}\quad ; & \mathop{{\mathbb{E}}}_{Q}\left[\langle {\bm y}, Q^{-1}\left(Q({\bm x}) \right) \rangle\right] = \langle {\bm y}, {\bm x} \rangle. \end{align*} $

We aim to design computationally efficient quantizers $Q_{\tt mse}$ and $Q_{\tt prod}$, that achieve optimal bounds for the distortion measures defined above, for any given bit-width $b$. Additionally, we aim for $Q_{\tt prod}$ to provide unbiased inner product estimates. In particular, assume that we are given $n$ real-valued vectors $x_1, x_2, \ldots x_n \in \mathbb{R}^d$. We design the following primitives:

  • Quant: efficiently quantizes the dataset and computes $Q({\bm x}_1), Q({\bm x}_2), \ldots Q({\bm x}_n)$.
  • DeQuant: given a quantized dataset, can efficiently reconstruct original vectors by computing $Q^{-1}\left(Q({\bm x}_i) \right)$ for any $i \in [n]$.

1.2 Related Work

Beginnings of VQ.

The vector quantization theory started by Shannon's seminal work [1, 2] on achievable distortion-rate functions. In 1963, Zador [21] made significant advances by employing high-resolution methods to derive the limiting operational distortion-rate function for fixed-rate quantization at high rates that closely matches Shannon's distortion-rate function. However, Zador did not specifically consider implementable algorithms. Gersho's influential paper [22], further advanced the vector quantization by popularizing high-resolution theory, simplifying Zador's results, introducing lattice vector quantization, and proposing a key conjecture that shaped the field. Despite these theoretical advancements, the practical applicability of vector quantization remained unclear in early years. The most straightforward encoding method, brute-force nearest neighbor search, was computationally expensive, hindering the adoption of VQ in practice.

Online vs Offline Quantization.

Online (data-oblivious) quantization methods apply instantly without needing data-specific tuning or calibrations [23, 24, 25, 26, 27]. In contrast, offline (data-dependent) methods require heavy preprocessing and learning to adapt the quantization map to the data, making them unsuitable for dynamic data scenarios [28]. For instance, methods such as those presented in [29, 30, 31, 32] use second-order (Hessian) information to tune the quantization map which requires heavy preprocessing and even in some cases post processing as well.

Online KV Cache Compression.

Several approaches have been proposed to compress the KV cache. These include architectural modifications [33, 34, 35] which restructure the transformer to minimize the number of stored key-value pairs. Additionally, pruning or evicting redundant or less critical tokens has emerged as another approach [36, 37, 38, 39, 40, 41, 42].

A simple yet effective approach to reducing KV cache size is quantizing the KV cache. Several quantization techniques have been developed specifically for this purpose [43, 44, 45, 46, 47, 25, 48, 49, 27]. Recently, a new quantization called QJL [20] introduced an efficient, data-oblivious 1-bit quantization approach based on sketching techniques, which provides unbiased estimates for inner product queries. This method does not require tuning or adaptation to the input data and we make use of this technology in our quantizer optimized for inner product distortion.

Product Quantization (PQ).

In Near Neighbor (NN) search problem with Euclidean datasets, the index size poses a significant memory bottleneck, often mitigated by quantization techniques, commonly referred to as Product Quantization (PQ) in the NN literature. Many of these algorithms rely on constructing a quantization codebook using variations of k-means during the indexing phase [50, 51, 52, 53, 10]. Therefore, these methods are ill-suited for online settings due to their requirement for extensive preprocessing.

Recently, a grid-based PQ method was introduced in [54], eliminating the need for preprocessing. This approach operates by projecting a uniform grid onto the unit sphere and conducting a search to identify the nearest projection to the data points. While the paper's theoretical guarantees are suboptimal, likely due to loose analysis—as practical performance surpasses theoretical bounds—the grid projection and binary search algorithm is also computationally slow and particularly inefficient on accelerators like GPU because of their algorithm's inherent lack of vectorization, which prevents parallel processing.

1.3 Overview of Techniques and Contributions

MSE Optimzied TurboQuant.

Our first VQ algorithm is designed to minimize MSE distortion deinfed in . To achieve this, we apply a random rotation to the input vectors, thereby inducing a Beta distribution on each coordinate, irrespective of the input vectors themselves. In high dimensions $d$, the distribution of each coordinate converges to a Gaussian distribution $\mathcal{N}(1, 1/d)$ due to concentration of measure and the central limit theorem. Furthermore, any two distinct coordinates become nearly uncorrelated and, more importantly, almost independent (a deeper result that goes beyond just correlation). This near-independence is a crucial aspect that simplifies our quantization design. It allows us to quantize each coordinate using optimal scalar quantization, disregarding interactions or correlations between different coordinates, while still achieving near-optimal distortion.

We find optimal scalar quantizers for random variables with Beta distributions by solving a continuous 1-dimensional k-means problem using the Max-Lloyd algorithm. We precompute and store these optimal codebooks for a range of practically useful bit-widths, to enable efficient subsequent invocations of our TurboQuant algorithm.

In Theorem 1 we prove that the $b$-bit MSE optimized TurboQuant $Q_{\tt mse}: \mathbb{R}^d \to { 0, 1 }^{b \cdot d}$ achieves the following distortion for any worst-case vector ${\bm x} \in \mathbb{R}^d$ with $|{\bm x}| = 1$:

  • $D_{\tt mse}(Q_{\tt mse}) := \mathop{{\mathbb{E}}}\left[\left| {\bm x} - Q_{\tt mse}^{-1}\left(Q_{\tt mse}({\bm x}) \right) \right|_2^2 \right] \le \frac{\sqrt{3}\pi}{2} \cdot \frac{1}{4^b}$ for any $b \ge 0$.
  • For small bit-widths the above distortion upper bound can be further refined. Specifically, for $b = 1, 2, 3, 4$ we have $D_{\tt mse}(Q_{\tt mse}) \approx {\bf 0.36}, {\bf 0.117}, {\bf 0.03}, {\bf 0.009}$, respectively.

Note that the unit norm assumption, $|{\bm x}|_2 = 1$, is standard and not restrictive. For datasets that do not satisfy this assumption we can compute and store the L2 norms in floating-point precision and rescale the dequantized points using these stored norms.

Inner Product TurboQuant.

We show that the MSE optimized quantizers are biased for inner product estimation and thus a different VQ scheme is needed to get an unbiased inner product quantizer. Our solution is a two stage algorithm that first applies the abovementioned $z_i$ with a bit-width one less than our target budget and then apply a QJL [20] on the residual error. This is proved to be unbiased and also has nearly optimal inner product error rate.

In Theorem 2 we prove that the $b$-bit inner product optimized TurboQuant $Q_{\tt prod}: \mathbb{R}^d \to { 0, 1 }^{b \cdot d}$ achieves the following distortion for any worst-case vectors ${\bm x}, {\bm y} \in \mathbb{R}^d$ with $|{\bm x}| = 1$:

  • $\mathop{{\mathbb{E}}}\left[\left< {\bm y}, Q_{\tt prod}^{-1}\left(Q_{\tt prod}({\bm x}) \right) \right> \right] = \langle {\bm y}, {\bm x} \rangle$
  • $D_{\tt prod}(Q_{\tt prod}) := \mathop{{\mathbb{E}}}\left[\left| \langle {\bm y}, {\bm x} \rangle - \langle {\bm y}, Q_{\tt prod}^{-1}\left(Q_{\tt prod}({\bm x}) \right) \rangle \right|^2 \right] \le \frac{\sqrt{3}\pi}{2} \cdot \frac{|{\bm y}|_2^2}{d} \cdot \frac{1}{4^b}$ for any $b \ge 0$.
  • For small bit-widths the above distortion upper bound can be further refined. Specifically, for $b = 1, 2, 3, 4$ we have $D_{\tt prod}(Q_{\tt prod}) \approx \frac{\bf 1.57}{d}, \frac{\bf 0.56}{d}, \frac{\bf 0.18}{d}, \frac{\bf 0.047}{d}$, respectively.

Lower Bound.

In Theorem 3, we leverage Shannon's lower bound and Yao's minimax principle to prove that for any randomized quantization algorithm $Q: \mathbb{R}^d \to { 0, 1 }^{b \cdot d}$ with bit-width $b$, there exist hard input instances ${\bm x}, {\bm y} \in \mathbb{R}^d$ with $|{\bm x}| = 1$ such that the following lower bounds hold:

  • $D_{\tt mse}(Q) := \mathop{{\mathbb{E}}}\left[\left| {\bm x} - Q^{-1}\left(Q({\bm x}) \right) \right|_2^2 \right] \ge \frac{1}{4^b}$
  • $D_{\tt prod}(Q) = \mathop{{\mathbb{E}}}\left[\left| \langle {\bm y}, {\bm x} \rangle - \langle {\bm y}, Q^{-1}\left(Q({\bm x}) \right) \rangle \right|^2 \right] \ge \frac{|{\bm y}|_2^2}{d} \cdot \frac{1}{4^b}$

As demonstrated by our lower bounds, TurboQuant's MSE distortion is provably within a factor of at most $\frac{\sqrt{3}\pi}{2} \approx {\bf 2.7}$ of the information-theoretical lower bound. Notably, for smaller bit-widths, this factor significantly decreases. For instance, at a bit-width of $b=1$ TurboQuant achieves a distortion that is only a factor of approximately ${\bf 1.45}$ away from the optimal which is also confirmed by our experimental results, indicating its efficiency in low-bit-width scenarios.

Experimental Results.

In Section 4.1, we empirically validate our theoretical distortion bounds, demonstrating that TurboQuant's observed distortions closely align with our predictions across various real-world datasets, approaching the established lower bounds.

Furthermore, in Section 4.2 and Section 4.3, we showcase TurboQuant's efficacy in online KV cache quantization. Specifically, we achieve perfect long-context retrieval in needle-in-a-haystack tasks and maintain high performance on other long-context downstream tasks, all while compressing the KV cache by a factor exceeding $Q_{\tt qjl}^{-1}$.

Finally in Section 4.4 we apply TurboQuant to various high-dimensional near neighbor search tasks. TurboQuant consistently outperforms data-dependent product quantization (PQ), while reducing the indexing time to essentially zero.

2. Preliminaries

Section Summary: This section establishes basic mathematical notations, using bold letters for vectors and matrices, and defines terms like expectations, variances, and special functions such as sign and quantizers for points on hyperspheres. It presents a key lemma showing that the coordinates of a random point uniformly distributed on a high-dimensional unit hypersphere follow a Beta distribution, which approximates a normal distribution as dimensions increase. Additionally, it introduces the Shannon Lower Bound, a fundamental limit on compression distortion for mean-squared error, and specializes it to hypersphere points, yielding a simple bound that distortion decreases exponentially with bits used per dimension.

We use boldface lowercase letters, such as ${\bm x} \in \mathbb{S}^{d-1}$ and ${\bm y} \in \mathbb{R}^d$, to denote vectors, and boldface uppercase letters, like $\mathop{{\mathbb{E}}}\left[\left< {\bm y}, Q_{\tt qjl}^{-1}\left(Q_{\tt qjl}({\bm x}) \right) \right> \right] = \langle {\bm y}, {\bm x} \rangle$, to denote matrices. To denote a slice of a vector $\mathtt{Var} \left(\left< {\bm y}, Q_{\tt qjl}^{-1}\left(Q_{\tt qjl}({\bm x}) \right) \right> \right) \le \frac{\pi}{2 d} \cdot \left| {\bm y} \right|_2^2$ between the coordinate indices ${\bm s}_1, {\bm s}_2, \ldots {\bm s}_m$ and ${\bm S}$ inclusive of the endpoints, we use the notation ${\bm s}_i$. For a matrix $d$, we write $z_i := \sqrt{\pi/2} \cdot {\bm s}_i^\top {\bm y} \cdot {\tt sign}({\bm s}_i^\top {\bm x})$ to denote its $i \in [d]$-th row vector, which we will simply refer to as $z_i$.

We use the notation ${\tt sign}$ to denote the hypersphere in $Q_{\tt qjl}^{-1}: { -1, +1}^d \to \mathbb{R}^d$ of radius $Q_{\tt qjl}$. For a random variable $Q_{\tt qjl}^{-1}$ we denote its differential entropy as ${\bm x} \in \mathbb{S}^{d-1}$. For random variables ${\bm y} \in \mathbb{R}^d$ and $\mathop{{\mathbb{E}}}\left[\left< {\bm y}, Q_{\tt qjl}^{-1}\left(Q_{\tt qjl}({\bm x}) \right) \right> \right] = \langle {\bm y}, {\bm x} \rangle$, the mutual information between them is denoted as $\mathtt{Var} \left(\left< {\bm y}, Q_{\tt qjl}^{-1}\left(Q_{\tt qjl}({\bm x}) \right) \right> \right) \le \frac{\pi}{2 d} \cdot \left| {\bm y} \right|_2^2$.

Given that TurboQuant employs random rotation to mitigate worst-case input scenarios, understanding the statistical properties of random points on a hypersphere is essential. The following lemma outlines one such property that we will need for analysis and design purposes:

Lemma 1: coordinate distribution of random point on hypersphere

For any positive integer ${\bm s}_1, {\bm s}_2, \ldots {\bm s}_m$ if ${\bm S}$ is a random variable uniformly distributed over the unit hypersphere, then for any ${\bm s}_i$ the coordinate $d$ follows the following (scaled/shifted) Beta distribution:

$ {\bm x}j \sim f{X}(x) := \frac{\Gamma(d/2)}{\sqrt{\pi} \cdot \Gamma((d-1)/2)} \left(1 - x^2 \right)^{(d-3)/2}. $

In high dimensions this beta distribtion converges to the normal distribution $z_i := \sqrt{\pi/2} \cdot {\bm s}_i^\top {\bm y} \cdot {\tt sign}({\bm s}_i^\top {\bm x})$.

Proof: $i \in [d]$ equals the ratio of the area of a sphere with radius $z_i$ in dimension ${\tt sign}$ to the volume of a unit sphere in dimension $Q_{\tt qjl}^{-1}: { -1, +1}^d \to \mathbb{R}^d$ scaled down by $Q_{\tt qjl}$ (by Pythagorean theorem). Therefore,

$ f_X(x) = \frac{\frac{2 \pi^{(d-1)/2}}{\Gamma((d-1)/2)} \cdot (1-x^2)^{(d-2)/2}}{\frac{2 \pi^{d/2}}{\Gamma(d/2)}} \cdot 1/\sqrt{1-x^2}= \frac{\Gamma(d/2)}{\sqrt{\pi} \cdot \Gamma((d-1)/2)} \left(1 - x^2 \right)^{(d-3)/2}. $

2.1 Shannon Lower Bound on Distortion

The Shannon Lower Bound (SLB) is a powerful tool, derived from Shannon's lossy source coding theorem [2], that provides a universal lower bound on the optimal achievable distortion rate for any lossy compression scheme. Specifically, we use a version of SLB tailored for the mean-squared error (MSE) distortion measure applied to general $Q_{\tt qjl}^{-1}$-dimensional sources.

Lemma 2: SLB

Let ${\bm x} \in \mathbb{S}^{d-1}$ be a random vector with an arbitrary probability distribution ${\bm y} \in \mathbb{R}^d$ and finite differential entropy $\mathop{{\mathbb{E}}}\left[\left< {\bm y}, Q_{\tt qjl}^{-1}\left(Q_{\tt qjl}({\bm x}) \right) \right> \right] = \langle {\bm y}, {\bm x} \rangle$.

Define the MSE distortion-rate function $\mathtt{Var} \left(\left< {\bm y}, Q_{\tt qjl}^{-1}\left(Q_{\tt qjl}({\bm x}) \right) \right> \right) \le \frac{\pi}{2 d} \cdot \left| {\bm y} \right|_2^2$ for total bit complexity ${\bm s}_1, {\bm s}_2, \ldots {\bm s}_m$ as:

$ D(p_X, B) := \inf \left{ \mathop{{\mathbb{E}}} \left[\left| {\bm x} - {\bm y} \right|_2^2 \right] : I({\bm x}; {\bm y}) \le B \right}, $

where the infimum is taken over all joint distributions of ${\bm S}$ and a reconstruction random vector ${\bm s}_i$ such that the mutual information $d$ is at most $z_i := \sqrt{\pi/2} \cdot {\bm s}_i^\top {\bm y} \cdot {\tt sign}({\bm s}_i^\top {\bm x})$ and $i \in [d]$ is the expected MSE distortion, calculated with respect to the joint distribution of $z_i$ and ${\tt sign}$.

Then, for any bit complexity $Q_{\tt qjl}^{-1}: { -1, +1}^d \to \mathbb{R}^d$, the following Shannon Lower Bound holds:

$ D(p_X, B) \ge \frac{d}{2 \pi e} \cdot 2^{(2/d) (h({\bm x}) - B)}. $

This is a classic result proved using backward Gaussian test channel (for a proof see [55]). Our lower bound result uses a corollary of SLB that corresponds to the uniformly distributed random points on the unit hyeprsphere. We present this in the following lemma:

Lemma 3: SLB for random point on hypersphere

Let $Q_{\tt qjl}$ be a random variable uniformly distributed over the unit hypersphere and define the MSE distortion-rate function $Q_{\tt qjl}^{-1}$ for total bit complexity ${\bm x} \in \mathbb{S}^{d-1}$ as per .

Then, for any bit complexity ${\bm y} \in \mathbb{R}^d$, the following distortion lower bound holds:

$ D(B) \ge 2^{-2B/d}. $

Proof: If we let $\mathop{{\mathbb{E}}}\left[\left< {\bm y}, Q_{\tt qjl}^{-1}\left(Q_{\tt qjl}({\bm x}) \right) \right> \right] = \langle {\bm y}, {\bm x} \rangle$ denote the area of the hypersphere $\mathtt{Var} \left(\left< {\bm y}, Q_{\tt qjl}^{-1}\left(Q_{\tt qjl}({\bm x}) \right) \right> \right) \le \frac{\pi}{2 d} \cdot \left| {\bm y} \right|_2^2$, the entropy of uniform distribution over hypersphere is ${\bm s}_1, {\bm s}_2, \ldots {\bm s}_m$. Plugging this into the SLB from we get ${\bm S}$. Using Stirling's approximation formula for Gamma function we have ${\bm s}_i$. By substituting this into the inequality obtained from we get the desired lower bound.

2.2 QJL: 1-bit inner product quantization

As previously stated, we design two VQ algorithms: one optimized for minimizing MSE and the other for minimizing inner product error. We show that MSE-optimal quantizers do not necessarily provide unbiased inner product estimates, particularly exhibiting significant bias at lower bit-widths. Our solution for inner product quantization is a two-stage algorithm. First, we apply the MSE-optimal quantizer using one less bit than the desired bit-width budget, thus minimizing the L2 norm of the residuals. Next we apply an unbiased and optimal single-bit quantizer to the residual. For the single-bit inner product quantizer, we utilize the recently proposed Quantized Johnson-Lindenstrauss (QJL) algorithm [20], which is an optimal inner product quantizer with a bit-width of one. Here, we present the QJL algorithm and its essential theoretical guarantees.

Definition 4: QJL

For any positive integer $d$ the QJL map $z_i := \sqrt{\pi/2} \cdot {\bm s}_i^\top {\bm y} \cdot {\tt sign}({\bm s}_i^\top {\bm x})$ is defined as:

$ Q_{\tt qjl}({\bm x}) := {\tt sign} \left({\bm S} \cdot {\bm x} \right) \quad ~\text{ for any } {\bm x} \in \mathbb{R}^d, $

where $i \in [d]$ is a random matrix with i.i.d. entries sampled from the normal distribution $z_i$ and the ${\tt sign}$ function is applied entry-wise to its vector input.

The inverse/dequantization map $Q_{\tt qjl}^{-1}: { -1, +1}^d \to \mathbb{R}^d$ is defined as:

$ Q_{\tt qjl}^{-1}({\bm z}) := \frac{\sqrt{\pi/2}}{d} \cdot {\bm S}^\top \cdot {\bm z} \quad ~\text{ for any } {\bm z} \in { -1, +1}^d. $

In the next lemma we restate the results from [20] that show the QJL is unbiased and also has small inner product distortion:

Lemma 5: performance guarantee: QJL

Let $Q_{\tt qjl}$ and $Q_{\tt qjl}^{-1}$ be defined as per .

For any vector ${\bm x} \in \mathbb{S}^{d-1}$ and any ${\bm y} \in \mathbb{R}^d$ we have the following:

  • Unbiased: $\mathop{{\mathbb{E}}}\left[\left< {\bm y}, Q_{\tt qjl}^{-1}\left(Q_{\tt qjl}({\bm x}) \right) \right> \right] = \langle {\bm y}, {\bm x} \rangle$.
  • Variance Bound: $\mathtt{Var} \left(\left< {\bm y}, Q_{\tt qjl}^{-1}\left(Q_{\tt qjl}({\bm x}) \right) \right> \right) \le \frac{\pi}{2 d} \cdot \left| {\bm y} \right|_2^2$

Proof: The unbiasedness immediately follows from Lemma 3.2 of [20]. To show the variance bound let ${\bm s}_1, {\bm s}_2, \ldots {\bm s}_m$ denote the rows of the random matrix ${\bm S}$ in . We have:

$ \left< {\bm y}, Q_{\tt qjl}^{-1}\left(Q_{\tt qjl}({\bm x}) \right) \right> = \frac{1}{d} \sum_{i\in[d]} \sqrt{\pi/2} \cdot {\bm s}_i^\top {\bm y} \cdot {\tt sign}({\bm s}_i^\top {\bm x}). $

Since ${\bm s}_i$ 's are i.i.d. the above is indeed the average of $d$ i.i.d. random samples defined as $z_i := \sqrt{\pi/2} \cdot {\bm s}_i^\top {\bm y} \cdot {\tt sign}({\bm s}_i^\top {\bm x})$ for $i \in [d]$. Let us now upper bound the variance of a single $z_i$ using Fact 3.4 from [20]:

$ \mathtt{Var} \left(z_i \right) = \pi/2 \cdot \mathtt{Var} \left({\bm s}_i^\top {\bm y} \cdot {\tt sign}({\bm s}_i^\top {\bm x}) \right) \le \pi/2 \cdot \mathop{{\mathbb{E}}} \left[({\bm s}_i^\top {\bm y})^2 \right] = \pi/2 \cdot \left| {\bm y} \right|_2^2,\tag{3} $

where the last equality above follows because ${\bm s}_i^\top {\bm y}$ is a Gaussian random variable with mean zero and variance $| {\bm y}|_2^2$. Now the variance of the average of $d$ i.i.d. random samples $z_1, z_2, \ldots z_d$ is:

$ \mathtt{Var} \left(\left< {\bm y}, Q_{\tt qjl}^{-1}\left(Q_{\tt qjl}({\bm x}) \right) \right> \right) = \frac{1}{d^2} \sum_{i\in[d]} \mathtt{Var} (z_i) \le \frac{\pi}{2 d} \cdot \left| {\bm y} \right|_2^2. $

3. TurboQuant: High Performance Quantization

Section Summary: TurboQuant introduces two advanced vector quantization methods designed for high-dimensional data compression, with one focused on minimizing the mean squared error between original and reconstructed vectors, and the other aimed at ensuring accurate, unbiased estimates of inner products despite typical quantization biases. The MSE-optimal version works by randomly rotating the input vector to make its coordinates independent, then applying efficient scalar quantizers to each one individually, which can be precomputed for speed. Theoretical analysis shows that TurboQuant performs nearly as well as the mathematical limits allow, with error rates that drop sharply as more bits are used for encoding.

We developed two VQ algorithms, each tailored to a specific objective. The first algorithm is designed to minimize the MSE between the original and reconstructed vectors after quantization. The second algorithm is optimized for unbiased inner product estimation, addressing the bias inherent in MSE-optimal quantizers. These algorithms are detailed in the following subsections.

Furthermore, in Section 3.3, we establish information-theoretic lower bounds on the best achievable distortion rates for any vector quantizer. This analysis demonstrates that TurboQuant achieve near-optimality, differing from the lower bound by only a small constant factor across all bit-widths.

3.1 MSE Optimal TurboQuant

Let ${\bm x} \in \mathbb{S}^{d-1}$ be a (worst-case) vector on the unit sphere in dimension $d$. We aim to quantize ${\bm x}$ to $b$ bits per coordinate while minimizing the reconstruction MSE defined in . We start by randomizing this vector by multiplying it with a random rotation matrix $\boldsymbol{\Pi} \in \mathbb{R}^{d \times d}$. We can generate $\boldsymbol{\Pi}$ by applying QR decomposition on a random matrix with i.i.d Normal entries.

The resulting rotated vector, $\boldsymbol{\Pi} \cdot {\bm x}$, is uniformly distributed on the unit sphere $\mathbb{S}^{d-1}$. As shown in , each coordinate of $\boldsymbol{\Pi} \cdot {\bm x}$ follows a Beta distribution, which converges to a normal distribution in high dimensions. Furthermore, in high dimensions, distinct coordinates of $\boldsymbol{\Pi} \cdot {\bm x}$ become nearly independent [56], allowing us to apply optimal scalar quantizers to each coordinate independently. Therefore, by , our task reduces to designing a scalar quantizer for random variables with the distribution $f_{X}(x) = \frac{\Gamma(d/2)}{\sqrt{\pi} \cdot \Gamma((d-1)/2)} \left(1 - x^2 \right)^{(d-3)/2}$ for $x \in [-1, 1]$.

The optimal scalar quantization problem, given a known probability distribution, can be framed as a continuous k-means problem in dimension one. Specifically, we aim to partition the interval $[-1, 1]$ into $2^b$ clusters/buckets. The optimal solution adheres to a Voronoi tessellation [18], meaning interval boundaries are the midpoints between consecutive centroids, when arranged in sorted order. Therefore, with $c_i$ 's denoting the centroids in ascending order, we can formulate the scalar quantization as the following k-means optimization problem:

$ \mathcal{C}(f_X, b) := \min_{-1 \le c_1 \le c_2 \le \ldots \le c_{2^b} \le 1} \sum_{i=1}^{2^b} \int_{\frac{c_{i-1} + c_i}{2}}^{\frac{c_i + c_{i+1}}{2}} |x - c_i|^2 \cdot f_X(x)~dx.\tag{4} $

Note that $\mathcal{C}(f_X, b)$ in denotes the optimal MSE cost function for bit-width $b$, a quantity we will bound to prove the upper bound on the end-to-end MSE of TurboQuant. The problem in can be solved using iterative numerical methods to achieve any desired precision. We solve for a range of practically relevant bit-widths $b$ once, and store the results for future uses by the quantizer.

For example, in moderately high dimensions $d$, where the distribution $f_X(x)$ closely approximates a normal distribution, the optimal quantization centroids for bit-widths $b = 1, 2$ are $\left{ \pm \frac{\sqrt{2/\pi}}{\sqrt{d}} \right}$ and $\left{ \pm \frac{0.453}{\sqrt{d}}, \pm \frac{1.51}{\sqrt{d}} \right}$, respectively.

Therefore the quantizer $Q_{\tt mse}: \mathbb{R}^d \to { 0, 1 }^{b \cdot d}$ first computes $\boldsymbol{\Pi} \cdot {\bm x}$ and then computes and stores the indices of the nearest centroids to each coordinate of this vector. The dequantization map $Q_{\tt mse}^{-1}: { 0, 1 }^{b \cdot d} \to \mathbb{R}^d$ reconstructs the vector by retrieving the centroids corresponding to the stored indices and then rotating the result back to the original basis through multiplication with $\boldsymbol{\Pi}^\top$. A pseudocode for these procedures is given in .

1: **input:** dimension $d$ and bit-width $b$
   // Global Parameters for Setting up $\textsc{TurboQuant}_{\tt mse}$
2: Generate a random rotation matrix $\boldsymbol{\Pi} \in \mathbb{R}^{d \times d}$
3: Construct codebook by finding centroids $c_1, c_2, \ldots c_{2^b} \in [-1, 1]$ that minimize MSE cost in Eq. (4)

4: **Procedure** $\textsc{Quant}_{\tt mse}({\bm x})$
5:     ${\bm y} \gets \boldsymbol{\Pi} \cdot {\bm x}$
6:     ${\tt idx}_j \gets \arg\min_{k \in [2^b]} |{\bm y}_j - c_k|$ for every $j \in [d]$    { ${\tt idx}_j$ 's are $b$-bit integers }
7:     **output:** ${\tt idx}$

8: **Procedure** $\textsc{DeQuant}_{\tt mse}({\tt idx})$
9:     $\tilde{{\bm y}}_j \gets c_{{\tt idx}_j}$ for every $j \in [d]$
10:    $\tilde{{\bm x}} \gets \boldsymbol{\Pi}^\top \cdot \tilde{{\bm y}}$
11:    **output:** $\tilde{{\bm x}}$

We are now ready to prove our main theorem for $\textsc{TurboQuant}_{\tt mse}$.

Theorem 6: performance guarantee: $\textsc{TurboQuant}_{\tt mse}$

For any bit-width $b \ge 1$ and any vector ${\bm x} \in \mathbb{S}^{d-1}$, the procedure $\textsc{Quant}_{\tt mse}({\bm x})$ in outputs an index vector ${\tt idx} \in [2^b]^d$.

When this index vector is passed to the primitive $\textsc{DeQuant}_{\tt mse}({\tt idx})$, it produces a reconstructed vector $\tilde{{\bm x}} \in \mathbb{R}^d$ that satisfies the following distortion bounds:

  • MSE defined as $D_{\tt mse} := \mathop{{\mathbb{E}}}_{\tilde{{\bm x}}}[{\left| {\bm x} - \tilde{{\bm x}} \right|}2^2]$ is bounded by $D{\tt mse} \le \frac{\sqrt{3} \pi}{2} \cdot \frac{1}{4^{b}}$ for any $b \ge 0$.
  • For small bit-widths, specifically $b = 1, 2, 3, 4$ the MSE exhibits finer-grained distortion values: $D_{\tt mse} \approx {\bf 0.36}, {\bf 0.117}, {\bf 0.03}, {\bf 0.009}$, respectively.

Proof: We start the proof by showing that $D_{\tt mse} = d \cdot \mathcal{C}(f_X, b)$, where $\mathcal{C}(f_X, b)$ is the optimal MSE cost for scalar quantizer defined in . Let $\tilde{{\bm y}}$ be defined as per line 12 of . Since $\boldsymbol{\Pi}$ is a rotation matrix we can write: ${\left| {\bm x} - \tilde{{\bm x}} \right|}_2 = {\left| \boldsymbol{\Pi} \cdot {\bm x} - \tilde{{\bm y}} \right|}2$. Using the notation ${\bm y} = \boldsymbol{\Pi} \cdot {\bm x}$ as per line 7 of and plugging this into the definition of $D{\tt mse}$ we can write:

$ $ \begin{align*} D_{\tt mse} &= \mathop{{\mathbb{E}}} [{\left| {\bm y} - \tilde{{\bm y}} \right|}2^2] \ &= \sum{j \in [d]} \mathop{{\mathbb{E}}}\left[| {\bm y}j - \tilde{{\bm y}}j|^2 \right] \ &= \sum{j \in [d]} \mathop{{\mathbb{E}}}\left[| {\bm y}j - c{{\tt idx}j}|^2 \right]\ &= d \cdot \mathop{{\mathbb{E}}}\left[| {\bm y}1 - c{{\tt idx}1}|^2 \right] \ &= d \cdot \min{-1 \le c_1 \le c_2 \le \ldots \le c{2^b} \le 1} \sum{i=1}^{2^b} \int_{\frac{c_{i-1} + c_i}{2}}^{\frac{c_i + c_{i+1}}{2}} |x - c_i|^2 \cdot f_X(x)~dx \ &= d \cdot \mathcal{C}(f_X, b). \end{align*} $ $

The third equality above follows from the definition of $\tilde{{\bm y}}$ in line 12 of and the fourth line above follows because all ${\bm y}_j$ 's have identical distribution of ${\bm y}j \sim f_X(\cdot)$ as shown in . The last two lines above follows because $c{{\tt idx}_j}$ is chosen to be the nearest centroid to each coordinate ${\bm y}_j$ in line 8.

Now we must bound the optimal k-means cost $\mathcal{C}(f_X, b)$. For moderate values of $d$, $f_X \to \mathcal{N}(0, 1/d)$. By numerically solving the optimization problem in for values $b = 1, 2, 3, 4$ we get that $\mathcal{C}(f_X, b) \approx \frac{0.36}{d}, \frac{0.117}{d}, \frac{0.03}{d}, \frac{0.009}{d}$, respectively. For larger bit-widths $b > 4$, we can apply the Panter-Dite [57] high-resolution formula for the distortion of a fixed-rate scalar quantizer, yielding the following bound:

$ \mathcal{C}(f_X, b) \le \frac{1}{12} \cdot \left(\int f_X(x)^{1/3}~dx \right)^3 \cdot \frac{1}{4^b} = \frac{\sqrt{3} \pi }{2 d} \cdot \frac{1}{4^b}. $

This completes the proof.

Entropy Encoding Codebook Pointers.

TurboQuant's efficiency can be further increased by applying entropy encoding to the indices that point to the closest codebook elements. Specifically, the probability of each codeword index appearing in the quantized vectors can be computed as $p_\ell := \int_{\frac{c_{\ell-1} + c_\ell}{2}}^{\frac{c_\ell + c_{\ell+1}}{2}} f_X(x)~dx$. Optimally coding the indices, reduces the average bit-width to nearly the entropy of the distribution ${ p_i }{i \in [2^b]}$. This lossless compression does not affect the distortion and provides a bit-width reduction at no cost. The most significant reduction occurs for $b=4$, where the entropy of ${ p_i }{i \in [2^b]}$ is approximately $3.8$. Detailed calculations for optimal prefix codes reveal that the average bit-width can be reduced by $5 %$. However, given the limited gain, we have chosen not to incorporate this technique into TurboQuant to maintain simplicity and speed.

3.2 Inner-product Optimal TurboQuant

For important applications like nearest neighbor search, having an unbiased inner product estimator is essential. However, $\textsc{TurboQuant}{\tt mse}$ presented in does not provide unbiased inner product estimates with query vectors. To illustrate this, consider the case with a bit-width of $b=1$. In this scenario, the optimal codebooks that solve the optimization problem in , for sufficiently large $d$, are $\left{ \pm \sqrt{\frac{2}{\pi d}} \right}$. This implies that the quantization map for $\textsc{TurboQuant}{\tt mse}$ is $Q_{\tt mse}({\bm x}) = {\tt sign} \left(\boldsymbol{\Pi} \cdot {\bm x} \right)$ for any ${\bm x} \in \mathbb{R}^d$, and the dequantization map is $Q_{\tt mse}^{-1}({\bm z}) = \sqrt{\frac{2}{\pi d}} \cdot \boldsymbol{\Pi}^\top \cdot {\bm z}$ for any ${\bm z} \in { -1, +1}^d$. Therefore, for large enough $d$, according to , we have $\mathop{{\mathbb{E}}}\left[\left< {\bm y}, Q_{\tt mse}^{-1}\left(Q_{\tt mse}({\bm x}) \right) \right> \right] = \frac{2}{\pi} \cdot \langle {\bm y}, {\bm x} \rangle$, which has a multiplicative bias of $2/\pi$. This bias diminishes with increasing bit-widths $b$, as we empirically demonstrate in Section 4.1.

To address this bias, we propose a solution that combines $\textsc{TurboQuant}{\tt mse}$ with an instance of QJL [20]. Specifically, let $Q{\tt mse}$ be the quantization map corresponding to $\textsc{TurboQuant}{\tt mse}$ with a bit-width of $b-1$. For any ${\bm x} \in \mathbb{S}^{d-1}$ the residual vector, defined as ${\bm r} := {\bm x} - Q{\tt mse}^{-1}\left(Q_{\tt mse}({\bm x}) \right)$, has a small L2 norm, i.e., on expectation $\mathop{{\mathbb{E}}}[\left| {\bm r} \right|] = \sqrt{\mathcal{C}(f_X, b-1)}$ (per ). We can then apply the QJL quantization map $Q_{\tt qjl}$ on this residual vector, resulting in an overall bit-width of $b$ and providing the following unbiased inner product estimator:

$ \left< {\bm y}, Q_{\tt mse}^{-1}\left(Q_{\tt mse}({\bm x}) \right) \right> + \left| {\bm r} \right|_2 \cdot \left< {\bm y}, Q_{\tt qjl}^{-1}\left(Q_{\tt qjl}({\bm r}) \right) \right>. $

More formally, the quantization map $Q_{\tt prod}: \mathbb{S}^{d-1} \to [2^{b-1}]^d \times { -1, 1 }^d \times \mathbb{R}$ is defined as:

$ Q_{\tt prod}({\bm x}) = \left[Q_{\tt mse}({\bm x}), Q_{\tt qjl}\left({\bm x} - Q_{\tt mse}^{-1}\left(Q_{\tt mse}({\bm x}) \right) \right), \left| {\bm x} - Q_{\tt mse}^{-1}\left(Q_{\tt mse}({\bm x}) \right) \right|_2 \right]. $

A pseudocode for this procedure is given in .

1: **input:** dimension $d$ and bit-width $b$
   // Global Parameters for Setting up $\textsc{TurboQuant}_{\tt prod}$
2: Instantiate a $\textsc{TurboQuant}_{\tt mse}$ with bit-width $b-1$ as per Algorithm 1
3: Generate a random projection matrix ${\bm S} \in \mathbb{R}^{d \times d}$ with i.i.d. entries ${\bm S}_{i,j} \sim \mathcal{N}(0, 1)$

4: **Procedure** $\textsc{Quant}_{\tt prod}({\bm x})$
5:     ${\tt idx} \gets \textsc{Quant}_{\tt mse}({\bm x})$
6:     ${\bm r} \gets {\bm x} - \textsc{DeQuant}_{\tt mse}({\tt idx})$    {residual vector}
7:     ${\tt qjl} \gets {\tt sign} \left( {\bm S} \cdot {\bm r} \right)$    {QJL on residual vector}
8:     **output:** $({\tt idx}, {\tt qjl}, \|{\bm r}\|_2)$

9: **Procedure** $\textsc{DeQuant}_{\tt prod}({\tt idx}, {\tt qjl}, \gamma)$
10:    $\tilde{{\bm x}}_{\tt mse} \gets \textsc{DeQuant}_{\tt mse}({\tt idx})$
11:    $\tilde{{\bm x}}_{\tt qjl} \gets \frac{\sqrt{\pi/2}}{d} \cdot \gamma \cdot {\bm S}^\top \cdot {\tt qjl}$
12:    **output:** $\tilde{{\bm x}}_{\tt mse} + \tilde{{\bm x}}_{\tt qjl}$

We prove the main result for $\textsc{TurboQuant}_{\tt prod}$ in the following theorem.

Theorem 7: performance guarantee: $\textsc{TurboQuant}_{\tt prod}$

For any bit-width $b \ge 1$ and any vector ${\bm x} \in \mathbb{S}^{d-1}$, the procedure $\textsc{Quant}_{\tt prod}({\bm x})$ in outputs an index vector ${\tt idx} \in [2^{b-1}]^d$ along with a sign vector ${\tt qjl} \in { -1, 1 }^d$ and a positive number $\gamma \ge 0$.

When these vectors and the scalar value are passed to the primitive $\textsc{DeQuant}_{\tt prod}({\tt idx}, {\tt qjl}, \gamma)$, it produces a reconstructed vector $\tilde{{\bm x}} \in \mathbb{R}^d$ that for any vector ${\bm y} \in \mathbb{R}^d$ satisfies the following properties:

  • Expected inner-product $\mathop{{\mathbb{E}}}_{\tilde{{\bm x}}}\left[\left< {\bm y}, \tilde{{\bm x}} \right> \right] = \langle {\bm y}, {\bm x} \rangle$
  • Inner-product distortion defined as $D_{\tt prod} := \mathop{{\mathbb{E}}}{\tilde{{\bm x}}}\left[\left| \langle {\bm y}, {\bm x} \rangle - \langle {\bm y}, \tilde{{\bm x}} \rangle \right|^2 \right]$ is bounded by $D{\tt prod} \le \frac{\sqrt{3} \pi^2 \cdot \left| {\bm y} \right|_2^2}{d} \cdot \frac{1}{4^{b}}$ for any $b \ge 0$.
  • For small bit-widths, specifically $b = 1, 2, 3, 4$, $D_{\tt prod}$ exhibits finer-grained distortion values: $D_{\tt prod} \approx \frac{\bf 1.57}{d}, \frac{\bf 0.56}{d}, \frac{\bf 0.18}{d}, \frac{\bf 0.047}{d}$, respectively.

Proof: First we compute the conditional expectation of the inner product estimate $\langle {\bm y}, \tilde{{\bm x}} \rangle$ conditioned on $\tilde{{\bm x}}_{\tt mse}$ as follows:

$ \begin{align*} \mathop{{\mathbb{E}}} \left[\langle {\bm y}, \tilde{{\bm x}} \rangle | \tilde{{\bm x}}{\tt mse} \right] &= \mathop{{\mathbb{E}}}{\tilde{{\bm x}}{\tt qjl}} \left[\langle {\bm y}, \tilde{{\bm x}}{\tt mse} + \tilde{{\bm x}}{\tt qjl} \rangle | \tilde{{\bm x}}{\tt mse} \right]\ &= \langle {\bm y}, \tilde{{\bm x}}{\tt mse} \rangle + \mathop{{\mathbb{E}}}{\tilde{{\bm x}}{\tt qjl}} \left[\langle {\bm y}, \tilde{{\bm x}}{\tt qjl} \rangle | \tilde{{\bm x}}{\tt mse} \right] \ &= \langle {\bm y}, \tilde{{\bm x}}{\tt mse} \rangle + \langle {\bm y}, {\bm r} \rangle\ &= \langle {\bm y}, {\bm x} \rangle, \end{align*} $

where the first equality follows from the definition of $\tilde{{\bm x}}$ in line 15 of the algorithm. The third equality above follows from and last line follows from definition of the residual vector ${\bm r} = {\bm x} - \tilde{{\bm x}}{\tt mse}$ in line 8. Now we can computed the unconditional expectation using the law of total expectation: $\mathop{{\mathbb{E}}}{\tilde{{\bm x}}}\left[\left< {\bm y}, \tilde{{\bm x}} \right> \right] = \mathop{{\mathbb{E}}}{\tilde{{\bm x}}{\tt mse}} \left[\mathop{{\mathbb{E}}} \left[\langle {\bm y}, \tilde{{\bm x}} \rangle | \tilde{{\bm x}}_{\tt mse} \right] \right] = \mathop{{\mathbb{E}}}[\langle {\bm y}, {\bm x} \rangle] = \langle {\bm y}, {\bm x} \rangle$, which proves the first claim of the theorem.

We apply the same conditioning on $\tilde{{\bm x}}_{\tt mse}$, when computing the distortion, and then compute the resulting conditional distortion:

$ \begin{align*} \mathop{{\mathbb{E}}}\left[\left. \left| \langle {\bm y}, {\bm x} \rangle - \langle {\bm y}, \tilde{{\bm x}} \rangle \right|^2 \right| \tilde{{\bm x}}{\tt mse} \right] &= \mathop{{\mathbb{E}}}{\tilde{{\bm x}}{\tt qjl}}\left[\left. \left| \langle {\bm y}, {\bm x} \rangle - \langle {\bm y}, \tilde{{\bm x}}{\tt mse} + \tilde{{\bm x}}{\tt qjl} \rangle \right|^2 \right| \tilde{{\bm x}}{\tt mse} \right] \ &= \mathop{{\mathbb{E}}}{\tilde{{\bm x}}{\tt qjl}}\left[\left. \left| \langle {\bm y}, {\bm r} \rangle - \langle {\bm y}, \tilde{{\bm x}}{\tt qjl} \rangle \right|^2 \right| \tilde{{\bm x}}{\tt mse} \right] \ &= \mathtt{Var} \left(\left. \langle {\bm y}, \tilde{{\bm x}}{\tt qjl} \rangle \right| \tilde{{\bm x}}{\tt mse} \right) \ &\le \frac{\pi}{2d} \cdot \left| {\bm r} \right|_2^2 \left| {\bm y} \right|_2^2, \end{align*} $

where the second equality above follows from the definitions of ${\bm r}$ and $\tilde{{\bm x}}{\tt mse}$ in lines 8 and 13 of . The third line above follows because $\mathop{{\mathbb{E}}}[\langle {\bm y}, \tilde{{\bm x}}{\tt qjl} \rangle] = \langle {\bm y}, {\bm r} \rangle$, by . The last line follows from the variance bound of QJL estimator shown in and using the fact that $\tilde{{\bm x}}_{\tt qjl}$ in line 14 is re-scaled by $\gamma = \left| {\bm r} \right|$.

Now by law of total expectation along with the fact that ${\bm r} = {\bm x} - \tilde{{\bm x}}_{\tt mse}$ we can bound the inner product distortion as follows:

$ \begin{align*} D_{\tt prod} &= \mathop{{\mathbb{E}}}{\tilde{{\bm x}}{\tt mse}} \left[\mathop{{\mathbb{E}}}\left[\left. \left| \langle {\bm y}, {\bm x} \rangle - \langle {\bm y}, \tilde{{\bm x}} \rangle \right|^2 \right| \tilde{{\bm x}}_{\tt mse} \right] \right]\ &\le \frac{\pi}{2 d} \cdot \left| {\bm y} \right|2^2 \cdot \mathop{{\mathbb{E}}}[{\left| {\bm x} - \tilde{{\bm x}}{\tt mse} \right|}_2^2]\ &= \frac{\pi}{2 d} \cdot \left| {\bm y} \right|2^2 \cdot D{\tt mse}. \end{align*} $

The theorem follows by invoking the MSE bounds from with bit-width $b-1$.

3.3 Lower Bounds

We show that TurboQuant achieves an optimal distortion rate, up to a small constant factor, for any bit-width by proving lower bounds on the best achievable distortion for any compression algorithm. Our lower bound proof leverages Yao's minimax principle. This principle allows us to relate the lower bound for randomized algorithms with worst-case deterministic input vectors to the lower bound for deterministic algorithms with randomized input vectors. Subsequently, we derive a lower bound on the achievable distortion rate for the latter using Shannon's lower bound (SLB) presented in . Formally, we prove the following theorem.

Theorem 8: lower bound on best achievable compression distortion

For any randomized quantization algorithm $Q: \mathbb{S}^{d-1} \to { 0, 1 }^{b \cdot d}$ with bit-width $b$ and any reconstruction map $Q^{-1}: { 0, 1 }^{b \cdot d} \to \mathbb{R}^d$, there exist a hard input instance ${\bm x} \in \mathbb{S}^{d-1}$ such that:

$ D_{\tt mse}(Q) := \mathop{{\mathbb{E}}}\left[\left| {\bm x} - Q^{-1}\left(Q({\bm x}) \right) \right|_2^2 \right] \ge \frac{1}{4^{b}}. $

Furthermore, there exists a ${\bm y} \in \mathbb{S}^{d-1}$ such that:

$ D_{\tt prod}(Q) = \mathop{{\mathbb{E}}}\left[\left| \langle {\bm y}, {\bm x} \rangle - \langle {\bm y}, Q^{-1}\left(Q({\bm x}) \right) \rangle \right|^2 \right] \ge \frac{1}{d} \cdot \frac{1}{4^{b}} $

Proof: By Yao's minimax principle the expected MSE of the optimal randomized compression algorithm for worst-case inputs ($D_{\tt mse}$) is equal to the expected MSE of the optimal deterministic compression algorithm when applied to inputs drawn from a maximally difficult randomized distribution. By definition, the MSE of the latter scenario is lower-bounded by the best achievable MSE for inputs uniformly distributed on the unit hypersphere.

The best achievable MSE for a compression algorithm with bit-width $b$, operating on uniformly distributed inputs from the sphere $\mathbb{S}^{d-1}$, is lower bounded in . Therefore, by invoking we conclude that $D_{\tt mse} \ge \frac{1}{4^b}$.

Furthermore, from $D_{\tt mse} \ge \frac{1}{4^b}$ and using the definition of $D_{\tt mse}$ we conclude that:

$ \begin{align*} D_{\tt mse} &= \sum_{j=1}^d \mathop{{\mathbb{E}}} \left[\left| {\bm x}_j - \left[Q^{-1}\left(Q({\bm x}) \right) \right] j \right|^2 \right]\ &= \sum{j=1}^d \mathop{{\mathbb{E}}} \left[\left| \langle {\bm e}_j, {\bm x} \rangle - \langle {\bm e}_j, Q^{-1}\left(Q({\bm x}) \right) \rangle \right|^2 \right]\ &\ge \frac{1}{4^b}. \end{align*} $

By pigeonhole principle there exist an index $j \in [d]$ such that $\mathop{{\mathbb{E}}} \left[\left| \langle {\bm e}_j, {\bm x} \rangle - \langle {\bm e}_j, Q^{-1}\left(Q({\bm x}) \right) \rangle \right|^2 \right] \ge \frac{1}{d} \cdot \frac{1}{4^b}$, which completes the proof.

We note that a comparable lower bound for the worst-case distortion in vector quantization can be derived using "sphere packing" arguments (indeed, with larger constants as this is a harder problem) [58]. However, offers a more robust and relevant lower bound for our analysis. This is because it establishes a lower bound on the expected distortion, rather than the worst-case error, and aligns seamlessly with our upper bounds presented in and .

4. Experiments

Section Summary: The experiments, run on a single NVIDIA A100 GPU, are split into validating theoretical predictions and testing the methods on real-world tasks like compressing key-value caches in AI models and searching for similar vectors. Using a dataset of text embeddings, researchers compared two quantization techniques: one optimized for inner product calculations that stays unbiased across bit levels, and another for overall error that shows bias but improves with more bits, with results matching the theory closely. In practical tests, the TurboQuant method excelled in retrieving hidden information from long documents and handling extended text generation, matching full-precision performance even when compressed fourfold, and outperforming other compression approaches.

All experiments are performed using a single NVIDIA A100 GPU. The experimental section is divided into two parts: one to empirically validate the theoretical results, and another to evaluate the performance of our methods on downstream tasks, specifically KV cache quantization and nearest neighbor vector search.

4.1 Empirical Validation

**Figure 1:** Error distribution of $\textsc{TurboQuant}_\text{\tt prod}$ and $\textsc{TurboQuant}_\text{\tt mse}$ for Inner Product Estimation.

In this section, we verify the theoretical results established in previous sections. We conduct our experiments using the DBpedia Entities dataset, which has been encoded into a 1536-dimensional space using OpenAI3 embeddings. To perform our experiments, we randomly sample 100, 000 data points from the dataset, denoted as training set, which serves as our primary dataset. Additionally, we extract 1, 000 distinct entries, denoted as query set, to be used as query points.

We evaluate two quantization methods: $\textsc{TurboQuant}\text{\tt prod}$ and $\textsc{TurboQuant}\text{\tt mse}$ . The method $\textsc{TurboQuant}\text{\tt mse}$ is designed to be optimzed for estimating the mean squared error (MSE) between the quantized and original vectors. In contrast, $\textsc{TurboQuant}\text{\tt prod}$ is unbiased for estimating the inner product between the quantized and original vectors.

**Figure 2:** The variance of Inner-product error remains constant for $\textsc{TurboQuant}_{\tt prod}$, while in $\textsc{TurboQuant}_{\tt mse}$ increases with the average inner product. Bit-width is $b=2$.

Both methods are applied to the task of inner product estimation by quantizing training set and analyzing the distortion in inner product calculations across different bit widths. As shown in , increasing the bit width reduces variance in both methods. However, when used for inner product estimation, $\textsc{TurboQuant}_\text{\tt mse}$ introduces bias. This bias diminishes as the bit width increases and eventually converges to zero.

The experimental results, illustrated in , confirm that $\textsc{TurboQuant}\text{\tt prod}$ remains unbiased for inner product estimation across all bit widths, while $\textsc{TurboQuant}\text{\tt mse}$ gradually improves with increasing bit width.

As observed in , when quantizing to 2 bits, the variance remains constant regardless of the inner product of the original vector in the $\textsc{TurboQuant}\text{\tt prod}$ approach. However, the same plot indicates that the bias in the $\textsc{TurboQuant}\text{\tt mse}$ approach is dependent on the average inner product. As the average inner product increases, the bias also increases.

**Figure 3:** Comparison of inner-product error and MSE against theoretical bounds across different bit ratios.

Along with the histograms, we also plot the average inner product error and MSE between the original and quantized vectors across different bit ratios. These plots are drawn alongside the upper and lower bounds established in our theoretical analysis. Our observations confirm that the results align with the theoretical predictions. Specifically, for inner product estimation, the $\textsc{TurboQuant}\text{\tt prod}$ approach performs better at lower bit ratios. However, as the bit count increases, $\textsc{TurboQuant}\text{\tt mse}$ reduces bias and ultimately achieves superior performance in inner product estimation.

4.2 Needle-In-A-Haystack

**Figure 4:** Evaluation of `Llama-3.1-8B-Instruct` on the "Needle-In-A-Haystack" test, where a model must retrieve a hidden sentence from long-context sequences. While some methods struggle with recall, TurboQuant, despite being more than **$4 \times$** quantized, achieves the same exact performance as the uncompressed baseline.

The "Needle-In-A-Haystack Test"" [59] is a benchmark designed to evaluate a model’s ability to retrieve specific information embedded within a long document. The test involves placing a unique sentence (the "needle") at an arbitrary location within a much larger text (the "haystack") and assessing whether the model can successfully extract it.

Following the experimental setup of [60], we conduct evaluations using the $\mathtt{Llama}$- $\mathtt{3.1}$- $\mathtt{8B}$- $\mathtt{Instruct}$ model. To analyze performance across different input sequence lengths, we vary the document size from 4k to 104k tokens. The primary metric used for evaluation is the recall score, which measures how accurately the model retrieves the hidden sentence.

For comparison, we benchmark our approach against several state-of-the-art memory-efficient methods, including PolarQuant [27], SnapKV [41], PyramidKV [61], and KIVI [25]. Each method is tested under a memory compression ratio of 0.25, meaning that only 25% of the full KV cache is utilized.

The results, illustrated in , reveal that quantization methods with theoretical guarantees, such as PolarQuant and TurboQuant, outperform token-level compression techniques like SnapKV and PyramidKV, as well as scalar quantization approaches like KIVI, which lack formal theoretical guarantees. Notably, TurboQuant achieves identical performance to the full-precision model, even at $4 \times$ compression, making it a robust solution for long-context processing.

4.3 End-to-end Generation on LongBench

We experiment with various KV cache compression algorithms on the LongBench dataset [62], which encompasses a broad range of long-text scenarios, including single- and multi-document question-answering, summarization, few-shot learning, synthetic tasks, and code completion. To ensure a balanced evaluation across different context lengths, we employ LongBench-E, a subset designed with a more uniform length distribution. This enables a fair assessment of each model’s performance across varying context sizes, making it a more reliable benchmark for evaluating compression techniques.

We compare TurboQuant against the leading baseline methods introduced in Section 4.2, using both $\mathtt{Llama}$- $\mathtt{3.1}$- $\mathtt{8B}$- $\mathtt{Instruct}$ and $\mathtt{Ministral}$- $\mathtt{7B}$- $\mathtt{Instruct}$ . Unlike existing approaches such as KIVI and PolarQuant, which leave generated tokens unquantized, our method applies quantization even during the streaming generation process.

As shown in , our approach outperforms other methods for both $\mathtt{Llama}$- $\mathtt{3.1}$- $\mathtt{8B}$- $\mathtt{Instruct}$ and $\mathtt{Ministral}$- $\mathtt{7B}$- $\mathtt{Instruct}$, achieving significantly higher average scores. We evaluate our method using 2.5-bit and 3.5-bit quantization during text generation. These non-integer bit precisions result from our strategy of splitting channels into outlier and non-outlier sets, and applying two independent instances of TurboQuant to each, allocating higher bit precision to outliers. This outlier treatment strategy is consistent with prior work [63, 64] . For example, in our 2.5-bit setup, 32 outlier channels are quantized at 3 bits, while the remaining 96 channels use 2 bits, leading to an effective bit precision of $ (32 \times 3 + 96 \times 2) / 128 = 2.5 $ . For 3.5-bit quantization, a different ratio of outliers and regular channels leads to a higher effective bit precision. Despite using fewer bits than competing techniques, TurboQuant maintains performance comparable to unquantized models. Remarkably, we achieve this while compressing quantized vectors by at least a factor of $4.5 \times$.

:::

Table 1: LongBench-V1 [62] results of various KV cache compression methods on $\mathtt{Llama}$- $\mathtt{3.1}$- $\mathtt{8B}$- $\mathtt{Instruct}$ .

:::

4.4 Near Neighbour Search Experiments

In this section, we establish the strength of our proposed method, even in the context of near-neighbor search. We conduct our experiments using the DBpedia [65] Entities dataset, which has been encoded into 1536-dimensional^1 and 3072-dimensional ^2 spaces using OpenAI3 embeddings. Additionally, we evaluate performance on a lower-dimensional dataset, utilizing the standard GloVe [66] embeddings. To construct our experimental setup, we randomly sample 100, 000 data points from the dataset, denoted as training set, which serves as our primary training and evaluation set. Furthermore, we extract 1, 000 distinct entries, denoted as query set, to be used as query points for datasets that do not explicitly provide a query set. For the GloVe dataset, we use a pre-existing query set consisting of 10, 000 points.

We compare our method, TurboQuant, against two baseline quantization approaches: Product Quantization (PQ) and RabitQ [54]. To ensure a fair comparison, we quantize the dataset training set using all three methods and evaluate their performance based on recall ratio at top-k, denoted as 1@k. Specifically, this metric assesses how often the true top inner product result is captured within the top-k approximated results returned by each algorithm.

:Table 2: Quantization time (in seconds) for different approaches across various dimensions using 4-bit quantization.

Approach d=200 d=1536 d=3072
Product Quantization 37.04 239.75 494.42
RabitQ 597.25 2267.59 3957.19
TurboQuant 0.0007 0.0013 0.0021

Product Quantization (PQ) relies on the k-means algorithm to construct codebooks, which require separate storage. As the number of bits increases, the size of the codebook grows exponentially, leading to additional storage overhead. In our experiments, we carefully tuned the parameters to match the bit allocation of other methods. The most efficient implementation, designed for rapid querying, employs AVX2 In-Register Lookup Tables (LUTs). Specifically, it uses LUT16 with (l = 16) codewords. However, we observed substantial quality degradation at this configuration. To achieve a balance between speed and accuracy, we opted for a version of PQ that uses LUT256, which contains 256 codewords. For 2-bit quantization, it groups 4 coordinates per lookup, while for 4-bit quantization, it groups 2 coordinates per lookup. Notably, since we use the same dataset for both training and evaluation, PQ benefits from an inherent advantage in this setup.

RabitQ. Unlike PQ, RabitQ lacks a fully vectorized implementation, making it impossible to leverage GPU acceleration. As a result, it runs significantly slower on CPU. Additionally, the method incurs extra computational overheads that we do not explicitly account for in the bit ratio comparisons. While RabitQ claims a certain bit ratio, in practice, it utilizes more bits than reported due to these inefficiencies.

Despite the advantages granted to the baseline methods, TurboQuant consistently outperforms both Product Quantization and RabitQ in terms of recall ratio across all experiments. This demonstrates the robustness and efficiency of our approach, making it a compelling alternative for high-dimensional quantization-based search tasks.

**Figure 5:** Recall comparison on different datasets with different embedding dimensions.

References

Section Summary: This references section lists key works in information theory, starting with Claude Shannon's foundational 1948 paper on communication and extending to modern advancements in large language models like GPT-4, Llama 3, Claude, and Gemini. It includes studies on scaling laws for neural networks, the influential "Attention is All You Need" transformer paper, and techniques for efficient retrieval-augmented generation, vector search tools such as Pinecone and Qdrant, and quantization methods to compress AI models for faster inference without losing accuracy. Overall, the citations blend classic theory with cutting-edge research on optimizing AI performance and storage.

[1] Shannon, C. E. A mathematical theory of communication. The Bell system technical journal, 27(3):379–423, 1948.

[2] Shannon, C. E. et al. Coding theorems for a discrete source with a fidelity criterion. IRE Nat. Conv. Rec, 4(142-163):1, 1959.

[3] Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.

[4] Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.

[5] Anthropic. Claude, 2024. https://www.anthropic.com/news/claude-3-family.

[6] Team, G., Georgiev, P., Lei, V. I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vincent, D., Pan, Z., Wang, S., et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.

[7] Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.

[8] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. NeurIPS, 2017.

[9] Elastic search., 2025. https://www.elastic.co/enterprise-search/vector-search.

[10] Guo, R., Sun, P., Lindgren, E., Geng, Q., Simcha, D., Chern, F., and Kumar, S. Accelerating large-scale inference with anisotropic vector quantization. In International Conference on Machine Learning, pp.\ 3887–3896. PMLR, 2020.

[11] Pinecone vectore database., 2025. https://www.pinecone.io/.

[12] Qdrant vectore search., 2025. https://qdrant.tech/.

[13] Pgvector search., 2025. https://github.com/pgvector/pgvector/.

[14] Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, H., and Wang, H. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2, 2023.

[15] Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., and Larson, J. From local to global: A graph rag approach to query-focused summarization. arXiv preprint arXiv:2404.16130, 2024.

[16] Khattab, O. and Zaharia, M. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pp.\ 39–48, 2020.

[17] Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., and Zaharia, M. Colbertv2: Effective and efficient retrieval via lightweight late interaction. arXiv preprint arXiv:2112.01488, 2021.

[18] Lloyd, S. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129–137, 1982.

[19] Max, J. Quantizing for minimum distortion. IRE Transactions on Information Theory, 6(1):7–12, 1960.

[20] Zandieh, A., Daliri, M., and Han, I. Qjl: 1-bit quantized jl transform for kv cache quantization with zero overhead, 2024a. URL https://arxiv.org/abs/2406.03482.

[21] Zador, P. L. Development and evaluation of procedures for quantizing multivariate distributions. Stanford University, 1964.

[22] Gersho, A. Asymptotically optimal block quantization. IEEE Transactions on information theory, 25(4):373–380, 1979.

[23] Dettmers, T., Lewis, M., Belkada, Y., and Zettlemoyer, L. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems, 35:30318–30332, 2022.

[24] Ashkboos, S., Mohtashami, A., Croci, M. L., Li, B., Cameron, P., Jaggi, M., Alistarh, D., Hoefler, T., and Hensman, J. Quarot: Outlier-free 4-bit inference in rotated llms. arXiv preprint arXiv:2404.00456, 2024.

[25] Liu, Z., Yuan, J., Jin, H., Zhong, S., Xu, Z., Braverman, V., Chen, B., and Hu, X. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:2402.02750, 2024b.

[26] Shah, J., Bikshandi, G., Zhang, Y., Thakkar, V., Ramani, P., and Dao, T. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. arXiv preprint arXiv:2407.08608, 2024.

[27] Han, I., Kacham, P., Karbasi, A., Mirrokni, V., and Zandieh, A. Polarquant: Quantizing kv caches with polar transformation. arXiv preprint arXiv:2502.02617, 2025a.

[28] Kim, S., Hooper, C., Gholami, A., Dong, Z., Li, X., Shen, S., Mahoney, M. W., and Keutzer, K. Squeezellm: Dense-and-sparse quantization. arXiv preprint arXiv:2306.07629, 2023.

[29] Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.

[30] Lin, J., Tang, J., Tang, H., Yang, S., Chen, W.-M., Wang, W.-C., Xiao, G., Dang, X., Gan, C., and Han, S. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of Machine Learning and Systems, 6:87–100, 2024.

[31] Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., and Han, S. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pp.\ 38087–38099. PMLR, 2023a.

[32] Chee, J., Cai, Y., Kuleshov, V., and De Sa, C. M. Quip: 2-bit quantization of large language models with guarantees. Advances in Neural Information Processing Systems, 36:4396–4429, 2023.

[33] Shazeer, N. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019.

[34] Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebron, F., and Sanghai, S. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 4895–4901, 2023.

[35] Dai, D., Deng, C., Zhao, C., Xu, R., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y., et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066, 2024.

[36] Beltagy, I., Peters, M. E., and Cohan, A. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.

[37] Zhang, Z., Sheng, Y., Zhou, T., Chen, T., Zheng, L., Cai, R., Song, Z., Tian, Y., Ré, C., Barrett, C., et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36, 2024b.

[38] Liu, Z., Desai, A., Liao, F., Wang, W., Xie, V., Xu, Z., Kyrillidis, A., and Shrivastava, A. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems, 36, 2024a.

[39] Xiao, G., Tian, Y., Chen, B., Han, S., and Lewis, M. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023b.

[40] Zandieh, A., Han, I., Mirrokni, V., and Karbasi, A. Subgen: Token generation in sublinear time and memory. arXiv preprint arXiv:2402.06082, 2024c.

[41] Li, Y., Huang, Y., Yang, B., Venkitesh, B., Locatelli, A., Ye, H., Cai, T., Lewis, P., and Chen, D. Snapkv: Llm knows what you are looking for before generation. arXiv preprint arXiv:2404.14469, 2024.

[42] Han, I., Kapralov, M., Kochetkova, E., Sheth, K., and Zandieh, A. Balancekv: Kv cache compression through discrepancy theory. arXiv preprint arXiv:2502.07861, 2025b.

[43] Yue, Y., Yuan, Z., Duanmu, H., Zhou, S., Wu, J., and Nie, L. Wkvquant: Quantizing weight and key/value cache for large language models gains more. arXiv preprint arXiv:2402.12065, 2024.

[44] Yang, J. Y., Kim, B., Bae, J., Kwon, B., Park, G., Yang, E., Kwon, S. J., and Lee, D. No token left behind: Reliable kv cache compression via importance-aware mixed precision quantization. arXiv preprint arXiv:2402.18096, 2024.

[45] Dong, S., Cheng, W., Qin, J., and Wang, W. Qaq: Quality adaptive quantization for llm kv cache. arXiv preprint arXiv:2403.04643, 2024.

[46] Kang, H., Zhang, Q., Kundu, S., Jeong, G., Liu, Z., Krishna, T., and Zhao, T. Gear: An efficient kv cache compression recipefor near-lossless generative inference of llm. arXiv preprint arXiv:2403.05527, 2024.

[47] Zhang, T., Yi, J., Xu, Z., and Shrivastava, A. Kv cache is 1 bit per channel: Efficient large language model inference with coupled quantization. arXiv preprint arXiv:2405.03917, 2024a.

[48] Hooper, C., Kim, S., Mohammadzadeh, H., Mahoney, M. W., Shao, Y. S., Keutzer, K., and Gholami, A. Kvquant: Towards 10 million context length llm inference with kv cache quantization. arXiv preprint arXiv:2401.18079, 2024.

[49] Kim, J., Park, J., Cho, J., and Papailiopoulos, D. Lexico: Extreme kv cache compression via sparse coding over universal dictionaries. arXiv preprint arXiv:2412.08890, 2024.

[50] Jegou, H., Douze, M., and Schmid, C. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence, 33(1):117–128, 2010.

[51] Babenko, A. and Lempitsky, V. Additive quantization for extreme vector compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 931–938, 2014.

[52] Ge, T., He, K., Ke, Q., and Sun, J. Optimized product quantization for approximate nearest neighbor search. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 2946–2953, 2013.

[53] Wang, J., Zhang, T., Sebe, N., Shen, H. T., et al. A survey on learning to hash. IEEE transactions on pattern analysis and machine intelligence, 40(4):769–790, 2017.

[54] Gao, J., Gou, Y., Xu, Y., Yang, Y., Long, C., and Wong, R. C.-W. Practical and asymptotically optimal quantization of high-dimensional vectors in euclidean space for approximate nearest neighbor search. arXiv preprint arXiv:2409.09913, 2024.

[55] Cover, T. M. Elements of information theory. John Wiley & Sons, 1999.

[56] Vershynin, R. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

[57] Panter, P. and Dite, W. Quantization distortion in pulse-count modulation with nonuniform spacing of levels. Proceedings of the IRE, 39(1):44–48, 1951.

[58] Gersho, A. On the structure of vector quantizers. IEEE Transactions on Information Theory, 28(2):157–166, 1982.

[59] Kamradt, G. Needle in a haystack - pressure testing llms., 2023. https://github.com/gkamradt/LLMTest_NeedleInAHaystack.

[60] Fu, Y., Panda, R., Niu, X., Yue, X., Hajishirzi, H., Kim, Y., and Peng, H. Data engineering for scaling language models to 128k context. arXiv preprint arXiv:2402.10171, 2024. URL {https://github.com/FranxYao/Long-Context-Data-Engineering}.

[61] Cai, Z., Zhang, Y., Gao, B., Liu, Y., Liu, T., Lu, K., Xiong, W., Dong, Y., Chang, B., Hu, J., et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling. arXiv preprint arXiv:2406.02069, 2024.

[62] Bai, Y., Lv, X., Zhang, J., Lyu, H., Tang, J., Huang, Z., Du, Z., Liu, X., Zeng, A., Hou, L., Dong, Y., Tang, J., and Li, J. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023.

[63] Zandieh, A., Daliri, M., and Han, I. Qjl: 1-bit quantized jl transform for kv cache quantization with zero overhead. arXiv preprint arXiv:2406.03482, 2024b.

[64] Su, Z., Chen, Z., Shen, W., Wei, H., Li, L., Yu, H., and Yuan, K. Rotatekv: Accurate and robust 2-bit kv cache quantization for llms via outlier-aware adaptive rotations, 2025. URL https://arxiv.org/abs/2501.16383.

[65] Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., and Gurevych, I. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URL https://openreview.net/forum?id=wCu6T5xFjeJ.

[66] Pennington, J., Socher, R., and Manning, C. GloVe: Global vectors for word representation. In Moschitti, A., Pang, B., and Daelemans, W. (eds.), Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 1532–1543, Doha, Qatar, October 2014. Association for Computational Linguistics. doi:10.3115/v1/D14-1162. URL https://aclanthology.org/D14-1162/.