Results 1 to 10 of about 249,885 (208)
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models [PDF]
Large language models (LLMs) have revolutionized natural language processing tasks. However, their practical deployment is hindered by their immense memory and computation requirements.
Wenqi Shao+9 more
semanticscholar +1 more source
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models [PDF]
Large language models (LLMs) show excellent performance but are compute- and memory-intensive. Quantization can reduce memory and accelerate inference. However, existing methods cannot maintain accuracy and hardware efficiency at the same time.
Guangxuan Xiao+4 more
semanticscholar +1 more source
QuIP: 2-Bit Quantization of Large Language Models With Guarantees [PDF]
This work studies post-training parameter quantization in large language models (LLMs). We introduce quantization with incoherence processing (QuIP), a new method based on the insight that quantization benefits from incoherent weight and Hessian matrices,
Jerry Chee+3 more
semanticscholar +1 more source
SqueezeLLM: Dense-and-Sparse Quantization [PDF]
Generative Large Language Models (LLMs) have demonstrated remarkable results for a wide range of tasks. However, deploying these models for inference has been a significant challenge due to their unprecedented resource requirements.
Sehoon Kim+7 more
semanticscholar +1 more source
LLM-QAT: Data-Free Quantization Aware Training for Large Language Models [PDF]
Several post-training quantization methods have been applied to large language models (LLMs), and have been shown to perform well down to 8-bits. We find that these methods break down at lower bit precision, and investigate quantization aware training ...
Zechun Liu+8 more
semanticscholar +1 more source
Finite Scalar Quantization: VQ-VAE Made Simple [PDF]
We propose to replace vector quantization (VQ) in the latent representation of VQ-VAEs with a simple scheme termed finite scalar quantization (FSQ), where we project the VAE representation down to a few dimensions (typically less than 10). Each dimension
Fabian Mentzer+3 more
semanticscholar +1 more source
LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models [PDF]
Quantization is an indispensable technique for serving Large Language Models (LLMs) and has recently found its way into LoRA fine-tuning. In this work we focus on the scenario where quantization and LoRA fine-tuning are applied together on a pre-trained ...
Yixiao Li+6 more
semanticscholar +1 more source
v2 matches the version accepted for publication on Phys. Rev. D. It includes additional clarifications and references. v3 includes some missing terms in a couple of equations.
Giulia Gubitosi+3 more
openaire +5 more sources
ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers [PDF]
How to efficiently serve ever-larger trained natural language models in practice has become exceptionally challenging even for powerful cloud servers due to their prohibitive memory/computation requirements.
Z. Yao+5 more
semanticscholar +1 more source
Autoregressive Image Generation using Residual Quantization [PDF]
For autoregressive (AR) modeling of high-resolution images, vector quantization (VQ) represents an image as a sequence of discrete codes. A short sequence length is important for an AR model to reduce its computational costs to consider long-range ...
Doyup Lee+4 more
semanticscholar +1 more source