Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
As the demand for long-context large language models (LLMs) increases, models with context windows of up to 128K or 1M tokens are becoming increasingly prevalent.
Jiaming Tang +5 more
semanticscholar +3 more sources
Inference Scaling for Long-Context Retrieval Augmented Generation
The scaling of inference computation has unlocked the potential of long-context large language models (LLMs) across diverse settings. For knowledge-intensive tasks, the increased compute is often allocated to incorporate more external knowledge. However,
Zhenrui Yue +9 more
semanticscholar +3 more sources
ZigzagAttention: Efficient Long-Context Inference with Exclusive Retrieval and Streaming Heads
With the rapid development of large language models (LLMs), handling long context has become one of the vital abilities in LLMs. Such long-context ability is accompanied by difficulties in deployment, especially due to the increased consumption of KV ...
Zhuorui Liu, Chen Zhang, Dawei Song
semanticscholar +3 more sources
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval
Transformer-based Large Language Models (LLMs) have become increasingly important. However, due to the quadratic time complexity of attention computation, scaling LLMs to longer contexts incurs extremely slow inference speed and high GPU memory ...
Di Liu +13 more
semanticscholar +3 more sources
PQCache: Product Quantization-based KVCache for Long Context LLM Inference
As the field of Large Language Models (LLMs) continues to evolve, the context length in inference is steadily growing. Key-Value Cache (KVCache), the intermediate representations of tokens within LLM inference, has now become the primary memory ...
Hailin Zhang +7 more
semanticscholar +3 more sources
SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning
Long-context inference for Large Language Models (LLMs) is heavily limited by high computational demands. While several existing methods optimize attention computation, they still process the full set of hidden states at each layer, limiting overall ...
Lingkun Long +5 more
semanticscholar +3 more sources
MILLION: MasterIng Long-Context LLM Inference Via Outlier-Immunized KV Product QuaNtization
Large language models (LLMs) are increasingly utilized for complex tasks requiring longer context lengths, with some models supporting up to 128 K or 1 M tokens.
Zongwu Wang +9 more
semanticscholar +3 more sources
Efficient Long-Context LLM Inference via KV Cache Clustering
Large language models (LLMs) with extended context windows have become increasingly prevalent for tackling complex tasks. However, the substantial Key-Value (KV) cache required for long-context LLMs poses significant deployment challenges.
Jie Hu +10 more
semanticscholar +3 more sources
ViT-Stain: Vision transformer-driven virtual staining for skin histopathology via global contextual learning. [PDF]
Current virtual staining approaches for histopathology slides use convolutional neural networks (CNNs) and generative adversarial networks (GANs). These approaches rely on local receptive fields, struggle to capture global context, and long-range tissue ...
Muhammad Altaf Hussain +7 more
doaj +2 more sources
Efficient Prompt Compression with Evaluator Heads for Long-Context Transformer Inference
Although applications involving long-context inputs are crucial for the effective utilization of large language models (LLMs), they also result in increased computational costs and reduced performance.
WeiZhi Fei +5 more
semanticscholar +3 more sources

