Results 171 to 180 of about 1,257,085 (204)
Some of the next articles are maybe not open access.

APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs

Annual Meeting of the Association for Computational Linguistics
While long-context inference is crucial for advancing large language model (LLM) applications, its prefill speed remains a significant bottleneck. Current approaches, including sequence parallelism strategies and compute reduction through approximate ...
Yuxiang Huang   +9 more
semanticscholar   +1 more source

MadaKV: Adaptive Modality-Perception KV Cache Eviction for Efficient Multimodal Long-Context Inference

Annual Meeting of the Association for Computational Linguistics
This paper introduces MadaKV, a modality-adaptive key-value (KV) cache eviction strategy designed to enhance the efficiency of multimodal large language models (MLLMs) in long-context inference.
Kunxi Li   +7 more
semanticscholar   +1 more source

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

arXiv.org
Encoder-only transformer models such as BERT offer a great performance-size tradeoff for retrieval and classification tasks with respect to larger decoder-only models.
Benjamin Warner   +13 more
semanticscholar   +2 more sources

TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization

Annual Meeting of the Association for Computational Linguistics
The Key-Value (KV) cache in generative large language models (LLMs) introduces substantial memory overhead. Existing works mitigate this burden by offloading or compressing the KV cache. However, loading the entire cache incurs significant latency due to
Dingyu Yao   +6 more
semanticscholar   +1 more source

S3-Attention:Attention-Aligned Endogenous Retrieval for Memory-Bounded Long-Context Inference

arXiv.org
Large language models are increasingly applied to multi-document and long-form inputs, yet long-context inference remains memory- and noise-inefficient.
Qingsen Ma   +9 more
semanticscholar   +1 more source

KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments

arXiv.org
We demonstrate that geometrically distinctive keys during LLM inference tend to have high attention scores. Based on the phenomenon we propose KeyDiff, a training-free KV cache eviction method based solely on key similarity.
Junyoung Park   +5 more
semanticscholar   +1 more source

MoQAE: Mixed-Precision Quantization for Long-Context LLM Inference via Mixture of Quantization-Aware Experts

Annual Meeting of the Association for Computational Linguistics
One of the primary challenges in optimizing large language models (LLMs) for long-context inference lies in the high memory consumption of the Key-Value (KV) cache.
Wei Tao   +6 more
semanticscholar   +1 more source

Serving Long-Context LLMs at the Mobile Edge: Test-Time Reinforcement Learning-based Model Caching and Inference Offloading

Global Communications Conference
Large Language Models (LLMs) can process data to perform zero-shot learning on unseen tasks and few-shot learning on complex reasoning tasks for devices.
Minrui Xu   +2 more
semanticscholar   +1 more source

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

Neural Information Processing Systems
The computational challenges of Large Language Model (LLM) inference remain a significant barrier to their widespread deployment, especially as prompt lengths continue to increase. Due to the quadratic complexity of the attention computation, it takes 30
Huiqiang Jiang   +11 more
semanticscholar   +1 more source

Breaking the Boundaries of Long-Context LLM Inference: Adaptive KV Management on a Single Commodity GPU

arXiv.org
Advanced Large Language Models (LLMs) have achieved impressive performance across a wide range of complex and long-context natural language tasks. However, performing long-context LLM inference locally on a commodity GPU (a PC) with privacy concerns ...
He Sun   +3 more
semanticscholar   +1 more source

Home - About - Disclaimer - Privacy