Results 171 to 180 of about 1,257,085 (204)
Some of the next articles are maybe not open access.
Annual Meeting of the Association for Computational Linguistics
While long-context inference is crucial for advancing large language model (LLM) applications, its prefill speed remains a significant bottleneck. Current approaches, including sequence parallelism strategies and compute reduction through approximate ...
Yuxiang Huang +9 more
semanticscholar +1 more source
While long-context inference is crucial for advancing large language model (LLM) applications, its prefill speed remains a significant bottleneck. Current approaches, including sequence parallelism strategies and compute reduction through approximate ...
Yuxiang Huang +9 more
semanticscholar +1 more source
Annual Meeting of the Association for Computational Linguistics
This paper introduces MadaKV, a modality-adaptive key-value (KV) cache eviction strategy designed to enhance the efficiency of multimodal large language models (MLLMs) in long-context inference.
Kunxi Li +7 more
semanticscholar +1 more source
This paper introduces MadaKV, a modality-adaptive key-value (KV) cache eviction strategy designed to enhance the efficiency of multimodal large language models (MLLMs) in long-context inference.
Kunxi Li +7 more
semanticscholar +1 more source
arXiv.org
Encoder-only transformer models such as BERT offer a great performance-size tradeoff for retrieval and classification tasks with respect to larger decoder-only models.
Benjamin Warner +13 more
semanticscholar +2 more sources
Encoder-only transformer models such as BERT offer a great performance-size tradeoff for retrieval and classification tasks with respect to larger decoder-only models.
Benjamin Warner +13 more
semanticscholar +2 more sources
TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization
Annual Meeting of the Association for Computational LinguisticsThe Key-Value (KV) cache in generative large language models (LLMs) introduces substantial memory overhead. Existing works mitigate this burden by offloading or compressing the KV cache. However, loading the entire cache incurs significant latency due to
Dingyu Yao +6 more
semanticscholar +1 more source
S3-Attention:Attention-Aligned Endogenous Retrieval for Memory-Bounded Long-Context Inference
arXiv.orgLarge language models are increasingly applied to multi-document and long-form inputs, yet long-context inference remains memory- and noise-inefficient.
Qingsen Ma +9 more
semanticscholar +1 more source
arXiv.org
We demonstrate that geometrically distinctive keys during LLM inference tend to have high attention scores. Based on the phenomenon we propose KeyDiff, a training-free KV cache eviction method based solely on key similarity.
Junyoung Park +5 more
semanticscholar +1 more source
We demonstrate that geometrically distinctive keys during LLM inference tend to have high attention scores. Based on the phenomenon we propose KeyDiff, a training-free KV cache eviction method based solely on key similarity.
Junyoung Park +5 more
semanticscholar +1 more source
Annual Meeting of the Association for Computational Linguistics
One of the primary challenges in optimizing large language models (LLMs) for long-context inference lies in the high memory consumption of the Key-Value (KV) cache.
Wei Tao +6 more
semanticscholar +1 more source
One of the primary challenges in optimizing large language models (LLMs) for long-context inference lies in the high memory consumption of the Key-Value (KV) cache.
Wei Tao +6 more
semanticscholar +1 more source
Global Communications Conference
Large Language Models (LLMs) can process data to perform zero-shot learning on unseen tasks and few-shot learning on complex reasoning tasks for devices.
Minrui Xu +2 more
semanticscholar +1 more source
Large Language Models (LLMs) can process data to perform zero-shot learning on unseen tasks and few-shot learning on complex reasoning tasks for devices.
Minrui Xu +2 more
semanticscholar +1 more source
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention
Neural Information Processing SystemsThe computational challenges of Large Language Model (LLM) inference remain a significant barrier to their widespread deployment, especially as prompt lengths continue to increase. Due to the quadratic complexity of the attention computation, it takes 30
Huiqiang Jiang +11 more
semanticscholar +1 more source
arXiv.org
Advanced Large Language Models (LLMs) have achieved impressive performance across a wide range of complex and long-context natural language tasks. However, performing long-context LLM inference locally on a commodity GPU (a PC) with privacy concerns ...
He Sun +3 more
semanticscholar +1 more source
Advanced Large Language Models (LLMs) have achieved impressive performance across a wide range of complex and long-context natural language tasks. However, performing long-context LLM inference locally on a commodity GPU (a PC) with privacy concerns ...
He Sun +3 more
semanticscholar +1 more source

