Results 21 to 30 of about 1,257,085 (204)

Exploiting Sparsity for Long Context Inference: Million Token Contexts on Commodity GPUs

open access: yesarXiv.org
There is growing demand for performing inference with hundreds of thousands of input tokens on trained transformer models. Inference at this extreme scale demands significant computational resources, hindering the application of transformers at long ...
Ryan Synk   +8 more
semanticscholar   +3 more sources

RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference

open access: yesarXiv.org
The growing context lengths of large language models (LLMs) pose significant challenges for efficient inference, primarily due to GPU memory and bandwidth constraints.
Yaoqi Chen   +17 more
semanticscholar   +3 more sources

ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference

open access: yesarXiv.org
Large Language Models (LLMs) require significant GPU memory when processing long texts, with the key value (KV) cache consuming up to 70\% of total memory during inference.
Xiang Liu   +6 more
semanticscholar   +3 more sources

RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression

open access: yesInternational Conference on Machine Learning
Transformer-based Large Language Models rely critically on the KV cache to efficiently handle extended contexts during the decode phase. Yet, the size of the KV cache grows proportionally with the input length, burdening both memory bandwidth and ...
Payman Behnam   +5 more
semanticscholar   +3 more sources

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

open access: yesInternational Conference on Learning Representations
Deploying long-context large language models (LLMs) is essential but poses significant computational and memory challenges. Caching all Key and Value (KV) states across all attention heads consumes substantial memory.
Guangxuan Xiao   +7 more
semanticscholar   +3 more sources

AlayaDB: The Data Foundation for Efficient and Effective Long-context LLM Inference

open access: yesCompanion of the 2025 International Conference on Management of Data
AlayaDB is a cutting-edge vector database system natively architected for efficient and effective long-context inference for Large Language Models (LLMs) at AlayaDB AI.
Yangshen Deng   +15 more
semanticscholar   +3 more sources

AccLLM: Accelerating Long-Context LLM Inference Via Algorithm-Hardware Co-Design

open access: yesIEEE Transactions on Very Large Scale Integration (VLSI) Systems
Recently, large language models (LLMs) have achieved huge success in the natural language processing (NLP) field, driving a growing demand to extend their deployment from the cloud to edge devices.
Yanbiao Liang   +3 more
semanticscholar   +3 more sources

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

open access: yesInternational Conference on Machine Learning
With the widespread deployment of long-context large language models (LLMs), there has been a growing demand for efficient support of high-throughput inference.
Hanshi Sun   +8 more
semanticscholar   +3 more sources

LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference

open access: yesarXiv.org
The inference of transformer-based large language models consists of two sequential stages: 1) a prefilling stage to compute the KV cache of prompts and generate the first token, and 2) a decoding stage to generate subsequent tokens.
Qichen Fu   +5 more
semanticscholar   +3 more sources

CEG: A joint model for causal commonsense events enhanced story ending generation

open access: yesPLoS ONE, 2023
With the success of pre-trained language models, the performance of story ending generation has been dramatically improved while remaining challenging due to the lack of commonsense reasoning ability. Most previous works mainly focus on using commonsense
Yushi Zhang   +5 more
doaj   +2 more sources

Home - About - Disclaimer - Privacy