Exploiting Sparsity for Long Context Inference: Million Token Contexts on Commodity GPUs
There is growing demand for performing inference with hundreds of thousands of input tokens on trained transformer models. Inference at this extreme scale demands significant computational resources, hindering the application of transformers at long ...
Ryan Synk +8 more
semanticscholar +3 more sources
RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference
The growing context lengths of large language models (LLMs) pose significant challenges for efficient inference, primarily due to GPU memory and bandwidth constraints.
Yaoqi Chen +17 more
semanticscholar +3 more sources
ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference
Large Language Models (LLMs) require significant GPU memory when processing long texts, with the key value (KV) cache consuming up to 70\% of total memory during inference.
Xiang Liu +6 more
semanticscholar +3 more sources
RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression
Transformer-based Large Language Models rely critically on the KV cache to efficiently handle extended contexts during the decode phase. Yet, the size of the KV cache grows proportionally with the input length, burdening both memory bandwidth and ...
Payman Behnam +5 more
semanticscholar +3 more sources
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
Deploying long-context large language models (LLMs) is essential but poses significant computational and memory challenges. Caching all Key and Value (KV) states across all attention heads consumes substantial memory.
Guangxuan Xiao +7 more
semanticscholar +3 more sources
AlayaDB: The Data Foundation for Efficient and Effective Long-context LLM Inference
AlayaDB is a cutting-edge vector database system natively architected for efficient and effective long-context inference for Large Language Models (LLMs) at AlayaDB AI.
Yangshen Deng +15 more
semanticscholar +3 more sources
AccLLM: Accelerating Long-Context LLM Inference Via Algorithm-Hardware Co-Design
Recently, large language models (LLMs) have achieved huge success in the natural language processing (NLP) field, driving a growing demand to extend their deployment from the cloud to edge devices.
Yanbiao Liang +3 more
semanticscholar +3 more sources
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
With the widespread deployment of long-context large language models (LLMs), there has been a growing demand for efficient support of high-throughput inference.
Hanshi Sun +8 more
semanticscholar +3 more sources
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference
The inference of transformer-based large language models consists of two sequential stages: 1) a prefilling stage to compute the KV cache of prompts and generate the first token, and 2) a decoding stage to generate subsequent tokens.
Qichen Fu +5 more
semanticscholar +3 more sources
CEG: A joint model for causal commonsense events enhanced story ending generation
With the success of pre-trained language models, the performance of story ending generation has been dramatically improved while remaining challenging due to the lack of commonsense reasoning ability. Most previous works mainly focus on using commonsense
Yushi Zhang +5 more
doaj +2 more sources

