Results 11 to 20 of about 1,257,085 (204)

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

open access: yesInternational Conference on Machine Learning
As the demand for long-context large language models (LLMs) increases, models with context windows of up to 128K or 1M tokens are becoming increasingly prevalent.
Jiaming Tang   +5 more
semanticscholar   +3 more sources

Inference Scaling for Long-Context Retrieval Augmented Generation

open access: yesInternational Conference on Learning Representations
The scaling of inference computation has unlocked the potential of long-context large language models (LLMs) across diverse settings. For knowledge-intensive tasks, the increased compute is often allocated to incorporate more external knowledge. However,
Zhenrui Yue   +9 more
semanticscholar   +3 more sources

ZigzagAttention: Efficient Long-Context Inference with Exclusive Retrieval and Streaming Heads

open access: yesarXiv.org
With the rapid development of large language models (LLMs), handling long context has become one of the vital abilities in LLMs. Such long-context ability is accompanied by difficulties in deployment, especially due to the increased consumption of KV ...
Zhuorui Liu, Chen Zhang, Dawei Song
semanticscholar   +3 more sources

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

open access: yesarXiv.org
Transformer-based Large Language Models (LLMs) have become increasingly important. However, due to the quadratic time complexity of attention computation, scaling LLMs to longer contexts incurs extremely slow inference speed and high GPU memory ...
Di Liu   +13 more
semanticscholar   +3 more sources

PQCache: Product Quantization-based KVCache for Long Context LLM Inference

open access: yesProceedings of the ACM on Management of Data
As the field of Large Language Models (LLMs) continues to evolve, the context length in inference is steadily growing. Key-Value Cache (KVCache), the intermediate representations of tokens within LLM inference, has now become the primary memory ...
Hailin Zhang   +7 more
semanticscholar   +3 more sources

SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning

open access: yesAAAI Conference on Artificial Intelligence
Long-context inference for Large Language Models (LLMs) is heavily limited by high computational demands. While several existing methods optimize attention computation, they still process the full set of hidden states at each layer, limiting overall ...
Lingkun Long   +5 more
semanticscholar   +3 more sources

MILLION: MasterIng Long-Context LLM Inference Via Outlier-Immunized KV Product QuaNtization

open access: yes2025 62nd ACM/IEEE Design Automation Conference (DAC)
Large language models (LLMs) are increasingly utilized for complex tasks requiring longer context lengths, with some models supporting up to 128 K or 1 M tokens.
Zongwu Wang   +9 more
semanticscholar   +3 more sources

Efficient Long-Context LLM Inference via KV Cache Clustering

open access: yesarXiv.org
Large language models (LLMs) with extended context windows have become increasingly prevalent for tackling complex tasks. However, the substantial Key-Value (KV) cache required for long-context LLMs poses significant deployment challenges.
Jie Hu   +10 more
semanticscholar   +3 more sources

ViT-Stain: Vision transformer-driven virtual staining for skin histopathology via global contextual learning. [PDF]

open access: yesPLoS ONE
Current virtual staining approaches for histopathology slides use convolutional neural networks (CNNs) and generative adversarial networks (GANs). These approaches rely on local receptive fields, struggle to capture global context, and long-range tissue ...
Muhammad Altaf Hussain   +7 more
doaj   +2 more sources

Efficient Prompt Compression with Evaluator Heads for Long-Context Transformer Inference

open access: yesarXiv.org
Although applications involving long-context inputs are crucial for the effective utilization of large language models (LLMs), they also result in increased computational costs and reduced performance.
WeiZhi Fei   +5 more
semanticscholar   +3 more sources

Home - About - Disclaimer - Privacy