Results 1 to 10 of about 1,257,065 (184)
A Cloud-Aware Scalable Architecture for Distributed Edge-Enabled BCI Biosensor System [PDF]
BCI biosensors enable continuous monitoring of neural activity, but existing systems face challenges in scalability, latency, and reliable integration with cloud infrastructure.
Sayantan Ghosh +7 more
doaj +2 more sources
Characterizing Prompt Compression Methods for Long Context Inference
Long context inference presents challenges at the system level with increased compute and memory requirements, as well as from an accuracy perspective in being able to reason over long contexts.
Siddharth Jha +4 more
semanticscholar +3 more sources
Long-Context Inference with Retrieval-Augmented Speculative Decoding
The emergence of long-context large language models (LLMs) offers a promising alternative to traditional retrieval-augmented generation (RAG) for processing extensive documents.
Guanzheng Chen +4 more
semanticscholar +3 more sources
MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference
Long-context Multimodal Large Language Models (MLLMs) that incorporate long text-image and text-video modalities, demand substantial resources as their multimodal Key-Value (KV) caches grow with increasing input lengths, challenging inference efficiency.
Zhongwei Wan +5 more
semanticscholar +3 more sources
Adamas: Hadamard Sparse Attention for Efficient Long-Context Inference
Large language models (LLMs) now support context windows of hundreds of thousands to millions of tokens, enabling applications such as long-document summarization, large-scale code synthesis, multi-document question answering and persistent multi-turn ...
Siyuan Yan +6 more
semanticscholar +3 more sources
SparseAccelerate: Efficient Long-Context Inference for Mid-Range GPUs
As Large Language Models (LLMs) scale to longer context windows, the computational cost of attention mechanisms, which traditionally grows quadratically with input length, presents a critical challenge for real-time and memory-constrained deployments ...
J. Vo
semanticscholar +3 more sources
Long-context inference optimization for large language models: a survey
With the rapid development of large language model (LLM) technology, the demand for processing long-text inputs has been increasing. However, long-text inference faces challenges such as high memory consumption and latency.
TAO Wei +3 more
doaj +5 more sources
BaKlaVa - Budgeted Allocation of KV cache for Long-context Inference
In Large Language Model (LLM) inference, Key-Value (KV) caches (KV-caches) are essential for reducing time complexity. However, they result in a linear increase in GPU memory as the context length grows.
A. B. Gulhan +4 more
semanticscholar +3 more sources
AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference
Long-context large language models (LLMs) inference is increasingly critical, motivating a number of studies devoted to alleviating the substantial storage and computational costs in such scenarios. Layer-wise skipping methods are promising optimizations
Zhuomin He +6 more
semanticscholar +3 more sources
Squeezed Attention: Accelerating Long Context Length LLM Inference
Emerging Large Language Model (LLM) applications require long input context in order to perform complex tasks like document analysis and code generation.
Coleman Hooper +7 more
semanticscholar +3 more sources

