Long-context inference - Open Access .click

Results 101 to 110 of about 1,257,085 (204)

OBCache: Optimal Brain KV Cache Pruning for Efficient Long-Context LLM Inference

Large language models (LLMs) with extended context windows enable powerful downstream applications but impose significant memory overhead, as caching all key-value (KV) states scales linearly with sequence length and batch size. Existing cache eviction methods address this by exploiting attention sparsity, yet they typically rank tokens heuristically ...
Gu, Yuzhe +3 more
openaire +2 more sources

Swin-GAT Fusion Dual-Stream Hybrid Network for High-Resolution Remote Sensing Road Extraction

Remote Sensing
This paper introduces a novel dual-stream collaborative architecture for remote sensing road segmentation, designed to overcome multi-scale feature conflicts, limited dynamic adaptability, and compromised topological integrity.
Hongkai Zhang +4 more
doaj +1 more source

Causal Intervention and Counterfactual Reasoning for Multimodal Pedestrian Trajectory Prediction

Journal of Imaging
Pedestrian trajectory prediction is crucial for autonomous systems navigating human-populated environments. However, existing methods face fundamental challenges including spurious correlations induced by confounding social environments, passive ...
Xinyu Han, Huosheng Xu
doaj +1 more source

Predicting the destination port of fishing vessels utilizing transformers

Maritime Transport Research
Vast databases on historical ship traffic are currently freely available in the form of AIS (Automatic Identification System) messages dating back to as early as 2002.
Andreas Berntsen Løvland +2 more
doaj +1 more source

Efficient Low Rank Attention for Long-Context Inference in Large Language Models

As the length of input text grows, the key-value (KV) cache in LLMs imposes prohibitive GPU memory costs and limits long-context inference on resource constrained devices. Existing approaches, such as KV quantization and pruning, reduce memory usage but suffer from numerical precision loss or suboptimal retention of key-value pairs.
Li, Tenghui +4 more
openaire +2 more sources

KBJNet: Kinematic Bi-Joint Temporal Convolutional Network Attention for Anomaly Detection in Multivariate Time Series Data

Data Science Journal
Detecting anomalies in multivariate time series data is crucial to ensure the security and stability of industrial processes. Yet, it remains challenging due to the absence of labeled anomaly data, the complexity of time series data, and the large ...
Muhammad Abdan Mulia +5 more
doaj +1 more source

LLMSteer: Improving Long-Context LLM Inference by Steering Attention on Reused Contexts

As large language models (LLMs) show impressive performance on complex tasks, they still struggle with longer contextual understanding and high computational costs. To balance efficiency and quality, we introduce LLMSteer, a fine-tuning-free framework that enhances LLMs through query-independent attention steering.
Gu, Zhuohan +3 more
openaire +2 more sources

Global–Local Mamba-Based Dual-Modality Fusion for Hyperspectral and LiDAR Data Classification

Remote Sensing
Hyperspectral image (HSI) and light detection and ranging (LiDAR) data offer complementary spectral and structural information; however, the integration of these high-dimensional, heterogeneous modalities poses significant challenges. We propose a Global–
Khanzada Muzammil Hussain, Keyun Zhao, Sachal Pervaiz, Ying Li +3 more
doaj +1 more source

InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference

The widespread of Large Language Models (LLMs) marks a significant milestone in generative AI. Nevertheless, the increasing context length and batch size in offline LLM inference escalate the memory requirement of the key-value (KV) cache, which imposes a huge burden on the GPU VRAM, especially for resource-constraint scenarios (e.g., edge computing ...
Pan, Xiurui +8 more
openaire +2 more sources

MOM: Memory-Efficient Offloaded Mini-Sequence Inference for Long Context Language Models

Long-context language models exhibit impressive performance but remain challenging to deploy due to high GPU memory demands during inference. We propose Memory-efficient Offloaded Mini-sequence Inference (MOM), a method that partitions critical layers into smaller "mini-sequences" and integrates seamlessly with KV cache offloading.
Zhang, Junyang +3 more
openaire +2 more sources

computer science
fos: computer and information sciences
computation and language cs.cl

computer science - computation and language
machine learning cs.lg
computer science - machine learning

computation and language
artificial intelligence cs.ai
computer science - artificial intelligence