XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference
Recently the generative Large Language Model (LLM) has achieved remarkable success in numerous applications. Notably its inference generates output tokens one-by-one, leading to many redundant computations. The widely-used KV-Cache framework makes a compromise between time and space complexities. However, caching data generates the increasingly growing
Li, Weizhuo +3 more
openaire +2 more sources
LLM-Augmented Prototype Representation for Few-Shot Named-Entity Recognition
Named Entity Recognition (NER) models face challenges in adapting to data distribution shifts, especially with unseen entity types and limited data. Few-shot learning is used to address long-tailed distributions and unseen classes, but struggles with few
Weerayut Buaphet +4 more
doaj +1 more source
How reliable are the statistics for the Stability and Growth Pact? [PDF]
The aim of this paper is to assess the reliability of the government deficit and debt figures reported to the European Commission by Member States. Reliability is one of the several dimensions of quality in statistics; it refers to the magnitudes of data
Jo�o Nogueira Martins +1 more
core
DAM: Dynamic Attention Mask for Long-Context Large Language Model Inference Acceleration
Long-context understanding is crucial for many NLP applications, yet transformers struggle with efficiency due to the quadratic complexity of self-attention. Sparse attention methods alleviate this cost but often impose static, predefined masks, failing to capture heterogeneous attention patterns. This results in suboptimal token interactions, limiting
Zhang, Hanzhi +4 more
openaire +2 more sources
Microbial network inference for longitudinal microbiome studies with LUPINE
Background The microbiome is a complex ecosystem of interdependent taxa that has traditionally been studied through cross-sectional studies. However, longitudinal microbiome studies are becoming increasingly popular.
Saritha Kodikara, Kim-Anh Lê Cao
doaj +1 more source
Asymmetric-Convolution-Guided Multipath Fusion for Real-Time Semantic Segmentation Networks
Aiming to provide solutions for problems proposed by the inaccurate segmentation of long objects and information loss of small objects in real-time semantic segmentation algorithms, this paper proposes a lightweight multi-branch real-time semantic ...
Jie Liu, Bing Zhao, Ming Tian
doaj +1 more source
Paged Attention Meets FlexAttention: Unlocking Long-Context Efficiency in Deployed Inference
Large Language Models (LLMs) encounter severe memory inefficiencies during long-context inference due to conventional handling of key-value (KV) caches. In this work, we introduce a novel integration of PagedAttention with PyTorch's FlexAttention, addressing internal fragmentation and inefficiencies associated with monolithic KV cache allocations ...
Joshi, Thomas +4 more
openaire +2 more sources
Although long noncoding RNAs (lncRNAs) constitute the majority of the human transcriptome, the functional roles of most remain elusive. While protein-coding genes in macrophage biology have been extensively studied, the contribution of lncRNAs in this ...
Christy Montano +4 more
doaj +1 more source
Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference
LLMs now form the backbone of AI agents for a diverse array of applications, including tool use, command-line agents, and web or computer use agents. These agentic LLM inference tasks are fundamentally different from chatbot-focused inference -- they often have much larger context lengths to capture complex, prolonged inputs, such as entire webpage ...
Wu, Haoran +15 more
openaire +2 more sources
Learned adaptive properties for mitigation of weight perturbations in embedded spiking networks
Recent years have seen an increased importance of neural network inference in edge-based scenarios, which impose size and power constraints requiring novel computing devices.
Sarah Luca +6 more
doaj +1 more source

