Results 231 to 240 of about 133,755 (277)
Some of the next articles are maybe not open access.

Annex cache: a cache assist to implement selective caching

Microprocessors and Microsystems, 1999
Efficient instruction and data caches are extremely important for achieving good performance from modern high performance processors. Conventional cache architectures exploit locality, but do so rather blindly. By forcing all references through a single structure, the cache’s effectiveness on many references is reduced.
L.K. John, T. Li, A. Subramanian
openaire   +1 more source

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

International Conference on Machine Learning
Efficiently serving large language models (LLMs) requires batching of many requests to reduce the cost per request. Yet, with larger batch sizes and longer context lengths, the key-value (KV) cache, which stores attention keys and values to avoid re ...
Zirui Liu   +7 more
semanticscholar   +1 more source

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

Neural Information Processing Systems
LLMs are seeing growing use for applications which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference.
Coleman Hooper   +6 more
semanticscholar   +1 more source

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

arXiv.org
In this study, we investigate whether attention-based information flow inside large language models (LLMs) is aggregated through noticeable patterns for long context processing.
Zefan Cai   +10 more
semanticscholar   +1 more source

Advising cache for lower cache

Proceedings of the 11th International Conference on Ubiquitous Information Management and Communication, 2017
In a virtualized environment, such as a cloud computing environment, guest and host operating systems run simultaneously. Both of the operating systems have page caches for disk accesses. In such an environment, the host operating system cache does not work effectively because of negative temporal locality of access and duplicated storing in both the ...
Yoshida Kotaro, Saneyasu Yamaguchi
openaire   +1 more source

dKV-Cache: The Cache for Diffusion Language Models

arXiv.org
Diffusion Language Models (DLMs) have been seen as a promising competitor for autoregressive language models. However, diffusion language models have long been constrained by slow inference.
Xinyin Ma   +3 more
semanticscholar   +1 more source

Timestep Embedding Tells: It’s Time to Cache for Video Diffusion Model

Computer Vision and Pattern Recognition
As a fundamental backbone for video generation, diffusion models are challenged by low inference speed due to the sequential nature of denoising. Previous methods speed up the models by caching and reusing model outputs at uniformly selected timesteps ...
Feng Liu   +8 more
semanticscholar   +1 more source

InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management

USENIX Symposium on Operating Systems Design and Implementation
Transformer-based large language models (LLMs) demonstrate impressive performance across various natural language processing tasks. Serving LLM inference for generating long contents, however, poses a challenge due to the enormous memory footprint of the
Wonbeom Lee   +3 more
semanticscholar   +1 more source

CEASER: Mitigating Conflict-Based Cache Attacks via Encrypted-Address and Remapping

Micro, 2018
Modern processors share the last-level cache between all the cores to efficiently utilize the cache space. Unfortunately, such sharing makes the cache vulnerable to attacks whereby an adversary can infer the access pattern of a co-running application by ...
Moinuddin K. Qureshi
semanticscholar   +1 more source

Cache capacity-aware CCN: Selective caching and cache-aware routing

2013 IEEE Global Communications Conference (GLOBECOM), 2013
Content-centric networking (CCN) is a new networking paradigm to resolve the data traffic explosion problem of the Internet caused by rapid increase in file sharing and video streaming traffic. Networks with CCN avoid delivery of the same contents on one link as many times as they are requested, as contents can be stored and transferred by the cache of
null Sung-Won Lee   +4 more
openaire   +1 more source

Home - About - Disclaimer - Privacy