Results 1 to 10 of about 133,755 (277)
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs [PDF]
In this study, we introduce adaptive KV cache compression, a plug-and-play method that reduces the memory footprint of generative inference for Large Language Models (LLMs).
Suyu Ge +5 more
semanticscholar +1 more source
Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time [PDF]
Large language models(LLMs) have sparked a new wave of exciting AI applications. Hosting these models at scale requires significant memory resources. One crucial memory bottleneck for the deployment stems from the context window.
Zichang Liu +7 more
semanticscholar +1 more source
CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving [PDF]
As large language models (LLMs) take on complex tasks, their inputs are supplemented with longer contexts that incorporate domain knowledge. Yet using long contexts is challenging as nothing can be generated until the whole context is processed by the ...
Yuhan Liu +13 more
semanticscholar +1 more source
Secretive Coded Caching With Shared Caches [PDF]
We consider the problem of \emph{secretive coded caching} in a shared cache setup where the number of users accessing a particular \emph{helper cache} is more than one, and every user can access exactly one helper cache. In secretive coded caching, the constraint of \emph{perfect secrecy} must be satisfied.
Shreya Shrestha Meel, B. Sundar Rajan
openaire +2 more sources
Efficient Stack Distance Approximation Based on Workload Characteristics
A stack distance of a reference is the depth from which the reference must be extracted from a stack. It has been widely applied to a variety of applications utilizing temporal locality information.
Sooyoung Lim, Dongchul Park
doaj +1 more source
Coded Caching With Shared Caches and Private Caches
This work studies the coded caching problem in a setting where the users are simultaneously endowed with a private cache and a shared cache. The setting consists of a server connected to a set of users, assisted by a smaller number of helper nodes that are equipped with their own storage. In addition to the helper cache, each user possesses a dedicated
Elizabath Peter +2 more
openaire +2 more sources
A Write-Buffer Scheme to Protect Cache Memories Against Multiple-Bit Errors
Protecting cache memories against radiation-induced soft errors is critical in designing highly reliable processors. Dirty lines in write-back data caches are more critical, since the dirty lines have no backups in lower-level memory (LLM).
Jie Li +5 more
doaj +1 more source
High-Performance and Flexible Design Scheme with ECC Protection in the Cache
To improve the reliability of static random access memory (SRAM), error-correcting codes (ECC) are typically used to protect SRAM in the cache. While improving the reliability, we also need additional circuits to support ECC, including encoding and ...
Yulun Zhou +3 more
doaj +1 more source
Optimizing the rendering of an object-based web application with deep nesting and many dependencies
The article is devoted to the study of the application of optimization methods for drawing web applications using deeply nested objects. The task of analyzing the user interface, which includes a complex data structure received from the server part, is ...
O.V., D.D.
doaj +1 more source
We study the management of buffers and storages in environments with unpredictably varying prices in a competitive analysis. In the economical caching problem, there is a storage with a certain capacity. For each time step, an online algorithm is given a price from the interval [1, α ], a consumption, and possibly a
Englert, Matthias +3 more
openaire +3 more sources

