Results 1 to 10 of about 133,755 (277)

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs [PDF]

open access: yesInternational Conference on Learning Representations, 2023
In this study, we introduce adaptive KV cache compression, a plug-and-play method that reduces the memory footprint of generative inference for Large Language Models (LLMs).
Suyu Ge   +5 more
semanticscholar   +1 more source

Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time [PDF]

open access: yesNeural Information Processing Systems, 2023
Large language models(LLMs) have sparked a new wave of exciting AI applications. Hosting these models at scale requires significant memory resources. One crucial memory bottleneck for the deployment stems from the context window.
Zichang Liu   +7 more
semanticscholar   +1 more source

CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving [PDF]

open access: yesConference on Applications, Technologies, Architectures, and Protocols for Computer Communication, 2023
As large language models (LLMs) take on complex tasks, their inputs are supplemented with longer contexts that incorporate domain knowledge. Yet using long contexts is challenging as nothing can be generated until the whole context is processed by the ...
Yuhan Liu   +13 more
semanticscholar   +1 more source

Secretive Coded Caching With Shared Caches [PDF]

open access: yesIEEE Communications Letters, 2021
We consider the problem of \emph{secretive coded caching} in a shared cache setup where the number of users accessing a particular \emph{helper cache} is more than one, and every user can access exactly one helper cache. In secretive coded caching, the constraint of \emph{perfect secrecy} must be satisfied.
Shreya Shrestha Meel, B. Sundar Rajan
openaire   +2 more sources

Efficient Stack Distance Approximation Based on Workload Characteristics

open access: yesIEEE Access, 2022
A stack distance of a reference is the depth from which the reference must be extracted from a stack. It has been widely applied to a variety of applications utilizing temporal locality information.
Sooyoung Lim, Dongchul Park
doaj   +1 more source

Coded Caching With Shared Caches and Private Caches

open access: yesIEEE Transactions on Communications, 2023
This work studies the coded caching problem in a setting where the users are simultaneously endowed with a private cache and a shared cache. The setting consists of a server connected to a set of users, assisted by a smaller number of helper nodes that are equipped with their own storage. In addition to the helper cache, each user possesses a dedicated
Elizabath Peter   +2 more
openaire   +2 more sources

A Write-Buffer Scheme to Protect Cache Memories Against Multiple-Bit Errors

open access: yesIEEE Access, 2022
Protecting cache memories against radiation-induced soft errors is critical in designing highly reliable processors. Dirty lines in write-back data caches are more critical, since the dirty lines have no backups in lower-level memory (LLM).
Jie Li   +5 more
doaj   +1 more source

High-Performance and Flexible Design Scheme with ECC Protection in the Cache

open access: yesMicromachines, 2022
To improve the reliability of static random access memory (SRAM), error-correcting codes (ECC) are typically used to protect SRAM in the cache. While improving the reliability, we also need additional circuits to support ECC, including encoding and ...
Yulun Zhou   +3 more
doaj   +1 more source

Optimizing the rendering of an object-based web application with deep nesting and many dependencies

open access: yesТехнічна інженерія, 2023
The article is devoted to the study of the application of optimization methods for drawing web applications using deeply nested objects. The task of analyzing the user interface, which includes a complex data structure received from the server part, is ...
O.V., D.D.
doaj   +1 more source

Economical Caching [PDF]

open access: yesACM Transactions on Computation Theory, 2013
We study the management of buffers and storages in environments with unpredictably varying prices in a competitive analysis. In the economical caching problem, there is a storage with a certain capacity. For each time step, an online algorithm is given a price from the interval [1, α ], a consumption, and possibly a
Englert, Matthias   +3 more
openaire   +3 more sources

Home - About - Disclaimer - Privacy