Results 51 to 60 of about 263,258 (207)
Fast Query Processing by Distributing an Index over CPU Caches [PDF]
New version published at IEEE Cluster Computing ...
Xiaoqin Ma, Gene Cooperman
openaire +2 more sources
Converting an Integer to a Decimal String in Under Two Nanoseconds
ABSTRACT Objective Converting binary integers to variable‐length decimal strings is a fundamental operation in computing. Conventional fast approaches rely on recursive division and small lookup tables. The goal of this work is to develop a significantly faster method for this task.
Jaël Champagne Gareau, Daniel Lemire
wiley +1 more source
Based on the vectorised and cache optimised kernel, a parallel lower upper decomposition with a novel communication avoiding pivoting scheme is developed to solve dense complex matrix equations generated by the method of moments.
Yan Chen +4 more
doaj +1 more source
Casper: Accelerating Stencil Computations Using Near-Cache Processing
Stencil computations are commonly used in a wide variety of scientific applications, ranging from large-scale weather prediction to solving partial differential equations.
Alain Denzler +6 more
doaj +1 more source
Harnessing Integrated CPU-GPU System Memory for HPC: a first look into Grace Hopper [PDF]
Memory management across discrete CPU and GPU physical memory is traditionally achieved through explicit GPU allocations and data copy or unified virtual memory.
Gabin Schieffer +4 more
semanticscholar +1 more source
ABSTRACT Large language models (LLMs) have made remarkable advances in natural language processing, demonstrating great potential in modelling structured sequences. However, adapting these capabilities to machine gaming tasks such as Go remains challenging due to limitations in strategy generalisation and optimisation efficiency.
Xiali Li +5 more
wiley +1 more source
A Simple Cache Emulator for Evaluating Cache Behavior for SMP Systems
Every modern CPU uses a complex memory hierarchy, which consists of multiple cache memory levels. It is very difficult to predict the behavior of this hierarchy for a given program (for details see [1, 2]).
I. Šimeček
doaj
Cache-optimized BFS on multi-core CPUs
Breadth-First Search (BFS) performance on shared-memory systems is often limited by irregular memory access and cache inefficiencies. This work presents two optimizations for BFS graph traversal: a bitmap-based algorithm designed for small-diameter graphs and MergedCSR, a graph storage format that improves cache locality for large-scale graphs ...
Salvatore Domenico Andaloro +2 more
openaire +1 more source
In this paper, the authors present GEMM-ArchProfiler, a simulation framework for evaluating General Matrix Multiplication performance in convolutional neural networks.
Binu Ayyappan, G. Santhosh Kumar
doaj +1 more source
NuMagSANS, a GPU‐accelerated software package for the computation of nuclear and magnetic small‐angle neutron scattering cross sections and correlation functions of complex systems, is presented.We present NuMagSANS, a GPU‐accelerated software package for calculating nuclear and magnetic small‐angle neutron scattering (SANS) cross sections and ...
Michael P. Adams, Andreas Michels
wiley +1 more source

