Results 11 to 20 of about 51,355 (218)
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet [PDF]
Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, e.g., the Vision Transformer (ViT) for image classification. The ViT model splits each image into a sequence of tokens with fixed length and then
Li Yuan +7 more
semanticscholar +1 more source
Token Merging: Your ViT But Faster [PDF]
We introduce Token Merging (ToMe), a simple method to increase the throughput of existing ViT models without needing to train. ToMe gradually combines similar tokens in a transformer using a general and light-weight matching algorithm that is as fast as ...
Daniel Bolya +5 more
semanticscholar +1 more source
V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision Transformer [PDF]
In this paper, we investigate the application of Vehicle-to-Everything (V2X) communication to improve the perception performance of autonomous vehicles.
Runsheng Xu +5 more
semanticscholar +1 more source
DeiT III: Revenge of the ViT [PDF]
A Vision Transformer (ViT) is a simple neural architecture amenable to serve several computer vision tasks. It has limited built-in architectural priors, in contrast to more recent architectures that incorporate priors either about the input data or of ...
Hugo Touvron, M. Cord, Herv'e J'egou
semanticscholar +1 more source
Rep ViT: Revisiting Mobile CNN From ViT Perspective [PDF]
Recently, lightweight Vision Transformers (ViTs) demon-strate superior performance and lower latency, compared with lightweight Convolutional Neural Networks (CNNs), on resource-constrained mobile devices.
Ao Wang +4 more
semanticscholar +1 more source
All are Worth Words: A ViT Backbone for Diffusion Models [PDF]
Vision transformers (ViT) have shown promise in various vision tasks while the U-Net based on a convolutional neural network (CNN) remains dominant in diffusion models.
Fan Bao +6 more
semanticscholar +1 more source
Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design [PDF]
Scaling laws have been recently employed to derive compute-optimal model size (number of parameters) for a given compute duration. We advance and refine such methods to infer compute-optimal model shapes, such as width and depth, and successfully ...
Ibrahim M. Alabdulmohsin +3 more
semanticscholar +1 more source
Splicing ViT Features for Semantic Appearance Transfer [PDF]
We present a method for semantically transferring the visual appearance of one natural image to another. Specifically, our goal is to generate an image in which objects in a source structure image are “painted” with the visual appearance of their ...
Narek Tumanyan +3 more
semanticscholar +1 more source
Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios [PDF]
Due to the complex attention mechanisms and model design, most existing vision Transformers (ViTs) can not perform as efficiently as convolutional neural networks (CNNs) in realistic industrial deployment scenarios, e.g. TensorRT and CoreML. This poses a
Jiashi Li +8 more
semanticscholar +1 more source
Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning [PDF]
Multi-scale Vision Transformer (ViT) has emerged as a powerful backbone for computer vision tasks, while the self-attention computation in Transformer scales quadratically w.r.t. the input patch number.
Ting Yao +4 more
semanticscholar +1 more source

