Results 11 to 20 of about 51,355 (218)

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet [PDF]

open access: yesIEEE International Conference on Computer Vision, 2021
Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, e.g., the Vision Transformer (ViT) for image classification. The ViT model splits each image into a sequence of tokens with fixed length and then
Li Yuan   +7 more
semanticscholar   +1 more source

Token Merging: Your ViT But Faster [PDF]

open access: yesInternational Conference on Learning Representations, 2022
We introduce Token Merging (ToMe), a simple method to increase the throughput of existing ViT models without needing to train. ToMe gradually combines similar tokens in a transformer using a general and light-weight matching algorithm that is as fast as ...
Daniel Bolya   +5 more
semanticscholar   +1 more source

V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision Transformer [PDF]

open access: yesEuropean Conference on Computer Vision, 2022
In this paper, we investigate the application of Vehicle-to-Everything (V2X) communication to improve the perception performance of autonomous vehicles.
Runsheng Xu   +5 more
semanticscholar   +1 more source

DeiT III: Revenge of the ViT [PDF]

open access: yesEuropean Conference on Computer Vision, 2022
A Vision Transformer (ViT) is a simple neural architecture amenable to serve several computer vision tasks. It has limited built-in architectural priors, in contrast to more recent architectures that incorporate priors either about the input data or of ...
Hugo Touvron, M. Cord, Herv'e J'egou
semanticscholar   +1 more source

Rep ViT: Revisiting Mobile CNN From ViT Perspective [PDF]

open access: yesComputer Vision and Pattern Recognition, 2023
Recently, lightweight Vision Transformers (ViTs) demon-strate superior performance and lower latency, compared with lightweight Convolutional Neural Networks (CNNs), on resource-constrained mobile devices.
Ao Wang   +4 more
semanticscholar   +1 more source

All are Worth Words: A ViT Backbone for Diffusion Models [PDF]

open access: yesComputer Vision and Pattern Recognition, 2022
Vision transformers (ViT) have shown promise in various vision tasks while the U-Net based on a convolutional neural network (CNN) remains dominant in diffusion models.
Fan Bao   +6 more
semanticscholar   +1 more source

Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design [PDF]

open access: yesNeural Information Processing Systems, 2023
Scaling laws have been recently employed to derive compute-optimal model size (number of parameters) for a given compute duration. We advance and refine such methods to infer compute-optimal model shapes, such as width and depth, and successfully ...
Ibrahim M. Alabdulmohsin   +3 more
semanticscholar   +1 more source

Splicing ViT Features for Semantic Appearance Transfer [PDF]

open access: yesComputer Vision and Pattern Recognition, 2022
We present a method for semantically transferring the visual appearance of one natural image to another. Specifically, our goal is to generate an image in which objects in a source structure image are “painted” with the visual appearance of their ...
Narek Tumanyan   +3 more
semanticscholar   +1 more source

Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios [PDF]

open access: yesarXiv.org, 2022
Due to the complex attention mechanisms and model design, most existing vision Transformers (ViTs) can not perform as efficiently as convolutional neural networks (CNNs) in realistic industrial deployment scenarios, e.g. TensorRT and CoreML. This poses a
Jiashi Li   +8 more
semanticscholar   +1 more source

Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning [PDF]

open access: yesEuropean Conference on Computer Vision, 2022
Multi-scale Vision Transformer (ViT) has emerged as a powerful backbone for computer vision tasks, while the self-attention computation in Transformer scales quadratically w.r.t. the input patch number.
Ting Yao   +4 more
semanticscholar   +1 more source

Home - About - Disclaimer - Privacy