Results 31 to 40 of about 170,334 (258)
Vision Transformers Are Robust Learners
Transformers, composed of multiple self-attention layers, hold strong promises toward a generic learning primitive applicable to different data modalities, including the recent breakthroughs in computer vision achieving state-of-the-art (SOTA) standard accuracy. What remains largely unexplored is their robustness evaluation and attribution.
Paul, Sayak, Chen, Pin-Yu
openaire +2 more sources
The Multiscale Surface Vision Transformer
Accepted for publication at MIDL 2024, 17 pages, 6 ...
Dahan, Simon +3 more
openaire +3 more sources
Privacy-Preserving Semantic Segmentation Using Vision Transformer
In this paper, we propose a privacy-preserving semantic segmentation method that uses encrypted images and models with the vision transformer (ViT), called the segmentation transformer (SETR).
Hitoshi Kiya +3 more
doaj +1 more source
Recurrent Attentional Networks for Saliency Detection
Convolutional-deconvolution networks can be adopted to perform end-to-end saliency detection. But, they do not work well with objects of multiple scales.
Kuen, Jason, Wang, Gang, Wang, Zhenhua
core +1 more source
Long Short-Term Memory Spatial Transformer Network
Spatial transformer network has been used in a layered form in conjunction with a convolutional network to enable the model to transform data spatially.
Chen, Tianyue, Feng, Shiyang, Sun, Hao
core +1 more source
Sign language recognition with transformer networks [PDF]
Sign languages are complex languages. Research into them is ongoing, supported by large video corpora of which only small parts are annotated. Sign language recognition can be used to speed up the annotation process of these corpora, in order to aid ...
Dambre, Joni +2 more
core
Human vision possesses a special type of visual processing systems called peripheral vision. Partitioning the entire visual field into multiple contour regions based on the distance to the center of our gaze, the peripheral vision provides us the ability to perceive various visual features at different regions.
Min, Juhong +3 more
openaire +2 more sources
V-LTCS: Backbone exploration for Multimodal Misogynous Meme detection
Memes have become a fundamental part of online communication and humour, reflecting and shaping the culture of today’s digital age. The amplified Meme culture is inadvertently endorsing and propagating casual Misogyny. This study proposes V-LTCS (Vision-
Sneha Chinivar +3 more
doaj +1 more source
GaitTriViT and GaitVViT: Transformer-based methods emphasizing spatial or temporal aspects in gait recognition [PDF]
In image recognition tasks, subjects with long distances and low resolution remain a challenge, whereas gait recognition, identifying subjects by walking patterns, is considered one of the most promising biometric technologies due to its stability and ...
Hongyun Sheng
doaj +2 more sources
Modeling Image Virality with Pairwise Spatial Transformer Networks
The study of virality and information diffusion online is a topic gaining traction rapidly in the computational social sciences. Computer vision and social network analysis research have also focused on understanding the impact of content and information
Agarwal, Sumeet, Dubey, Abhimanyu
core +1 more source

