Scaling Vision Transformers

Papers Read on AI

English - July 23, 2021 13:25 - 24 minutes - 22.7 MB - ★★★★ - 3 ratings
Tech News News Homepage Download Apple Podcasts Google Podcasts Overcast Castro Pocket Casts RSS feed

Previous Episode: NeuMIP: Multi-Resolution Neural Materials

Next Episode: Deduplicating Training Data Makes Language Models Better

Attention-based neural networks such as the Vision Transformer (ViT) have recently attained state-of-the-art results on many computer vision benchmarks. We successfully train a ViT model with two billion parameters, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy. The model also performs well on few-shot learning, for example, reaching 84.86% top-1 accuracy on ImageNet with only 10 examples per class.

2021: Xiaohua Zhai, Alexander Kolesnikov, N. Houlsby, L. Beyer

https://arxiv.org/pdf/2106.04560v1.pdf