Building the howto100m Video Corpus
Data Skeptic
English - August 19, 2019 20:12 - 22 minutes - 25.9 MB - ★★★★★ - 477 ratingsScience Technology machinelearning datamining datascience science skepticism statistics Homepage Download Apple Podcasts Google Podcasts Overcast Castro Pocket Casts RSS feed
Previous Episode: BERT
Next Episode: Applied Data Science in Industry
Video annotation is an expensive and time-consuming process. As a consequence, the available video datasets are useful but small. The availability of machine transcribed explainer videos offers a unique opportunity to rapidly develop a useful, if dirty, corpus of videos that are "self annotating", as hosts explain the actions they are taking on the screen.
This episode is a discussion of the HowTo100m dataset - a project which has assembled a video corpus of 136M video clips with captions covering 23k activities.
Related LinksThe paper will be presented at ICCV 2019