OpenAI Embeddings (and Controversy?!)

Yannic Kilcher Videos (Audio Only)

English - February 16, 2022 08:07 - 15 minutes - 14.8 MB - ★★★★★ - 1 rating
Technology Homepage Download Apple Podcasts Google Podcasts Overcast Castro Pocket Casts RSS feed

Previous Episode: Unsupervised Brain Models - How does Deep Learning inform Neuroscience? (w/ Patrick Mineault)

Next Episode: Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents (+Author)

#mlnews #openai #embeddings

COMMENTS DIRECTLY FROM THE AUTHOR (thanks a lot for reaching out Arvind :) ):

1. The FIQA results you share also have code to reproduce the results in the paper using the API: https://twitter.com/arvind_io/status/... There's no discrepancy AFAIK.

2. We leave out 6 not 7 BEIR datasets. Results on msmarco, nq and triviaqa are in a separate table (Table 5 in the paper). NQ is part of BEIR too and we didn't want to repeat it. Finally, the 6 datasets we leave out are not readily available and it is common to leave them out in prior work too. For examples, see SPLADE v2 (https://arxiv.org/pdf/2109.10086.pdf) also evaluates on the same 12 BEIR datasets.

3. Finally, I'm now working on time travel so that I can cite papers from the future :)

END COMMENTS FROM THE AUTHOR

OpenAI launches an embeddings endpoint in their API, providing high-dimensional vector embeddings for use in text similarity, text search, and code search. While embeddings are universally recognized as a standard tool to process natural language, people have raised doubts about the quality of OpenAI's embeddings, as one blog post found they are often outperformed by open-source models, which are much smaller and with which embedding would cost a fraction of what OpenAI charges. In this video, we examine the claims made and determine what it all means.

OUTLINE:

0:00 - Intro

0:30 - Sponsor: Weights & Biases

2:20 - What embeddings are available?

3:55 - OpenAI shows promising results

5:25 - How good are the results really?

6:55 - Criticism: Open models might be cheaper and smaller

10:05 - Discrepancies in the results

11:00 - The author's response

11:50 - Putting things into perspective

13:35 - What about real world data?

14:40 - OpenAI's pricing strategy: Why so expensive?

Sponsor: Weights & Biases

https://wandb.me/yannic

Merch: store.ykilcher.com

ERRATA: At 13:20 I say "better", it should be "worse"

References:

https://openai.com/blog/introducing-t...

https://arxiv.org/pdf/2201.10005.pdf

https://beta.openai.com/docs/guides/e...

https://beta.openai.com/docs/api-refe...

https://twitter.com/Nils_Reimers/stat...

https://medium.com/@nils_reimers/open...

https://mobile.twitter.com/arvind_io/...

https://twitter.com/gwern/status/1487...

https://twitter.com/Nils_Reimers/stat...

https://twitter.com/gwern/status/1470...

https://www.reddit.com/r/MachineLearn...

https://mobile.twitter.com/arvind_io/...

Links:

TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick

YouTube: https://www.youtube.com/c/yannickilcher

Twitter: https://twitter.com/ykilcher

Discord: https://discord.gg/4H8xxDF

BitChute: https://www.bitchute.com/channel/yann...

LinkedIn: https://www.linkedin.com/in/ykilcher

BiliBili: https://space.bilibili.com/2017636191

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):

SubscribeStar: https://www.subscribestar.com/yannick...

Patreon: https://www.patreon.com/yannickilcher

Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq

Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2

Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m

OpenAI Embeddings (and Controversy?!)

Yannic Kilcher Videos (Audio Only)

Twitter Mentions