Linear Digressions

How Polls Got Brexit "Wrong"

August 08, 2016 01:37 - 15 minutes - 20.9 MB

Continuing the discussion of how polls do (and sometimes don't) tell us what to expect in upcoming elections--let's take a concrete example from the recent past, shall we? The Brexit referendum was, by and large, expected to shake out for "remain", but when the votes were counted, "leave" came out ahead. Everyone was shocked (SHOCKED!) but maybe the polls weren't as wrong as the pundits like to claim. Relevant links: http://www.slate.com/articles/news_and_politics/moneybox/2016/07/why_poli...

Election Forecasting

August 01, 2016 02:40 - 28 minutes - 39.8 MB

Not sure if you heard, but there's an election going on right now. Polls, surveys, and projections about, as far as the eye can see. How to make sense of it all? How are the projections made? Which are some good ones to follow? We'll be your trusty guides through a crash course in election forecasting. Relevant links: http://www.wired.com/2016/06/civis-election-polling-clinton-sanders-trump/ http://election.princeton.edu/ http://projects.fivethirtyeight.com/2016-election-forecast/ http:...

Machine Learning for Genomics

July 25, 2016 02:14 - 20 minutes - 28 MB

Genomics data is some of the biggest #bigdata, and doing machine learning on it is unlocking new ways of thinking about evolution, genomic diseases like cancer, and what really makes each of us different for everyone else. This episode touches on some of the things that make machine learning on genomics data so challenging, and the algorithms designed to do it anyway.

Climate Modeling

July 18, 2016 02:26 - 19 minutes - 27.2 MB

Hot enough for you? Climate models suggest that it's only going to get warmer in the coming years. This episode unpacks those models, so you understand how they work. A lot of the episodes we do are about fun studies we hear about, like "if you're interested, this is kinda cool"--this episode is much more important than that. Understanding these models, and taking action on them where appropriate, will have huge implications in the years to come. Relevant links: https://climatesight.org/

Reinforcement Learning Gone Wrong

July 11, 2016 02:42 - 28 minutes - 51.8 MB

Last week’s episode on artificial intelligence gets a huge payoff this week—we’ll explore a wonderful couple of papers about all the ways that artificial intelligence can go wrong. Malevolent actors? You bet. Collateral damage? Of course. Reward hacking? Naturally! It’s fun to think about, and the discussion starting now will have reverberations for decades to come. https://www.technologyreview.com/s/601519/how-to-create-a-malevolent-artificial-intelligence/ http://arxiv.org/abs/1605....

Reinforcement Learning for Artificial Intelligence

July 03, 2016 18:28 - 18 minutes - 33.5 MB

There’s a ton of excitement about reinforcement learning, a form of semi-supervised machine learning that underpins a lot of today’s cutting-edge artificial intelligence algorithms. Here’s a crash course in the algorithmic machinery behind AlphaGo, and self-driving cars, and major logistical optimization projects—and the robots that, tomorrow, will clean our houses and (hopefully) not take over the world…

Differential Privacy: how to study people without being weird and gross

June 27, 2016 01:53 - 18 minutes - 25.1 MB

Apple wants to study iPhone users' activities and use it to improve performance. Google collects data on what people are doing online to try to improve their Chrome browser. Do you like the idea of this data being collected? Maybe not, if it's being collected on you--but you probably also realize that there is some benefit to be had from the improved iPhones and web browsers. Differential privacy is a set of policies that walks the line between individual privacy and better data, includin...

How the sausage gets made

June 20, 2016 02:25 - 29 minutes - 40.1 MB

Something a little different in this episode--we'll be talking about the technical plumbing that gets our podcast from our brains to your ears. As it turns out, it's a multi-step bucket brigade process of RSS feeds, links to downloads, and lots of hand-waving when it comes to trying to figure out how many of you (listeners) are out there.

SMOTE: makin' yourself some fake minority data

June 13, 2016 03:06 - 14 minutes - 20.1 MB

Machine learning on imbalanced classes: surprisingly tricky. Many (most?) algorithms tend to just assign the majority class label to all the data and call it a day. SMOTE is an algorithm for manufacturing new minority class examples for yourself, to help your algorithm better identify them in the wild. Relevant links: https://www.jair.org/media/953/live-953-2037-jair.pdf

Conjoint Analysis: like AB testing, but on steroids

June 06, 2016 02:13 - 18 minutes - 25.3 MB

Conjoint analysis is like AB tester, but more bigger more better: instead of testing one or two things, you can test potentially dozens of options. Where might you use something like this? Well, if you wanted to design an entire hotel chain completely from scratch, and to do it in a data-driven way. You'll never look at Courtyard by Marriott the same way again. Relevant link: https://marketing.wharton.upenn.edu/files/?whdmsaction=public:main.file&fileID=466

Traffic Metering Algorithms

May 30, 2016 01:57 - 17 minutes - 24 MB

This episode is for all you (us) traffic nerds--we're talking about the hidden structure underlying traffic on-ramp metering systems. These systems slow down the flow of traffic onto highways so that the highways don't get overloaded with cars and clog up. If you're someone who listens to podcasts while commuting, and especially if your area has on-ramp metering, you'll never look at highway access control the same way again (yeah, we know this is super nerdy; it's also super awesome). Rel...

Traffic Metering Algorithms

May 30, 2016 01:57

This episode is for all you (us) traffic nerds--we're talking about the hidden structure underlying traffic on-ramp metering systems. These systems slow down the flow of traffic onto highways so that the highways don't get overloaded with cars and clog up. If you're someone who listens to podcasts while commuting, and especially if your area has on-ramp metering, you'll never look at highway access control the same way again (yeah, we know this is super nerdy; it's also super awesome). Rel...

Um Detector 2: The Dynamic Time Warp

May 23, 2016 02:05 - 14 minutes - 19.2 MB

One tricky thing about working with time series data, like the audio data in our "um" detector (remember that? because we barely do...), is that sometimes events look really similar but one is a little bit stretched and squeezed relative to the other. Besides having an amazing name, the dynamic time warp is a handy algorithm for aligning two time series sequences that are close in shape, but don't quite line up out of the box. Relevant link: http://www.aaai.org/Papers/Workshops/1994/WS-94-...

Inside a Data Analysis: Fraud Hunting at Enron

May 16, 2016 02:36 - 30 minutes - 41.9 MB

It's storytime this week--the story, from beginning to end, of how Katie designed and built the main project for Udacity's Intro to Machine Learning class, when she was developing the course. The project was to use email and financial data to hunt for signatures of fraud at Enron, one of the biggest cases of corporate fraud in history; that description makes the project sound pretty clean but getting the data into the right shape, and even doing some dataset merging (that hadn't ever been do...

What's the biggest #bigdata?

May 09, 2016 01:28 - 25 minutes - 35 MB

Data science and is often mentioned in the same breath as big data. But how big is big data? And who has the biggest big data? CERN? Youtube? ... Something (or someone) else? Relevant link: http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195

Data Contamination

May 02, 2016 02:24 - 20 minutes - 28.8 MB

Supervised machine learning assumes that the features and labels used for building a classifier are isolated from each other--basically, that you can't cheat by peeking. Turns out this can be easier said than done. In this episode, we'll talk about the many (and diverse!) cases where label information contaminates features, ruining data science competitions along the way. Relevant links: https://www.researchgate.net/profile/Claudia_Perlich/publication/221653692_Leakage_in_data_mining_Formu...

Model Interpretation (and Trust Issues)

April 25, 2016 00:45 - 16 minutes - 23.3 MB

Machine learning algorithms can be black boxes--inputs go in, outputs come out, and what happens in the middle is anybody's guess. But understanding how a model arrives at an answer is critical for interpreting the model, and for knowing if it's doing something reasonable (one could even say... trustworthy). We'll talk about a new algorithm called LIME that seeks to make any model more understandable and interpretable. Relevant Links: http://arxiv.org/abs/1602.04938 https://github.com/marc...

Updates! Political Science Fraud and AlphaGo

April 18, 2016 02:48 - 31 minutes - 43.6 MB

We've got updates for you about topics from past shows! First, the political science scandal of the year 2015 has a new chapter, we'll remind you about the original story and then dive into what has happened since. Then, we've got an update on AlphaGo, and his/her/its much-anticipated match against the human champion of the game Go. Relevant Links: https://soundcloud.com/linear-digressions/electoral-insights-part-2 https://soundcloud.com/linear-digressions/go-1 http://www.sciencemag.org/n...

Ecological Inference and Simpson's Paradox

April 11, 2016 02:43 - 18 minutes - 25.5 MB

Simpson's paradox is the data science equivalent of looking through one eye and seeing a very clear trend, and then looking through the other eye and seeing the very clear opposite trend. In one case, you see a trend one way in a group, but then breaking the group into subgroups gives the exact opposite trend. Confused? Scratching your head? Welcome to the tricky world of ecological inference. Relevant links: https://gking.harvard.edu/files/gking/files/part1.pdf http://blog.revolutionana...

Discriminatory Algorithms

April 04, 2016 02:30 - 15 minutes - 21.1 MB

Sometimes when we say an algorithm discriminates, we mean it can tell the difference between two types of items. But in this episode, we'll talk about another, more troublesome side to discrimination: algorithms can be... racist? Sexist? Ageist? Yes to all of the above. It's an important thing to be aware of, especially when doing people-centered data science. We'll discuss how and why this happens, and what solutions are out there (or not). Relevant Links: http://www.nytimes.com/2015/...

Recommendation Engines and Privacy

March 28, 2016 02:46 - 31 minutes - 43.3 MB

This episode started out as a discussion of recommendation engines, like Netflix uses to suggest movies. There's still a lot of that in here. But a related topic, which is both interesting and important, is how to keep data private in the era of large-scale recommendation engines--what mistakes have been made surrounding supposedly anonymized data, how data ends up de-anonymized, and why it matters for you. Relevant links: http://www.netflixprize.com/ http://bits.blogs.nytimes.com/2010/03/...

Neural nets play cops and robbers (AKA generative adverserial networks)

March 21, 2016 02:58 - 18 minutes - 26 MB

One neural net is creating counterfeit bills and passing them off to a second neural net, which is trying to distinguish the real money from the fakes. Result: two neural nets that are better than either one would have been without the competition. Relevant links: http://arxiv.org/pdf/1406.2661v1.pdf http://arxiv.org/pdf/1412.6572v3.pdf http://soumith.ch/eyescream/

A Data Scientist's View of the Fight against Cancer

March 14, 2016 03:26 - 19 minutes - 26.3 MB

In this episode, we're taking many episodes' worth of insights and unpacking an extremely complex and important question--in what ways are we winning the fight against cancer, where might that fight go in the coming decade, and how do we know when we're making progress? No matter how tricky you might think this problem is to solve, the fact is, once you get in there trying to solve it, it's even trickier than you thought.

Congress Bots and DeepDrumpf

March 11, 2016 04:17 - 20 minutes - 28.6 MB

Hey, sick of the election yet? Fear not, there are algorithms that can automagically generate political-ish speech so that we never need to be without an endless supply of Congressional speeches and Donald Trump twitticisms! Relevant links: http://arxiv.org/pdf/1601.03313v2.pdf http://qz.com/631497/mit-built-a-donald-trump-ai-twitter-bot-that-sounds-scarily-like-him/ https://twitter.com/deepdrumpf

Multi - Armed Bandits

March 07, 2016 02:44 - 11 minutes - 15.8 MB

Multi-armed bandits: how to take your randomized experiment and make it harder better faster stronger. Basically, a multi-armed bandit experiment allows you to optimize for both learning and making use of your knowledge at the same time. It's what the pros (like Google Analytics) use, and it's got a great name, so... winner! Relevant link: https://support.google.com/analytics/answer/2844870?hl=en

Experiments and Messy, Tricky Causality

March 04, 2016 03:54 - 16 minutes - 23.3 MB

"People with a family history of heart disease are more likely to eat healthy foods, and have a high incidence of heart attacks." Did the healthy food cause the heart attacks? Probably not. But establishing causal links is extremely tricky, and extremely important to get right if you're trying to help students, test new medicines, or just optimize a website. In this episode, we'll unpack randomized experiments, like AB tests, and maybe you'll be smarter as a result. Will you be smarter B...

Backpropagation

February 29, 2016 03:58 - 12 minutes - 17 MB

The reason that neural nets are taking over the world right now is because they can be efficiently trained with the backpropagation algorithm. In short, backprop allows you to adjust the weights of the neural net based on how good of a job the neural net is doing at classifying training examples, thereby getting better and better at making predictions. In this episode: we talk backpropagation, and how it makes it possible to train the neural nets we know and love.

Text Analysis on the State Of The Union

February 26, 2016 03:51 - 22 minutes - 30.7 MB

First up in this episode: a crash course in natural language processing, and important steps if you want to use machine learning techniques on text data. Then we'll take that NLP know-how and talk about a really cool analysis of State of the Union text, which analyzes the topics and word choices of every President from Washington to Obama. Relevant link: https://civisanalytics.com/blog/data-science/2016/01/15/data-science-on-state-of-the-union-addresses/

Paradigms in Artificial Intelligence

February 22, 2016 04:32 - 17 minutes - 23.8 MB

Artificial intelligence includes a number of different strategies for how to make machines more intelligent, and often more human-like, in their ability to learn and solve problems. An ambitious group of researchers is working right now to classify all the approaches to AI, perhaps as a first step toward unifying these approaches and move closer to strong AI. In this episode, we'll touch on some of the most provocative work in many different subfields of artificial intelligence, and their s...

Survival Analysis

February 19, 2016 03:44 - 15 minutes - 21.1 MB

Survival analysis is all about studying how long until an event occurs--it's used in marketing to study how long a customer stays with a service, in epidemiology to estimate the duration of survival of a patient with some illness, and in social science to understand how the characteristics of a war inform how long the war goes on. This episode talks about the special challenges associated with survival analysis, and the tools that (data) scientists use to answer all kinds of duration-related...

Gravitational Waves

February 15, 2016 02:46 - 20 minutes - 28.1 MB

All aboard the gravitational waves bandwagon--with the first direct observation of gravitational waves announced this week, Katie's dusting off her physics PhD for a very special gravity-related episode. Discussed in this episode: what are gravitational waves, how are they detected, and what does this announcement mean for future studies of the universe. Relevant links: http://www.nytimes.com/2016/02/12/science/ligo-gravitational-waves-black-holes-einstein.html https://www.ligo.caltech.edu/...

The Turing Test

February 12, 2016 04:11 - 15 minutes - 20.9 MB

Let's imagine a future in which a truly intelligent computer program exists. How would it convince us (humanity) that it was intelligent? Alan Turing's answer to this question, proposed over 60 years ago, is that the program could convince a human conversational partner that it, the computer, was in fact a human. 60 years later, the Turing Test endures as a gold standard of artificial intelligence. It hasn't been beaten, either--yet. Relevant links: https://en.wikipedia.org/wiki/Turing_t...

Item Response Theory: how smart ARE you?

February 08, 2016 03:37 - 11 minutes - 16.2 MB

Psychometrics is all about measuring the psychological characteristics of people; for example, scholastic aptitude. How is this done? Tests, of course! But there's a chicken-and-egg problem here: you need to know both how hard a test is, and how smart the test-taker is, in order to get the results you want. How to solve this problem, one equation with two unknowns? Item response theory--the data science behind such tests and the GRE. Relevant links: https://en.wikipedia.org/wiki/Item_r...

Go!

February 05, 2016 04:52 - 19 minutes - 27.4 MB

As you may have heard, a computer beat a world-class human player in Go last week. As recently as a year ago the prediction was that it would take a decade to get to this point, yet here we are, in 2016. We'll talk about the history and strategy of game-playing computer programs, and what makes Google's AlphaGo so special. Relevant link: http://googleresearch.blogspot.com/2016/01/alphago-mastering-ancient-game-of-go.html

Great Social Networks in History

February 01, 2016 04:22 - 12 minutes - 17.4 MB

The Medici were one of the great ruling families of Europe during the Renaissance. How did they come to rule? Not power, or money, or armies, but through the strength of their social network. And speaking of great historical social networks, analysis of the network of letter-writing during the Enlightenment is helping humanities scholars track the dispersion of great ideas across the world during that time, from Voltaire to Benjamin Franklin and everyone in between. Relevant links: https:...

How Much to Pay a Spy (and a lil' more auctions)

January 29, 2016 05:36 - 16 minutes - 23.3 MB

A few small encores on auction theory, and then--how can you value a piece of information before you know what it is? Decision theory has some pointers. Some highly relevant information if you are trying to figure out how much to pay a spy. Relevant links: https://tuecontheoryofnetworks.wordpress.com/2013/02/25/the-origin-of-the-dutch-auction/ http://www.nowozin.net/sebastian/blog/the-fair-price-to-pay-a-spy-an-introduction-to-the-value-of-information.html

Sold! Auctions (Part 2)

January 25, 2016 02:58 - 17 minutes - 24 MB

The Google ads auction is a special kind of auction, one you might not know as well as the famous English auction (which we talked about in the last episode). But if it's what Google uses to sell billions of dollars of ad space in real time, you know it must be pretty cool. Relevant links: https://en.wikipedia.org/wiki/English_auction http://people.ischool.berkeley.edu/~hal/Papers/2006/position.pdf http://www.benedelman.org/publications/gsp-060801.pdf

Going Once, Going Twice: Auctions (Part 1)

January 22, 2016 03:40 - 12 minutes - 17.4 MB

The Google AdWords algorithm is (famously) an auction system for allocating a massive amount of online ad space in real time--with that fascinating use case in mind, this episode is part one in a two-part series all about auctions. We dive into the theory of auctions, and what makes a "good" auction. Relevant links: https://en.wikipedia.org/wiki/English_auction http://people.ischool.berkeley.edu/~hal/Papers/2006/position.pdf http://www.benedelman.org/publications/gsp-060801.pdf

Chernoff Faces and Minard Maps

January 18, 2016 03:38 - 15 minutes - 20.9 MB

A data visualization extravaganza in this episode, as we discuss Chernoff faces (you: "faces? huh?" us: "oh just you wait") and the greatest data visualization of all time, or at least the Napoleonic era. Relevant links: http://lya.fciencias.unam.mx/rfuentes/faces-chernoff.pdf https://en.wikipedia.org/wiki/Charles_Joseph_Minard

t-SNE: Reduce Your Dimensions, Keep Your Clusters

January 15, 2016 04:05 - 16 minutes - 23.2 MB

Ever tried to visualize a cluster of data points in 40 dimensions? Or even 4, for that matter? We prefer to stick to 2, or maybe 3 if we're feeling well-caffeinated. The t-SNE algorithm is one of the best tools on the market for doing dimensionality reduction when you have clustering in mind. Relevant links: https://www.youtube.com/watch?v=RJVL80Gg3lA

The [Expletive Deleted] Problem

January 11, 2016 04:23 - 9 minutes - 13.6 MB

The town of [expletive deleted], England, is responsible for the clbuttic [expletive deleted] problem. This week on Linear Digressions: we try really hard not to swear too much. Related links: https://en.wikipedia.org/wiki/Scunthorpe_problem https://www.washingtonpost.com/news/worldviews/wp/2016/01/05/where-is-russia-actually-mordor-in-the-world-of-google-translate/

Unlabeled Supervised Learning--whaaa?

January 08, 2016 03:26 - 12 minutes - 17.3 MB

In order to do supervised learning, you need a labeled training dataset. Or do you...? Relevant links: http://www.cs.columbia.edu/~dplewis/candidacy/goldman00enhancing.pdf

Hacking Neural Nets

January 05, 2016 02:56 - 15 minutes - 21.2 MB

Machine learning: it can be fooled, just like you or me. Here's one of our favorite examples, a study into hacking neural networks. Relevant links: http://arxiv.org/pdf/1412.1897v4.pdf

Zipf's Law

December 31, 2015 18:08 - 11 minutes - 16.1 MB

Zipf's law is related to the statistics of how word usage is distributed. As it turns out, this is also strikingly reminiscent of how income is distributed, and populations of cities, and bug reports in software, as well as tons of other phenomena that we all interact with every day. Relevant links: http://economix.blogs.nytimes.com/2010/04/20/a-tale-of-many-cities/ http://arxiv.org/pdf/cond-mat/0412004.pdf https://terrytao.wordpress.com/2009/07/03/benfords-law-zipfs-law-and-the-pareto-dist...

Indie Announcement

December 30, 2015 15:57 - 1 minute - 1.82 MB

We've gone indie! Which shouldn't change anything about the podcast that you know and love, but we're super excited to keep bringing you Linear Digressions as a fully independent podcast. Some links mentioned in the show: https://twitter.com/lindigressions https://twitter.com/benjaffe https://twitter.com/multiarmbandit https://soundcloud.com/linear-digressions http://lineardigressions.com/

Portrait Beauty

December 27, 2015 13:34 - 11 minutes - 16.1 MB

It's Da Vinci meets Skynet: what makes a portrait beautiful, according to a machine learning algorithm. Snap a selfie and give us a listen.

The Cocktail Party Problem

December 18, 2015 00:17 - 12 minutes - 22.1 MB

Grab a cocktail, put on your favorite karaoke track, and let’s talk some more about disentangling audio data!

A Criminally Short Introduction to Semi Supervised Learning

December 04, 2015 03:13 - 9 minutes - 12.7 MB

Because there are more interesting problems than there are labeled datasets, semi-supervised learning provides a framework for getting feedback from the environment as a proxy for labels of what's "correct." Of all the machine learning methodologies, it might also be the closest to how humans usually learn--we go through the world, getting (noisy) feedback on the choices we make and learn from the outcomes of our actions.

Thresholdout: Down with Overfitting

November 27, 2015 17:55 - 15 minutes - 21.8 MB

Overfitting to your training data can be avoided by evaluating your machine learning algorithm on a holdout test dataset, but what about overfitting to the test data? Turns out it can be done, easily, and you have to be very careful to avoid it. But an algorithm from the field of privacy research shows promise for keeping your test data safe from accidental overfitting

The State of Data Science

November 10, 2015 04:36 - 15 minutes - 21.5 MB

How many data scientists are there, where do they live, where do they work, what kind of tools do they use, and how do they describe themselves? RJMetrics wanted to know the answers to these questions, so they decided to find out and share their analysis with the world. In this very special interview episode, we welcome Tristan Handy, VP of Marketing at RJMetrics, who will talk about "The State of Data Science Report."

Linear Digressions

Episodes

Twitter Mentions