Linear Digressions

Data Science for Making the World a Better Place

November 06, 2015 03:43 - 9 minutes - 13.1 MB

There's a good chance that great data science is going on close to you, and that it's going toward making your city, state, country, and planet a better place. Not all the data science questions being tackled out there are about finding the sleekest new algorithm or billion-dollar company idea--there's a whole world of social data science that just wants to make the world a better place to live in.

Kalman Runners

October 29, 2015 03:10

The Kalman Filter is an algorithm for taking noisy measurements of dynamic systems and using them to get a better idea of the underlying dynamics than you could get from a simple extrapolation. If you've ever run a marathon, or been a nuclear missile, you probably know all about these challenges already. By the way, we neglected to mention in the episode: Katie's marathon time was 3:54:27!

Kalman Runners

October 29, 2015 03:10 - 14 minutes - 20.2 MB

The Kalman Filter is an algorithm for taking noisy measurements of dynamic systems and using them to get a better idea of the underlying dynamics than you could get from a simple extrapolation. If you've ever run a marathon, or been a nuclear missile, you probably know all about these challenges already. By the way, we neglected to mention in the episode: Katie's marathon time was 3:54:27!

Neural Net Inception

October 23, 2015 02:25 - 15 minutes - 21 MB

When you sleep, the neural pathways in your brain take the "white noise" of your resting brain, mix in your experiences and imagination, and the result is dreams (that is a highly unscientific explanation, but you get the idea). What happens when neural nets are put through the same process? Train a neural net to recognize pictures, and then send through an image of white noise, and it will start to see some weird (but cool!) stuff.

Benford's Law

October 16, 2015 03:30 - 17 minutes - 24.3 MB

Sometimes numbers are... weird. Benford's Law is a favorite example of this for us--it's a law that governs the distribution of the first digit in certain types of numbers. As it turns out, if you're looking up the length of a river, the population of a country, the price of a stock... not all first digits are created equal.

Guinness

October 07, 2015 03:30 - 14 minutes - 20.2 MB

Not to oversell it, but the student's t-test has got to have the most interesting history of any statistical test. Which is saying a lot, right? Add some boozy statistical trivia to your arsenal in this epsiode.

PFun with P Values

September 02, 2015 03:24 - 17 minutes - 23.5 MB

Doing some science, and want to know if you might have found something? Or maybe you've just accomplished the scientific equivalent of going fishing and reeling in an old boot? Frequentist p-values can help you distinguish between "eh" and "oooh interesting". Also, there's a lot of physics in this episode, nerds.

Watson

August 25, 2015 02:26 - 15 minutes - 21.4 MB

This machine learning algorithm beat the human champions at Jeopardy. What is... Watson?

Bayesian Psychics

August 18, 2015 00:05 - 11 minutes - 16.1 MB

Come get a little "out there" with us this week, as we use a meta-study of extrasensory perception (or ESP, often used in the same sentence as "psychics") to chat about Bayesian vs. frequentist statistics.

Troll Detection

August 07, 2015 20:56 - 12 minutes - 17.8 MB

Ever found yourself wasting time reading online comments from trolls? Of course you have; we've all been there (it's 4 AM but I can't turn off the computer and go to sleep--someone on the internet is WRONG!). Now there's a way to use machine learning to automatically detect trolls, and minimize the impact when they try to derail online conversations.

Yiddish Translation

August 03, 2015 03:06 - 12 minutes - 16.8 MB

Imagine a language that is mostly spoken rather than written, contains many words in other languages, and has relatively little written overlap with English. Now imagine writing a machine-learning-based translation system that can convert that language to English. That's the problem that confronted researchers when they set out to automatically translate between Yiddish and English; the tricks they used help us understand a lot about machine translation.

Modeling Particles in Atomic Bombs

July 06, 2015 23:30 - 15 minutes - 21.5 MB

In a fun historical journey, Katie and Ben explore the history of the Manhattan Project, discuss the difficulties in modeling particle movement in atomic bombs with only punch-card computers and ingenuity, and eventually come to present-day uses of the Metropolis-Hastings algorithm... mentioning Solitaire along the way.

Random Number Generation

June 19, 2015 18:49 - 10 minutes - 14.3 MB

Let's talk about randomness! Although randomness is pervasive throughout the natural world, it's surprisingly difficult to generate random numbers. And even if your numbers look random (but actually aren't), it can have interesting consequences on the security of systems, and the accuracy of models and research. In this episode, Katie and Ben talk about randomness, its place in machine learning and computation in general, along with some random digressions of their own.

Electoral Insights (Part 2)

June 09, 2015 02:46 - 21 minutes - 29.3 MB

Following up on our last episode about how experiments can be performed in political science, now we explore a high-profile case of an experiment gone wrong. An extremely high-profile paper that was published in 2014, about how talking to people can convince them to change their minds on topics like abortion and gay marriage, has been exposed as the likely product of a fraudulently produced dataset. We’ll talk about a cool data science tool called the Kolmogorov-Smirnov test, which a pair o...

Electoral Insights (Part 1)

June 05, 2015 20:38 - 9 minutes - 12.8 MB

The first of our two-parter discussing the recent electoral data fraud case. The results of the study in question were covered widely, including by This American Life (who later had to issue a retraction). Data science for election research involves studying voters, who are people, and people are tricky to study—every one of them is different, and the same treatment can have different effects on different voters. But with randomized controlled trials, small variations from person to person ...

Falsifying Data

June 01, 2015 21:04 - 17 minutes - 24.4 MB

In the first of a few episodes on fraud in election research, we’ll take a look at a case study from a previous Presidential election, where polling results were faked. What are some telltale signs that data fraud might be present in a dataset? We’ll explore that in this episode.

Reporter Bot

May 20, 2015 23:16 - 11 minutes - 15.5 MB

There’s a big difference between a table of numbers or statistics, and the underlying story that a human might tell about how those numbers were generated. Think about a baseball game—the game stats and a newspaper story are describing the same thing, but one is a good input for a machine learning algorithm and the other is a good story to read over your morning coffee. Data science and machine learning are starting to bridge this gap, taking the raw data on things like baseball games, fina...

Careers in Data Science

May 16, 2015 05:43 - 16 minutes - 22.8 MB

Let’s talk money. As a “hot” career right now, data science can pay pretty well. But for an individual person matched with a specific job or industry, how much should someone expect to make? Since Katie was on the job market lately, this was something she’s been researching, and it turns out that data science itself (in particular linear regressions) has some answers. In this episode, we go through a survey of hundreds of data scientists, who report on their job duties, industry, skills, ...

That's "Dr Katie" to You

May 14, 2015 17:37 - 3 minutes - 4.15 MB

Katie successfully defended her thesis! We celebrate her return, and talk a bit about what getting a PhD in Physics is like.

Neural Nets (Part 2)

May 11, 2015 14:37 - 10 minutes - 15 MB

In the last episode, we zipped through neural nets and got a quick idea of how they work and why they can be so powerful. Here’s the real payoff of that work: In this episode, we’ll talk about a brand-new pair of results, one from Stanford and one from Google, that use neural nets to perform automated picture captioning. One neural net does the object and relationship recognition of the image, a second neural net handles the natural language processing required to express that in an English ...

Neural Nets (Part 1)

May 01, 2015 18:59 - 9 minutes - 12.4 MB

There is no known learning algorithm that is more flexible and powerful than the human brain. That's quite inspirational, if you think about it--to level up machine learning, maybe we should be going back to biology and letting millions of year of evolution guide the structure of our algorithms. This is the idea behind neural nets, which mock up the structure of the brain and are some of the most studied and powerful algorithms out there. In this episode, we’ll lay out the building blocks o...

Inferring Authorship (Part 2)

April 28, 2015 16:56 - 14 minutes - 19.3 MB

Now that we’re up to speed on the classic author ID problem (who wrote the unsigned Federalist Papers?), we move onto a couple more contemporary examples. First, J.K. Rowling was famously outed using computational linguistics (and Twitter) when she wrote a book under the pseudonym Robert Galbraith. Second, we’ll talk about a mystery that still endures--who is Satoshi Nakamoto? Satoshi is the mysterious person (or people) behind an extremely lucrative cryptocurrency (aka internet money) ca...

Inferring Authorship (Part 1)

April 16, 2015 17:25 - 8 minutes - 12.2 MB

This episode is inspired by one of our projects for Intro to Machine Learning: given a writing sample, can you use machine learning to identify who wrote it? Turns out that the answer is yes, a person’s writing style is as distinctive as their vocal inflection or their gait when they walk. By tracing the vocabulary used in a given piece, and comparing the word choices to the word choices in writing samples where we know the author, it can be surprisingly clear who is the more likely author ...

Statistical Mistakes and the Challenger Disaster

April 06, 2015 19:36 - 13 minutes - 18.1 MB

After the Challenger exploded in 1986, killing all 7 astronauts aboard, an investigation into the cause was immediately launched. In the cold temperatures the night before the launch, the o-rings that seal off the fuel tanks from the rocket boosters became inflexible, so they did not seal properly, which led to the fuel tank explosion. NASA knew that there could be o-ring problems, but performed the analysis of their data incorrectly and ended up massively underestimating the risk associate...

Genetics and Um Detection (HMM Part 2)

March 25, 2015 17:29 - 14 minutes - 20.4 MB

In part two of our series on Hidden Markov Models (HMMs), we talk to Katie and special guest Francesco about more useful and novel applications of HMMs. We revisit Katie's "Um Detector," and hear about how HMMs are used in genetics research.

Introducing Hidden Markov Models (HMM Part 1)

March 24, 2015 15:57 - 14 minutes - 20.5 MB

Wikipedia says, "A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (hidden) states." What does that even mean? In part one of a special two-parter on HMMs, Katie, Ben, and special guest Francesco explain the basics of HMMs, and some simple applications of them in the real world. This episode sets the stage for part two, where we explore the use of HMMs in Modern Genetics, and possibly Katie's "Um Dete...

Monte Carlo For Physicists

March 12, 2015 23:18 - 8 minutes - 11.3 MB

This is another physics-centered podcast, about an ML-backed particle identification tool that we use to figure out what kind of particle caused a particular blob in the detector. But in this case, as in many cases, it looks hard at the outset to use ML because we don't have labeled training data. Monte Carlo to the rescue! Monte Carlo (MC) is fake data that we generate for ourselves, usually following certain sets of rules (often a Markov chain; in physics we generate MC according to the l...

Random Kanye

March 04, 2015 23:04 - 8 minutes - 12 MB

Ever feel like you could randomly assemble words from a certain vocabulary and make semi-coherent Kanye West lyrics? Or technical documentation, imitations of local newscasters, your politically outspoken uncle, etc.? Wonder no more, there's a way to do this exact type of thing: it's called a Markov Chain, and probably the most powerful way to generate made-up data that you can then use for fun and profit. The idea behind a Markov Chain is that you probabilistically generate a sequence of ste...

Lie Detectors

February 25, 2015 18:20 - 9 minutes - 12.8 MB

Often machine learning discussions center around algorithms, or features, or datasets--this one centers around interpretation, and ethics. Suppose you could use a technology like fMRI to see what regions of a person's brain are active when they ask questions. And also suppose that you could run trials where you watch their brain activity while they lie about some minor issue (say, whether the card in their hand is a spade or a club)--could you use machine learning to analyze those images, a...

The Enron Dataset

February 09, 2015 00:00 - 12 minutes - 17.1 MB

In 2000, Enron was one of the largest and companies in the world, praised far and wide for its innovations in energy distribution and many other markets. By 2002, it was apparent that many bad apples had been cooking the books, and billions of dollars and thousands of jobs disappeared. In the aftermath, surprisingly, one of the greatest datasets in all of machine learning was born--the Enron emails corpus. Hundreds of thousands of emails amongst top executives were made public; there's n...

Labels and Where To Find Them

February 04, 2015 02:30 - 13 minutes - 18.2 MB

Supervised classification is built on the backs of labeled datasets, but a good set of labels can be hard to find. Great data is everywhere, but the corresponding labels can sometimes be really tricky. Take a few examples we've already covered, like lie detection with an MRI machine (have to take pictures of someone's brain while they try to lie, not a trivial task) or automated image captioning (so many images! so many valid labels!) In this epsiode, we'll dig into this topic in depth,...

Um Detector 1

January 23, 2015 20:16 - 13 minutes - 18.3 MB

So, um... what about machine learning for audio applications? In the course of starting this podcast, we've edited out a lot of "um"'s from our raw audio files. It's gotten now to the point that, when we see the waveform in soundstudio, we can almost identify an "um" by eye. Which makes it an interesting problem for machine learning--is there a way we can train an algorithm to recognize the "um" pattern, too? This has become a little side project for Katie, which is very much still a work...

Better Facial Recognition with Fisherfaces

January 07, 2015 01:33 - 11 minutes - 16.4 MB

Now that we know about eigenfaces (if you don't, listen to the previous episode), let's talk about how it breaks down. Variations that are trivial to humans when identifying faces can really mess up computer-driven facial ID--expressions, lighting, and angle are a few. Something that can easily happen is an algorithm can optimize to identify one of those traits, rather than the underlying trait of whether the person is the same (for example, if the training image is me smiling, you may reje...

Facial Recognition with Eigenfaces

January 07, 2015 01:30 - 10 minutes - 13.8 MB

A true classic topic in ML: Facial recognition is very high-dimensional, meaning that each picture can have millions of pixels, each of which can be a single feature. It's computationally expensive to deal with all these features, and invites overfitting problems. PCA (principal components analysis) is a classic dimensionality reduction tool that compresses these many dimensions into the few that contain the most variation in the data, and those principal components are often then fed into a ...

Stats of World Series Streaks

December 17, 2014 00:41 - 12 minutes - 11.5 MB

Baseball is characterized by a high level of equality between teams; even the best teams might only have 55% win percentages (contrast this with college football, where teams go undefeated pretty regularly). In this regime, where 2 outcomes (Giants win/Giants lose) are approximately equally likely, we can model the win/loss chances with a binomial distribution. Using the binomial distribution, we can calculate an interesting little result: what's the chance of the world series going to only...

Computers Try to Tell Jokes

November 26, 2014 18:59 - 9 minutes - 12.6 MB

Computers are capable of many impressive feats, but making you laugh is usually not one of them. Or could it be? This episode will talk about a custom-built machine learning algorithm that searches through text and writes jokes based on what it finds. The jokes are formulaic: they're all of the form "I like my X like I like my Y: Z" where X and Y are nouns, and Z is an adjective that can describe both X and Y. For (dumb) example, "I like my men like I like my coffee: steaming hot." The joke...

How Outliers Helped Defeat Cholera

November 22, 2014 00:00 - 10 minutes - 15 MB

In the 1850s, there were a lot of things we didn’t know yet: how to create an airplane, how to split an atom, or how to control the spread of a common but deadly disease: cholera. When a cholera outbreak in London killed scores of people, a doctor named John Snow used it as a chance to study whether the cause might be very small organisms that were spreading through the water supply (the prevailing theory at the time was miasma, or “bad air”). By tracing the geography of all the deaths fr...

Hunting for the Higgs

November 16, 2014 00:00 - 10 minutes - 14.1 MB

Machine learning and particle physics go together like peanut butter and jelly--but this is a relatively new development. For many decades, physicists looked through their fairly large datasets using the laws of physics to guide their exploration; that tradition continues today, but as ever-larger datasets get made, machine learning becomes a more tractable way to deal with the deluge. With this in mind, ATLAS (one of the major experiments at CERN, the European Center for Nuclear Resear...

Linear Digressions

Episodes

Twitter Mentions