Linear Digressions

Jupyter Notebooks

August 21, 2017 01:09 - 15 minutes - 21.8 MB

This week's episode is just in time for JupyterCon in NYC, August 22-25... Jupyter notebooks are probably familiar to a lot of data nerds out there as a great open-source tool for exploring data, doing quick visualizations, and packaging code snippets with explanations for sharing your work with others. If you're not a data person, or you are but you haven't tried out Jupyter notebooks yet, here's your nudge to go give them a try. In this episode we'll go back to the old days, before noteb...

Curing Cancer with Machine Learning is Super Hard

August 14, 2017 01:49 - 19 minutes - 26.6 MB

Today, a dispatch on what can go wrong when machine learning hype outpaces reality: a high-profile partnership between IBM Watson and MD Anderson Cancer Center has recently hit the rocks as it turns out to be tougher than expected to cure cancer with artificial intelligence. There are enough conflicting accounts in the media to make it tough to say exactly went wrong, but it's a good chance to remind ourselves that even in a post-AI world, hard problems remain hard.

KL Divergence

August 07, 2017 03:07 - 25 minutes - 35.2 MB

Kullback Leibler divergence, or KL divergence, is a measure of information loss when you try to approximate one distribution with another distribution. It comes to us originally from information theory, but today underpins other, more machine-learning-focused algorithms like t-SNE. And boy oh boy can it be tough to explain. But we're trying our hardest in this episode!

Sabermetrics

July 31, 2017 01:15 - 25 minutes - 35.4 MB

It's moneyball time! SABR (the Society for American Baseball Research) is the world's largest organization of statistics-minded baseball enthusiasts, who are constantly applying the craft of scientific analysis to trying to figure out who are the best baseball teams and players. It can be hard to objectively measure sports greatness, but baseball has a data-rich history and plenty of nerdy fans interested in analyzing that data. In this episode we'll dissect a few of the metrics from stand...

What Data Scientists Can Learn from Software Engineers

July 24, 2017 01:52 - 23 minutes - 32.7 MB

We're back again with friend of the pod Walt, former software engineer extraordinaire and current data scientist extraordinaire, to talk about some best practices from software engineering that are ready to jump the fence over to data science. If last week's episode was for software engineers who are interested in becoming more like data scientists, then this week's episode is for data scientists who are looking to improve their game with best practices from software engineering.

Software Engineering to Data Science

July 17, 2017 02:36 - 19 minutes - 26.2 MB

Data scientists and software engineers often work side by side, building out and scaling technical products and services that are data-heavy but also require a lot of software engineering to build and maintain. In this episode, we'll chat with a Friend of the Pod named Walt, who started out as a software engineer but works as a data scientist now. We'll talk about that transition from software engineering to data science, and what special capabilities software engineers have that data scien...

Re-Release: Fighting Cholera with Data, 1854

July 10, 2017 00:19 - 12 minutes - 16.6 MB

This episode was first released in November 2014. In the 1850s, there were a lot of things we didn’t know yet: how to create an airplane, how to split an atom, or how to control the spread of a common but deadly disease: cholera. When a cholera outbreak in London killed scores of people, a doctor named John Snow used it as a chance to study whether the cause might be very small organisms that were spreading through the water supply (the prevailing theory at the time was miasma, or “bad air”...

Re-Release: Data Mining Enron

July 02, 2017 17:53 - 32 minutes - 44.3 MB

This episode was first release in February 2015. In 2000, Enron was one of the largest and companies in the world, praised far and wide for its innovations in energy distribution and many other markets. By 2002, it was apparent that many bad apples had been cooking the books, and billions of dollars and thousands of jobs disappeared. In the aftermath, surprisingly, one of the greatest datasets in all of machine learning was born--the Enron emails corpus. Hundreds of thousands of emails amon...

Factorization Machines

June 26, 2017 02:23 - 19 minutes - 27.3 MB

What do you get when you cross a support vector machine with matrix factorization? You get a factorization machine, and a darn fine algorithm for recommendation engines.

Anscombe's Quartet

June 19, 2017 02:19 - 15 minutes - 21.5 MB

Anscombe's Quartet is a set of four datasets that have the same mean, variance and correlation but look very different. It's easy to think that having a good set of summary statistics (like mean, variance and correlation) can tell you everything important about a dataset, or at least enough to know if two datasets are extremely similar or extremely different, but Anscombe's Quartet will always be standing behind you, laughing at how silly that idea is. Anscombe's Quartet was devised in 1973...

Traffic Metering Algorithms

June 12, 2017 03:01 - 18 minutes - 25.5 MB

Originally release June 2016 This episode is for all you (us) traffic nerds--we're talking about the hidden structure underlying traffic on-ramp metering systems. These systems slow down the flow of traffic onto highways so that the highways don't get overloaded with cars and clog up. If you're someone who listens to podcasts while commuting, and especially if your area has on-ramp metering, you'll never look at highway access control the same way again (yeah, we know this is super nerdy; it...

Page Rank

June 05, 2017 01:46 - 19 minutes - 27.4 MB

The year: 1998. The size of the web: 150 million pages. The problem: information retrieval. How do you find the "best" web pages to return in response to a query? A graduate student named Larry Page had an idea for how it could be done better and created a search engine as a research project. That search engine was called Google.

Fractional Dimensions

May 29, 2017 02:54 - 20 minutes - 28.1 MB

We chat about fractional dimensions, and what the actual heck those are.

Things You Learn When Building Models for Big Data

May 22, 2017 01:44 - 21 minutes - 29.7 MB

As more and more data gets collected seemingly every day, and data scientists use that data for modeling, the technical limits associated with machine learning on big datasets keep getting pushed back. This week is a first-hand case study in using scikit-learn (a popular python machine learning library) on multi-terabyte datasets, which is something that Katie does a lot for her day job at Civis Analytics. There are a lot of considerations for doing something like this--cloud computing, art...

How to Find New Things to Learn

May 15, 2017 01:49 - 17 minutes - 24.6 MB

If you're anything like us, you a) always are curious to learn more about data science and machine learning and stuff, and b) are usually overwhelmed by how much content is out there (not all of it very digestible). We hope this podcast is a part of the solution for you, but if you're looking to go farther (who isn't?) then we have a few new resources that are presenting high-quality content in a fresh, accessible way. Boring old PDFs full of inscrutable math notation, your days are numbered!

Federated Learning

May 08, 2017 01:50

As machine learning makes its way into more and more mobile devices, an interesting question presents itself: how can we have an algorithm learn from training data that's being supplied as users interact with the algorithm? In other words, how do we do machine learning when the training dataset is distributed across many devices, imbalanced, and the usage associated with any one user needs to be obscured somewhat to protect the privacy of that user? Enter Federated Learning, a set of relate...

Word2Vec

May 01, 2017 02:17 - 17 minutes - 24.7 MB

Word2Vec is probably the go-to algorithm for vectorizing text data these days. Which makes sense, because it is wicked cool. Word2Vec has it all: neural networks, skip-grams and bag-of-words implementations, a multiclass classifier that gets swapped out for a binary classifier, made-up dummy words, and a model that isn't actually used to predict anything (usually). And all that's before we get to the part about how Word2Vec allows you to do algebra with text. Seriously, this stuff is cool.

Feature Processing for Text Analytics

April 24, 2017 02:17 - 17 minutes - 24 MB

It seems like every day there's more and more machine learning problems that involve learning on text data, but text itself makes for fairly lousy inputs to machine learning algorithms. That's why there are text vectorization algorithms, which re-format text data so it's ready for using for machine learning. In this episode, we'll go over some of the most common and useful ways to preprocess text data for machine learning.

Education Analytics

April 17, 2017 02:09 - 21 minutes - 29 MB

This week we'll hop into the rapidly developing industry around predictive analytics for education. For many of the students who eventually drop out, data science is showing that there might be early warning signs that the student is in trouble--we'll talk about what some of those signs are, and then dig into the meatier questions around discrimination, who owns a student's data, and correlation vs. causation. Spoiler: we have more questions than we have answers on this one. Bonus appearan...

A Technical Deep Dive on Stanley, the First Self-Driving Car

April 10, 2017 01:50

In our follow-up episode to last week's introduction to the first self-driving car, we will be doing a technical deep dive this week and talking about the most important systems for getting a car to drive itself 140 miles across the desert. Lidar? You betcha! Drive-by-wire? Of course! Probabilistic terrain reconstruction? Absolutely! All this and more this week on Linear Digressions.

An Introduction to Stanley, the First Self-Driving Car

April 03, 2017 01:34

In October 2005, 23 cars lined up in the desert for a 140 mile race. Not one of those cars had a driver. This was the DARPA grand challenge to see if anyone could build an autonomous vehicle capable of navigating a desert route (and if so, whose car could do it the fastest); the winning car, Stanley, now sits in the Smithsonian Museum in Washington DC as arguably the world's first real self-driving car. In this episode (part one of a two-parter), we'll revisit the DARPA grand challenge fro...

Feature Importance

March 27, 2017 01:53 - 20 minutes - 27.8 MB

Figuring out what features actually matter in a model is harder to figure out than you might first guess. When a human makes a decision, you can just ask them--why did you do that? But with machine learning models, not so much. That's why we wanted to talk a bit about both regularization (again) and also other ways that you can figure out which models have the biggest impact on the predictions of your model.

Space Codes!

March 20, 2017 02:50 - 23 minutes - 32.9 MB

It's hard to get information to and from Mars. Mars is very far away, and expensive to get to, and the bandwidth for passing messages with Earth is not huge. The messages you do pass have to traverse millions of miles, which provides ample opportunity for the message to get corrupted or scrambled. How, then, can you encode messages so that errors can be detected and corrected? How does the decoding process allow you to actually find and correct the errors? In this episode, we'll talk abo...

Finding (and Studying) Wikipedia Trolls

March 13, 2017 01:44 - 15 minutes - 21.7 MB

You may be shocked to hear this, but sometimes, people on the internet can be mean. For some of us this is just a minor annoyance, but if you're a maintainer or contributor of a large project like Wikipedia, abusive users can be a huge problem. Fighting the problem starts with understanding it, and understanding it starts with measuring it; the thing is, for a huge website like Wikipedia, there can be millions of edits and comments where abuse might happen, so measurement isn't a simple tas...

A Sprint Through What's New in Neural Networks

March 06, 2017 03:27 - 16 minutes - 23.3 MB

Advances in neural networks are moving fast enough that, even though it seems like we talk about them all the time around here, it also always seems like we're barely keeping up. So this week we have another installment in our "neural nets: they so smart!" series, talking about three topics. And all the topics this week were listener suggestions, too!

Empirical Bayes

February 20, 2017 03:30 - 18 minutes - 26 MB

Say you're looking to use some Bayesian methods to estimate parameters of a system. You've got the normalization figured out, and the likelihood, but the prior... what should you use for a prior? Empirical Bayes has an elegant answer: look to your previous experience, and use past measurements as a starting point in your prior. Scratching your head about some of those terms, and why they matter? Lucky for you, you're standing in front of a podcast episode that unpacks all of this.

Endogenous Variables and Measuring Protest Effectiveness

February 13, 2017 03:31

Have you been out protesting lately, or watching the protests, and wondered how much effect they might have on lawmakers? It's a tricky question to answer, since usually we need randomly distributed treatments (e.g. big protests) to understand causality, but there's no reason to believe that big protests are actually randomly distributed. In other words, protest size is endogenous to legislative response, and understanding cause and effect is very challenging. So, what to do? Well, at lea...

Calibrated Models

February 06, 2017 01:56 - 14 minutes - 20 MB

Remember last week, when we were talking about how great the ROC curve is for evaluating models? How things change... This week, we're exploring calibrated risk models, because that's a kind of model that seems like it would benefit from some nice ROC analysis, but in fact the ROC AUC can steer you wrong there.

Ensemble Algorithms

January 23, 2017 02:31 - 13 minutes - 18 MB

If one machine learning model is good, are two models better? In a lot of cases, the answer is yes. If you build many ok models, and then bring them all together and use them in combination to make your final predictions, you've just created an ensemble model. It feels a little bit like cheating, like you just got something for nothing, but the results don't like: algorithms like Random Forests and Gradient Boosting Trees (two types of ensemble algorithms) are some of the strongest out-of-...

How to evaluate a translation: BLEU scores

January 16, 2017 01:59 - 17 minutes - 23.5 MB

As anyone who's encountered a badly translated text could tell you, not all translations are created equal. Some translations are smooth, fluent and sound like a poet wrote them; some are jerky, non-grammatical and awkward. When a machine is doing the translating, it's awfully easy to end up with a robotic-sounding text; as the state of the art in machine translation improves, though, a natural question to ask is: according to what measure? How do we quantify a "good" translation? Enter t...

Zero Shot Translation

January 09, 2017 03:20 - 25 minutes - 35.1 MB

Take Google-size data, the flexibility of a neural net, and all (well, most) of the languages of the world, and what you end up with is a pile of surprises. This episode is about some interesting features of Google's new neural machine translation system, namely that with minimal tweaking, it can accommodate many different languages in a single neural net, that it can do a half-decent job of translating between language pairs it's never been explicitly trained on, and that it seems to have i...

Google Neural Machine Translation

January 02, 2017 01:44 - 18 minutes - 25 MB

Recently, Google swapped out the backend for Google Translate, moving from a statistical phrase-based method to a recurrent neural network. This marks a big change in methodology: the tried-and-true statistical translation methods that have been in use for decades are giving way to a neural net that, across the board, appears to be giving more fluent and natural-sounding translations. This episode recaps statistical phrase-based methods, digs into the RNN architecture a little bit, and reca...

Data and the Future of Medicine : Interview with Precision Medicine Initiative researcher Matt Might

December 26, 2016 01:19 - 34 minutes - 47.9 MB

Today we are delighted to bring you an interview with Matt Might, computer scientist and medical researcher extraordinaire and architect of President Obama's Precision Medicine Initiative. As the Obama Administration winds down, we're talking with Matt about the goals and accomplishments of precision medicine (and related projects like the Cancer Moonshot) and what he foresees as the future marriage of data and medicine. Many thanks to Matt, our friends over at Partially Derivative (hi, Jon...

Special Crossover Episode: Partially Derivative interview with White House Data Scientist DJ Patil

December 18, 2016 17:53 - 46 minutes - 63.4 MB

We have the pleasure of bringing you a very special crossover episode this week: our friends at Partially Derivative (another great podcast about data science, you should check it out) recently interviewed White House Chief Data Scientist DJ Patil. We think DJ's message about the importance and impact of data science is worth spreading, so it's our pleasure to bring it to you today. A huge thanks to Jonathon Morgan and Partially Derivative for sharing this interview with us--enjoy! Relevan...

How to Lose at Kaggle

December 12, 2016 04:28 - 17 minutes - 23.7 MB

Competing in a machine learning competition on Kaggle is a kind of rite of passage for data scientists. Losing unexpectedly at the very end of the contest is also something that a lot of us have experienced. It's not just bad luck: a very specific combination of overfitting on popular competitions can take someone who is in the top few spots in the final days of a contest and bump them down hundreds of slots in the final tally.

Attacking Discrimination in Machine Learning

December 05, 2016 03:38 - 23 minutes - 32.1 MB

Imagine there's an important decision to be made about someone, like a bank deciding whether to extend a loan, or a school deciding to admit a student--unfortunately, we're all too aware that discrimination can sneak into these situations (even when everyone is acting with the best of intentions!). Now, these decisions are often made with the assistance of machine learning and statistical models, but unfortunately these algorithms pick up on the discrimination in the world (it sneaks in thro...

Recurrent Neural Nets

November 28, 2016 02:47 - 12 minutes - 17.3 MB

This week, we're doing a crash course in recurrent neural networks--what the structural pieces are that make a neural net recurrent, how that structure helps RNNs solve certain time series problems, and the importance of forgetfulness in RNNs. Relevant links: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Stealing a PIN with signal processing and machine learning

November 21, 2016 02:32 - 16 minutes - 23.2 MB

Want another reason to be paranoid when using the free coffee shop wifi? Allow us to introduce WindTalker, a system that cleverly combines a dose of signal processing with a dash of machine learning to (potentially) steal the PIN from your phone transactions without ever having physical access to your phone. This episode has it all, folks--channel state information, ICMP echo requests, low-pass filtering, PCA, dynamic time warps, and the PIN for your phone.

Neural Net Cryptography

November 14, 2016 04:06 - 16 minutes - 22.3 MB

Cryptography used to be the domain of information theorists and spies. There's a new player now: neural networks. Given the task of communicating securely, neural networks are inventing new encryption methods that, as best we can tell, are unlike anything humans have ever seen before. Relevant links: http://arstechnica.co.uk/information-technology/2016/10/google-ai-neural-network-cryptography/ https://arxiv.org/pdf/1610.06918v1.pdf

Deep Blue

November 07, 2016 04:20 - 20 minutes - 27.6 MB

In 1997, Deep Blue was the IBM algorithm/computer that did what no one, at the time, though possible: it beat the world's best chess player. It turns out, though, that one of the most important moves in the matchup, where Deep Blue psyched out its opponent with a weird move, might not have been so inspired after all. It might have been nothing more than a bug in the program, and it changed computer science history. Relevant links: https://www.wired.com/2012/09/deep-blue-computer-bug/

Organizing Google's Datasets

October 31, 2016 02:17 - 15 minutes - 20.6 MB

If you're a data scientist, there's a good chance you're used to working with a lot of data. But there's a lot of data, and then there's Google-scale amounts of data. Keeping all that data organized is a Google-sized task, and as it happens, they've built a system for that organizational challenge. This episode is all about that system, called Goods, and in particular we'll dig into some of the details of what makes this so tough. Relevant links: http://static.googleusercontent.com/media/...

Fighting Cancer with Data Science: Followup

October 24, 2016 01:58 - 25 minutes - 35.4 MB

A few months ago, Katie started on a project for the Vice President's Cancer Moonshot surrounding how data can be used to better fight cancer. The project is all wrapped up now, so we wanted to tell you about how that work went and what changes to cancer data policy were suggested to the Vice President. See lineardigressions.com for links to the reports discussed on this episode.

The 19-year-old determining the US election

October 17, 2016 01:01 - 12 minutes - 17.1 MB

Sick of the presidential election yet? We are too, but there's still almost a month to go, so let's just embrace it together. This week, we'll talk about one of the presidential polls, which has been kind of an outlier for quite a while. This week, the NY Times took a closer look at this poll, and was able to figure out the reason it's such an outlier. It all goes back to a 19-year-old African American man, living in Illinois, who really likes Donald Trump... Relevant Links: http://www.n...

How to Steal a Model

October 09, 2016 22:57 - 13 minutes - 18.7 MB

What does it mean to steal a model? It means someone (the thief, presumably) can re-create the predictions of the model without having access to the algorithm itself, or the training data. Sound far-fetched? It isn't. If that person can ask for predictions from the model, and he (or she) asks just the right questions, the model can be reverse-engineered right out from under you. Relevant links: https://www.usenix.org/system/files/conference/usenixsecurity16/sec16_paper_tramer.pdf

Regularization

October 03, 2016 02:13 - 17 minutes - 24 MB

Lots of data is usually seen as a good thing. And it is a good thing--except when it's not. In a lot of fields, a problem arises when you have many, many features, especially if there's a somewhat smaller number of cases to learn from; supervised machine learning algorithms break, or learn spurious or un-interpretable patterns. What to do? Regularization can be one of your best friends here--it's a method that penalizes overly complex models, which keeps the dimensionality of your model u...

The Cold Start Problem

September 26, 2016 02:24 - 15 minutes - 21.4 MB

You might sometimes find that it's hard to get started doing something, but once you're going, it gets easier. Turns out machine learning algorithms, and especially recommendation engines, feel the same way. The more they "know" about a user, like what movies they watch and how they rate them, the better they do at suggesting new movies, which is great until you realize that you have to start somewhere. The "cold start" problem will be our focus in this episode, both the heuristic solution...

Open Source Software for Data Science

September 19, 2016 04:27 - 20 minutes - 27.6 MB

If you work in tech, software or data science, there's an excellent chance you use tools that are built upon open source software. This is software that's built and distributed not for a profit, but because everyone benefits when we work together and share tools. Tim Head of scikit-optimize chats with us further about what it's like to maintain an open source library, how to get involved in open source, and why people like him need people like you to make it all work.

Scikit + Optimization = Scikit-Optimize

September 12, 2016 01:54 - 15 minutes - 21.6 MB

We're excited to welcome a guest, Tim Head, who is one of the maintainers of the scikit-optimize package. With all the talk about optimization lately, it felt appropriate to get in a few words with someone who's out there making it happen for python. Relevant links: https://scikit-optimize.github.io/ http://www.wildtreetech.com/

Two Cultures: Machine Learning and Statistics

September 05, 2016 01:50 - 17 minutes - 24 MB

It's a funny thing to realize, but data science modeling is usually about either explainability, interpretation and understanding, or it's about predictive accuracy. But usually not both--optimizing for one tends to compromise the other. Leo Breiman was one of the titans of both kinds of modeling, a statistician who helped bring machine learning into statistics and vice versa. In this episode, we unpack one of his seminal papers from 2001, when machine learning was just beginning to take ...

Optimization Solutions

August 29, 2016 02:01 - 20 minutes - 27.6 MB

You've got an optimization problem to solve, and a less-than-forever amount of time in which to solve it. What do? Use a heuristic optimization algorithm, like a hill climber or simulated annealing--we cover both in this episode! Relevant link: http://www.lizsander.com/programming/2015/08/04/Heuristic-Search-Algorithms.html

Linear Digressions

Episodes

Twitter Mentions