Linear Digressions

Re - Release: How To Lose At Kaggle

August 13, 2018 02:31 - 17 minutes - 8.2 MB

We've got a classic for you this week as we take a week off for the dog days of summer. See you again next week! Competing in a machine learning competition on Kaggle is a kind of rite of passage for data scientists. Losing unexpectedly at the very end of the contest is also something that a lot of us have experienced. It's not just bad luck: a very specific combination of overfitting on popular competitions can take someone who is in the top few spots in the final days of a contest and bu...

Troubling Trends In Machine Learning Scholarship

August 06, 2018 01:31 - 29 minutes - 13.5 MB

There's a lot of great machine learning papers coming out every day--and, if we're being honest, some papers that are not as great as we'd wish. In some ways this is symptomatic of a field that's growing really quickly, but it's also an artifact of strange incentive structures in academic machine learning, and the fact that sometimes machine learning is just really hard. At the same time, a high quality of academic work is critical for maintaining the reputation of the field, so in this episo...

Can Fancy Running Shoes Cause You To Run Faster?

July 29, 2018 19:12

The stars aligned for me (Katie) this past weekend: I raced my first half-marathon in a long time and got to read a great article from the NY Times about a new running shoe that Nike claims can make its wearers run faster. Causal claims like this one are really tough to verify, because even if the data suggests that people wearing the shoe are faster that might be because of correlation, not causation, so I loved reading this article that went through an analysis of thousands of runners' data...

Compliance Bias

July 22, 2018 16:07 - 23 minutes - 10.7 MB

When you're using an AB test to understand the effect of a treatment, there are a lot of assumptions about how the treatment (and control, for that matter) get applied. For example, it's easy to think that everyone who was assigned to the treatment arm actually gets the treatment, everyone in the control arm doesn't, and that the two groups get their treatment instantaneously. None of these things happen in real life, and if you really care about measuring your treatment effect then that's so...

AI Winter

July 15, 2018 20:11 - 19 minutes - 8.72 MB

Artificial Intelligence has been widely lauded as a solution to almost any problem. But as we justapose the hype in the field against the real-world benefits we see, it raises the question: Are we coming up on an AI winter

Rerelease: How to Find New Things to Learn

July 08, 2018 22:28 - 18 minutes - 8.49 MB

We like learning on vacation. And we're on vacation, so we thought we'd re-air this episode about how to learn. Original Episode: https://lineardigressions.com/episodes/2017/5/14/how-to-find-new-things-to-learn Original Summary: If you're anything like us, you a) always are curious to learn more about data science and machine learning and stuff, and b) are usually overwhelmed by how much content is out there (not all of it very digestible). We hope this podcast is a part of the solution fo...

Rerelease: Space Codes

July 02, 2018 04:36 - 24 minutes - 11.2 MB

We're on vacation on Mars, so we won't be communicating with you all directly this week. Though, if we wanted to, we could probably use this episode to help get started. Original Episode: http://lineardigressions.com/episodes/2017/3/19/space-codes Original Summary: It's hard to get information to and from Mars. Mars is very far away, and expensive to get to, and the bandwidth for passing messages with Earth is not huge. The messages you do pass have to traverse millions of miles, which pr...

Rerelease: Anscombe's Quartet

June 25, 2018 01:20 - 16 minutes - 7.43 MB

We're on vacation, so we hope you enjoy this episode while we each sip cocktails on the beach. Original Episode: http://lineardigressions.com/episodes/2017/6/18/anscombes-quartet Original Summary: Anscombe's Quartet is a set of four datasets that have the same mean, variance and correlation but look very different. It's easy to think that having a good set of summary statistics (like mean, variance and correlation) can tell you everything important about a dataset, or at least enough to kn...

Rerelease: Hurricanes Produced

June 18, 2018 17:00 - 28 minutes - 12.9 MB

Now that hurricane season is upon us again (and we are on vacation), we thought a look back on our hurricane forecasting episode was prudent. Stay safe out there.

GDPR

June 11, 2018 02:24 - 18 minutes - 8.42 MB

By now, you have probably heard of GDPR, the EU's new data privacy law. It's the reason you've been getting so many emails about everyone's updated privacy policy. In this episode, we talk about some of the potential ramifications of GRPD in the world of data science.

Git for Data Scientists

June 03, 2018 17:52 - 22 minutes - 10.1 MB

If you're a data scientist, chances are good that you've heard of git, which is a system for version controlling code. Chances are also good that you're not quite as up on git as you want to be--git has a strong following among software engineers but, in our anecdotal experience, data scientists are less likely to know how to use this powerful tool. Never fear: in this episode we'll talk through some of the basics, and what does (and doesn't) translate from version control for regular softwar...

Analytics Maturity

May 20, 2018 15:09 - 19 minutes - 8.95 MB

Data science and analytics are hot topics in business these days, but for a lot of folks looking to bring data into their organization, it can be hard to know where to start and what it looks like when they're succeeding. That was the motivation for writing a whitepaper on the analytics maturity of an organization, and that's what we're talking about today. In particular, we break it down into five attributes of an organization that contribute (or not) to their success in analytics, and what ...

SHAP: Shapley Values in Machine Learning

May 13, 2018 14:24 - 19 minutes - 8.79 MB

Shapley values in machine learning are an interesting and useful enough innovation that we figured hey, why not do a two-parter? Our last episode focused on explaining what Shapley values are: they define a way of assigning credit for outcomes across several contributors, originally to understand how impactful different actors are in building coalitions (hence the game theory background) but now they're being cross-purposed for quantifying feature importance in machine learning models. This e...

Game Theory for Model Interpretability: Shapley Values

May 07, 2018 02:17 - 27 minutes - 12.4 MB

As machine learning models get into the hands of more and more users, there's an increasing expectation that black box isn't good enough: users want to understand why the model made a given prediction, not just what the prediction itself is. This is motivating a lot of work into feature important and model interpretability tools, and one of the most exciting new ones is based on Shapley Values from game theory. In this episode, we'll explain what Shapley Values are and how they make a cool ap...

AutoML

April 30, 2018 02:50 - 15 minutes - 7.05 MB

If you were a machine learning researcher or data scientist ten years ago, you might have spent a lot of time implementing individual algorithms like decision trees and neural networks by hand. If you were doing that work five years ago, the algorithms were probably already implemented in popular open-source libraries like scikit-learn, but you still might have spent a lot of time trying different algorithms and tuning hyperparameters to improve performance. If you're doing that work today, ...

CPUs, GPUs, TPUs: Hardware for Deep Learning

April 23, 2018 02:52 - 12 minutes - 5.8 MB

A huge part of the ascent of deep learning in the last few years is related to advances in computer hardware that makes it possible to do the computational heavy lifting required to build models with thousands or even millions of tunable parameters. This week we'll pretend to be electrical engineers and talk about how modern machine learning is enabled by hardware.

A Technical Introduction to Capsule Networks

April 16, 2018 01:12 - 31 minutes - 14.4 MB

Last episode we talked conceptually about capsule networks, the latest and greatest computer vision innovation to come out of Geoff Hinton's lab. This week we're getting a little more into the technical details, for those of you ready to have your mind stretched.

A Conceptual Introduction to Capsule Networks

April 09, 2018 01:59 - 14 minutes - 6.45 MB

Convolutional nets are great for image classification... if this were 2016. But it's 2018 and Canada's greatest neural networker Geoff Hinton has some new ideas, namely capsule networks. Capsule nets are a completely new type of neural net architecture designed to do image classification on far fewer training cases than convolutional nets, and they're posting results that are competitive with much more mature technologies. In this episode, we'll give a light conceptual introduction to capsul...

Convolutional Neural Nets

April 02, 2018 01:40 - 21 minutes - 10 MB

If you've done image recognition or computer vision tasks with a neural network, you've probably used a convolutional neural net. This episode is all about the architecture and implementation details of convolutional networks, and the tricks that make them so good at image tasks.

Google Flu Trends

March 26, 2018 01:20 - 12 minutes - 5.85 MB

It's been a nasty flu season this year. So we were remembering a story from a few years back (but not covered yet on this podcast) about when Google tried to predict flu outbreaks faster than the Centers for Disease Control by monitoring searches and looking for spikes in searches for flu symptoms, doctors appointments, and other related terms. It's a cool idea, but after a few years turned into a cautionary tale of what can go wrong after Google's algorithm systematically overestimated flu ...

How to pick projects for a professional data science team

March 19, 2018 03:07 - 31 minutes - 14.3 MB

This week's episodes is for data scientists, sure, but also for data science managers and executives at companies with data science teams. These folks all think very differently about the same question: what should a data science team be working on? And how should that decision be made? That's the subject of a talk that I (Katie) gave at Strata Data in early March, about how my co-department head and I select projects for our team to work on. We have several goals in data science project sel...

Autoencoders

March 12, 2018 01:47 - 12 minutes - 5.81 MB

Autoencoders are neural nets that are optimized for creating outputs that... look like the inputs to the network. Turns out this is a not-too-shabby way to do unsupervised machine learning with neural nets.

When Private Data Isn't Private Anymore

March 05, 2018 03:35 - 26 minutes - 12.1 MB

After all the back-patting around making data science datasets and code more openly available, we figured it was time to also dump a bucket of cold water on everyone's heads and talk about the things that can go wrong when data and code is a little too open. In this episode, we'll talk about two interesting recent examples: a de-identified medical dataset in Australia that was re-identified so specific celebrities and athletes could be matched to their medical records, and a series of milit...

What makes a machine learning algorithm "superhuman"?

February 26, 2018 04:52 - 34 minutes - 15.9 MB

A few weeks ago, we podcasted about a neural network that was being touted as "better than doctors" in diagnosing pneumonia from chest x-rays, and how the underlying dataset used to train the algorithm raised some serious questions. We're back again this week with further developments, as the author of the original blog post pointed us toward more developments. All in all, there's a lot more clarity now around how the authors arrived at their original "better than doctors" claim, and a number...

Open Data and Open Science

February 19, 2018 01:39 - 16 minutes - 7.74 MB

One interesting trend we've noted recently is the proliferation of papers, articles and blog posts about data science that don't just tell the result--they include data and code that allow anyone to repeat the analysis. It's far from universal (for a timely counterpoint, read this article ), but we seem to be moving toward a new normal where data science conclusions are expected to be shown, not just told. Relevant links: https://github.com/fivethirtyeight/data https://blog.patricktriest.com...

Defining the quality of a machine learning production system

February 12, 2018 02:00 - 20 minutes - 9.38 MB

Building a machine learning system and maintaining it in production are two very different things. Some folks over at Google wrote a paper that shares their thoughts around all the items you might want to test or check for your production ML system. Relevant links: https://research.google.com/pubs/pub45742.html

Auto-generating websites with deep learning

February 04, 2018 23:02 - 19 minutes - 8.88 MB

We've already talked about neural nets in some detail (links below), and in particular we've been blown away by the way that image recognition from convolutional neural nets can be fed into recurrent neural nets that generate descriptions and captions of the images. Our episode today tells a similar tale, except today we're talking about a blog post where the author fed in wireframes of a website design and asked the neural net to generate the HTML and CSS that would actually build a website ...

The Case for Learned Index Structures, Part 2: Hash Maps and Bloom Filters

January 29, 2018 02:15 - 20 minutes - 9.47 MB

Last week we started the story of how you could use a machine learning model in place of a data structure, and this week we wrap up with an exploration of Bloom Filters and Hash Maps. Just like last week, when we covered B-trees, we'll walk through both the "classic" implementation of these data structures and how a machine learning model could create the same functionality.

The Case for Learned Index Structures, Part 1: B-Trees

January 22, 2018 02:32 - 18 minutes - 8.63 MB

Jeff Dean and his collaborators at Google are turning the machine learning world upside down (again) with a recent paper about how machine learning models can be used as surprisingly effective substitutes for classic data structures. In this first part of a two-part series, we'll go through a data structure called b-trees. The structural form of b-trees make them efficient for searching, but if you squint at a b-tree and look at it a little bit sideways then the search functionality starts to...

Challenges with Using Machine Learning to Classify Chest X-Rays

January 15, 2018 01:57 - 18 minutes - 8.24 MB

Another installment in our "machine learning might not be a silver bullet for solving medical problems" series. This week, we have a high-profile blog post that has been making the rounds for the last few weeks, in which a neural network trained to visually recognize various diseases in chest x-rays is called into question by a radiologist with machine learning expertise. As it seemingly always does, it comes down to the dataset that's used for training--medical records assume a lot of contex...

The Fourier Transform

January 08, 2018 02:07 - 15 minutes - 7.17 MB

The Fourier transform is one of the handiest tools in signal processing for dealing with periodic time series data. Using a Fourier transform, you can break apart a complex periodic function into a bunch of sine and cosine waves, and figure out what the amplitude, frequency and offset of those component waves are. It's a really handy way of re-expressing periodic data--you'll never look at a time series graph the same way again.

Statistics of Beer

January 02, 2018 01:57 - 15 minutes - 7.02 MB

What better way to kick off a new year than with an episode on the statistics of brewing beer?

Re - Release: Random Kanye

December 24, 2017 19:07 - 9 minutes - 13.1 MB

We have a throwback episode for you today as we take the week off to enjoy the holidays. This week: what happens when you have a markov chain that generates mashup Kanye West lyrics with Bible verses? Exactly what you think.

Debiasing Word Embeddings

December 18, 2017 02:31 - 18 minutes - 8.4 MB

When we covered the Word2Vec algorithm for embedding words, we mentioned parenthetically that the word embeddings it produces can sometimes be a little bit less than ideal--in particular, gender bias from our society can creep into the embeddings and give results that are sexist. For example, occupational words like "doctor" and "nurse" are more highly aligned with "man" or "woman," which can create problems because these word embeddings are used in algorithms that help people find informatio...

The Kernel Trick and Support Vector Machines

December 11, 2017 01:58 - 17 minutes - 8.15 MB

Picking up after last week's episode about maximal margin classifiers, this week we'll go into the kernel trick and how that (combined with maximal margin algorithms) gives us the much-vaunted support vector machine.

Maximal Margin Classifiers

December 04, 2017 04:03 - 14 minutes - 6.57 MB

Maximal margin classifiers are a way of thinking about supervised learning entirely in terms of the decision boundary between two classes, and defining that boundary in a way that maximizes the distance from any given point to the boundary. It's a neat way to think about statistical learning and a prerequisite for understanding support vector machines, which we'll cover next week--stay tuned!

Re - Release: The Cocktail Party Problem

November 27, 2017 02:11 - 13 minutes - 6.28 MB

Grab a cocktail, put on your favorite karaoke track, and let’s talk some more about disentangling audio data!

Clustering with DBSCAN

November 20, 2017 03:08 - 16 minutes - 7.43 MB

DBSCAN is a density-based clustering algorithm for doing unsupervised learning. It's pretty nifty: with just two parameters, you can specify "dense" regions in your data, and grow those regions out organically to find clusters. In particular, it can fit irregularly-shaped clusters, and it can also identify outlier points that don't belong to any of the clusters. Pretty cool!

The Kaggle Survey on Data Science

November 13, 2017 02:49 - 25 minutes - 11.6 MB

Want to know what's going on in data science these days? There's no better way than to analyze a survey with over 16,000 responses that recently released by Kaggle. Kaggle asked practicing and aspiring data scientists about themselves, their tools, how they find jobs, what they find challenging about their jobs, and many other questions. Then Kaggle released an interactive summary of the data, as well as the anonymized dataset itself, to help data scientists understand the trends in the da...

Machine Learning: The High Interest Credit Card of Technical Debt

November 06, 2017 04:35 - 22 minutes - 10.2 MB

This week, we've got a fun paper by our friends at Google about the hidden costs of maintaining machine learning workflows. If you've worked in software before, you're probably familiar with the idea of technical debt, which are inefficiencies that crop up in the code when you're trying to go fast. You take shortcuts, hard-code variable values, skimp on the documentation, and generally write not-that-great code in order to get something done quickly, and then end up paying for it later on. ...

Improving Upon a First-Draft Data Science Analysis

October 30, 2017 01:38 - 15 minutes - 6.88 MB

There are a lot of good resources out there for getting started with data science and machine learning, where you can walk through starting with a dataset and ending up with a model and set of predictions. Think something like the homework for your favorite machine learning class, or your most recent online machine learning competition. However, if you've ever tried to maintain a machine learning workflow (as opposed to building it from scratch), you know that taking a simple modeling scrip...

Survey Raking

October 23, 2017 02:51 - 17 minutes - 23.9 MB

It's quite common for survey respondents not to be representative of the larger population from which they are drawn. But if you're a researcher, you need to study the larger population using data from your survey respondents, so what should you do? Reweighting the survey data, so that things like demographic distributions look similar between the survey and general populations, is a standard technique and in this episode we'll talk about survey raking, a way to calculate survey weights whe...

Happy Hacktoberfest

October 16, 2017 01:46 - 15 minutes - 21.5 MB

It's the middle of October, so you've already made two pull requests to open source repos, right? If you have no idea what we're talking about, spend the next 20 minutes or so with us talking about the importance of open source software and how you can get involved. You can even get a free t-shirt! Hacktoberfest main page: https://hacktoberfest.digitalocean.com/#details

Re - Release: Kalman Runners

October 09, 2017 02:28 - 17 minutes - 24.6 MB

In honor of the Chicago marathon this weekend (and due in large part to Katie recovering from running in it...) we have a re-release of an episode about Kalman filters, which is part algorithm part elaborate metaphor for figuring out, if you're running a race but don't have a watch, how fast you're going. Katie's Chicago race report: miles 1-13: light ankle pain, lovely cool weather, the most fun EVAR miles 13-17: no more ankle pain but quads start getting tight, it's a little more effort m...

Neural Net Dropout

October 02, 2017 03:32 - 18 minutes - 25.9 MB

Neural networks are complex models with many parameters and can be prone to overfitting. There's a surprisingly simple way to guard against this: randomly destroy connections between hidden units, also known as dropout. It seems counterintuitive that undermining the structural integrity of the neural net makes it robust against overfitting, but in the world of neural nets, weirdness is just how things go sometimes. Relevant links: https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf

Disciplined Data Science

September 25, 2017 01:49 - 29 minutes - 40.6 MB

As data science matures as a field, it's becoming clearer what attributes a data science team needs to have to elevate their work to the next level. Most of our episodes are about the cool work being done by other people, but this one summarizes some thinking Katie's been doing herself around how to guide data science teams toward more mature, effective practices. We'll go through five key characteristics of great data science teams, which we collectively refer to as "disciplined data scien...

Hurricane Forecasting

September 18, 2017 01:37 - 27 minutes - 38.4 MB

It's been a busy hurricane season in the Southeastern United States, with millions of people making life-or-death decisions based on the forecasts around where the hurricanes will hit and with what intensity. In this episode we'll deconstruct those models, talking about the different types of models, the theory behind them, and how they've evolved through the years.

Finding Spy Planes with Machine Learning

September 11, 2017 02:11 - 18 minutes - 24.9 MB

There are law enforcement surveillance aircraft circling over the United States every day, and in this episode, we'll talk about how some folks at BuzzFeed used public data and machine learning to find them. The fun thing here, in our opinion, is the blend of intrigue (spy planes!) with tech journalism and a heavy dash of publicly available and reproducible analysis code so that you (yes, you!) can see exactly how BuzzFeed identifies the surveillance planes.

Data Provenance

September 04, 2017 01:35 - 22 minutes - 31.3 MB

Software engineers are familiar with the idea of versioning code, so you can go back later and revive a past state of the system. For data scientists who might want to reconstruct past models, though, it's not just about keeping the modeling code. It's also about saving a version of the data that made the model. There are a lot of other benefits to keeping track of datasets, so in this episode we'll talk about data lineage or data provenance.

Adversarial Examples

August 28, 2017 02:25 - 16 minutes - 22.2 MB

Even as we rely more and more on machine learning algorithms to help with everyday decision-making, we're learning more and more about how they're frighteningly easy to fool sometimes. Today we have a roundup of a few successful efforts to create robust adversarial examples, including what it means for an adversarial example to be robust and what this might mean for machine learning in the future.

Linear Digressions

Episodes

Twitter Mentions