Linear Digressions

So long, and thanks for all the fish

July 26, 2020 23:32 - 35 minutes - 16.4 MB

All good things must come to an end, including this podcast. This is the last episode we plan to release, and it doesn’t cover data science—it’s mostly reminiscing, thanking our wonderful audience (that’s you!), and marveling at how this thing that started out as a side project grew into a huge part of our lives for over 5 years. It’s been a ride, and a real pleasure and privilege to talk to you each week. Thanks, best wishes, and good night! —Katie and Ben

A Reality Check on AI-Driven Medical Assistants

July 19, 2020 23:51 - 14 minutes - 6.41 MB

The data science and artificial intelligence community has made amazing strides in the past few years to algorithmically automate portions of the healthcare process. This episode looks at two computer vision algorithms, one that diagnoses diabetic retinopathy and another that classifies liver cancer, and asks the question—are patients now getting better care, and achieving better outcomes, with these algorithms in the mix? The answer isn’t no, exactly, but it’s not a resounding yes, because t...

A Data Science Take on Open Policing Data

July 13, 2020 02:02 - 23 minutes - 10.9 MB

A few weeks ago, we put out a call for data scientists interested in issues of race and racism, or people studying how those topics can be studied with data science methods, should get in touch to come talk to our audience about their work. This week we’re excited to bring on Todd Hendricks, Bay Area data scientist and a volunteer who reached out to tell us about his studies with the Stanford Open Policing dataset.

The Data Science Open Source Ecosystem

June 29, 2020 02:34 - 23 minutes - 10.6 MB

Open source software is ubiquitous throughout data science, and enables the work of nearly every data scientist in some way or another. Open source projects, however, are disproportionately maintained by a small number of individuals, some of whom are institutionally supported, but many of whom do this maintenance on a purely volunteer basis. The health of the data science ecosystem depends on the support of open source projects, on an individual and institutional level. https://hdsr.mitpres...

Criminology and Data Science

June 15, 2020 01:26 - 30 minutes - 14.2 MB

This episode features Zach Drake, a working data scientist and PhD candidate in the Criminology, Law and Society program at George Mason University. Zach specializes in bringing data science methods to studies of criminal behavior, and got in touch after our last episode (about racially complicated recidivism algorithms). Our conversation covers a wide range of topics—common misconceptions around race and crime statistics, how methodologically-driven criminology scholars think about building ...

Racism, the criminal justice system, and data science

June 07, 2020 23:33 - 31 minutes - 14.5 MB

As protests sweep across the United States in the wake of the killing of George Floyd by a Minneapolis police officer, we take a moment to dig into one of the ways that data science perpetuates and amplifies racism in the American criminal justice system. COMPAS is an algorithm that claims to give a prediction about the likelihood of an offender to re-offend if released, based on the attributes of the individual, and guess what: it shows disparities in the predictions for black and white offe...

An interstitial word from Ben

June 05, 2020 01:38 - 5 minutes - 8.24 MB

A message from Ben around algorithmic bias, and how our models are sometimes reflections of ourselves.

Convolutional Neural Networks

May 31, 2020 21:46 - 21 minutes - 10 MB

This is a re-release of an episode that originally aired on April 1, 2018 If you've done image recognition or computer vision tasks with a neural network, you've probably used a convolutional neural net. This episode is all about the architecture and implementation details of convolutional networks, and the tricks that make them so good at image tasks.

Protecting Individual-Level Census Data with Differential Privacy

May 18, 2020 01:49 - 21 minutes - 9.76 MB

The power of finely-grained, individual-level data comes with a drawback: it compromises the privacy of potentially anyone and everyone in the dataset. Even for de-identified datasets, there can be ways to re-identify the records or otherwise figure out sensitive personal information. That problem has motivated the study of differential privacy, a set of techniques and definitions for keeping personal information private when datasets are released or used for study. Differential privacy is ge...

Causal Trees

May 11, 2020 01:34 - 15 minutes - 7.08 MB

What do you get when you combine the causal inference needs of econometrics with the data-driven methodology of machine learning? Usually these two don’t go well together (deriving causal conclusions from naive data methods leads to biased answers) but economists Susan Athey and Guido Imbens are on the case. This episodes explores their algorithm for recursively partitioning a dataset to find heterogeneous treatment effects, or for you ML nerds, applying decision trees to causal inference pro...

The Grammar Of Graphics

May 04, 2020 01:12 - 35 minutes - 16.3 MB

You may not realize it consciously, but beautiful visualizations have rules. The rules are often implict and manifest themselves as expectations about how the data is summarized, presented, and annotated so you can quickly extract the information in the underlying data using just visual cues. It’s a bit abstract but very profound, and these principles underlie the ggplot2 package in R that makes famously beautiful plots with minimal code. This episode covers a paper by Hadley Wickham (author ...

Gaussian Processes

April 27, 2020 01:33 - 20 minutes - 9.58 MB

It’s pretty common to fit a function to a dataset when you’re a data scientist. But in many cases, it’s not clear what kind of function might be most appropriate—linear? quadratic? sinusoidal? some combination of these, and perhaps others? Gaussian processes introduce a nonparameteric option where you can fit over all the possible types of functions, using the data points in your datasets as constraints on the results that you get (the idea being that, no matter what the “true” underlying fun...

Keeping ourselves honest when we work with observational healthcare data

April 20, 2020 02:43 - 19 minutes - 8.76 MB

The abundance of data in healthcare, and the value we could capture from structuring and analyzing that data, is a huge opportunity. It also presents huge challenges. One of the biggest challenges is how, exactly, to do that structuring and analysis—data scientists working with this data have hundreds or thousands of small, and sometimes large, decisions to make in their day-to-day analysis work. What data should they include in their studies? What method should they use to analyze it? What h...

Changing our formulation of AI to avoid runaway risks: Interview with Prof. Stuart Russell

April 13, 2020 01:55 - 28 minutes - 13.3 MB

AI is evolving incredibly quickly, and thinking now about where it might go next (and how we as a species and a society should be prepared) is critical. Professor Stuart Russell, an AI expert at UC Berkeley, has a formulation for modifications to AI that we should study and try implementing now to keep it much safer in the long run. Prof. Russell’s new book, “Human Compatible: Artificial Intelligence and the Problem of Control” gives an accessible but deeply thoughtful exploration of why he t...

Putting machine learning into a database

April 06, 2020 01:51 - 24 minutes - 11.2 MB

Most data scientists bounce back and forth regularly between doing analysis in databases using SQL and building and deploying machine learning pipelines in R or python. But if we think ahead a few years, a few visionary researchers are starting to see a world in which the ML pipelines can actually be deployed inside the database. Why? One strong advantage for databases is they have built-in features for data governance, including things like permissioning access and tracking the provenance of...

The work-from-home episode

March 29, 2020 22:23 - 29 minutes - 13.3 MB

Many of us have the privilege of working from home right now, in an effort to keep ourselves and our family safe and slow the transmission of covid-19. But working from home is an adjustment for many of us, and can hold some challenges compared to coming in to the office every day. This episode explores this a little bit, informally, as we compare our new work-from-home setups and reflect on what’s working well and what we’re finding challenging.

Understanding Covid-19 transmission: what the data suggests about how the disease spreads

March 23, 2020 01:03 - 25 minutes - 11.6 MB

Covid-19 is turning the world upside down right now. One thing that’s extremely important to understand, in order to fight it as effectively as possible, is how the virus spreads and especially how much of the spread of the disease comes from carriers who are experiencing no or mild symptoms but are contagious anyway. This episode digs into the epidemiological model that was published in Science this week—this model finds that the data suggests that the majority of carriers of the coronavirus...

Network effects re-release: when the power of a public health measure lies in widespread adoption

March 15, 2020 22:43 - 26 minutes - 12.2 MB

This week’s episode is a re-release of a recent episode, which we don’t usually do but it seems important for understanding what we can all do to slow the spread of covid-19. In brief, public health measures for infectious diseases get most of their effectiveness from their widespread adoption: most of the protection you get from a vaccine, for example, comes from all the other people who also got the vaccine. That’s why measures like social distancing are so important right now: even if you...

Causal inference when you can't experiment: difference-in-differences and synthetic controls

March 09, 2020 01:39 - 20 minutes - 9.52 MB

When you need to untangle cause and effect, but you can’t run an experiment, it’s time to get creative. This episode covers difference in differences and synthetic controls, two observational causal inference techniques that researchers have used to understand causality in complex real-world situations.

Better know a distribution: the Poisson distribution

March 02, 2020 02:55 - 31 minutes - 14.6 MB

This is a re-release of an episode that originally ran on October 21, 2018. The Poisson distribution is a probability distribution function used to for events that happen in time or space. It’s super handy because it’s pretty simple to use and is applicable for tons of things—there are a lot of interesting processes that boil down to “events that happen in time or space.” This episode is a quick introduction to the distribution, and then a focus on two of our favorite everyday applications: ...

The Lottery Ticket Hypothesis

February 23, 2020 23:03 - 19 minutes - 9.04 MB

Recent research into neural networks reveals that sometimes, not all parts of the neural net are equally responsible for the performance of the network overall. Instead, it seems like (in some neural nets, at least) there are smaller subnetworks present where most of the predictive power resides. The fascinating thing is that, for some of these subnetworks (so-called “winning lottery tickets”), it’s not the training process that makes them good at their classification or regression tasks: th...

Interesting technical issues prompted by GDPR and data privacy concerns

February 17, 2020 01:50 - 20 minutes - 9.35 MB

Data privacy is a huge issue right now, after years of consumers and users gaining awareness of just how much of their personal data is out there and how companies are using it. Policies like GDPR are imposing more stringent rules on who can use what data for what purposes, with an end goal of giving consumers more control and privacy around their data. This episode digs into this topic, but not from a security or legal perspective—this week, we talk about some of the interesting technical ch...

Thinking of data science initiatives as innovation initiatives

February 10, 2020 01:10 - 17 minutes - 7.99 MB

Put yourself in the shoes of an executive at a big legacy company for a moment, operating in virtually any market vertical: you’re constantly hearing that data science is revolutionizing the world and the firms that survive and thrive in the coming years are those that execute on a data strategy. What does this mean for your company? How can you best guide your established firm through a successful transition to becoming data-driven? How do you balance the momentum your firm has right now, an...

Building a curriculum for educating data scientists: Interview with Prof. Xiao-Li Meng

February 02, 2020 23:36 - 31 minutes - 14.5 MB

As demand for data scientists grows, and it remains as relevant as ever that practicing data scientists have a solid methodological and technical foundation for their work, higher education institutions are coming to terms with what’s required to educate the next cohorts of data scientists. The heterogeneity and speed of the field makes it challenging for even the most talented and dedicated educators to know what a data science education “should” look like. This doesn’t faze Xiao-Li Meng, P...

Running experiments when there are network effects

January 27, 2020 00:13 - 24 minutes - 11.3 MB

Traditional A/B tests assume that whether or not one person got a treatment has no effect on the experiment outcome for another person. But that’s not a safe assumption, especially when there are network effects (like in almost any social context, for instance!) SUTVA, or the stable treatment unit value assumption, is a big phrase for this assumption and violations of SUTVA make for some pretty interesting experiment designs. From news feeds in LinkedIn to disentangling herd immunity from ind...

Zeroing in on what makes adversarial examples possible

January 20, 2020 02:41 - 22 minutes - 10.5 MB

Adversarial examples are really, really weird: pictures of penguins that get classified with high certainty by machine learning algorithms as drumsets, or random noise labeled as pandas, or any one of an infinite number of mistakes in labeling data that humans would never make but computers make with joyous abandon. What gives? A compelling new argument makes the case that it’s not the algorithms so much as the features in the datasets that holds the clue. This week’s episode goes through sev...

Unsupervised Dimensionality Reduction: UMAP vs t-SNE

January 13, 2020 00:53 - 29 minutes - 13.5 MB

Dimensionality reduction redux: this episode covers UMAP, an unsupervised algorithm designed to make high-dimensional data easier to visualize, cluster, etc. It’s similar to t-SNE but has some advantages. This episode gives a quick recap of t-SNE, especially the connection it shares with information theory, then gets into how UMAP is different (many say better). Between the time we recorded and released this episode, an interesting argument made the rounds on the internet that UMAP’s advanta...

Data scientists: beware of simple metrics

January 05, 2020 22:54 - 24 minutes - 11.4 MB

Picking a metric for a problem means defining how you’ll measure success in solving that problem. Which sounds important, because it is, but oftentimes new data scientists only get experience with a few kinds of metrics when they’re learning and those metrics have real shortcomings when you think about what they tell you, or don’t, about how well you’re really solving the underlying problem. This episode takes a step back and says, what are some metrics that are popular with data scientists, ...

Communicating data science, from academia to industry

December 30, 2019 01:53 - 26 minutes - 12 MB

For something as multifaceted and ill-defined as data science, communication and sharing best practices across the field can be extremely valuable but also extremely, well, multifaceted and ill-defined. That doesn’t bother our guest today, Prof. Xiao-Li Meng of the Harvard statistics department, who is leading an effort to start an open-access Data Science Review journal in the model of the Harvard Business Review or Law Review. This episode features Xiao-Li talking about the need he sees for...

Optimizing for the short-term vs. the long-term

December 23, 2019 02:50 - 19 minutes - 8.88 MB

When data scientists run experiments, like A/B tests, it’s really easy to plan on a period of a few days to a few weeks for collecting data. The thing is, the change that’s being evaluated might have effects that last a lot longer than a few days or a few weeks—having a big sale might increase sales this week, but doing that repeatedly will teach customers to wait until there’s a sale and never buy anything at full price, which could ultimately drive down revenue in the long term. Increasing ...

Interview with Prof. Andrew Lo, on using data science to inform complex business decisions

December 16, 2019 03:15 - 27 minutes - 12.7 MB

This episode features Prof. Andrew Lo, the author of a paper that we discussed recently on Linear Digressions, in which Prof. Lo uses data to predict whether a medicine in the development pipeline will eventually go on to win FDA approval. This episode gets into the story behind that paper: how the approval prospects of different drugs inform the investment decisions of pharma companies, how to stitch together siloed and incomplete datasts to form a coherent picture, and how the academics bui...

Using machine learning to predict drug approvals

December 08, 2019 22:56 - 25 minutes - 11.4 MB

One of the hottest areas in data science and machine learning right now is healthcare: the size of the healthcare industry, the amount of data it generates, and the myriad improvements possible in the healthcare system lay the groundwork for compelling, innovative new data initiatives. One spot that drives much of the cost of medicine is the riskiness of developing new drugs: drug trials can cost hundreds of millions of dollars to run and, especially given that numerous medicines end up faili...

Facial recognition, society, and the law

December 02, 2019 03:14 - 43 minutes - 19.8 MB

Facial recognition being used in everyday life seemed far-off not too long ago. Increasingly, it’s being used and advanced widely and with increasing speed, which means that our technical capabilities are starting to outpace (if they haven’t already) our consensus as a society about what is acceptable in facial recognition and what isn’t. The threats to privacy, fairness, and freedom are real, and Microsoft has become one of the first large companies using this technology to speak out in spec...

Lessons learned from doing data science, at scale, in industry

November 25, 2019 00:45 - 28 minutes - 12.8 MB

If you’ve taken a machine learning class, or read up on A/B tests, you likely have a decent grounding in the theoretical pillars of data science. But if you’re in a position to have actually built lots of models or run lots of experiments, there’s almost certainly a bunch of extra “street smarts” insights you’ve had that go beyond the “books smarts” of more academic studies. The data scientists at Booking.com, who run build models and experiments constantly, have written a paper that bridges ...

Varsity A/B Testing

November 18, 2019 02:09 - 36 minutes - 16.5 MB

When you want to understand if doing something causes something else to happen, like if a change to a website causes and dip or rise in downstream conversions, the gold standard analysis method is to use randomized controlled trials. Once you’ve properly randomized the treatment and effect, the analysis methods are well-understood and there are great tools in R and python (and other languages) to find the effects. However, when you’re operating at scale, the logistics of running all those tes...

The Care and Feeding of Data Scientists: Growing Careers

November 11, 2019 03:44 - 25 minutes - 11.6 MB

In the third and final installment of a conversation with Michelangelo D’Agostino, VP of Data Science and Engineering at Shoprunner, about growing and mentoring data scientists on your team. Some of our topics of conversation include how to institute hack time as a way to learn new things, what career growth looks like in data science, and how to institutionalize professional growth as part of a career ladder. As with the other episodes in this series, the topics we cover today are also cover...

The Care and Feeding of Data Scientists: Recruiting and Hiring Data Scientists

November 04, 2019 00:21 - 20 minutes - 9.28 MB

This week’s episode is the second in a three-part interview series with Michelangelo D’Agostino, VP of Data Science at Shoprunner. This discussion centers on building a team, which means recruiting, interviewing and hiring data scientists. Since data science talent is in such high demand, and data scientists are understandably choosy about where they go to work, a good recruiting and hiring program can have a big impact on the size and quality of the team. Our chat covers much a couple of sec...

The Care and Feeding of Data Scientists: Becoming a Data Science Manager

October 28, 2019 01:27 - 24 minutes - 11.3 MB

Data science management isn’t easy, and many data scientists are finding themselves learning on the job how to manage data science teams as they get promoted into more formal leadership roles. O’Reilly recently release a report, written by yours truly (Katie) and another experienced data science manager, Michelangelo D’Agostino, where we lay out the most important tasks of a data science manager and some thoughts on how to unpack those tasks and approach them in a way that makes a new manager...

Procella: YouTube's super-system for analytics data storage

October 21, 2019 01:27 - 29 minutes - 13.6 MB

If you’re trying to manage a project that serves up analytics data for a few very distinct uses, you’d be wise to consider having custom solutions for each use case that are optimized for the needs and constraints of that use cases. You also wouldn’t be YouTube, which found themselves with this problem (gigantic data needs and several very different use cases of what they needed to do with that data) and went a different way: they built one analytics data system to serve them all. Procella, t...

What's really so hard about feature engineering?

October 06, 2019 22:37 - 21 minutes - 9.75 MB

Feature engineering is ubiquitous but gets surprisingly difficult surprisingly fast. What could be so complicated about just keeping track of what data you have, and how you made it? A lot, as it turns out—most data science platforms at this point include explicit features (in the product sense, not the data sense) just for keeping track of and sharing features (in the data sense, not the product sense). Just like a good library needs a catalogue, a city needs a map, and a home chef needs a c...

Data storage for analytics: stars and snowflakes

September 30, 2019 11:22 - 15 minutes - 7.04 MB

If you’re a data scientist or data engineer thinking about how to store data for analytics uses, one of the early choices you’ll have to make (or live with, if someone else made it) is how to lay out the data in your data warehouse. There are a couple common organizational schemes that you’ll likely encounter, and that we cover in this episode: first is the famous star schema, followed by the also-famous snowflake schema.

Data storage: transactions vs. analytics

September 23, 2019 01:49 - 16 minutes - 7.39 MB

Data scientists and software engineers both work with databases, but they use them for different purposes. So if you’re a data scientist thinking about the best way to store and access data for your analytics, you’ll likely come up with a very different set of requirements than a software engineer looking to power an application. Hence the split between analytics and transactional databases—certain technologies are designed for one or the other, but no single type of database is perfect for b...

GROVER: an algorithm for making, and detecting, fake news

September 16, 2019 03:21 - 18 minutes - 8.45 MB

There are a few things that seem to be very popular in discussions of machine learning algorithms these days. First is the role that algorithms play now, or might play in the future, when it comes to manipulating public opinion, for example with fake news. Second is the impressive success of generative adversarial networks, and similar algorithms. Third is making state-of-the-art natural language processing algorithms and naming them after muppets. We get all three this week: GROVER is an alg...

Data science teams as innovation initiatives

September 09, 2019 02:24 - 15 minutes - 7.03 MB

When a big, established company is thinking about their data science strategy, chances are good that whatever they come up with, it’ll be somewhat at odds with the company’s current structure and processes. Which makes sense, right? If you’re a many-decades-old company trying to defend a successful and long-lived legacy and market share, you won’t have the advantage that many upstart competitors have of being able to bake data analytics and science into the core structure of the organization....

Organizational Models for Data Scientists

August 25, 2019 23:06 - 23 minutes - 10.6 MB

When data science is hard, sometimes it’s because the algorithms aren’t converging or the data is messy, and sometimes it’s because of organizational or business issues: the data scientists aren’t positioned correctly to bring value to their organization. Maybe they don’t know what problems to work on, or they build solutions to those problems but nobody uses what they build. A lot of this can be traced back to the way the team is organized, and (relatedly) how it interacts with the rest of t...

Data Shapley

August 19, 2019 02:38 - 16 minutes - 7.75 MB

We talk often about which features in a dataset are most important, but recently a new paper has started making the rounds that turns the idea of importance on its head: Data Shapley is an algorithm for thinking about which examples in a dataset are most important. It makes a lot of intuitive sense: data that’s just repeating examples that you’ve already seen, or that’s noisy or an extreme outlier, might not be that valuable for using to train a machine learning model. But some data is very v...

Putting the "science" in data science: the scientific method, the null hypothesis, and p-hacking

July 29, 2019 01:30 - 24 minutes - 11.1 MB

The modern scientific method is one of the greatest (perhaps the greatest?) system we have for discovering knowledge about the world. It’s no surprise then that many data scientists have found their skills in high demand in the business world, where knowing more about a market, or industry, or type of user becomes a competitive advantage. But the scientific method is built upon certain processes, and is disciplined about following them, in a way that can get swept aside in the rush to get som...

Interleaving

July 22, 2019 12:20 - 16 minutes - 7.74 MB

If you’re Google or Netflix, and you have a recommendation or search system as part of your bread and butter, what’s the best way to test improvements to your algorithm? A/B testing is the canonical answer for testing how users respond to software changes, but it gets tricky really fast to think about what an A/B test means in the context of an algorithm that returns a ranked list. That’s why we’re talking about interleaving this week—it’s a simple modification to A/B testing that makes it mu...

Deepfakes

July 01, 2019 01:25 - 15 minutes - 6.93 MB

Generative adversarial networks (GANs) are producing some of the most realistic artificial videos we’ve ever seen. These videos are usually called “deepfakes”. Even to an experienced eye, it can be a challenge to distinguish a fabricated video from a real one, which is an extraordinary challenge in an era when the truth of what you see on the news or especially on social media is worthy of skepticism. And just in case that wasn’t unsettling enough, the algorithms just keep getting better and ...

Revisiting Biased Word Embeddings

June 24, 2019 00:26 - 18 minutes - 8.31 MB

The topic of bias in word embeddings gets yet another pass this week. It all started a few years ago, when an analogy task performed on Word2Vec embeddings showed some indications of gender bias around professions (as well as other forms of social bias getting reproduced in the algorithm’s embeddings). We covered the topic again a while later, covering methods for de-biasing embeddings to counteract this effect. And now we’re back, with a second pass on the original Word2Vec analogy task, but...

Linear Digressions

Episodes

Twitter Mentions