A discussion with Katharine Jarmul, aka kjam, about some of the challenges of data science with respect to testing.
Some of the topics we discuss:
experimentation vs testing
testing pipelines and pipeline changes
automating data validation
property based testing
schema validation and detecting schema changes
using unit test techniques to test data pipeline stages
testing nodes and transitions in DAGs
testing expected and unexpected data
missing data and non-signals
corrupting a dataset with noise
fuzz testing for both data pipelines and web APIs
datafuzz
hypothesis
testing internal interfaces
documenting and sharing domain expertise to build good reasonableness
intermediary data and stages
neural networks
speaking at conferences
Special Guest: Katharine Jarmul.

A discussion with Katharine Jarmul, aka kjam, about some of the challenges of data science with respect to testing.

Some of the topics we discuss:

experimentation vs testing
testing pipelines and pipeline changes
automating data validation
property based testing
schema validation and detecting schema changes
using unit test techniques to test data pipeline stages
testing nodes and transitions in DAGs
testing expected and unexpected data
missing data and non-signals
corrupting a dataset with noise
fuzz testing for both data pipelines and web APIs
datafuzz
hypothesis
testing internal interfaces
documenting and sharing domain expertise to build good reasonableness
intermediary data and stages
neural networks
speaking at conferences

Special Guest: Katharine Jarmul.

Sponsored By:

Python Testing with pytest, 2nd edition: The fastest way to learn pytest and practical testing practices.Patreon Supporters: Help support the show with as little as $1 per month and be the first to know when new episodes come out.

Links:

@kjam on Twitter — Data Magic and Computer SorceryKjamistan: Data Sciencedatafuzz’s Python library — The goal of datafuzz is to give you the ability to test your data science code and models with BAD data.Hypothesis Python library — Hypothesis is a Python library for finding edge cases in your code you wouldn’t have thought to look for.

Twitter Mentions