![Linear Digressions artwork](https://is3-ssl.mzstatic.com/image/thumb/Podcasts113/v4/4d/ca/f2/4dcaf27f-1f74-9477-477b-f7aaecb6d843/mza_1113917496893811473.jpg/100x100bb.jpg)
Data Contamination
Linear Digressions
English - May 02, 2016 02:24 - 20 minutes - 28.8 MB - ★★★★★ - 350 ratingsTechnology data science machine learning linear digressions Homepage Download Apple Podcasts Google Podcasts Overcast Castro Pocket Casts RSS feed
Previous Episode: Model Interpretation (and Trust Issues)
Next Episode: What's the biggest #bigdata?
Supervised machine learning assumes that the features and labels used for building a classifier are isolated from each other--basically, that you can't cheat by peeking. Turns out this can be easier said than done. In this episode, we'll talk about the many (and diverse!) cases where label information contaminates features, ruining data science competitions along the way.
Relevant links:
https://www.researchgate.net/profile/Claudia_Perlich/publication/221653692_Leakage_in_data_mining_Formulation_detection_and_avoidance/links/54418bb80cf2a6a049a5a0ca.pdf