Episode 110: Wisdom in Building Data Infrastructure, Lessons From Open-Source Development, The Missing README, and The Future of Data Engineering with Chris Riccomini

Datacast

English - March 14, 2023 16:00 - 2 hours - 116 MB - ★★★★★ - 4 ratings
Technology Business Careers research data engineering data science artificial intelligence machine learning statistics technology startup computer science venture capital Homepage Download Apple Podcasts Google Podcasts Overcast Castro Pocket Casts RSS feed

Previous Episode: Episode 109: Developer Productivity, Real-Time Data Infrastructure, and The Fat-Tailed Nature of Enterprise Software with Nnamdi Iregbulem

Next Episode: Episode 111: Astrophysics, Visualization Recommendation, and Scalable Data Science with Doris Lee

Show Notes(01:47) Chris reflected on his educational experience at Santa Clara University in the mid-2000s, where he also interned at NeoMagic and Intacct Corporation.(07:31) Chris recalled valuable lessons from his first job as a software engineer at PayPal, researching new fraud prevention techniques.(11:28) Chris shared the technical and operational challenges associated with his work at LinkedIn as a data scientist - scaling LinkedIn's Hadoop cluster, improving LinkedIn's "People You May Know" algorithm, and delivering the next generation of LinkedIn's "Who's Viewed My Profile" product.(22:00) Chris provided criteria that his team relied on when choosing their big data solutions (which include Aster Data, Greenplum, and Hadoop).(25:22) Chris gave advice to early-stage startups that want to start adopting best practices in observability and deployment.(28:02) Chris expanded on his concept that models and microservices should be running on the same continuous delivery stack.(30:52) Chris discussed his strategy to become a better interviewer - as he performed ~1,500 interviews at LinkedIn and WePay.(37:39) Chris explained the motivation behind the creation of Apache Samza (LinkedIn's streaming system infrastructure built on top of Apache Kafka) and discussed its high-level design philosophy.(46:19) Chris shared lessons learned from evangelizing Samza to the broader open-source community outside of LinkedIn.(52:44) Chris talked about his decision to join the Data Infrastructure team at WePay as a principal software engineer after 7 years at LinkedIn.(01:00:53) Chris shared the technical details behind the evolution of WePay's data infrastructure throughout his time there.(01:12:40) Chris shared an insider perspective on the adoption of Apache Airflow from his experience as a Project Committee Member.(01:20:15) Chris discussed the fundamental design principles that make Apache Kafka such a powerful technology.(01:25:40) Chris reflected on his experience building out WePay's engineering team.(01:27:14) Chris shared the story behind the writing journey of the "Missing README" - which he co-authored with Dmitriy Ryaboy.(01:38:16) Chris revisited his predictions in a 2019 post called "The Future of Data Engineering" and discussed key trends such as real-time data warehouses, data mesh, and headless BI.(01:44:27) Chris gave advice to a smart, driven engineer who wants to explore angel investing - given his experience as a strategic investor and advisor for startups in the data space since 2015.(01:48:17) Chris shared advice on hiring engineers and navigating open-source product strategies for companies he invested in.(01:53:57) Chris reflected on his consistency in adding value to the relationships he has formed over the years.(01:58:00) Closing segment.Chris's Contact InfoWebsite Twitter LinkedIn Github AngelListMentioned ContentBlog PostsJoel Spolsky's Blog Models and microservices should be running on the same continuous delivery stack (Oct 2018)Using checksums to verify syncing 100M database records (Napkin Math, Jan 2021)Datacast episode with Jeremiah Lowin, CEO of Prefect (March 2022)Kafka CDC breaks database encapsulation (Nov 2018)Kafka provides data portability and infrastructure agility (Jan 2019)The Future of Data Engineering (July 2019)Work For Two Companies (Nov 2021)PeopleWill Larson Maxime Beauchemin Julia Evans Gunnar Morling Coda HaleBooksGoogle's Site Reliability Engineering Books"On Writing Well""The Missing README""Empire of Light: Tesla, Edison, Westinghouse, and the Race to Electrify the World"Notes

My conversation with Chris was recorded back in May 2022. Earlier this year, Chris released Recap, a dead simple data catalog for engineers, written in Python. Recap makes it easy for engineers to build infrastructure and tools that need metadata. Check out his blog post and get started with Recap's documentation!

About the show

Datacast features long-form, in-depth conversations with practitioners and researchers in the data community to walk through their professional journeys and unpack the lessons learned along the way. I invite guests coming from a wide range of career paths — from scientists and analysts to founders and investors — to analyze the case for using data in the real world and extract their mental models (“the WHY and the HOW”) behind their pursuits. Hopefully, these conversations can serve as valuable tools for early-stage data professionals as they navigate their own careers in the exciting data universe.

Datacast is produced and edited by James Le. For inquiries about sponsoring the podcast, email [email protected].

Subscribe by searching for Datacast wherever you get podcasts, or click one of the links below:

Listen on Spotify Listen on Apple Podcasts Listen on Google Podcasts

If you’re new, see the podcast homepage for the most recent episodes to listen to, or browse the full guest list.