Episode 116: Distributed Databases, Open-Source Standards, and Streaming Data Lakehouse with Vinoth Chandar

Datacast

English - May 12, 2023 03:00 - 1 hour - 84.5 MB - ★★★★★ - 4 ratings
Technology Business Careers research data engineering data science artificial intelligence machine learning statistics technology startup computer science venture capital Homepage Download Apple Podcasts Google Podcasts Overcast Castro Pocket Casts RSS feed

Previous Episode: Episode 115: Product-Led Sales, Community-Led Category Creation, and Unlocking Revenue Data with Alexa Grabell

Next Episode: Episode 117: Vector Databases, The Embeddings Revolution, and Working in China with Frank Liu

Show Notes(01:58) Vinoth shared his college experience studying IT at the Madras Institute of Technology in Chennai, India.(07:09) Vinoth reflected on his time at UT Austin, getting a Master's degree in Computer Science - where he did research on high-bandwidth content distribution and large-scale parallel processing with shell pipes.(11:20) Vinoth recalled his two years as a software engineer at Oracle, working on their database replication engine, HPC, and stream processing.(15:30) Vinoth walked over his transition to LinkedIn as a senior software engineer, working primarily on Voldemort - a key-value store that handles a big chunk of traffic on Linkedin and serves thousands of requests per second over terabytes of data.(24:41) Vinoth talked about his career transition to Uber in late 2014 as a founding engineer on Uber's data team and architect of Uber's data architecture.(28:39) Vinoth reflected on the state of Uber's data infrastructure when he joined.(34:31) Vinoth elaborated on Uber's case for incremental processing on Hadoop.(38:53) Vinoth reviewed the initial design and implementation of Hudi across the Hadoop ecosystem at Uber in 2016.(41:33) Vinoth shared the evolution of Hudi after it was initially open-sourced by Uber in 2017 and eventually incubated into the Apache Software Foundation in 2019.(46:49) Vinoth explained how to keep the development of Apache Hudi vendor-neutral.(49:36) Vinoth provided lessons learned about establishing standards for open-source data projects.(53:45) Vinoth went over the valuable leadership lessons that he absorbed throughout his 4.5 years at Uber.(57:17) Vinoth reflected on his 1.5 years as a principal engineer at Confluent working on ksqlDB, which makes it easy to create event streaming applications.(01:02:16) Vinoth articulated the vision for Apache Hudi as a Streaming Data Lake platform.(01:08:00) Vinoth highlighted the challenges with databases around indexing and concurrency control.(01:11:37) Vinoth shared the unique challenges around prioritizing the Hudi roadmap and engaging an open-source community.(01:16:32) Vinoth shared the founding story of Onehouse, a cloud-native, fully-managed lakehouse service built on Apache Hudi.(01:22:02 ) Vinoth emphasized Onehouse's commitment towards openness.(01:24:36) Vinoth shared valuable hiring lessons to attract the right people who are excited about Onehouse's mission.(01:26:40) Vinoth shared fundraising advice to founders who are seeking the right investors for their startups.(01:28:24) Closing segment.Vinoth's Contact InfoLinkedIn TwitterOnehouse's ResourcesWebsite | Twitter | LinkedIn About | Product | Blog | CareersApache Hudi's ResourcesUser Docs | Technical Wiki | Roadmap GitHub | Twitter | SlackMentioned ContentArticles and PresentationsVoldemort : Prototype to Production (May 2014)Uber's Case for Incremental Processing on Hadoop (Aug 2016)Hoodie: An Open Source Incremental Processing Framework From Uber (2017)The Past, Present, and Future of Efficient Data Lake Architectures (2021)Highly Available, Fault-Tolerant Pull Queries in ksqlDB (May 2020)Apache Hudi - The Data Lake Platform (July 2021)Introducing Onehouse (Feb 2022)Automagic Data Lake Infrastructure (Feb 2022)Onehouse Commitment to Openness (Feb 2022)PeopleLeslie Lamport Jeff Dean Michael StonebreakerBookZero To One (by Peter Thiel)Notes

My conversation with Vinoth was recorded back in August 2022. The Onehouse team has had some announcements in 2023 that I recommend looking at:

The Launch Announcement of Onetable The $25M Series A Funding Announcement Onehouse Availability in AWS Marketplace Onehouse Product Demo on building a data lake for GitHub analytics at scale Walmart's recent study on different open-source data lakehouse formats This discussion around the Hudi 1.x visionAbout the show

Datacast features long-form, in-depth conversations with practitioners and researchers in the data community to walk through their professional journeys and unpack the lessons learned along the way. I invite guests coming from a wide range of career paths — from scientists and analysts to founders and investors — to analyze the case for using data in the real world and extract their mental models (“the WHY and the HOW”) behind their pursuits. Hopefully, these conversations can serve as valuable tools for early-stage data professionals as they navigate their own careers in the exciting data universe.

Datacast is produced and edited by James Le. For inquiries about sponsoring the podcast, email [email protected].

Subscribe by searching for Datacast wherever you get podcasts, or click one of the links below:

Listen on Spotify Listen on Apple Podcasts Listen on Google Podcasts

If you’re new, see the podcast homepage for the most recent episodes to listen to, or browse the full guest list.

About the show