Episode 58: Deep Learning Meets Distributed Systems with Jim Dowling

Datacast

English - March 19, 2021 08:00 - 1 hour - 68 MB - ★★★★★ - 4 ratings
Technology Business Careers research data engineering data science artificial intelligence machine learning statistics technology startup computer science venture capital Homepage Download Apple Podcasts Google Podcasts Overcast Castro Pocket Casts RSS feed

Previous Episode: Episode 57: Building Data Science Projects with Pier Paolo-Ippolito

Next Episode: Episode 59: Bridging The Gap Between Data and Models with Willem Pienaar

Show Notes(1:56) Jim went over his education at Trinity College Dublin in the late 90s/early 2000s, where he got early exposure to academic research in distributed systems.(4:26) Jim discussed his research focused on dynamic software architecture, particularly the K-Component model that enables individual components to adapt to a changing environment.(5:37) Jim explained his research on collaborative reinforcement learning that enables groups of reinforcement learning agents to solve online optimization problems in dynamic systems.(9:03) Jim recalled his time as a Senior Consultant for MySQL.(9:52) Jim shared the initiatives at the RISE Research Institute of Sweden, in which he has been a researcher since 2007.(13:16) Jim dissected his peer-to-peer systems research at RISE, including theoretical results for search algorithm and walk topology.(15:30) Jim went over challenges building peer-to-peer live streaming systems at RISE, such as GradientTV and Glive.(18:18) Jim provided an overview of research activities at the Division of Software and Computer Systems at the School of Electrical Engineering and Computer Science at KTH Royal Institute of Technology.(19:04) Jim has taught courses on Distributed Systems and Deep Learning on Big Data at KTH Royal Institute of Technology.(22:20) Jim unpacked his O’Reilly article in 2017 called “Distributed TensorFlow,” which includes the deep learning hierarchy of scale.(29:47) Jim discussed the development of HopsFS, a next-generation distribution of the Hadoop Distributed File System (HDFS) that replaces its single-node in-memory metadata service with a distributed metadata service built on a NewSQL database.(34:17) Jim rationalized the intention to commercialize HopsFS and built Hopsworks, an user-friendly data science platform for Hops.(36:56) Jim explored the relative benefits of public research money and VC-funded money.(41:48) Jim unpacked the key ideas in his post “Feature Store: The Missing Data Layer in ML Pipelines.”(47:31) Jim dissected the critical design that enables the Hopsworks feature store to refactor a monolithic end-to-end ML pipeline into separate feature engineering and model training pipelines.(52:49) Jim explained why data warehouses are insufficient for machine learning pipelines and why a feature store is needed instead.(57:59) Jim discussed prioritizing the product roadmap for the Hopswork platform.(01:00:25) Jim hinted at what’s on the 2021 roadmap for Hopswork.(01:03:22) Jim recalled the challenges of getting early customers for Hopsworks.(01:04:30) Jim intuited the differences and similarities between being a professor and being a founder.(01:07:00) Jim discussed worrying trends in the European Tech ecosystem and the role that Logical Clocks will play in the long run.(01:13:37) Closing segment.Jim’s Contact InfoLogical Clocks Twitter LinkedIn Google Scholar Medium ACM Profile GitHubMentioned Content

Research Papers

“The K-Component Architecture Meta-Model for Self-Adaptive Software” (2001)“Dynamic Software Evolution and The K-Component Model” (2001)“Using feedback in collaborative reinforcement learning to adaptively optimize MANET routing” (2005)“Building Autonomic Systems Using Collaborative Reinforcement Learning” (2006)“Improving ICE Service Selection in a P2P System using the Gradient Topology” (2007)“gradienTv: Market-Based P2P Live Media Streaming on the Gradient Overlay” (2010)“GLive: The Gradient Overlay as a Market Maker for Mesh-Based P2P Live Streaming” (2011)“HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases” (2016)“Scaling HDFS to More Than 1 Million Operations Per Second with HopsFS” (2017)“Hopsworks: Improving User Experience and Development on Hadoop with Scalable, Strongly Consistent Metadata” (2017)“Implicit Provenance for Machine Learning Artifacts” (2020)“Time Travel and Provenance for Machine Learning Pipelines” (2020)“Maggy: Scalable Asynchronous Parallel Hyperparameter Search” (2020)

Articles

“Distributed TensorFlow” (2017)“Reflections on AWS’s S3 Architectural Flaws” (2017)“Meet Michelangelo: Uber’s Machine Learning Platform” (2017)“Feature Store: The Missing Data Layer in ML Pipelines” (2018)“What Is Wrong With European Tech Companies?” (2019)“ROI of Feature Stores” (2020)“MLOps With A Feature Store” (2020)“ML Engineer Guide: Feature Store vs. Data Warehouse” (2020)“Unifying Single-Host and Distributed Machine Learning with Maggy” (2020)“How We Secure Your Data With Hopsworks” (2020)“One Function Is All You Need For ML Experiments” (2020)“Hopsworks: World’s Only Cloud-Native Feature Store, now available on AWS and Azure” (2020)“Hopsworks 2.0: The Next Generation Platform for Data-Intensive AI with a Feature Store” (2020)“Hopsworks Feature Store API 2.0, a new paradigm” (2020)“Swedish startup Logical Clocks takes a crack at scaling MySQL backend for live recommendations” (2021)

Projects

Apache Hudi (by Uber)Delta Lake (by Databricks)Apache Iceberg (by Netflix)MLflow (by Databricks)Apache Flink (by The Apache Foundation)

People

Leslie Lamport (The Father of Distributed Computing)Jeff Dean (Creator of MapReduce and TensorFlow, Lead of Google AI)Richard Sutton (The Father of Reinforcement Learning — who wrote “The Bitter Lesson”)

Programming Books

C++ Programming Languages books (by Scott Meyers)“Effective Java” (by Joshua Bloch)“Programming Erlang” (by Joe Armstrong)“Concepts, Techniques, and Models of Computer Programming” (by Peter Van Roy and Seif Haridi)

About the show

Datacast features long-form, in-depth conversations with practitioners and researchers in the data community to walk through their professional journeys and unpack the lessons learned along the way. I invite guests coming from a wide range of career paths — from scientists and analysts to founders and investors — to analyze the case for using data in the real world and extract their mental models (“the WHY and the HOW”) behind their pursuits. Hopefully, these conversations can serve as valuable tools for early-stage data professionals as they navigate their own careers in the exciting data universe.

Datacast is produced and edited by James Le. For inquiries about sponsoring the podcast, email [email protected].

Subscribe by searching for Datacast wherever you get podcasts, or click one of the links below:

Listen on Spotify Listen on Apple Podcasts Listen on Google Podcasts

If you’re new, see the podcast homepage for the most recent episodes to listen to, or browse the full guest list.

Twitter Mentions

@jim_dowling