Episode 111: Astrophysics, Visualization Recommendation, and Scalable Data Science with Doris Lee

Datacast

English - March 24, 2023 05:30 - 54 minutes - 50.2 MB - ★★★★★ - 4 ratings
Technology Business Careers research data engineering data science artificial intelligence machine learning statistics technology startup computer science venture capital Homepage Download Apple Podcasts Google Podcasts Overcast Castro Pocket Casts RSS feed

Previous Episode: Episode 110: Wisdom in Building Data Infrastructure, Lessons From Open-Source Development, The Missing README, and The Future of Data Engineering with Chris Riccomini

Next Episode: Episode 112: Distributed Systems Research, The Philosophy of Computational Complexity, and Modern Streaming Database with Arjun Narayan

Show Notes(01:30) Doris walked through her time doing research in physics and astrophysics at UC Berkeley and getting involved with data science.(04:11) Doris reflected on her decision to pursue the Ph.D. program in computer science at the University of Illinois, Urbana-Champaign.(05:53) Doris discussed her development of no-code, interactive visualization interfaces accelerating users toward data insight discovery.(10:37) Doris explained how the RISE Lab and I School at UC Berkeley helped shape her thinking around working with end-users and building something to serve the data science community.(16:05) Doris unpacked the focus of her Ph.D. dissertation - which is to make data exploration and visualization easier and more accessible through automation.(17:27) Doris shared the motivation and high-level design behind the development of Lux, a general-purpose visual exploration assistant situated within a computational notebook.(21:25) Doris revealed the recipe for open-source community engagement and roadmap prioritization with Lux.(26:17) Doris shared the founding story of Ponder, whose mission is to improve data science productivity by empowering users to do data science at all scales.(31:02) Doris explained how Ponder helps solve the fragmentation challenges across the data stack.(34:27) Doris provided a brief overview of Modin, which improves the scalability of data frames.(38:41) Doris discussed Ponder's go-to-market strategy to drive more enterprise interest toward the product.(41:23) Doris discussed her team's challenges in finding early design partners across various industries.(44:16) Doris shared valuable hiring lessons to attract the right people who are excited about Ponder's mission.(47:42) Doris shared fundraising advice to founders who are seeking the right investors for their startups.(49:33) Doris highlighted the difference between being a researcher and a founder.(51:06) Closing segment.Doris' Contact InfoWebsite Twitter LinkedIn GitHubPonder's ResourcesWebsite | Twitter | LinkedIn | Slack Modin | Lux EventsMentioned ContentPublicationsThe Case for a Visual Discovery Assistant:A Holistic Solution for Accelerating Visual Data Exploration (IEEE Data Bulletin 2018)Understanding Sense-making in Visual Query Systems (IEEE Visual Analytics Science and Tech 2019)Deconstructing Categorization in Visualization Recommendation: A Taxonomy and Comparative Study (IEEE Transactions on Visualization and Computer Graphics 2021)Lux: Always-On Visualization Recommendation for Exploratory Data Science (Dec 2021)Blog PostsInsight Machines: The Past, Present, and Future of Visualization Recommendation (Multiple Views, Feb 2020)Announcing Ponder (March 2022)How we parallelized 600+ pandas functions with Modin (March 2022)Using Lux to visualize your pandas dataframes with zero effort (March 2022)Ph.D. Alum Doris Lee Wants to Democratize Data Science Tools (March 2022)PeopleChip Huyen Shreyar Shankar Parul PandeyNotes

My conversation with Doris was recorded back in May 2022. Earlier this year, Ponder developed the first-of-its-kind technology that allows anyone to run their pandas code directly in your data warehouse, be it Snowflake, BigQuery, or Redshift. With Ponder, you get the same pandas-native experience that you love, but with the power and scalability of cloud-native data warehouses. More details are in this blog post.

Additionally, you can run NumPy commands on your data warehouse as well. This means you can work with the NumPy API to build data and ML pipelines, and let Snowflake / BigQuery / Redshift take care of scaling, security, and compliance. More details are in this blog post.

If you are interested in trying these new capabilities out, sign up here!

About the show

Datacast features long-form, in-depth conversations with practitioners and researchers in the data community to walk through their professional journeys and unpack the lessons learned along the way. I invite guests coming from a wide range of career paths — from scientists and analysts to founders and investors — to analyze the case for using data in the real world and extract their mental models (“the WHY and the HOW”) behind their pursuits. Hopefully, these conversations can serve as valuable tools for early-stage data professionals as they navigate their own careers in the exciting data universe.

Datacast is produced and edited by James Le. For inquiries about sponsoring the podcast, email [email protected].

Subscribe by searching for Datacast wherever you get podcasts, or click one of the links below:

Listen on Spotify Listen on Apple Podcasts Listen on Google Podcasts

If you’re new, see the podcast homepage for the most recent episodes to listen to, or browse the full guest list.