About Data Pipelines

Regular Programming

English - January 01, 2024 08:00 - 43 minutes - 20 MB
Technology technology programming software developers code development javascript python elixir Homepage Download Apple Podcasts Google Podcasts Overcast Castro Pocket Casts RSS feed

Previous Episode: About Fun With GenServers

Next Episode: About things you built long ago that start doing weird things

Lars dove into data pipelines, and emerged bearing arrows and wishing for a lot fewer copies.

What is there to think about regarding data pipelines, what is interesting about them?

Which tools are out there, and why might you want to use them?

Why all this talk about making fewer copies of data?

What does Lars' current ideal pipeline look like, and where does Elixir fit in?

Links

Matt Topol Apache Arrow Large language models Vector search BigQuery sed AWK jq Replacing Hadoop with bash - "Command-line Tools can be 235x Faster than your Hadoop Cluster"Hadoop MapReduce Unix pipes Directed acyclic graph tee - to "materialize inbetween states"Apache Beam Apache Spark Apache Flink Apache Pulsar Airbyte - shoves data between systems using connectorsCronjob Fivetran - Airbyte competitorApache Airflow ETL - Extract, transform, loadDesigning data-intensive applications Stream processing Ephemerality Data lake Data warehouse The people's front of Judea DBT - SQL-SQL batch-work-thingySQL with Jinja templates Snowflake - data warehouse thingScala Broadway Oban - "robust job processing for Elixir"Dashbit pandas - Python data libraryAPL Arrow flight GRPC DataFusion - query execution enginePolars - "DataFrames in Rust"Explorer - built on top of PolarsVoltron data The Composable Codex Pyarrow - Arrow bindings for Python

Quotes

I've been reading a lot about data pipelinesWhat's so special about data pipelines?There's a lot of special toolingThere's a lot of bad, bad toolingLess than optimal toolingConverging on something biggerlkHe got me eventuallyAll of your steps in one bucketWhat tools do you associate with data?I inherited a data pipelineBashReduceIterate on the L and the TThe modern data stackAnd then you demand more workNo unnecessary copiesBarely a copyReconnecting with my Python roots