Lars dove into data pipelines, and emerged bearing arrows and wishing for a lot fewer copies.

What is there to think about regarding data pipelines, what is interesting about them?

Which tools are out there, and why might you want to use them?

Why all this talk about making fewer copies of data?

What does Lars' current ideal pipeline look like, and where does Elixir fit in?

Links

Matt TopolApache ArrowLarge language modelsVector searchBigQuerysedAWKjqReplacing Hadoop with bash - "Command-line Tools can be 235x Faster than your Hadoop Cluster"HadoopMapReduceUnix pipesDirected acyclic graphtee - to "materialize inbetween states"Apache BeamApache SparkApache FlinkApache PulsarAirbyte - shoves data between systems using connectorsCronjobFivetran - Airbyte competitorApache AirflowETL - Extract, transform, loadDesigning data-intensive applicationsStream processingEphemeralityData lakeData warehouseThe people's front of JudeaDBT - SQL-SQL batch-work-thingySQL with Jinja templatesSnowflake - data warehouse thingScalaBroadwayOban - "robust job processing for Elixir"Dashbitpandas - Python data libraryAPLArrow flightGRPCDataFusion - query execution enginePolars - "DataFrames in Rust"Explorer - built on top of PolarsVoltron dataThe Composable CodexPyarrow - Arrow bindings for Python

Quotes

I've been reading a lot about data pipelinesWhat's so special about data pipelines?There's a lot of special toolingThere's a lot of bad, bad toolingLess than optimal toolingConverging on something biggerlkHe got me eventuallyAll of your steps in one bucketWhat tools do you associate with data?I inherited a data pipelineBashReduceIterate on the L and the TThe modern data stackAnd then you demand more workNo unnecessary copiesBarely a copyReconnecting with my Python roots