Software Daily

Architects of Intelligence with Martin Ford Holiday Repeat

June 15, 2020 09:00 - 1000 Bytes

Originally published January 31, 2019 Artificial intelligence is reshaping every aspect of our lives, from transportation to agriculture to dating. Someday, we may even create a superintelligence–a computer system that is demonstrably smarter than humans. But there is widespread disagreement on how soon we could build a superintelligence. There is not even a broad consensus on how we can define the term “intelligence”. Information technology is improving so rapidly we are losing the abilit...

Cruise Simulation with Tom Boyd

June 12, 2020 09:00 - 1000 Bytes

Cruise is an autonomous car company with a development cycle that is highly dependent on testing its cars–both in the wild and in simulation. The testing cycle typically requires cars to drive around gathering data, and that data to subsequently be integrated into a simulated system called Matrix. With COVID-19, the ability to run tests in the wild has been severely dampened. Cruise cannot put so many cars on the road, and thus has had to shift much of its testing procedures to rely more he...

Grafana with Torkel Ödegaard

June 11, 2020 09:00 - 1000 Bytes

Grafana is an open source visualization and monitoring tool that is used for creating dashboards and charting time series data. Grafana is used by thousands of companies to monitor their infrastructure. It is a popular component in monitoring stacks, and is often used together with Prometheus, ElasticSearch, MySQL, and other data sources. The engineering complexities around building Grafana involve the large number of integrations, the highly configurable ReactJS frontend, and the ability t...

Apache Airflow with Maxime Beauchemin, Vikram Koka, and Ash Berlin-Taylor

June 10, 2020 09:00 - 1000 Bytes

Apache Airflow was released in 2015, introducing the first popular open source solution to data pipeline orchestration. Since that time, Airflow has been widely adopted for dependency-based data workflows. A developer might orchestrate a pipeline with hundreds of tasks, with dependencies between jobs in Spark, Hadoop, and Snowflake. Since Airflow’s creation, it has powered the data infrastructure at companies like Airbnb, Netflix, and Lyft. It has also been at the center of Astronomer, a st...

Human in the Loop Data Analytics with Aditya Parameswaran

June 09, 2020 09:00 - 1000 Bytes

The life cycle of data management includes data cleaning, extraction, integration, analysis and exploration, and machine learning models. It would be great if all of this data management could be handled with automation, but unfortunately that is not an option. For most applications, data management requires a human in the loop. A human in the loop might be responsible for working in a spreadsheet, or labeling data as a mechanical turk, or creating an algorithm for data labeling in Snorkel....

Tilt: Kubernetes Tooling with Dan Bentley

June 08, 2020 09:00 - 1000 Bytes

Kubernetes continues to mature as a platform for infrastructure management. At this point, many companies have well-developed workflows and deployment patterns for working with applications built on Kubernetes. The complexity of some of these deployments may be daunting, and when a new employee joins a company, that employee needs to get quickly onboarded with the custom dev environment. Environment management is not the only issue with Kubernetes development. When a service gets updated, ...

Uber’s Data Visualization Tools with Ib Green

June 05, 2020 09:00 - 1000 Bytes

Uber needs to visualize data on a range of different surfaces. A smartphone user sees cars moving around on a map as they wait for their ride to arrive. Data scientists and operations researchers within Uber study the renderings of traffic moving throughout a city. Data visualization is core to Uber, and the company has developed a stack of technologies around visualization in order to build appealing, highly functional applications. DeckGL is a library for high-performance visualizations o...

Prisma: Modern Database Tooling with Johannes Schickling

June 04, 2020 09:00 - 1000 Bytes

A frontend developer issuing a query to a backend server typically requires the developer to issue that query through an ORM or a raw database query. Prisma is an alternative to both of these data access patterns, allowing for easier database access through auto-generated, type-safe query building tailored to an existing database schema. By integrating with Prisma, the developer gets a database client that has query autocompletion, and an API server with less boilerplate code. Prisma also h...

Tecton: Machine Learning Platform from Uber with Kevin Stumpf

June 03, 2020 09:00 - 1000 Bytes

Machine learning workflows have had a problem for a long time: taking a model from the prototyping step and putting it into production is not an easy task. A data scientist who is developing a model is often working with different tools, or a smaller data set, or different hardware than the environment which that model will be deployed to. This problem existed at Uber just as it does at many other companies. Models were difficult to release, iterations were complicated, and collaboration be...

HoloClean: Data Quality Management with Theodoros Rekatsinas

June 02, 2020 09:00 - 1000 Bytes

Many data sources produce new data points at a very high rate. With so much data, the issue of data quality emerges. Low quality data can degrade the accuracy of machine learning models that are built around those data sources. Ideally, we would have completely clean data sources, but that’s not very realistic. One alternative is a data cleaning system, which can allow us to clean up the data after it has already been generated. HoloClean is a statistical inference engine that can impute, c...

Disaggregated Servers with Yiying Zhang

June 01, 2020 09:00 - 1000 Bytes

Server infrastructure traditionally consists of monolithic servers containing all of the necessary hardware to run a computer. These different hardware components are located next to each other, and do not need to communicate over a network boundary to connect the CPU and memory. LegoOS is a model for disaggregated, network-attached hardware. LegoOS disseminates the traditional operating system functionalities into loosely-coupled hardware and software components. By disaggregating data cen...

Kubernetes vs. Serverless with Matt Ward

May 29, 2020 09:00 - 1000 Bytes

Kubernetes has become a highly usable platform for deploying and managing distributed systems. The user experience for Kubernetes is great, but is still not as simple as a full-on serverless implementation–at least, that has been a long-held assumption. Why would you manage your own infrastructure, even if it is Kubernetes? Why not use autoscaling Lambda functions and other infrastructure-as-a-service products? Matt Ward is a listener of the show and an engineer at Mux, a company that mak...

Distributed Systems Research with Peter Alvaro

May 28, 2020 09:00 - 1000 Bytes

Every software company is a distributed system, and distributed systems fail in unexpected ways. This ever-present tendency for systems to fail has led to the rise of failure testing, otherwise known as chaos engineering. Chaos engineering involves the deliberate failure of subsystems within an overall system to ensure that the system itself can be resilient to these kinds of unexpected failures. Peter Alvaro is a distributed systems researcher who has published papers on a range of subje...

Brex Engineering with Cosmin Nicolaescu

May 27, 2020 09:00 - 1000 Bytes

Brex is a credit card company that provides credit to startups, mostly companies which have raised money. Brex processes millions of transactions, and uses the data from those transactions to assess creditworthiness, prevent fraud, and surface insights for the users of their cards. Brex is full of interesting engineering problems. The high volume of transactions requires data infrastructure to support all those transactions coming through the platform. As a credit card company, Brex needs t...

Edge Machine Learning with Zach Shelby

May 26, 2020 09:00 - 1000 Bytes

Devices on the edge are becoming more useful with improvements in the machine learning ecosystem. TensorFlow Lite allows machine learning models to run on microcontrollers and other devices with only kilobytes of memory. Microcontrollers are very low-cost, tiny computational devices. They are cheap, and they are everywhere. The low-energy embedded systems community and the machine learning community have come together with a collaborative effort called tinyML. tinyML represents the improvem...

RedwoodJS with Tom Preston-Werner

May 22, 2020 09:00 - 1000 Bytes

Over the last 5 years, web development has matured considerably. React has become a standard for frontend component development. GraphQL has seen massive growth in adoption as a data fetching middleware layer. The hosting platforms have expanded beyond AWS and Heroku, to newer environments like Netlify and Vercel. These changes are collectively known as the JAMStack. With the changes brought by the JAMStack, it raises the question: how should an app be built today? Can a framework offer gui...

ArcGIS: Geographic Information Software with Max Payson

May 21, 2020 09:00 - 1000 Bytes

Geospatial analytics tools are used to render visualizations for a vast array of applications. Data sources such as satellites and cellular data can gather location data, and that data can be superimposed over a map. A map-based visualization can allow the end user to make decisions based on what they see. ArcGIS is one of the most widely used geospatial analytics platforms. It is created by ESRI, the Environmental Systems Research Institute, which was started in 1969. Today, ESRI products ...

RudderStack: Open Source Customer Data Infrastructure with Soumyadeb Mitra

May 20, 2020 09:00 - 1000 Bytes

Customer data infrastructure is a type of tool for saving analytics and information about your customers. The company that is best known in this category is Segment, a very popular API company. This customer data is used for making all kinds of decisions around product roadmap, pricing, and design. RudderStack is a company built around open source customer data infrastructure. RudderStack can be self-hosted, allowing users to deploy it to their own servers and manage their data however they...

Matterport 3-D Imaging with Japjit Tulsi

May 19, 2020 09:00 - 1000 Bytes

Matterport is a company that builds 3-D imaging for the inside of buildings, construction sites, and other locations that require a “digital twin.” Generating digital images of the insides of buildings has a broad spectrum of applications, and there are considerable engineering challenges in building such a system. Matterport’s hardware stack involves a camera built in-house by the company. The camera can take 360 degree scans of a room, stitch the imagery together, and make the digital twi...

Frontend Performance with Anycart’s Rafael Sanches

May 18, 2020 09:00 - 1000 Bytes

There are many bad recipe web sites. Every time I navigate to a recipe website, it feels like my browser is filling up with spyware. The page loads slowly, everything seems broken, I can feel the 25 different JavaScript adtech tags interrupting each other. Whether I am searching for banana bread or a spaghetti sauce recipe, recipe sites usually make me lose my appetite. Anycart is a recipe platform that allows users to buy all of the ingredients for the recipe and have those ingredients del...

May 16, 2020 09:00 - 1000 Bytes

For the last five months, we have been working on a new version of Software Daily, the platform we built to host and present our content. We are creating a platform that integrates the podcast with a set of other features that make it easier to learn from the audio interviews. Software Daily includes the following features: The world of software is large, and growing bigger every day. Software Daily is a place to explore this world of software companies and projects. If the podcast is ...

AWS Virtualization with Anthony Liguori

May 15, 2020 09:00 - 1000 Bytes

Amazon’s virtual server instances have come a long way since the early days of EC2. There are now a wide variety of available configuration options for spinning up an EC2 instance, which can be chosen from based on the workload that will be scheduled onto a virtual machine. There are also Fargate containers and AWS Lambda functions, creating even more options for someone who wants to deploy virtualized infrastructure. The high demand for virtual machines has led to Amazon moving down the st...

International Consumer Credit Infrastructure with Brian Regan and Misha Esipov

May 14, 2020 09:00 - 1000 Bytes

A credit score is a rating that allows someone to qualify for a line of credit, which could be a loan such as a mortgage, or a credit card. We are assigned a credit score based on a credit history, which could be related to work history, rental payments, or loan repayments. One problem with the credit scoring system is that it is not internationalized. If I am coming from Brazil, I have a rental history of someone from Brazil. That information does not get naturally ported over to the Unit...

Grapl: Graph-Based Detection and Response with Colin O’Brien

May 13, 2020 09:00 - 1000 Bytes

A large software company such as Dropbox is at a constant risk of security breaches. These security breaches can take the form of social engineering attacks, network breaches, and other malicious adversarial behavior. This behavior can be surfaced by analyzing collections of log data. Log-based threat response is not a new technique. But how should those logs be analyzed? Grapl is a system for modeling log data as a graph, and analyzing that graph for threats based on how nodes in the graph...

Static Analysis for Infrastructure with Guy Eisenkot

May 12, 2020 09:00 - 1000 Bytes

Infrastructure-as-code tools are used to define the architecture of software systems. Common infrastructure-as-code tools include Terraform and AWS CloudFormation. When infrastructure is defined as code, we can use static analysis tools to analyze that code for configuration mistakes, just as we could analyze a programming language with traditional static analysis tools. When a developer writes a program, that developer might use static analysis to parse a program for common mistakes–memor...

Social Distancing Data with Ryan Fox Squire

May 11, 2020 09:00 - 1000 Bytes

Social distancing has been imposed across the United States. We are running an experiment unlike anything before it in history, and it is likely to have a lasting impact on human behavior. By looking at location data of how people are moving around today, we can examine the real-world impacts of social distancing. SafeGraph is a company that provides geospatial location data to be used by developers and researchers. Much of their data is aggregated from cell phone GPS pings which identify w...

Dropbox Engineering with Andrew Fong

May 08, 2020 09:00 - 1000 Bytes

Dropbox is a consumer storage product with petabytes of data. Dropbox was originally started on the cloud, backed by S3. Once there was a high enough volume of data, Dropbox created its own data centers, designing hardware for the express purpose of storing user files. Over the last 13 years, Dropbox’s infrastructure has developed hardware, software, networking, data center infrastructure, and operational procedures that make the cloud storage product best in class. Andrew Fong has been a...

Pravega: Storage for Streams with Flavio Junquiera

May 07, 2020 09:00 - 1000 Bytes

“Data stream” is a word that can be used in multiple ways. A stream can refer to data in motion or data at rest. When a stream is data in motion, an endpoint is receiving new pieces of data on a continual basis. Each new data point is sent over the wire and captured by the other end. Another way a stream can be represented is as a sequence of events that have been written to a storage medium. This is a stream at rest. Pravega is a system for storing large streams of data. Pravega can be u...

Advanced Redis with Alvin Richards

May 06, 2020 09:00 - 1000 Bytes

Redis is an in-memory object storage system that is commonly used as a cache for web applications. This core primitive of in-memory object storage has created a larger ecosystem encompassing a broad set of tools. Redis is also used for creating objects such as queues, streams, and probabilistic data structures. Machine learning systems also need access to fast, in-memory object storage. RedisAI is a newer module for supporting machine learning tasks. For serverless computing, RedisGears all...

Multicloud MySQL with Jiten Vaidya and Anthony Yeh

May 05, 2020 09:00 - 1000 Bytes

For many applications, a transactional MySQL database is the source of truth. To make a MySQL database scale, some developers deploy their database using Vitess, a sharding system built on top of Kubernetes. Jiten Vaidya and Anthony Yeh work at PlanetScale, a company that focuses on building and supporting MySQL databases sharded with Vitess. Their experience comes from working at YouTube, which has a massive, rapidly growing database for storing the information about videos on the site. S...

Isolation with Courtland Allen and Anurag Goel

May 04, 2020 09:00 - 1000 Bytes

We are all living in social isolation due to the quarantine from COVID-19. Isolation is changing our habits and our moods, ravaging the economy, and changing how we work. One positive change is that more people have been reconnecting with their friends and family over frequent calls and video chats. Isolation is not a normal way for humans to live. We are social animals, and we need social interaction. We’ve changed how we use Internet products. There has been an evolution of trends in onli...

Data Lakehouse with Michael Armbrust

May 01, 2020 09:00 - 1000 Bytes

A data warehouse is a system for performing fast queries on large amounts of data. A data lake is a system for storing high volumes of data in a format that is slow to access. A typical workflow for a data engineer is to pull data sets from this slow data lake storage into the data warehouse for faster querying. Apache Spark is a system for fast processing of data across distributed datasets. Spark is not thought of as a data warehouse technology, but it can be used to fulfill some of the r...

JAMStack Content Management with Scott Gallant, Jordan Patterson, and Nolan Phillips

April 30, 2020 09:00 - 1000 Bytes

A content management system (CMS) defines how the content on a website is arranged and presented. The most widely used CMS is WordPress, the open source tool that is written in PHP. A large percentage of the web consists of WordPress sites, and WordPress has a huge ecosystem of plugins and templates. Despite the success of WordPress, the JAMStack represents the future of web development. JAM stands for JavaScript, APIs, and Markup. In contrast to the monolithic WordPress deployments, a JAM...

Prefect Dataflow Scheduler with Jeremiah Lowin

April 29, 2020 09:00 - 1000 Bytes

A data workflow scheduler is a tool used for connecting multiple systems together in order to build pipelines for processing data. A data pipeline might include a Hadoop task for ETL, a Spark task for stream processing, and a TensorFlow task to train a machine learning model. The workflow scheduler manages the tasks in that data pipeline and the logical flow between them. Airflow is a popular data workflow scheduler that was originally created at Airbnb. Since then, the project has been ad...

CockroachDB with Peter Mattis

April 28, 2020 09:00 - 1000 Bytes

A relational database often holds critical operational data for a company, including user names and financial information. Since this data is so important, a relational database must be architected to avoid data loss. Relational databases need to be a distributed system in order to provide the fault tolerance necessary for production use cases. If a database node goes down, the database must be able to recover smoothly without data loss, and this requires having all of the data in the datab...

Dask: Scalable Python with Matthew Rocklin

April 27, 2020 09:00 - 1000 Bytes

Python is the most widely used language for data science, and there are several libraries that are commonly used by Python data scientists including Numpy, Pandas, and scikit-learn. These libraries improve the user experience of a Python data scientist by giving them access to high level APIs. Data science is often performed over huge datasets, and the data structures that are instantiated with those datasets need to be spread across multiple machines. To manage large distributed datasets, ...

Rasa: Conversational AI with Tom Bocklisch

April 24, 2020 09:00 - 1000 Bytes

Chatbots became widely popular around 2016 with the growth of chat platforms like Slack and voice interfaces such as Amazon Alexa. As chatbots came into use, so did the infrastructure that enabled chatbots. NLP APIs and complete chatbot frameworks came out to make it easier for people to build chatbots. The first suite of chatbot frameworks were largely built around rule-based state machine systems. These systems work well for a narrow set of use cases, but fall over when it comes to chatbo...

Cloudburst: Stateful Functions-as-a-Service with Vikram Sreekanti

April 23, 2020 09:00 - 1000 Bytes

Serverless computing is a way of designing applications that do not directly address or deploy application code to servers. Serverless applications are composed of stateless functions-as-a-service and stateful data storage systems such as Redis or DynamoDB. Serverless applications allow for scaling up and down the entire architecture, because each component is naturally scalable. And this pattern can be used to create a wide variety of applications. The functions-as-a-service can handle th...

NGINX API Management with Kevin Jones

April 22, 2020 09:00 - 1000 Bytes

NGINX is a web server that can be used to manage the APIs across an organization. Managing these APIs involves deciding on the routing and load balancing across the servers which host them. If the traffic of a website suddenly spikes, the website needs to spin up new replica servers and update the API gateway to route traffic to those new replicas. Some servers should not be accessible to outside traffic, and policy management is used to configure the security policies of different APIs. An...

Frontend Monitoring with Matt Arbesfeld

April 21, 2020 09:00 - 1000 Bytes

Web development has historically had more work being done on the server than on the client. The observability tooling has reflected this emphasis on the backend. Monitoring tools for log management and backend metrics have existed for decades, helping developers debug their server infrastructure. Today, web frontends have more work to do. Detailed components in frameworks such as React and Angular might respond quickly without waiting for a network request, with their mutations being proces...

Zoom Vulnerabilities with Patrick Wardle

April 20, 2020 09:00 - 1000 Bytes

Zoom video chat has become an indispensable part of our lives. In a crowded market of video conferencing apps, Zoom managed to build a product that performs better than the competition, scaling with high quality to hundreds of meeting participants, and millions of concurrent users. Zoom’s rapid growth in user adoption came from its focus on user experience and video call quality. This focus on product quality came at some cost to security quality. As our entire digital world has moved onto ...

Facebook OpenStreetMap Engineering with Saurav Mapatra and Jacob Wasserman

April 17, 2020 09:00 - 1000 Bytes

Facebook applications use maps for showing users where to go. These maps can display businesses, roads, and event locations. Understanding the geographical world is also important for performing search queries that take into account a user’s location. For all of these different purposes, Facebook needs up-to-date, reliable mapping data. OpenStreetMap is an open system for accessing mapping data. Anyone can use OpenStreetMap to add maps to their application. The data in OpenStreetMap is crow...

NGINX Service Mesh with Alan Murphy

April 16, 2020 09:00 - 1000 Bytes

NGINX is a web server that is used as a load balancer, an API gateway, a reverse proxy, and other purposes. Core application servers such as Ruby on Rails are often supported by NGINX, which handles routing the user requests between the different application server instances. This model of routing and load balancing between different application instances has matured over the last ten years due to an increase in the number of servers, and an increase in the variety of services. A pattern...

Shopify React Native with Farhan Thawar

April 15, 2020 09:00 - 1000 Bytes

Shopify is a platform for selling products and building a business. It is a large e-commerce company with hundreds of engineers and several different mobile apps. Shopify’s engineering culture is willing to adopt new technologies aggressively, trying new tools that might provide significant leverage to the organization. React Native is one of those technologies. React Native can be used to make cross-platform mobile development easier by allowing code reuse between Android and iOS. React Na...

Ceph Storage System with Sage Weil

April 14, 2020 09:00 - 1000 Bytes

Ceph is a storage system that can be used for provisioning object storage, block storage, and file storage. These storage primitives can be used as the underlying medium for databases, queueing systems, and bucket storage. Ceph is used in circumstances where the developer may not want to use public cloud resources like Amazon S3. As an example, consider telecom infrastructure. Telecom companies that have their own data centers need software layers which make it simpler for the operators and...

Collaborative SQL with Rahil Sondhi

April 13, 2020 09:00 - 1000 Bytes

Data analysts need to collaborate with each other in the same way that software engineers do. They also need a high quality development environment. These data analysts are not working with programming languages like Java and Python, so they are not using an IDE such as Eclipse. Data analysts predominantly use SQL, and the tooling for a data analyst to work with SQL is often a SQL explorer tool that lacks the kind of collaborative experience that we would expect in the age of Slack and Git...

Reserved Instances with Aran Khanna

April 10, 2020 09:00 - 1000 Bytes

When a developer spins up a virtual machine on AWS, that virtual machine could be purchased using one of several types of cost structures. These cost structures include on-demand instances, spot instances, and reserved instances. On-demand instances are often the most expensive, because the developer gets reliable VM infrastructure without committing to long-term pricing. Spot instances are cheap, spare compute capacity with lower reliability, that is available across AWS infrastructure. Re...

Snorkel: Training Dataset Management with Braden Hancock

April 09, 2020 09:00 - 1000 Bytes

Machine learning models require the use of training data, and that data needs to be labeled. Today, we have high quality data infrastructure tools such as TensorFlow, but we don’t have large high quality data sets. For many applications, the state of the art is to manually label training examples and feed them into the training process. Snorkel is a system for scaling the creation of labeled training data. In Snorkel, human subject matter experts create labeling functions, and these functio...

Cadence: Uber’s Workflow Engine with Maxim Fateev

April 08, 2020 09:00 - 1000 Bytes

A workflow is an application that involves more than just a simple request/response communication. For example, consider a session of a user taking a ride in an Uber. The user initiates the ride, and the ride might last for an hour. At the end of the ride, the user is charged for the ride and sent a transactional email. Throughout this entire ride, there are many different services and database tables being accessed across the Uber infrastructure. The transactions across this infrastructure...

kSQLDB: Kafka Streaming Interface with Michael Drogalis

April 07, 2020 09:00 - 1000 Bytes

Kafka is a distributed stream processing system that is commonly used for storing large volumes of append-only event data. Kafka has been open source for almost a decade, and as the project has matured, it has been used for new kinds of applications. Kafka’s pubsub interface for writing and reading topics is not ideal for all of these applications, which has led to the creation of ksqlDB, a database system built for streaming applications that uses Kafka as the underlying infrastructure fo...

Software Daily

Episodes

Twitter Mentions