Imagine you are the first true hire on a company’s data team. You are the only one with experience in analytics and data modeling as well as data engineering. There are other engineers on the team but they are all software engineers with little data experience.
The data infrastructure at the company is in a bad state with many different data tools, long-running “data models” (if you can call them that), and no single source of truth. You’ve been brought onto the team to help untangle all of the mess and build a reliable data pipeline.
Because you’re a team of one, your only option is the modern data stack where you can set up a suite of tools to help you instantly scale how the company uses its data. Building a solution from scratch would require resources—time, money, and people—that the company doesn’t have.
This was my exact scenario a few years ago when I began working for a startup-like company. The modern data stack helped me thrive as an analytics engineer and build something reliable for the business.
What Is The Modern Data Stack?
The modern data stack is a collection of tools and cloud technologies that allow you to collect, ingest, transform, monitor, and orchestrate data.
You can think of a data warehouse being the central location of the stack where all of your data is stored. You then have a data ingestion tool that ingests data from different sources into your warehouse and a transformation tool that transforms your data directly in the warehouse.
Modern data stacks also require an orchestration tool to help all of the different moving pieces work together. And, once you’ve gotten to a place where you can really prioritize data quality, you can add a monitoring platform to alert you of any data quality issues to your stack.
Together, these different pieces and the tools that allow them to function help a company to scale fast and effectively without requiring tons of money, time, and talent.
Problem #1: Needing to build a solution from scratch that solves a fairly common problem.
Before the idea of the modern data stack and the different tools that comprise it, companies really had no choice but to build out products from scratch. If they wanted a way to ingest data, they most likely had to code and schedule some type of script that used APIs of external sources. Now we have ingestion tools like Airbyte which do this for you and package it up in an easy-to-use UI!
And this wasn’t just with data ingestion but also with data monitoring, data observability, and data orchestration. It’s interesting to look at corporate companies that still tend to go the “build everything yourself” route.
I definitely noticed that when I worked for a 10,000+ employee company. While this may make sense when you have tons of data, capital, and employees, for most companies it doesn’t. For smaller-scale startups, it makes more sense to pay a company that built a top-notch product and maintain it well than to build the product themselves.
Problem #2: Setting up on-prem infrastructure to support your data needs.
One of the main benefits of the modern data stack is the fact that it exists on the cloud. In order to use these tools, you really just have to have a connection to the internet! Previously, companies had to set up robust, on-premise warehouses to store their data. When I interned at a large shipping company in college I remember taking a tour through the room that stored all of our data. It was intense and required a lot of maintenance!
Now, we have cloud data warehouses like Snowflake and Databricks that take care of all of this for you. Better yet, if you’re a company without too much data, you can save costs by sharing cloud resources with other smaller companies. These platforms automatically do this for you, taking the hassle off your plate.
Before, it made sense for smaller companies to go without any data than it did to pay for on-premise data infrastructure. Now, everyone can take advantage of the insights given to them by their own data.
Problem #3: Having no version control.
Version control helps you keep track of all of the changes made to your code. Believe it or not, this is something that the modern data stack unlocks. Before modern data stack tools, there was often no way to collaborate with other engineers and keep track of the changes being merged, leading to possible production issues.
Modern data stack tools easily integrate with popular version control tools like Github, making collaboration and data governance a breeze. There are even tools like Datafold which use CI/CD to validate changes in your data itself before merging code. Tools like this help to cut down on production issues and decrease downtime when there is an issue.
Problem #4: Needing to optimize query performance when things become slow
With on-prem data warehouses, you were kinda screwed when a query took too long to run. You had to figure out ways to optimize it yourself and always ensure you were running the most performant queries.
With cloud data warehouses like Snowflake and Databricks, you can take advantage of built-in optimizers. These platforms optimize your queries behind the scenes, ensuring they will be as performant as possible. This makes it easy for users of all skill levels to use as well.
Snowflake uses something called micro-partitions and clustering to optimize its queries instead of indexes which are used in more traditional databases like Postgres. Snowflake also has something called query plan rewriting where it will rewrite your query to achieve better performance, saving you time and effort.
All of the major benefits of the modern data stack come down to saving time and money- a company’s two most precious resources. Funny enough, all of the problems I listed above can be framed in the context of these two things as well. Modern data stacks help you optimize your queries and speed up the building process to save time. They increase data quality and cut down on company resources needed to save money.
Luckily, using a modern data stack, I was able to build a data pipeline from scratch in just 3 months. Queries that previously took over two days to run were running in a matter of hours if not minutes. Stakeholders finally had data that they could rely on because we had a single source of truth.
If it weren’t for the modern data stack, I have no idea what would have happened with the data infrastructure that existed. Tools like Airbyte, Prefect, Snowflake, and dbt allowed us to move fast while producing high-quality data, something that had never existed at that company before.