Now let's discuss opportunities that the MDS stack delivers for enterprises.
Getting started: Think about all the gains you can realize from existing solutions that could allow non-data engineers to start getting insight in a few hours or days.
Data privacy & legal: MDS lowers the legal risks, giving more flexibility on where to run with open code. This means it’s easier to comply with changing data regulations.
Modularity & extensibility: You can choose dedicated specialized tools for each use case without cost.
Technological advantage: You are using the latest technology and gaining benefits over competitors who remain less data-driven.
It is strongly recommended to have highly technical people who use DevOps to set up the Modern Data Stack, especially on a scale or in an enterprise where many use it with something like Kubernetes.
Modularity: Supporting the Data Engineering Lifecycle
Whether you choose an enterprise data platform or the open-source route with the Modern Data Stack, your selection must solve the Data Engineering Lifecycle. Each of these blocks needs to be implemented in one way or another to output insights valuable to the data consumers.
When not to use an MDS Stack
The advantage of modularity is undoubtedly also a disadvantage to the point that failures can pop up at each integration of another tool. Always keep in mind this quote from Data Platforms: The Past:
Integrating good tools doesn't mean you'll get a good stack and expected results. It's still a complex thing.
On the other hand, if you choose wisely, you can end up much better than a custom monolith that lost its maintainer.
Costs are another consideration, but can be hard to predict. As you run many independent tools, it's harder to forecast the costs and to balance than to have one tool solution that does it for you.
The core of each MDS tool needs to be built for enterprise scale and is added as an afterthought. You need to prove the solution first before you scale. I'm confident that the scale can't be fixed, but that is something to keep in mind when testing the MDS tool.
Adopting MDS is slower than just plugging a credit card into an enterprise data platform and getting started. With such platforms, you do not need to consider deploying, upgrading, security, or other concerns. These things are hard. Therefore, you will need a dedicated data team and DevOps people.
In the end, it's always a tradeoff between fully on-premise and hosting all your tools yourself vs. everything managed on SaaS:
The Core Open (Modern) Data Stack
There are always more tools you can add to your Modern Data Stack! You can use the data engineering lifecycle as a reference for the building blocks to add. But in general, the core four you need are data ingestion, transformation, orchestration, and analytics.
Unfortunately, the fast phase will continue for a while. Sure, there are thousands of tools, and it's impossible to keep updated with them, but you only need a few. Take Airbyte, dbt, a visualization tool, and Dagster, and you’ll be up and running within hours—with a battle-tested, open-source high-impact tool at your fingertips. I illustrated these tools and how to set them up in Part I: The Open Data Stack Distilled into Four Core Tools. Check it out to get started with the analytics journey today.
Another consideration is naming. Even though the term Modern Data Stack hasn't spread much, there is already the dislike of "modern," which has no tangible value. Alternative names for "Modern Data Stack" that I saw are new generation open-source data stack (ngods), DataStack 2.0, and the Boring Data Stack. While starting an open-source Project, I found Open Data Stack to be the perfect nomenclature, emphasizing the value open-source tools provide. “Open” also speaks to open standards—desperately needed in data engineering.
No matter the name, the essence of open source, extensible, and free-to-use will not change.
📏 GitHub Project: Check out the open-data-stack example project we just started (we’ll be adding to it as part of this series).
🔮 The Next Step of the Modern Data Stack is Open Data Stack
Most of us live in a bubble with the latest trends of big tech data-driven companies (FANG). If you follow the space closely, you see indications that people are overwhelmed with too many tools in the MDS stack, which is no surprise as our industry is growing like no other.
Another sign shows DuckDB hype; people like the simplicity of it— removing instead of adding to the data stack. Many use cases are possible, and everyone is looking to simplify things. In the end, simplification also means fewer moving parts and fewer errors.
At the same time, the Modern Data Stack will not go anywhere, except it will most likely be renamed to something else. It's still evolving, as many large European enterprises learned from their past with vendor lock-in and closed-source and are ready to leave that behind.
Enterprises still have legacy code that will use old programming languages for a while. So these are still slowly adopting Open Data Stack over the next decade. Again, integrating good tools from the data stack doesn't mean you'll get a good stack per se. It helps to choose wisely. Ask yourself: "Do we have the capacity and knowledge to own and manage the data stack ourselves?"
If you want to try it yourself, follow along with our open-data-stack project on GitHub, where you will see the core Open Data Stack tools in action. If you have any questions, our community will be happy to help you. And if you want to discuss more, you can chat with 10k+ other data engineers on our Slack Community or sign up for new articles in our Newsletter.
Either way, we look forward to hearing from you!
The data movement infrastructure for the modern data teams.
Simon is a Data Engineer and Technical Author at Airbyte. He is dedicated, empathetic, and entrepreneurial with 15+ years of experience in the data ecosystem. He enjoys maintaining awareness of new innovative and emerging open-source technologies.