dbt is a command line interface (CLI) tool that needs to be installed first. Choose your preferred way of installation. To initialize, you can run the command to set up an empty project: <span class="text-style-code">dbt init my-open-data-stack-project</span>.
Next, you can start setting up your SQL statement into macros and models, where the macros are your SQL statements with extended Jinja macros and the models are your physical elements you want to have in your destination defined as a table view (see image below; you can specify this in <span class="text-style-code">dbt_project.yml</span>).
💻 You can find the above-illustrated project with different components (e.g., macros, models, profiles…) at our open-data-stack project under transformation_dbt on GitHub.
Analytics and Data Visualization (SQL) with Metabase
When data is extracted and transformed, it's time to visualize and get the value from all your hard work. Visuals are done through Analytics and Business Intelligence and one of their Tools. The BI tool might be the most crucial tool for data engineers, as it's the visualization everyone sees–and has an opinion on!
Analytics is the systematic computational analysis of data and statistics. It is used to discover, interpret, and communicate meaningful patterns in data. It also entails applying data patterns toward effective decision-making.
ℹ️ In this category, I like pretty much all the available tools. If you implement strong data engineering fundamentals and data modeling, you choose the BI tool, notebook, and build your data app. It's amazing how many BI tools get built almost daily, with Rill Data being an interesting one to look out for.
Why Metabase?
Out of the many choices available, I chose Metabase for its simplicity and ease of setup for non-engineers.
Metabase lets you ask questions about your data and displays answers in formats that make sense, whether a bar chart or a detailed table. You can save your questions and group questions into friendly dashboards. Metabase also simplifies sharing dashboards across teams and enables self-serving to a certain extent.
How to get started with Metabase
To start, you must download the metabase.jar here. When done, you simply run:
<pre><code>java -jar metabase.jar</code></pre>
Now you can start connecting your data sources and creating dashboards.
Data Orchestration (Python) with Dagster
The last core data stack tool is the orchestrator. It’s used quickly as a data orchestrator to model dependencies between tasks in complex heterogeneous cloud environments end-to-end. It is integrated with above-mentioned open data stack tools. They are especially effective if you have some glue code that needs to be run on a certain cadence, triggered by an event, or if you run an ML model on top of your data.
Another crucial part of the orchestration is applying Functional Data Engineering. The functional approach brings clarity to “pure” functions and removes side effects. They can be written, tested, reasoned about, and debugged in isolation without understanding the external context or history of events surrounding their execution. As data pipelines quickly grow in complexity and data teams grow in numbers, using methodologies that provide clarity isn’t a luxury–it’s a necessity.
Why Dagster?
Dagster is a framework that forces me to write functional Python code. Like dbt, it enforces best practices such as writing declarative, abstracted, idempotent, and type-checked functions to catch errors early. Dagster also includes simple unit testing and handy features to make pipelines solid, testable, and maintainable. It also has a deep integration with Airbyte, allowing data integration as code. Read more on the latest data orchestration trends.
How to get started with Dagster
To get started easily, you can scaffold an example project <span class="text-style-code">assets_modern_data_stack</span> which includes a data pipeline with Airbyte, dbt and some ML code in Python.
The tools I've mentioned so far represent what I would call the core of the open data stack if you want to work with data end to end. The beauty of the data stack is that you can now add specific use cases with other tools and frameworks. I’m adding some here for inspiration:
So far, we've reviewed the difference between the modern data stack and the open data stack. We've discussed its superpower and why you'd want to use it. We also discussed core open-source tools as part of the available data stack.
To see these four core tools in action, read our tutorial on Configure Airbyte Connections with Python (Dagster), which scrapes GitHub repositories and integrates with Airbyte, creates SQL views with dbt, orchestrates with Dagster, and visualizes a dashboard Metabase.
We didn't discuss enterprise data platforms or so-called no-code solutions. The next blog post discusses The Struggle of Enterprise Adoption with the open data stack. Particular focus will center on mid- and large-sized enterprises that want to adapt to the new data stack.
As always, if you want to discuss more on the topic of Open Data Stack, you can chat with 10k+ other data engineers or me on our Community Slack. Follow Open Data Stack Projects open on GitHub, or sign up for new articles with our Newsletter.
The data movement infrastructure for the modern data teams.
Simon is a Data Engineer and Technical Author at Airbyte. He is dedicated, empathetic, and entrepreneurial with 15+ years of experience in the data ecosystem. He enjoys maintaining awareness of new innovative and emerging open-source technologies.