This tutorial is the first part of the Series “Building an Open Data Stack”, where we explore different use cases with open-source tools and open standards. We will implement an Airbyte monitoring dashboard with dbt and Metabase on a locally deployed Airbyte instance. It will essentially allow you to get an operational view of your current running syncs and a high-level overview of what happened in the past.
Today you have several options to either monitor via Airbyte UI by clicking through the syncs and logs or set up a more sophisticated system by channeling metrics events through Prometheus. Our dashboard approach is interesting, as you can see everything at one glance, add filters and drill-downs, or correlate with other data you might have.
ℹ️ All code shown here is open on GitHub in the Open Data Stack Repo. Feel free to fork and play around with it and create issues or PRs for improvements 🙂.
📝 All Metrics are implemented with dbt views; therefore, you can put any BI tool on top. We use Metabase, but I also tried Superset and Rill Data.
🚨This tutorial is not meant to use in production as we query the internal Postgres database of Airbyte. But it could be if you replicate Airbyte tables to a read-only schema or database. You could also copy the data to BigQuery/Snowflake and run dbt on that data. Or follow the event-based approach mentioned exposing these metrics.
Step 1: Set Up Airbyte
You probably use Airbyte already, and thus the interest in monitoring. If so, please jump to step 3, where we create the needed views and dashboard. If you have yet to use Airbyte and would like to follow along, continue with the next two chapters.
Airbyte logs its status in an internal Postgres database. We will directly access this internal database to see what has been stored, with a focus on the following six tables:
- <span class="text-style-code">actor</span>: Each record represents a configured connector. e.g. A Postgres connector configured to pull data from my database.
- <span class="text-style-code">actor_definition</span>: Each record represents a connector that Airbyte supports, e.g. Postgres. This table represents all the connectors that is supported by the current running platform.
- <span class="text-style-code">attempts</span>: Each record in this table represents an attempt.
- <span class="text-style-code">connection</span>: Each record in this table configures a connection (source_id, destination_id, and relevant configuration).
- <span class="text-style-code">jobs</span>: Each record in this table represents a job.
- <span class="text-style-code">workspace</span>: Each record represents a logical workspace for an Airbyte user. In the open-source version of the product, only one workspace is allowed.
For more information or the complete list, check out the docs.
Next, let’s clone the Airbyte repo and add one small thing to make the Postgres database of Airbyte available outside of docker to access with dbt and Metabase. The container is called <span class="text-style-code">airbyte-db</span>.
Now let’s edit <span class="text-style-code">docker-compose.yaml</span> and add these two lines below:
After we added the two lines above, we started up Airbyte as we would generally with:
If everything worked fine, Airbyte should be running on localhost:8000, and you should see with <span class="text-style-code">docker ps</span> that the port <span class="text-style-code">5435</span> is forwarded like this:
Step 2: Sync Data
If you do not have any source you'd like to sync, you can quickly add some demo connectors to check the monitoring dashboard with <span class="text-style-code">source-connector=Faker</span> where you can specify the number of rows with count.
As a destination, you can use a Postgres database. If you haven’t one running locally, you can start up a brand new one with:
See more details and step-by-step guides on Replicate Data Between PostgreSQL Databases
Make sure you create a connection with Airbyte (again, check the above tutorial when in doubt), and run some syncs so you have some historical runs that will show up in the dashboard.
Step 3: Create Transforming Views with dbt
After we have set up Airbyte and started some syncs to have some data to visualize, we can now create views on top of the Postgres tables that get updated on each Airbyte sync, including configs you define in the UI. These tables are JSON blobs, and it can be challenging to unnest to a tabular table format. That's what the dbt views are doing, plus adding some metrics to visualize.
📝 Performance: Everything is based on views that allow you to see everything in real-time. The disadvantage is that it will get slower the more data you query. If your dashboards get too slow, you have two options: filtering down to reduce the amount of data queried or toggling the <span class="text-style-code">materialize=view</span> to <span class="text-style-code">table</span>. Changing to table materialization would require a <span class="text-style-code">dbt run</span> each time you want new data, but the queries would be pre-aggregated and faster.
Suppose you followed along with the port forwarding from step 1 above in <span class="text-style-code">docker-compose.yaml</span>, we should be able to connect to the Airbyte Postgres database with the following connection (default password, if not changed, is "docker"):
And see the following tables of Airbyte:
Initial Setup of dbt
Next, we want to create the views on top of it to convert the JSON fields into queryable tables. For that, we need to install the dbt dependencies and then run <span class="text-style-code">dbt run</span> to create them.
Running and configure above profile automatically up by changing to directory <span class="text-style-code">transformation_dbt</span> and simply run (more details on dbt_readme):
This will essentially create the virtual env in a folder called <span class="text-style-code">.venv</span>, activate that virtual environment, set up the dbt_profile and run dbt to create the views needed.
Once we have set up the dependencies above, we can check the views on top of the Airbyte database which should now be created in schemas <span class="text-style-code">monitoring_core</span> and <span class="text-style-code">monitoring_mart</span>:
Running dbt only
If you followed the above steps and you didn’t make any changes to the dbt models, this step is not needed.
In case you would like to change your dbt models to add columns or certain measures, you can run dbt manually with <span class="text-style-code">dbt run</span> (make sure you activated the virtual environment with <span class="text-style-code">poetry shell</span>). If everything works out, you should see something like this:
Step 4: Open Up Metabase Dashboard
The next step is to start Metabase and open the predefined dashboard. This dashboard contains measures such as <span class="text-style-code">success_rate, attempt_duration, volume_rows</span> and <span class="text-style-code">volume_mb</span>.
To start up, you must download the metabase.jar here and place it into <span class="text-style-code">visualization/metabase/metabase.jar</span>. When done, run the following commands to start Metabase:
Metabase will run on localhost:3000. As you run from the existing Metabase database you might be asked to log in. Please use the username <span class="text-style-code">email@example.com</span> and password <span class="text-style-code">airbyte123</span> if so. If you want to know more details about Metabase, check out their docs.
The opened Airbyte Monitoring Dashboard should look something like this:
If you trigger a new run and refresh the Metabase dashboard, you'll see live data with one sync with status <span class="text-style-code">running</span>.
Step 5: Explore Alternative BI Dashboards
As we created views inside the Postgres database with measures defined, we can exchange the BI tool with others or integrate it with your notebook or data apps.
In the end, it was more complex than we needed to copy data out of Postgres (which defeats the purpose of live data but is still interesting to try). If you are interested, I added some descriptions with some example data to play with on the GitHub repo.
ℹ️ About Rill Data: Rill makes creating and consuming metrics easy by combining a SQL-based data modeler, real-time database, and metrics dashboard into a single product—a simple alternative to complex BI stacks.
I hope this tutorial is helpful to you with either lot of syncs to have a historical overview or an operational dashboard of what's currently going on. This tutorial is to get you started. It's not meant to be complete by any means. There are many angles to build on top of and customize your specific reporting needs.
One example is adding more detailed error messages about why a job failed if that is important. Furthermore, check the columns on the current views for more available data you can report; some more columns are there and can be added to Metabase or your BI tool of choice.
Finally, feel free to add new dbt views with a PR or comment on what to add; happy to chat on Slack or GitHub. As part of the series, the next step might be to add a Semantic Layer to define all Metrics once in a descriptive manner outside of Metabase.