Photo by Matthias Speicher on Unsplash
This tutorial is the first part of the Series “Building an Open Data Stack”, where we explore different use cases with open-source tools and open standards. We will implement an Airbyte monitoring dashboard with dbt and Metabase on a locally deployed Airbyte instance. It will essentially allow you to get an operational view of your current running syncs and a high-level overview of what happened in the past.
Today you have several options to either monitor via Airbyte UI by clicking through the syncs and logs or set up a more sophisticated system by channeling metrics events through Prometheus. Our dashboard approach is interesting, as you can see everything at one glance, add filters and drill-downs, or correlate with other data you might have.
ℹ️ All code shown here is open on GitHub in the Open Data Stack Repo . Feel free to fork and play around with it and create issues or PRs for improvements 🙂.
📝 All Metrics are implemented with dbt views; therefore, you can put any BI tool on top. We use Metabase, but I also tried Superset and Rill Data.
🚨This tutorial is not meant to use in production as we query the internal Postgres database of Airbyte . But it could be if you replicate Airbyte tables to a read-only schema or database. You could also copy the data to BigQuery/Snowflake and run dbt on that data. Or follow the event-based approach mentioned exposing these metrics.
Step 1: Set Up Airbyte You probably use Airbyte already, and thus the interest in monitoring. If so, please jump to step 3, where we create the needed views and dashboard. If you have yet to use Airbyte and would like to follow along, continue with the next two chapters.
Airbyte logs its status in an internal Postgres database. We will directly access this internal database to see what has been stored, with a focus on the following six tables:
actor : Each record represents a configured connector. e.g. A Postgres connector configured to pull data from my database.actor_definition : Each record represents a connector that Airbyte supports, e.g. Postgres. This table represents all the connectors that is supported by the current running platform.attempts : Each record in this table represents an attempt.connection : Each record in this table configures a connection (source_id, destination_id, and relevant configuration).jobs : Each record in this table represents a job.workspace : Each record represents a logical workspace for an Airbyte user. In the open-source version of the product, only one workspace is allowed.For more information or the complete list, check out the docs .
Next, let’s clone the Airbyte repo and add one small thing to make the Postgres database of Airbyte available outside of docker to access with dbt and Metabase. The container is called airbyte-db .
git clone https://github.com/airbytehq/airbyte.git
cd airbyte
Now let’s edit docker-compose.yaml and add these two lines below:
How to expose ports on 5435 outside of docker After we added the two lines above, we started up Airbyte as we would generally with:
If everything worked fine, Airbyte should be running on localhost:8000, and you should see with docker ps that the port 5435 is forwarded like this:
Exposed port via docker container Step 2: Sync Data If you do not have any source you'd like to sync, you can quickly add some demo connectors to check the monitoring dashboard with source-connector=Faker where you can specify the number of rows with count .
Example source connector to generate monitoring data As a destination, you can use a Postgres database. If you haven’t one running locally, you can start up a brand new one with:
docker run --rm --name airbyte-destination -e POSTGRES_PASSWORD=password -p 3000:5432 -d postgres
See more details and step-by-step guides on Replicate Data Between PostgreSQL Databases
Make sure you create a connection with Airbyte (again, check the above tutorial when in doubt), and run some syncs so you have some historical runs that will show up in the dashboard.
Step 5: Explore Alternative BI Dashboards As we created views inside the Postgres database with measures defined, we can exchange the BI tool with others or integrate it with your notebook or data apps.
Rill Data Rill Data is a new BI tool focused on local data powered by Sveltekit and DuckDB . They take a cmd-line first approach, simply making it enjoyable to up a BI dashboard with one line.
In the end, it was more complex than we needed to copy data out of Postgres (which defeats the purpose of live data but is still interesting to try). If you are interested, I added some descriptions with some example data to play with on the GitHub repo.
Exploring Rill Data with exported job data ℹ️ About Rill Data: Rill makes creating and consuming metrics easy by combining a SQL-based data modeler, real-time database, and metrics dashboard into a single product—a simple alternative to complex BI stacks.
What’s next? I hope this tutorial is helpful to you with either lot of syncs to have a historical overview or an operational dashboard of what's currently going on. This tutorial is to get you started. It's not meant to be complete by any means. There are many angles to build on top of and customize your specific reporting needs.
One example is adding more detailed error messages about why a job failed if that is important. Furthermore, check the columns on the current views for more available data you can report; some more columns are there and can be added to Metabase or your BI tool of choice.
Finally, feel free to add new dbt views with a PR or comment on what to add; happy to chat on Slack or GitHub . As part of the series, the next step might be to add a Semantic Layer to define all Metrics once in a descriptive manner outside of Metabase.