How to Build an ETL Pipeline from PostgreSQL to Tableau
Data in its natural state is available in different locations and various formats. Storing the diversified data in a centralized repository is often considered a beneficial step in improving data accessibility. Most organizations rely on relational databases, like PostgreSQL, to store and manage data effectively. However, to extract meaningful insights, this data often needs to be transformed and visualized using tools like Tableau.
By integrating Postgres with Tableau, you can analyze trends and identify patterns, which will help you make data-driven decisions. That said, migrating data between these platforms can be a complicated process. It might require you to transform the data to make it compatible with the Tableau environment. Even minor errors can lead to time-consuming troubleshooting.
In this article, you will explore two simple methods to build an ETL—extract, transform, and load—pipeline from PostgreSQL to Tableau.
PostgreSQL: A Brief Overview
![PostgreSQL](https://cdn.prod.website-files.com/6064b31ff49a2d31e0493af1/67a9f3088de8a36216b9426e_AD_4nXeu5cVSk6bI4dv_zAzk1s1x4i6J6g40uOyAFqZy22_3d8dJWemJFx8H55Xy7C_jg1VwpkVjgfuPYLZoTq60oYH4DPbhdlhFoPPk-OlY-0NqtGmeS2Zi7Dglb6iZfGxPhwtJDPHU.png)
PostgreSQL, or Postgres, is an object-relational database management system (ORDBMS) that supports relational and non-relational data querying. Unlike traditional relational databases, it allows you to store and retrieve objects. Object-oriented features like table inheritance and user-defined data types offer flexibility and robustness for performing complex operations on your data.
Another feature that makes Postgres a well-known database is its support for JSON. It bridges the gap between relational databases and document-based data storage by allowing the management of semi-structured data.
PostgreSQL Key Features
- Support for Vector Embeddings: By employing the pgvector extension, you can use PostgreSQL as a vector database. This extension allows you to store, query, and index vector embeddings along with relational data.
- ACID Compliance: Postgres complies with ACID—atomicity, consistency, isolation, and durability—properties. These properties are essential to maintain data reliability and integrity during database transactions.
- Fault Tolerance: Features like Multi-Version Concurrency Control, write-ahead logging, and point-in-time recovery make PostgreSQL highly fault-tolerant. These features provide you with data recovery capabilities during system failures.
Tableau: A Brief Overview
![Tableau](https://cdn.prod.website-files.com/6064b31ff49a2d31e0493af1/67a9f30808ffb0b2af141688_AD_4nXc8N47zGvsA9sutUPz4SO3KHRnSQcRBEAn05qCeCeMfWKXtMzJQ88VtEGjvWSzMr6bPUg2djNAmtBwkLH31gaBtm74llUHgWkr4BkYisNz1is-p9ACH634ikh3Fd8kUTT0JSKH8NQ.png)
Tableau is a business intelligence and analytics software that encourages you to visualize data insights into intuitive visuals and dashboards. By graphically representing data, you can simplify its understanding within your organization. In this way, different teams can come together to create impactful strategies to enhance business decision-making.
Similar to any other data visualization tool, Tableau enables you to create different charts, dashboards, maps, and graphs to analyze trends. It offers an easy-to-use interface, which can benefit users without any technical knowledge. By applying certain data visualization best practices, such as choosing the right visuals, you can create simplistic stories out of complex data.
Tableau Key Features
- Geospatial Analytics: Tableau offers a geospatial visualization feature, which enables you to plot data on maps based on their physical location. Along with this, you can use spatial joins to represent the relationship between data with different spatial definitions.
- LOD Expression: With Level of Detail—LOD—expressions, you can execute complex queries on multi-dimensional data. Instead of integrating the data on the Tableau interface, LOD allows you to perform operations at the data source level. To accomplish this, LOD expressions allow you to perform aggregations at several levels of granularity within the same visualization.
- Hyper: An SQL engine, Hyper, lets you perform real-time analytics, interactive exploration, and ETL transformations with Tableau Prep. It provides an in-memory data solution, which enables you to analyze large volumes of data quickly.
Why Migrate Data from PostgreSQL to Tableau?
- Data Analytics: Although PostgreSQL is an effective platform for data storage, it lacks the ability to perform complex data analytics operations. Migrating data from PostgreSQL to Tableau provides a straightforward way to analyze data with trend lines and statistical modeling.
- Data Decision-Making: Based on the visuals generated in Tableau, business professionals can develop marketing campaigns that target specific customer segments. This can aid in strengthening business performance.
- Improved Collaboration: Integrating PostgreSQL data into Tableau enables the generation of dashboards that can be shared across multiple teams within your organization. This helps in onboarding professionals with varied levels of technical understanding on the same page.
How Airbyte Facilitates Seamless ETL Pipeline Development?
![Airbyte](https://cdn.prod.website-files.com/6064b31ff49a2d31e0493af1/67a9f30812ef4f22fc4acb32_AD_4nXcOAs7FY51hlGytqSdbxpv-ZxZPc9KW6xLq83A8CJoBf92cAPcgJuwtA7dGOCUfgc-6Vbhw0a6mc0waLka45WZXybVov9ExMK5hky1GAriikjAbjFp5rpQznN2acn56X2tUkqBKYw.png)
Airbyte is a data integration platform that enables you to move data from multiple sources to the destination of your preference. With over 550 pre-built data connectors, it encourages you to extract structured, semi-structured, and unstructured data from various locations. If the connector you seek is unavailable, Airbyte offers a Connector Builder and a suite of Connector Development Kits (CDKs) for creating custom connections.
Here are a few key features offered by Airbyte:
- Flexible Pipeline Development: Airbyte provides you with numerous options to build data pipelines, including UI, Terraform Provider, API, and PyAirbyte. You can choose any of these to develop data integration solutions based on your requirements.
- Change Data Capture (CDC): The CDC functionality allows you to identify incremental changes made to the source data and replicate them in the destination system. In this way, you can automatically keep track of updates and maintain data consistency.
- Support for Vector Stores: Airbyte supports popular vector databases, including Qdrant, Pinecone, and Chroma. By storing vector embeddings in these data stores, you can streamline the development of powerful AI applications.
- Automated RAG Techniques: Using automated chunking and embedding operations, you can transform raw data into vector embeddings. You can generate embeddings using Airbyte’s pre-built LLM providers, including OpenAI, Cohere, and Anthropic.
- Orchestrate Data Pipelines: To support data management throughout the data lifecycle, you can integrate Airbyte with prominent orchestration tools, such as Kestra, Dagster, and Prefect.
One of the features that stands out as a go-to choice for developing robust pipelines with Airbyte is using the Python library PyAirbyte. This library lets you use Airbyte connectors in a Python development environment. Utilizing PyAirbyte, you can extract data from different locations and load it into SQL caches, including Postgres, BigQuery, and Snowflake. These caches are compatible with Python libraries like Pandas and AI frameworks such as LangChain and LlamaIndex.
Here’s how you can use Airbyte to build an ETL pipeline from PostgreSQL to Tableau:
Method 1: Using PyAirbyte and Tableau Hyper API to Move Data from PostgreSQL to Tableau
In this approach, we will utilize PyAirbyte to extract data from different sources and load it into a .hyper file format. Converting the data into this format enables you to optimize performance for analytics and visualization.
Another reason to consider migrating data into a .hyper file is that it provides better memory management techniques. With this approach, you can store huge datasets in a compressed format that is easy to move to Tableau.
Finally, you can analyze the data file using the Tableau Desktop interface.
Before getting started with the steps, ensure you satisfy the prerequisites.
Prerequisites:
- You must have a code editor either installed or running on the web. This method uses Google Colab.
- Access to Postgres database credentials.
Step 1: Set Up the Environment
Now that all the prerequisites are satisfied, let’s create a virtual environment and install all the libraries required to perform this method.
To add a virtual environment in Google Colab, execute:
Install PyAirbyte:
For effectively performing the transformations on the extracted data, install the JupySQL extension to enable the execution of SQL in the code editor:
Load the JupySQL extension to work with SQL in the Python development environment.
To extract the data in a .hyper file format, install the Tableau Hyper API.
Step 2: Configuring Postgres for Cache Storage (Optional)
Before initiating the data extraction process, you must decide the location in the Postgres database where the cache should be stored. If you are not using Google Colab, or you wish to store the data in a specific location, feel free to do so. Otherwise, you can install and configure Postgres by executing the code below.
To install the PostgreSQL server:
Setup a password `postgres` for username `postgres`:
Setup a database with the name `pyairbyte_demo` to be used:
Step 3: Extracting Data from a Source to Postgres Cache
Let’s import PyAirbyte and start the data extraction step.
You can check all the connectors PyAirbyte offers by executing the following code:
Now that you have explored all the provided connectors, install the source from which you wish to extract the data. In the code below, replace the source-faker with the name of your preferred data source.
To configure this source, provide the access credentials in the code below.
To verify the credentials, run the .check() method:
The above code must result in a success message.
Step 4: Storing the Data in Postgres Cache
The next step is to store the data extracted from the source in a Postgres cache. To achieve this, import the PostgresCache method from PyAirbyte.
Define a Postgres Cache and pass the necessary configuration credentials.
The code uses the pyairbyte_demo database created in the second step.
Source data can have multiple associated data streams. You can select all of the source streams and read the data into the Postgres cache by running:
This code will store the data in the Postgres cache with the location as the pyairbyte_demo database.
Step 5: Data Transformation
Since the data is stored in a Postgres database, use the JupySQL extension to query the data. You can utilize SQL statements to perform basic analysis and produce insights.
Another popular way to transform data is to convert the caches in Pandas DataFrame with the to_pandas() method. This method allows you to modify data by removing null values, bias, and unwanted redundancies. You can also select the essential subsections of the data for specific business use cases.
The following sections assume you store customer table data in Pandas DataFrame. For ease of understanding, suppose the df variable holds the data.
Step 6: Creating a Tableau-Compatible Hyper File
Let’s use the Tableau Hyper API library to extract the Pandas DataFrame in a hyper file. For this step, import the Path class from the pathlib module to specify the filesystem for efficient data storage.
Import the necessary classes from the tableauhyperapi library.
Define the path to save the Hyper file. For instance, to store the data in a file with the name customer.hyper, run the code below:
You can now define optional process parameters. To limit the number of log files to 2 and file size to less than 100 megabytes, execute:
Now that the file path is defined, create a connection to the Hyper database engine and convert the Pandas DataFrame to a .hyper extension file.
The above code initializes a HyperProcessconnectsction to the Hyper file. Using the connection, you can define the table schema, highlighting how you wish to save the data. After creating a table, insert the data from the df variable into the Hyper file.
Finally, the results are verified using an SQL statement that queries through the data to determine the count of rows. This data will be stored in the customer.hyper file in the same Google Colab virtual machine environment. You can explicitly download the file on your local machine to preserve the data.
To import the customer.hyper file in Tableau, open Tableau Desktop and locate the file using the connection mile.
![Customer Hyper File](https://cdn.prod.website-files.com/6064b31ff49a2d31e0493af1/67a9f3077368a87ef75cc20f_AD_4nXdaAYxG_GxtX8OgbxE7TOI4LxIAoUJMjTu1pXK06sAGFhOA9YuEZ1XfDQAMIZ5jjYRIxMJFpnj5MlmyMfAUQ3d9iFRlJE7fD56NpZAb9G5aiN1fs98ofbTtEIZf2eonyp3BRz5chw.png)
Method 2: Using PostgreSQL ODBC to Move Data to Tableau
Another method you can follow is to use the Other Database (ODBC) connector. With the ODBC connector, you can connect the Postgres data to Tableau Prep Builder, a data transformation solution. After transforming the data into an analysis-ready format, you can migrate it to the Tableau desktop.
Before getting started, you must ensure that you satisfy the following prerequisites.
Prerequisites:
- Install a PostgreSQL ODBC driver.
- Ensure that you have the latest version of Tableau Prep Builder.
- You must have the credentials to your Postgres database.
Here are the steps to create an ETL pipeline from PostgreSQL to Tableau:
Step 1: Set Up Data Source Name (DSN)
- Search for the ODBC Data Source Administrator on your local machine.
- On the Administrator window, click on System DSN.
![ODBC Data Source Administration](https://cdn.prod.website-files.com/6064b31ff49a2d31e0493af1/67a9f3073dbd75858836e8fc_AD_4nXe-9Q6Y9TBJsrdBcTKZvSHY251VueqEDl6mLiqkjQQkbazqNfLaL7NqI12r_vQgkGDVoWBVa3nzXS5aHfPGv_f5kldcw1YorzVfb9pTg_qsGzG4WgQqad9KBaoxlkMnDosCUJXiIQ.png)
- Click the Add… button and select the Postgres ODBC driver from the available options on the Create New Data Source dialog box.
![Create New Data Source Page](https://cdn.prod.website-files.com/6064b31ff49a2d31e0493af1/67a9f308fff621a54c05f1ae_AD_4nXcdkUzH6YG_9rleBrwi_Bky2DwKbLrGzaZ54QMPg1qx0fSM_MonXTlIVJF2ZLR5kBE0dldQbz4TgPAI5DGr6Iy5znV606i-uOaCfp7-SKQUDw58seTmmnxX3TsnLTMSA349qi4QFw.png)
- Click Finish.
- Enter all the configuration details to set up the PostgreSQL ODBC driver. Click on Test to verify the connectivity.
![PostgreSQL Unicode ODBC Driver Setup](https://cdn.prod.website-files.com/6064b31ff49a2d31e0493af1/67a9f307a5704cdd47bb4a14_AD_4nXfVSQA1ugz32fbKKk4znW9uF-I993BnIzB3n7Dhcn4pMRuDyYU3w_2v6ao9NgJ2aTyivdA4Vwad5ODPklfy6XQIL6pD_76-qeirYtJyXoLRFdXggcgFkfdCnyNfZx2vGPKdiRwuBg.png)
- Click Save.
Step 2: Connecting Postgres to Tableau Prep Builder
- Launch the Tableau Prep Builder and click on the Add connection button.
- Select Other Databases (ODBC) from the available connectors.
- In the Other Databases (ODBC) dialog box, select the DSN from the dropdown and enter the Username and Password fields.
- Click Sign In.
- Select the database from the Connections pane.
Step 3: Data Transformation
The data is now available in the Tableau Prep Builder. However, the data can have a lot of inconsistencies, like null values. In this step, you must leverage the features provided in the Prep Builder interface to transform and prepare the data to make it compatible for analysis and visualization. Depending on your specific business requirements, you must either remove these inconsistencies or replace them with some other value.
Step 4: Loading the Data to Tableau Desktop
After transforming the data, you can publish it on the Tableau Server, from where the Tableau Desktop can access it. With this data, you can create visualizations and publish the created dashboards on the Server to enhance accessibility across your organization.
Limitations of Using ODBC Driver for Postgres-Tableau ETL Pipeline
- Performance Issues: Creating ETL pipelines from PostgreSQL to Tableau using ODBC drivers is inefficient in terms of performance. Due to ineffective data retrieval mechanisms, the pipeline can encounter performance bottlenecks for large datasets.
- Maintenance Challenges: This process might require updating the ODBC drivers frequently to ensure compatibility with PostgreSQL and Tableau. In this method, the DSN settings are updated locally, and you must manually upgrade and synchronize DSN settings.
- Error Handling: Troubleshooting any issues encountered during this migration can be time-consuming. The error messages generated might be generic, requiring you to manually check the logs from PostgreSQL, the ODBC driver, and Tableau.
Why Choose PyAirbyte?
To overcome the limitation that comes with the manual approach, you can rely on PyAirbyte.
- It offers pre-built connectors, eliminating the need to build the sources from scratch.
- PyAirbyte supports incremental data reading. With this feature, you can track and process only the newly added data without replicating the entire dataset.
- Using PyAirbyte, you get flexible data management capabilities, where you can select the specific stream of data depending on your use case.
Key Takeaways
Through this article, you understand two different methods for building an ETL pipeline that will allow you to extract data from Postgres and load it into Tableau for further analysis. Although both methods encourage you to effectively move data between the platforms, the second method has some limitations. To overcome the challenges that come with using a manual approach, you can use PyAirbyte. It offers pre-built connections to various platforms to empower data migration within your developer environment.