Your organization likely collects large amounts of data in various systems such as databases, CRM systems, application servers, and so on. Accessing and analyzing data that is spread across multiple systems can be a challenge. To address this challenge, data integration can be used to create a unified view of your organization's data.
If you agree that your organization could benefit from having a unified view of all of its data, here are some questions you will need to ask yourself to come up with a data integration strategy:
To help you answer these questions, in this article I cover the key data integration concepts that you will need to understand. I start out with a definition of data integration, including an overview of various data integration methods that can be used. This is followed by a discussion of the benefits of data integration into a unified destination versus alternative types of data integration. I then compare the two most popular data integration techniques: ETL (extract, transform, load) versus ELT (extract, load, transform). Next, I review various storage technologies that can be used for centralizing data. Finally, I assess the main tools for data integration.
At a high level, data integration is the process of combining data from disparate source systems into a single unified view. This can be accomplished via manual integration, data virtualization, application integration, or by moving data from multiple sources into a unified destination. These data integration methods are discussed below.
Before implementing a systematic approach to data integration, organizations may initially make use of manual integration when trying to make sense of data that is spread across multiple systems. This involves analysts manually logging into source systems, analyzing and/or exporting data on these systems, and creating reports based on their findings.
Manual integration as a data integration strategy has several disadvantages. In addition to being time-consuming, analysts require access to multiple operational systems which creates security risks. Furthermore, analysts may run expensive analytics operations on systems that are not optimized for such workloads, which may interfere with the functioning of these systems. Finally, data in the source systems may frequently change which means that manually generated reports will quickly become outdated.
Organizations may also consider adopting a data virtualization solution to integrate their data. In this type of data integration, data from multiple sources is left in place and is accessed via a virtualization layer so that it appears as a single data store. This virtualization layer makes use of adapters that translate queries executed on the virtualization layer into a format that each connected source system can execute. The virtualization layer then combines the responses from these source systems into a single result. This data integration strategy is sometimes used when a BI tool like Tableau needs to access data from multiple data sources.
One disadvantage of data virtualization is that analytics workloads are executed on operational systems, which could interfere with their functioning. Another disadvantage is that the virtualization layer may act as a bottleneck on the performance of analytics operations.
Another alternative data integration solution is to directly link multiple applications to each other and move data directly between them. This is known as application integration, and linking can be done via point-to-point communications, via a middleware layer such as an enterprise service bus (ESB), or through an application integration tool.
Application integration may result in many copies of the same data across multiple source systems, which may increase cost, and may cause a large amount of point-to-point traffic between various systems. Furthermore, as with the previous data integration types, executing analytics workloads directly on operational systems could interfere with their functioning.
Sending data from across an enterprise into a centralized system such as a database, a data warehouse, a data lake, or a data lakehouse results in a single unified location for accessing and analyzing all the information that is flowing through an organization. At Airbyte we are advocates of this data integration methodology, and the next section of this article is dedicated to discussing its benefits in more detail.
Below is a high-level representation of data replication from multiple sources into Google BigQuery.
There are several benefits that come from data integration into a unified destination, including providing a single source of truth, leveraging technology that is designed for analyzing big data, transforming data in a single location, improving security, and reducing operational risks. These data integration benefits are discussed in more detail below.
Data that has been moved from source systems into a central location can be easily and efficiently accessed by business intelligence or data analytics applications. This data can also be used to display a single pane of glass that presents a complete picture of relevant information from across an entire organization. This helps organizations to quickly identify opportunities, make smarter decisions, identify hidden patterns, and understand customer behavior. For example, you can gather and combine all your marketing data (Google Ads, Facebook Ads, Hubspot, etc.) with your sales data (Salesforce, etc.) to create a unified customer view.
The source systems where data is generated are rarely designed for efficiently executing large data analytics operations. For example, while it may be possible to run tools such as grep, sed, and awk to analyze millions or billions of log lines on a source system, this is likely slow and inefficient. A more efficient data integration technique is to drive data into a centralized system that is purpose-built for analyzing big data, and then run your data integration analytics jobs on that central system. A data warehouse is an example of a system that is designed for data integration analytics workloads.
Data transformation may involve steps such as cleansing, normalizing, structuring, sorting, joining, or enriching data. Storing all of an organization’s data in a centralized system makes it easy to transform all of its data with a single set of tools and with a common methodology.
Unifying data into a centralized system allows analysts to request the data that they need from that central system rather than accessing data directly on operational source systems. This means that administrators do not need to grant analysts access to operational systems, which improves security.
If analytics jobs can be executed on a centralized system rather than executing on operational source systems, the risk of expensive or rogue analytics jobs interfering with the functioning of operational systems is mitigated.
ETL (extract, transform, load) and ELT (extract, load, transform) are the two most popular data integration techniques that can be used to send data into a destination such as a database, a data warehouse, a data lake, or a data lakehouse. In these data integration techniques, the transform (T) step performs operations such as cleansing, normalizing, structuring, sorting, joining, or enriching the data.
The ETL method for data integration was once preferred due to the high cost of on-premises computation and storage. However, the plummeting cost of cloud-based computation and storage has removed the requirement to perform transformation before loading data into the destination.
ETL is a data integration technique which extracts (E) data from a source system, then transforms (T) data before it loads (L) it into a destination such as a database, data warehouse, data lake, or data lakehouse. The following diagram shows an example of ETL from multiple sources into Google BigQuery.
It is worth noting that there are several disadvantages to the ETL data integration technique. Because only transformed data is stored in the destination system, analysts must know beforehand every way they are going to use the data, and every report they are going to produce. Modifications to requirements can be costly, and often require re-ingesting data from source systems. Additionally, every transformation that is performed on the data may obscure some of the underlying information, and analysts only see what was kept during the transformation phase. And finally, building an ETL-based data pipeline is often beyond the technical capabilities of analysts.
The ETL data integration technique was once preferred due to the high cost of on-premises computation and storage – for example, the transform step can remove unwanted data before loading into the destination, which reduces the workload and storage cost on the destination system. However, the plummeting cost of cloud-based computation and storage has removed the requirement to perform transformation before loading data into the destination.
The alternative ELT data integration approach is now often preferred and has several advantages when compared to ETL, as discussed in the next section.
ELT is a data integration method which extracts (E) data from a source system, and loads (L) data into a destination system before it transforms (T) the data. In other words, in the ELT data integration approach, the transformation (T) of the data is done at the destination after the data has been loaded. In the case of Airbyte, transformations are normally done by leveraging dbt (data build tool).
A core tenet of ELT philosophy is that data should be untouched as it moves through the extracting and loading stages so that the raw data is always accessible. If an unmodified version of the data exists in the destination, it can be re-transformed in the future without the need for a resync of data from source systems.
An example of ELT is given below. This diagram shows Airbyte extracting data from multiple sources and loading the raw data into Google BigQuery. In a subsequent step, dbt is used to execute SQL commands on BigQuery to transform the data.
Because data is transformed after it has been loaded, it is not necessary to know in-advance exactly how the data will be used – new transformations can be performed on the raw data as the need arises. Furthermore, analysts can always access the original raw data, as subsequent transformations will not have compromised its integrity. This gives analysts autonomy from data engineers, since it is not necessary to modify ingest pipelines in order to transform the data in a new way.
At Airbyte, we are advocates of the open-source ELT approach for data integration. More details about Airbyte’s approach to data transformation can be found in the following articles: Transformations with SQL, Transformations with dbt, Transformations with Airbyte, and Create an open-source dbt package to analyze Github data.
Data integration may be used to combine data from multiple source systems into databases, data warehouses, data lakes, or data lakehouses. One data integration challenge is deciding on which storage technology to use, as each of these data integration technologies is built to address different kinds of data storage and data processing requirements.
High-level differences between storage technologies and their use cases are briefly discussed below. However, keep in mind that the boundaries between databases, data lakes, data warehouses, and data lakehouses are fuzzy. For example, you could use a database such as Postgres as your data warehouse, or make analytical queries on top of a data lake such as Amazon S3 by using Amazon Athena.
Databases are the most familiar data integration technology. Databases are primarily used in systems that require fast, fault-tolerant transactional processing of records one at a time. Airbyte can drive data into databases including MySQL, Oracle DB, Postgres, and MongoDB.
However, databases are designed for transactional processing and are not optimized for running analytical workflows. Therefore, if the goal is to centralize your data into a system that can efficiently run data analytics, then a data warehouse or data lakehouse is probably more appropriate than a database.
A data warehouse is a unified repository that is designed for storing, querying, and analyzing large amounts of information from multiple sources, and often serves as a core business intelligence and data analytics component. The data that is stored in a warehouse is generally stored with a known purpose and is often transformed into a structure that makes data analytics fast and easy. You can use Airbyte to load data into data warehouses including Amazon Redshift, Google BigQuery, and Snowflake.
Data warehouses are generally optimized for efficiently running analytics across big data sets, and so a data warehouse is often a good candidate for use as a data integration destination technology. On the other hand, while a data warehouse is great for storing and analyzing large amounts of structured data, many enterprises also need to manage large amounts of unstructured data. This kind of storage requirement may be better addressed by a data lake, as discussed in the next section.
A data lake is a system that is used for inexpensively storing vast amounts of raw unstructured data, where the purpose of the data may not yet be defined. A data lake is not designed for efficient and quick analysis, but rather it is intended for storing vast amounts of data cheaply. Data lakes are less expensive than data warehouses, and often use low-cost commodity hardware such as object storage. Many companies build their data lakes on top of cloud storage services. Airbyte can send data to many such services including Amazon S3, Google Cloud Storage, and Azure Blob Storage.
If the goal is to use a data integration technology that can cheaply store large amounts of unstructured data, or to keep data that does not yet have a defined purpose, then data integration into a data lake may be a good choice.
A data lakehouse is a relatively new storage architecture that combines the data management capabilities of data warehouses and the storage flexibility and lower cost of data lakes. A data lakehouse can therefore serve as a single platform for data warehousing as well as serving as a data lake. Examples of data lakehouses that Airbyte can load data into are Databricks Lakehouse Platform and AWS Datalake (AWS Lake Formation).
If the data integration benefits of both data lakes and data warehouses are required, then data integration into a data lakehouse may be a good option.
Selecting the best data integration tool can be challenging. There are many factors that you should consider, including: ETL versus ELT, open-source versus closed-source, cloud-hosted versus on-premises, number of source connectors, and so on. This section starts with a brief introduction to Airbyte, followed by an overview of the main alternative data integration tools. Each alternative includes a link to a feature-by-feature comparison versus Airbyte.
Airbyte is the leading open-source data integration tool and has the goal of commoditizing data integration by addressing the long tail of connectors through a large contributor community. Airbyte provides a connector development kit (CDK) to build new connectors within just a few hours, and it supports hundreds of data connectors, a few of which are illustrated in the diagram below.
Fivetran is a closed-source, managed ELT service. Fivetran charges on monthly active rows (in other words, the number of rows that have been edited or added in a given month). For a detailed comparison see Airbyte vs. Fivetran.
Stitch is a cloud-based platform for ETL. Stitch leverages the Singer protocol to move data records from SaaS applications and databases into data warehouses and data lakes. For a detailed comparison see Airbyte vs. Stitch.
Matillion is a data integration tool built specifically for cloud database platforms. For a detailed comparison see Airbyte vs. Matillion.
Hevo is a closed-source, managed ETL service. Hevo Data offers real-time replication to several destinations. For a detailed comparison see Airbyte vs. Hevo Data.
Singer is an open-source JSON-based ETL framework. It was introduced by Stitch as a way to offer extendibility to the connectors they had pre-built. For a detailed comparison see Airbyte vs. Singer.
Meltano is an open-source orchestrator dedicated to data integration, built by Gitlab on top of Singer’s taps and targets. For a detailed comparison see Airbyte vs. Meltano.
Apache Airflow is an open-source workflow management tool. Airflow is not strictly an ETL tool, but many data teams also use Airflow operators to build ETL/ELT pipelines in Python. You can also integrate Airflow with Airbyte. For a detailed comparison see Airbyte vs Airflow.
If you have made it this far, then you likely work for an organization that creates large amounts of data across multiple systems, and probably already know that making sense of it all can be a challenge. To address this challenge, data integration can be used to create a unified view of your organization's data.
At Airbyte we believe that making use of ELT pipelines to move all of your data into a unified destination is an excellent data integration methodology. If you agree, then depending on your use case, you will need to decide if you will centralize your data into a database, a data warehouse, a data lake, or a data lakehouse.
Using Airbyte’s open-source data integration platform allows you to leverage community-built and quality-assured connectors. Furthermore, in the case that you can’t find a connector that suits your requirements, Airbyte makes it easy to build your own custom connectors by using our connector development kit.
If you haven’t yet tried out Airbyte for your data integration needs then you may wish to signup for Airbyte Cloud for a managed experience, or download Airbyte Open-Source to deploy it in your own infrastructure!
Get all your ELT data pipelines running in minutes with Airbyte.