How To Create an Adjust Python Pipeline with PyAirbyte

10 min read
April 5, 2024

Integrating data from various sources like Adjust into a cohesive analysis can be challenging for many organizations. These challenges often stem from complex API integrations, maintenance complexities, and the need for scalable solutions to handle growing data volumes.

PyAirbyte, with its straightforward setup and extensive compatibility with various data sources, presents an effective solution to these hurdles. It simplifies the data integration process, reducing the technical burden of API management, and caters to scalability needs efficiently.

By streamlining the creation of data pipelines, PyAirbyte not only addresses the pain points of traditional methods but also empowers teams to focus more on deriving insights and less on overcoming technical barriers.

Traditional Methods for Creating Adjust Data Pipelines

In the realm of digital marketing, integrating and analyzing data from platforms like Adjust is critical for accurate performance assessment and strategic planning. Developers and data engineers typically rely on custom Python scripts to create data pipelines that extract, transform, and load Adjust data into databases or data warehouses for analysis. This conventional method, while flexible, comes with its set of challenges.

Custom Python Scripts: The Go-To Method

Custom Python scripts have been the primary tool for pulling data from APIs like Adjust's. Python, with its rich set of libraries for data manipulation and HTTP requests, enables the development of scripts tailored to the unique requirements of each Adjust data extraction task. Developers can specify what data to pull, how to transform it, and where to load it, giving them complete control over the data pipeline.

However, this method is not without its pain points.

Pain Points in Extracting Data from Adjust

  1. Complex API Integration: Adjust offers a robust API for data extraction, but mastering its nuances requires a significant investment of time and effort. This complexity can introduce errors and inefficiencies in the data pipeline.
  2. Maintenance Overhead: Adjust, like most platforms, evolves its features and API endpoints. Each adjustment necessitates updates to the custom scripts, leading to a high maintenance overhead. Developers must continuously monitor for API changes and implement modifications to avoid disruptions in data flows.
  3. Data Transformation Challenges: After extraction, data often requires cleaning, normalization, or transformation before it's ready for analysis. Implementing these transformations within Python scripts can be error-prone and time-consuming, especially when dealing with large volumes of data or complex transformations.
  4. Scalability Issues: As businesses grow and data volumes increase, custom scripts may struggle to efficiently process larger datasets. Scalability becomes a bottleneck, necessitating frequent optimizations or complete rewrites of the scripts to accommodate higher loads.

Impact on Data Pipeline Efficiency and Maintenance

The efficiency of a data pipeline is crucial for timely insights and decision-making. However, the outlined challenges with custom Python scripts for Adjust data integration can significantly impair this efficiency. Time and resources are diverted to maintain and debug scripts rather than focusing on analysis and strategic actions. Additionally, the risk of data inaccuracies or pipeline failures increases, potentially leading to misleading analysis and strategic missteps.

Moreover, the maintenance burden can be substantial. Keeping up with API changes, ensuring data quality, and adjusting to new business requirements demand continuous attention. This not only stretches the resources thin but also slows down the adaptability of the data pipeline to new analytical needs.

In summary, while custom Python scripts offer flexibility and control in creating Adjust data pipelines, they bring considerable challenges that can undermine the efficiency and maintenance of these pipelines. These challenges necessitate a reevaluation of the traditional methods and a search for more streamlined, adaptable solutions like PyAirbyte, which promises to address these pain points by simplifying the integration and maintenance process.

Implementing a Python Data Pipeline for Adjust with PyAirbyte

Setting Up PyAirbyte

First, you need to install the PyAirbyte package. PyAirbyte is a Python library for interacting with Airbyte, an open-source data integration platform. This library allows for the programmatic control of Airbyte functions, such as configuring sources and destinations, checking connections, and reading data. The installation is simple:

pip install airbyte

Configuring the Adjust Source Connector

With PyAirbyte installed, you configure the Adjust source connector. This involves specifying your Adjust API token, the start date for data ingestion, the metrics and dimensions you're interested in, any additional metrics, and whether your data ingestion should include today's data.

import airbyte as ab

# Create and configure the source connector:
source = ab.get_source(
   "source-adjust",
   install_if_missing=True,
   config={
     "api_token": "your_api_token_here",
     "ingest_start": "2023-01-01",
     "metrics": ["network_cost", "network_installs"],
     "dimensions": ["country", "os_name"],
     "additional_metrics": ["custom_metric_1", "custom_metric_2"],
     "until_today": False
   }
)

This code block initializes Airbyte's Adjust source connector with your specific configuration using the ab.get_source method. The install_if_missing=True argument ensures that if the Adjust connector is not already installed in your Airbyte instance, it will be installed automatically.

Verifying Config and Credentials

Next, it's crucial to verify the configuration and credentials to ensure that the connection to Adjust can be established successfully.

source.check()

This step uses the source.check() method, which performs a connection check to see if your Adjust source can be reached with the provided API token and configuration.

Listing Available Data Streams

After the source is configured and verified, you might want to explore the available data streams. These are the specific types of data you can extract from Adjust, like campaign performance metrics or user demographics.

source.get_available_streams()

By calling source.get_available_streams(), you retrieve the list of all data streams that the Adjust connector can provide. This is useful for understanding exactly what data can be pulled.

Selecting Streams and Loading Data

With the available streams known, you can select which ones to load. You may choose to load all available streams or only a subset.

source.select_all_streams()

In this case, source.select_all_streams() tells Airbyte to prepare all available data streams for extraction and loading into your desired destination or cache.

Reading Data into a Local Cache

Now, it’s time to read the data. For simplicity, the example uses DuckDB as a local cache, but PyAirbyte supports other destinations like Postgres, Snowflake, and BigQuery.

cache = ab.get_default_cache()
result = source.read(cache=cache)

This code uses ab.get_default_cache() to get a default local cache instance (DuckDB) and then reads the selected streams into this cache with source.read(cache=cache).

Extracting Data into a Pandas DataFrame

Finally, you may want to work with the cached data directly in Python for analysis or further processing. This can be done by loading a specific data stream into a pandas DataFrame.

df = cache["your_stream"].to_pandas()

Replace "your_stream" with the actual stream name you're interested in. This code pulls data from the specified stream out of the cache and loads it into a pandas DataFrame, making it ready for analysis with pandas' powerful data manipulation capabilities.

This step-by-step explanation outlines how to implement a data pipeline from Adjust using PyAirbyte, from installation and configuration through data extraction and loading into a pandas DataFrame for analysis.

For keeping up with the latest PyAirbyte’s features, make sure to check our documentation. And if you’re eager to see more code examples with PyAirbyte, check out our Quickstarts library.

Why Using PyAirbyte for Adjust Data Pipelines

Ease of Installation and Setup

PyAirbyte simplifies the initial setup process for data engineering tasks. The primary requirement is a Python environment, which is a common setup in most data processing ecosystems. Installing PyAirbyte is as simple as running a pip install airbyte command, making it accessible even to those who are new to data engineering or Python scripting. This straightforward approach removes barriers to entry, allowing more users to leverage PyAirbyte for their data integration needs.

Flexibility in Connector Configuration

The platform supports a wide array of source connectors out of the box, including those for popular platforms like Adjust. PyAirbyte not only allows easy configuration of these available connectors but also supports the integration of custom connectors. This flexibility ensures that teams can tailor their data pipelines to include exactly the sources they need, whether pre-built or bespoke, catering to specific project requirements.

Efficient Data Stream Selection

One of the key features of PyAirbyte is its ability to select specific data streams for processing. This selection capability is crucial for conserving computing resources and optimizing data pipeline efficiency. By focusing only on the data streams that are necessary for a given analysis or task, PyAirbyte eliminates unnecessary data processing overhead, making the entire data pipeline more streamlined and cost-effective.

Multiple Caching Backends Support

PyAirbyte's architecture supports various caching backends, including but not limited to DuckDB, MotherDuck, Postgres, Snowflake, and BigQuery. This wide range of supported backends provides users with the flexibility to choose a caching solution that fits their specific requirements in terms of scalability, performance, and maintenance. DuckDB serves as the default cache if no specific backend is defined, offering a convenient and efficient option for local data caching and processing.

Incremental Data Reading

Handling large datasets efficiently is a common challenge in data engineering. PyAirbyte addresses this by supporting incremental data reading, which significantly reduces the load on data sources and minimizes computing resource usage. By fetching only new or changed data since the last extraction, PyAirbyte ensures that pipelines are not only faster but also more cost-effective, particularly when dealing with large volumes of data.

Compatibility with Python Libraries

The compatibility of PyAirbyte with a broad spectrum of Python libraries, including Pandas and various SQL-based tools, opens up a plethora of possibilities for data transformation and analysis. This compatibility ensures that data engineers and scientists can easily integrate PyAirbyte into existing Python-based workflows, leveraging familiar tools and libraries to manipulate and analyze data. Furthermore, this integration capability extends the utility of PyAirbyte to orchestrators and AI frameworks, making it a versatile tool in the data engineer's toolkit.

Enabling AI Applications

The ultimate goal of many data pipelines today is to feed into AI and machine learning models. PyAirbyte, with its efficient data integration capabilities, incremental loading feature, and compatibility with data analysis tools, is well-suited for such applications. By streamlining the process of data ingestion and preparation, PyAirbyte enables teams to focus more on developing and refining their AI models rather than being bogged down by data pipeline issues.

Conclusion

In navigating the intricate landscape of data pipelines, particularly for platforms like Adjust, PyAirbyte emerges as a beacon of efficiency, flexibility, and simplicity. Our exploration from setting up and configuring data sources to the detailed integration and analysis processes underscores the versatility and power of PyAirbyte in handling Adjust data pipelines.

PyAirbyte's straightforward installation, ease of configuration, and compatibility with an extensive range of data sources simplify the once-daunting task of data integration. Its ability to selectively process data streams, support incremental reads, and interface with various caching backends and Python libraries enhances the efficiency and scalability of data pipelines, making it an invaluable tool for data engineers and developers.

Do you have any questions or feedback for us? You can keep in touch by joining our Slack channel! If you want to keep up to date with new PyAirbyte features, subscribe to our newsletter.

Enhancing Python with Airbyte connectors
Try PyAirbyte