How to Set Up Batch ETL Jobs Between Oracle and Azure Data Lake?
Efficient data movement between enterprise solutions is necessary for critical analysis, reporting, and business operation. When working with voluminous datasets in the Oracle database, migrating them to scalable cloud storage solutions like Azure Data Lake can help improve accessibility and performance.
The batch ETL process enables you to periodically extract data from Oracle, apply transformations, and load it into Azure Data Lake for further processing. This integration guarantees optimized data management for advanced analytics.
In this article, you’ll learn how to set up batch ETL jobs between Oracle and Azure Data Lake.
Oracle: An Overview

Oracle is a robust relational database management system (RDBMS) that allows you to store data in tables while also supporting object-oriented features. The object-oriented functionality of Oracle lets you define custom data types for the storage of complex data. It enables you to preserve data in different formats, including text, multimedia, and geospatial information. You can efficiently query this data at any point in time using structured query language (SQL).
Azure Data Lake: An Overview

Azure Data Lake from Microsoft is a cloud-based scalable storage solution designed for big data analytics. It helps you store, manage, and process structured, semi-structured, and unstructured data of any size in a centralized repository. Built on top of Azure Blob Storage, this solution offers high availability and disaster recovery capabilities to ensure data durability.
Why Migrate Data Between Oracle and Azure Data Lake?
- Enhanced Analytics: Azure Data Lake is optimized for big data analytics. It offers connections to Azure Machine Learning, an ML-as-a-service solution, to streamline AI workflows. Transferring Oracle data to Azure Data Lake enables refined ML model training and AI-driven analytics.
- Integration Capabilities: Replicating data from Oracle to Azure Data Lake provides robust integration capabilities in the Microsoft ecosystem. You can connect your data to various applications, including PowerBI, Azure Synapse Analytics, and Azure Monitor, available in the Microsoft Azure Marketplace.
- Flexibility: Unlike Oracle’s rigid schema requirements, Azure Data Lake supports schema-on-read. This enables you to store raw data in its form and process it based on use cases.
Using Azure Data Factory to Replicate Data to and from Oracle Database

Azure Data Factory (ADF) is a cloud data integration solution offered by Microsoft Azure. It supports various activities, such as Copy, Script, and Lookup, to set up batch ETL jobs between different platforms.
The Copy Activity in ADF facilitates the migration of data between various platforms. The Script Activity can aid you in transforming the data using Data Manipulation Language (DML) and Data Definition Language (DDL). The transformed data can then be loaded into your chosen sink, or destination. On the other hand, the Lookup activity is beneficial for retrieving data from any data source available in the ADF. To interact with ADF, you can use the Azure Data Factory Studio—a web-based user interface.
Let’s get started with the steps involved in moving data between Oracle and Azure Data Lake. However, before initiating the migration, you must ensure that the necessary prerequisites are satisfied.
Prerequisites:
- You must have access to Azure and the necessary credentials to create resources like linked services in ADF.
- Access to Oracle database credentials.
- An account on Azure Data Lake Storage Gen2 to store the data from Oracle.
- Create an ADF instance.
- To connect to the Oracle data store, you must determine whether it is hosted on-premise or on the cloud. For on-premise, Azure virtual network, and Amazon Virtual Private Cloud systems, configure self-hosted integration runtime. For managed cloud services, use the Azure Integration Runtime IPs.
Step 1: Set Up Azure Data Factory Environment
- Launch the Azure portal in your web browser.
- Search Data Factories using the search bar located at the top of the page.

- Select your data factory resource.

- Click on Open Azure Data Factory Studio.

- You will be redirected to the Azure Data Factory homepage.
Step 2: Create a Linked Service to Oracle
To move data between Oracle and Azure Data Lake using the Copy activity, you must first create a linked service to configure the connectors.
- Navigate to the Manage tab on the left panel.

- Select the Linked services, and click + New.

- Under the Data store option, search Oracle and select the available option.

- A New linked service (Oracle) dialog box will appear. Enter the service details and credentials to set up the Oracle connector.
- Test the connection and click Create.
Step 3: Create a Linked Service to Azure Data Lake
- Select the Manage tab on the Azure Data Factory interface.

- Under Connections, click Linked services and select + New.

- On the New linked service page, search data lake and select the Azure Data Lake Storage Gen2 option.

- Enter all your credentials on the New linked service (Azure Data Lake Storage Gen 2) page.
- Click Test connection to verify all the credentials are valid.
- Tap the Create button to connect to your Azure Data Lake instance.
Step 4: Create a Dataset for Oracle and Azure Data Lake
After setting up the linked service for both the source and the destination, you can create a dataset for both platforms. In ADF, a dataset is a named view that references the original data that you wish to replicate across different locations. A dataset can identify the information stored within files, folders, documents, and tables.
- To create a new dataset, navigate to the Author tab on the Azure Data Factory Studio.

- Click on the + sign icon and choose Dataset.

- A New dataset window will appear. In the search bar, enter Oracle and select the available option.

- Select the dataset format that best represents your data.

- Under the Linked service option, select the existing Oracle service.
Follow the same steps to create a new dataset for Azure Data Lake Storage Gen2.
Step 5: Build an ETL Pipeline in Azure Data Factory
Now that the requirements are satisfied, let’s create an ETL pipeline to migrate data between Oracle and Azure Data Lake.
- Select the Author tab again.

- Click on Pipelines and choose New pipeline.

- Under the Activities section, search Copy data.

- Select the Copy data activity. You can provide a name and a description of the activity in the General section.
- Navigate to the Source option adjacent to the General section.
- Depending on your requirements, select the Source dataset from the drop-down menu as the dataset created for Oracle or Azure Data Lake Storage Gen2.
- Visit the Sink option and select the Azure Data Lake Storage Gen2 or the Oracle dataset.
- Under the Mapping heading, you can define how you wish to replicate the dataset. It represents the transformation component of the ETL pipeline. You can also define custom scripts using the Script activity to reshape data into a compatible format.
After performing these steps, you can validate the pipeline to check its performance. Finally, you can select the Debug option. This will generate a success message on the terminal. The success message highlights the effective configuration of your data pipeline.
Airbyte Approach to Simplifying the Development of ETL Pipelines

Airbyte is a no-code data integration tool that allows you to replicate data between multiple platforms. It supports 550+ pre-built data connectors, enabling you to move data to your preferred destination, including Oracle DB and Azure Blob Storage. If the connector you seek is unavailable, Airbyte provides a Connector Builder and a suite of Connector Development Kits (CDKs) for building custom connectors.
Here are a few key features of Airbyte that support configuring batch ETL jobs:
- Developer-Friendly Pipelines: PyAirbyte—a Python library—enables you to use the Airbyte connectors in a Python development environment. This library lets you extract data from various sources into popular SQL caches, including Postgres, Google BigQuery, and Snowflake. The generated caches are compatible with various Python libraries like Pandas, which enables you to transform the data, making it compatible with the destination schema.
- Automated RAG Techniques: Airbyte supports automated chunking, embedding, and indexing operations, allowing you to transform raw data into vector embeddings and store them in vector databases. This simplifies the method of building ETL pipelines for AI workflows, as the vector databases facilitate efficient data retrieval with vector similarity searches.
- Orchestrate Data Pipelines: To effectively manage and monitor the newly created ETL pipelines, you can integrate Airbyte with popular data orchestrators, such as Apache Airflow, Dagster, and Prefect.
Let’s use PyAirbyte to create an ETL pipeline that allows you to extract data from a demo data source source-faker. Before getting started, you must access a code editor to execute the code snippets. The following parts of this tutorial will use Google Colab.
Open a Colab notebook and create a virtual environment to isolate dependencies.
To install PyAirbyte, execute:
Import the PyAirbyte library in the Colab notebook by running the following code in a cell:
You can now check the availability of connectors by executing the following:
Let’s install the source connector. For example, to install a fake demo source, use the code below.
Replace the source-faker placeholder with the source of your choice.
Configure this source by providing the access credentials.
After setting up the source, verify the credentials by running the following command:
The above code must respond with a success message.
Source data can have multiple data streams. To read the streams in an internal DuckDB cache, run the following:
Let’s assume that the source data contains a stream named products. Convert the products stream into Pandas DataFrame.
Now, you can apply transformation logic to the products_df DataFrame to make it compatible with the destination. The transformed data can then be loaded into various analytical platforms for further analysis. By following these straightforward steps, you can effortlessly build ETL workflows.
Key Takeaways
This article provided a comprehensive step-by-step guide to building a batch ETL pipeline for transferring data between Oracle and Azure Data Lake. Configuring batch ETL jobs between these platforms is essential for data-driven decision-making. This allows you to work with transactional data in Oracle while leveraging Azure Data Lake for analytics.
To streamline the development of data pipelines, sign up for Airbyte.