Top Azure Data Services Overview: Integration, Storage and Analytics
After reviewing Microsoft Azure’s offer for relational databases, it is now time to check out other services that are also designed to support most companies’ data projects requirements: a data integration tool for the creation of ETL/ELT processes, a scalable data storage solution and a data platform that integrates all the components of a modern data solution.
Data integration services
Azure Data Factory
It is a serverless data integration and data transformation cloud-based service. Some say it is the cloud version of the well-known SQL Server Integration Services (SSIS), which by the way, is fully compatible with Azure Data Factory. You can lift and shift your SSIS packages and run them in Azure Data Factory. It will not be cheap, but it can be done.
The service offers a web-based and code-free UI. Pipelines are developed following a drag-and-drop approach, which makes the learning process easier and faster for non-technical professionals.
Regarding integration capabilities, Azure Data Factory has more than 100 native connectors to different types of systems, such as the NoSQL databases MongoDB and Cassandra, as well as connectors to file stores like Oracle Cloud Storage and Amazon S3, among other systems. Data exposed through OData feeds, SOAP APIs, or RESTful APIs, can be accessed thanks to the generic connectors available for each method. If no native connector is available for your data source, Azure Blobs could be used as a staging area to load your data as files, and then access these files via your pipelines in Azure Data Factory.
Now, what about the data transformation capabilities? Well, depending on your project needs, you can either transform natively your data in Azure Data Factory, or you can invoke external services. On the native side, we have Mapping Data Flows and a Power Query-based component. Both run on scaled-out Spark clusters that are fully managed by Azure, and allow performing, in a graphical way, all sorts of data transformations.
On the external services side, we have different options related to other services Azure offers. For example, if you have an HDInsight cluster (Azure service to run open-source frameworks focused on big-data processing like Apache Hadoop), you can execute different processes on it like hive queries, and Spark programs, among others. Other services that are fully integrated with Azure Data Factory are Azure Databricks and Azure Synapse Analytics. You can execute a Spark jar, a python file, or a Databricks/Synapse notebook from your pipeline. If you have specific requirements to transform your data, you could use the custom component that allows you to run customized code logic on an Azure Batch pool of virtual machines (Azure service designed to create and manage a pool of compute nodes - virtual machines).
Last, but not least, there are two awesome features that Azure Data Factory has: Expressions and Functions. They can be used to add dynamic behavior within your pipelines and to support the implementation of algorithms with low-to-medium complexity logic, such as getting the current date and time, and converting it to a specific time zone. Another use case of the two features, is the implementation of a metadata-driven pipeline to transfer data from multiple source tables to their corresponding target tables. If you have done something similar before, you know that it could save you tons of work.
Data storage services
Although we will talk about only one service, the Azure storage platform offers several different types of services aiming to solve specific data storage needs. The baseline or foundation of all these services is the Storage Account. Once you have it, you will be able to access all the other services. Data in this account is secure, highly available, durable, and massively scalable. It can be accessed from anywhere in the world over HTTP or HTTPS via a REST API; client libraries are also available in many programming languages such as .NET, Java, Node.js, Python, PHP, Ruby, Go, among others.
Azure Blob Storage - Azure Data Lake Storage
Blob stands for Binary Large OBject, and it is a data type used to store unstructured data like video, audio, or images. Azure Blob Storage, also known as Azure Blobs, is the storage solution Microsoft has to offer for this type of data. It is mainly used for the following purposes:
- Web applications: serve images or documents directly to the browser
- Video/Audio streaming
- Store data for backup and restore, disaster recovery, and archiving
- Data analysis
Optimization of costs is possible thanks to the three available access tiers: Hot, Cool, and Archive. Use the first one, if you are planning to access the data constantly. The Cool one is designed to store data that will be accessed infrequently and will be stored at least for 30 days. If you need to store data for a significant amount of time due to compliance or regulatory reasons, and you will rarely access it, then the Archive tier is the right one. Each tier has a different cost, and its own conditions. For instance, the Archive tier cost per GB is very cheap, but the process to access data from it can take up to 15 hours.
Another way to reduce storage costs is by using the feature Azure Storage Lifecycle Management, which helps you to manage your data lifecycle efficiently by allowing you to create rule-based policies to move data objects between tiers in an automated manner.
In terms of data organization, Azure Blobs does not support the use of directories. You can simulate them by adding the character “/” to the name of the blobs. Azure Blobs will identify this and will know that it is a virtual “folder”.
This limitation causes the renaming or deletion of a “directory” to require several operations proportional to the number of blobs in the structure. On the contrary, Azure Data Lake Storage does support a hierarchical organization, and blobs can be stored in directories. The directories’ metadata allows the renaming and deletion operations to be executed as single atomic operations. The hierarchical organization helps to keep the data organized, which translates to better data storage and retrieval performance, especially for big data projects in which data is stored in data lake services like Azure Data Lake Storage and is queried/analyzed by using other services such as Azure Databricks and Azure Synapse Analytics.
Azure Data Lake Storage, also known as Azure Data Lake Storage Gen2, is not a separate service from Azure Blobs. It is rather an extra set of capabilities focused on big data analytics that can be enabled. It was built on Azure Blobs, and designed to handle multiple petabytes of data and hundreds of gigabytes of throughput. Thanks to its compatibility with Hadoop, data can be accessed and managed just like it is with a Hadoop Distributed File System (HDFS).
This is the main product Microsoft offers to support data lake projects, and It has full integration with other Azure services, such as Azure Data Factory, and Azure HDInsight, among others.
Data analytics services
Azure Synapse Analytics
It is a unified analytics platform with data warehousing, and big data processing capabilities. The services that are part of the platform aim to support tasks, such as data storage and exploration, data integration, big data analytics, and machine learning.
All services are managed and contained within an Azure Synapse Workspace. Once it is set up, you will be able to configure any of the other services. In this post, we will check out the main features of only one of the services: Synapse SQL analytics runtime.
Synapse SQL is divided into two services: Serverless SQL Pool and Dedicated SQL Pool. The first one is a distributed data processing system for data exploration and data analytics. It is used to query data stored in your Azure Data Lake instance, Azure Cosmos DB or Dataverse by using a similar-to T-SQL syntax. Since it is a serverless service, there is no need to set up any kind of infrastructure. As a matter of fact, the service is enabled by default once you create your Azure Synapse Workspace, and it offers auto-scaling of resources depending on the workload. Something to add is that you pay for the amount of data processed by your queries; having the service enabled does not incur any costs at all.
On the other hand, the SQL Dedicated Pool, which is a service designed for the implementation of data warehouses, does have a dedicated set of resources available, which are determined by what is called, Data Warehousing Units (DWU); a term that Microsoft came up with to represent the combination of CPU, memory, and IO. The number of DWUs can be scaled up or down depending on your needs. Since the service is based on a Massively Parallel Processing (MPP) engine, queries are processed in parallel, providing high response times. However, performance is not something you can take for granted just because of the underlying architecture. It is also important to understand how data is stored, the distributed tables model, and to follow the design guidelines offered by Microsoft to get the most out of the product.
Microsoft Azure has within its offer way much more products and services than the ones we reviewed in this 2-part blog series, yet the idea was to do sort of a superficial walkthrough of some of the Azure services commonly used by companies in their data journey.