Have you ever wondered: What's the fundamental difference between a data warehouse and an operational database? If so, this article is written for you!
On the surface, data warehouses and operational databases provide similar functionalities. However, they are optimized for different use cases. Data warehouses are generally optimized for OLAP (online analytical processing) workloads, while operational databases are generally optimized for OLTP (online transactional processing) workloads.
In this article you will learn about OLAP and OLTP workloads, and will see the fundamental architectural difference that makes data warehouses preferred for OLAP workloads and operational databases preferred for OLTP workloads.
This article covers the following topics:
A brief review of background information that provides a basis for understanding the architectural decisions made for operational databases and data warehouses. This includes an overview of latency numbers that every programmer should know, as well as a high-level overview of how pages of data are read from disk into memory. An example of an OLTP workload, with an explanation of how row-oriented storage makes operational databases good for such workloads. An example of an OLAP workload along with an explanation of how column-oriented storage makes data warehouses good for such workloads. A review benchmark results that compare a popular OLTP database with an up-and-coming OLAP data warehouse. A summary of when an operational database should be used and when a data warehouse may be a better choice. The benefits of using both an operational database and a data warehouse. How Airbyte may be used to synchronize data from an operational database to a data warehouse as part of your overall data integration strategy. By the end of this article you will understand the fundamental difference between a data warehouse and an operational database, and you will be well positioned to determine if your requirements are best served by a data warehouse, an operational database, or a combination of both of these technologies.
Background information In order to understand the fundamental difference between a data warehouse and an operational database, it is useful to understand design decisions and trade-offs that have been made for each one. Furthermore, in order to understand the underlying reasoning for these design decisions, it is helpful to have a general understanding of latency numbers and a few basic operating system concepts.
A review latencies The latency numbers every programmer should know table, produced by Jonas Bonér in 2012 can be used as a rough guide to help understand the time required for accessing data depending on where it is located.
ℹ️ Even though current (2022) latency numbers may be different than those from 2012, the relative orders of magnitude should still hold true and help understand the trade-offs that will be discussed later in this article.
The latency numbers from the linked article have been copied into the image below:
The orders of magnitude differences between different kinds of data access are worth paying attention to. For example, loading data from an SSD is more than 3 orders of magnitude (more than 1000x) slower than referencing main memory, and a disk seek on a spinning disk is 5 orders of magnitude (100,000x) slower than referencing data that is in memory.
The above latencies should make it clear that there is a huge performance advantage to minimizing disk access .
One optimization to reduce the number of times data needs to be read from disk is to ensure that the working set of data is efficiently loaded into memory. This is covered in more detail in the next section.
How data is loaded into memory? Operational databases and data warehouses generally store much more data on disk than can possibly fit into memory. Therefore, they rely on the underlying operating system to read different portions of data from disk into memory as required. The operating system makes use of memory paging to read and write data. When new data is requested, a page is retrieved from disk and loaded into memory. Pages range in size from 4KiB to 2MiB depending on how the system is configured.
Because data is read as pages, this means that even in the case that only a tiny piece of data is required from some portion of a disk, a minimum 4KiB of data would be read – and potentially much more than that, depending on the system configuration. Therefore, without careful consideration of how data is stored on disk, memory paging may result in a high percentage of useless data being loaded into memory alongside the requested information.
In the case that the pages that are retrieved from disk contain a low density of useful data, then many more pages need to be read from disk than would otherwise be the case. Furthermore, if a large number of pages are read into memory (as would ultimately be the case if most of the data in each page is irrelevant) then the server will quickly run out of memory capacity, at which point, some of the data in memory will be paged back to disk (written to disk) to make space for alternative pages in memory. Because of the relative slowness of disk operations, this will have a negative impact on performance, and in the worst case can lead to thrashing .
Both data warehouses and operational databases are designed to ensure that the data that is returned in each page has a high density of useful information by following the principle of locality based on the expected data access patterns. Because data warehouses and operational databases are designed for different kinds of workloads with different data access patterns, they store data on disk in distinct ways.
What is an operational database? An operational database is a type of database designed to support the day-to-day transactions and data processing needs of an organization. Operational databases store and manage data related to ongoing business operations, such as sales transactions, customer information, inventory levels, and financial records.
Operational databases are designed for OLTP (Online Transactional Processing) workloads, which often involve many users executing a high number of small atomic transactions. In other words, operational databases are optimized for high-throughput transactional processing of records.
Examples of popular operational databases are Oracle , MySQL , and PostgreSQL .
An example operational database workload An example of a typical operational database workload could be the storage and tracking of inventory information for a retailer. Required columns are SKU, Name, Inventory count, Price, etc. This could look as follows:
Example data that may be stored in an operational database In a system built on top of this operational database, if a client purchases Bread, then a cashier would scan the SKU (Id1), which would retrieve the corresponding record from the operational database to get the product name (Bread) and price ($1.50), and it would then decrement the inventory count (50 to 49).
In this example several fields within a single record have been accessed to process this purchase, but notably the values located in other records were not required. Operational databases are therefore optimized to ensure that the contents of a given record can be efficiently loaded from disk and efficiently processed. For a large retailer, millions of such transactions may need to be processed each day.
Operational database storage Operational databases store their data in a row-oriented manner which means that the values associated with a single record are stored together on disk. Storing data in this manner corresponds to the most common data access and usage patterns associated with OLTP workloads, such as the inventory/purchase example just given. This minimizes the number of pages that need to be read from disk when performing an OLTP operation. As should be clear from the latency numbers discussed earlier, keeping disk activity to a minimum is key to providing good performance.
Assuming that the database stores its data on a solid-state drive (SSD), a row-oriented/record-by-record format on disk may look as follows:
How an operational database may store row-oriented data on a solid-state (SSD) drive Alternatively, if an operational database uses spinning disks, the data storage may look similar to the following:
How an operational database may store row-oriented data on a spinning hard disk drive (HDD)