Exploring Different Types of Databases: A Guide for Data Engineers
Databases come in various configurations, each designed to support different use cases, data types, and data models.
For example, relational databases are built to record transactions and support analytical queries, and NoSQL databases are designed for real-time data processing.
To help you understand the different types of database systems, this article will explain each type of database and its key features. We’ve also listed the four key considerations for choosing the right solution for your project.
Types of Databases Comparison
Here is a quick comparison of the six main types of databases:
Relational Databases (RDBMS)
Relational databases are used to store structured data. They organize data into tables with columns and rows. Each row represents a unique instance of data, and each column represents a different attribute or property of that data.
A relational database is a collection of tables. Primary and foreign keys establish relationships between tables.
Data analysts use SQL (structured query language) and relational database management system (RDBMS) software to query and manipulate data.
Relational database solutions normalize data and implement constraints to maintain data integrity and consistency.
RDBMS tools can be used to create operational databases for real-time OLTP (Online Transactional Processing) workloads that record simple database transactions in real-time.
Data from relational databases can also be used for data warehousing to support data integration.
Relational database management systems have three key characteristics:
- ACID properties: Relational databases comply with the ACID (Atomicity, Consistency, Isolation, Durability) properties, which ensure that database transactions are processed reliably and consistently.
- Schema-based data organization: This database type uses a fixed, predefined schema to store data in tables.
- SQL as a query language: Structured Query Language (SQL) is a standardized programming language used to retrieve, insert, update, and delete data from tables in a relational database management system.
Popular relational databases
1. MySQL: MySQL is an open-source, feature-rich RDBMS that supports database transactions, ACID compliance, foreign keys, triggers, and stored procedures. It also has several tools for database management.
2. PostgreSQL: PostgreSQL is an open-source RDBMS that drives large-scale enterprise applications where customization and extensibility are important. It has dynamic features for centralized database management and is supported by a large, active community of users and developers.
3. Microsoft SQL Server: Microsoft SQL Server is an enterprise database for managing and storing large amounts of data. It includes many advanced features for data warehousing and in-memory OLTP for real-time analytics.
4. Oracle Database: Oracle DB is a high-performance, scalable RDBMS commonly used by large-scale enterprises. It is used in mission-critical applications that require high performance, scalability, and reliability.
NoSQL (Not only SQL) databases are non-relational databases. Instead of a fixed data model, they use varying data models and can handle semi-structured and unstructured data.
A NoSQL database has a flexible schema, which makes it more adaptable to changing data structures. It provides the flexibility needed for the ever-evolving use cases of modern data teams.
These types of databases are highly scalable and can handle big data workloads easily. They drive applications that require high availability and real-time processing, such as social media, gaming, and e-commerce.
Key types of NoSQL databases
There are four main types of NoSQL databases:
Document-oriented databases, also called document stores, store data as documents. Every document is a self-contained entity. It can have any number of key-value pairs, where the value can be a scalar, an array, or another nested document.
Document databases are schema-less, meaning the document’s structure can vary from one document to another.
This flexibility makes a document database ideal for handling evolving data structures and storing business data that is not well-defined in advance, like user-generated content, log files, and sensor data. Some document stores also have advanced querying features.
Examples of document databases include MongoDB, Couchbase, and RavenDB.
This type of database stores data as key-value pairs. Here, the key is a constant that defines the data set (e.g., gender, color, region), and the value is a variable that belongs to that set. A simple example of a key-value pair is “color = blue.”
Key-value stores are designed for high performance and low latency. They are ideal for accessing data quickly and frequently. They are used for caching, session management, and real-time analytics.
Some key-value stores serve other use cases by allowing complex data structures to be stored as values, such as lists, sets, or maps.
Examples of popular key-value stores include Redis, Amazon DynamoDB, and Riak.
Column-family databases, also known as column-oriented databases or wide-column stores, store data in columns instead of rows.
A column-family database system organizes data by column families or groups of related columns. They are highly scalable and optimized for read-heavy workloads, making them ideal for data analytics and reporting.
Apache Cassandra, Google Bigtable, and ScyllaDB are examples of column-oriented databases. They are used for real-time analytics, IoT data processing, and content management.
A graph database stores data in a graph-like structure consisting of nodes (vertices) and edges (relationships). They represent and store complex relationships between data points.
Users can leverage built-in visualization tools within a graph database to explore and understand the relationships between data points.
A graph database can provide high performance for complex queries, as they can navigate through large datasets and query relationships between nodes.
Examples of graph databases include Neo4j, Amazon Neptune, and OrientDB. These databases are used in social networks, recommendation engines, and fraud detection.
Time-series databases (TSDB) store and query time-stamped or time-series data. This data type is characterized by measurements that are tracked, monitored, and aggregated over time.
Sensor data, stock prices, and server logs are examples of time-series data. TSDB solutions are used for IoT sensor networks, financial analysis, and log management.
Time-series databases utilize data retention mechanisms, which allow users to control how long data is retained. Data retention policies can be configured at various levels of granularity and customized to match specific use cases and data storage requirements.
1. Efficient storage and retrieval of time-series data: Time-series databases efficiently store and analyze data based on time intervals. They are optimized for fast data ingestion and retrieval, allowing for real-time analysis of data streams.
2. Time-based aggregations and computations: This type of database includes built-in support for aggregation and analytics functions, making it easy to perform complex calculations and analysis on time-series data.
Popular time-series databases
1. InfluxDB: InfluxDB is an open-source, distributed time-series database. It can handle high write and query loads for large-scale time-series data. It supports use cases like monitoring, IoT, and real-time analytics, where time is a critical factor in data analysis.
2. TimescaleDB: TimescaleDB is an open-source, relational database built as an extension of PostgreSQL. It adds time-series-specific features on top of PostgreSQL’s capabilities.
3. OpenTSDB: OpenTSDB (Open Time Series Database) is a distributed database for storing high volumes of time-series data. It is built on the Hadoop Distributed File System (HDFS) and HBase, which are part of the Apache Hadoop ecosystem. It can be used as a standalone database or as part of a larger data pipeline.
NewSQL databases are designed to leverage the benefits of both SQL (relational) and NoSQL databases. They combine the scalability and performance of non-relational databases with the familiar structure and querying capabilities of SQL databases.
These databases use distributed architectures and clustering to achieve high scalability while still providing the strong consistency guarantees and transactional capabilities of traditional relational databases.
NewSQL databases use horizontal scaling and support distributed transactions, enabling users to perform complex transactions across a distributed environment.
1. ACID compliance: NewSQL databases maintain ACID compliance, ensuring data consistency and integrity.
2. Scalability and performance enhancements: NewSQL databases provide high performance and low latency. They are horizontally scalable and can handle massive amounts of data and traffic.
Popular NewSQL databases
1. CockroachDB: CockroachDB is an open-source, distributed SQL database that uses a “geo-partitioning” approach to achieve high scalability and availability.
2. Google Cloud Spanner: Google Cloud Spanner is a fully managed, horizontally scalable relational database designed to provide strong consistency and transactional capabilities across a global network of data centers.
3. MemSQL: MemSQL is a distributed, in-memory SQL database that provides real-time analytics and transactional processing on large volumes of data.
In-memory databases store data entirely in the main memory (RAM) of a computer instead of on disk or other secondary storage devices. This enables rapid data access and query processing, along with improving scalability and availability.
These types of databases are used in real-time applications that require very fast query response times, such as analytics, financial trading systems, and online transaction processing (OLTP) systems.
In-memory databases often have limited capacity compared to disk-based databases. As a result, they may not be suitable for very large datasets or applications with high data ingestion rates.
1. High-speed data access: Since data is stored in memory, in-memory databases can retrieve and process data much faster than disk-based databases, resulting in lower latency.
2. Volatility considerations: In-memory databases mitigate the risk of data loss, due to system crashes, power outages, or other unforeseen circumstances, by using techniques like data replication, snapshotting, and transaction logging.
Popular in-memory databases
1. Redis: An in-memory data store that supports various data structures, like strings, hashes, lists, sets, and sorted sets, and provides advanced features such as transactions and pub/sub messaging.
2. SAP HANA: A high-performance, in-memory database optimized for business analytics and real-time data processing.
3. Memcached: Memcached is a distributed memory caching system. It is used to cache frequently accessed data such as HTML pages, images, and database query results. This speeds up web applications by alleviating database load.
A distributed database is a database that is spread across multiple nodes or locations, connected through a shared network. It is managed using a distributed database management system (DDBMS).
Data storage in a distributed database is done using two methods - data replication and fragmentation.
They are built to address the limitations of a traditional centralized database, such as scalability, availability, and fault tolerance.
1. Horizontal scaling: Distributed databases can scale horizontally by adding more nodes, which helps them handle large data volumes and high traffic levels.
2. Partitioning: A distributed database must continue functioning even when individual nodes or sub-networks fail or become unavailable. They must handle network partitions and ensure that data remains consistent.
3. Fault tolerance and high availability: This type of database is highly available and resilient, even in the face of hardware or network failures. Data is typically replicated across multiple nodes to ensure that users can always access data.
4. Performance and consistency: Distributed databases deliver high performance using optimized data structures and techniques like sharding and indexing. They also ensure that all nodes have consistent data using consensus algorithms like Paxos or Raft.
Popular distributed databases
1. Apache Cassandra: Apache Cassandra is a distributed database management system that stores data across many commodity servers. It uses a column-oriented data model and offers high availability, tunable consistency, linear scalability, and support for multi-data center replication.
2. Amazon DynamoDB: Amazon DynamoDB is a fully managed NoSQL database for modern applications that require consistent, single-digit millisecond response times at any scale. It is a serverless database that automatically scales tables based on storage requirements. It supports both document and key-value data models.
3. CockroachDB: CockroachDB is an open-source, distributed SQL database that can run at scale, spanning multiple data centers and cloud regions. It is a cloud database that offers horizontal scaling, automatic failover, and self-healing capabilities.
There are two other common types of database systems to consider - hierarchical databases and object-oriented databases. These databases are not widely used today and serve specialized use cases.
A hierarchical database stores data using a tree-like structure. Data is stored as records and represented by a node. The top node is called the root node. All other nodes are connected to it in a hierarchical fashion using parent and child nodes. Each parent node can have one or more child nodes, but each child node has only one parent node.
Hierarchical databases are not suitable for more complex relationships between data. Changing the database schema and data elements can also be difficult in a hierarchical model.
A database where a child node can have more than one parent node is called a Network Database. This database system is highly structurally dependent, so altering the structure is challenging.
An object oriented database management system (OODBMS) stores database entries in the form of objects, which are instances of classes or prototypes. Developers can use an object oriented programming language to manage data.
Data objects and their properties stored in an object oriented database persist even after your program terminates.
Object oriented databases support OOP concepts and programming paradigms, including encapsulation, inheritance, abstraction, and polymorphism.
Object oriented databases support many data types, but the highly complex data structure can impact performance. Integration with other systems, like business intelligence tools, can also be challenging.
Choosing the Right Database for Your Project
Consider these four significant factors when selecting a database for your project:
1. Data model and structure
The first consideration should be the data models and structure of each type of database system. These two factors affect key project requirements, including data types, data volumes, methods used to access data, query types, and query response times.
Relational databases are best suited for structured data with well-defined relationships between entities. They have a fixed data model and can perform complex analytical queries.
NoSQL and other distributed databases are ideal for unstructured or semi-structured data. They have flexible data models, provide high performance, and enable data engineers to make database changes quickly.
NewSQL databases are also a great option, provided the project is not too complex and does not require advanced features like stored procedures or triggers.
2. Scalability requirements
If scalability is a crucial requirement, distributed and NoSQL databases are generally a better option because they can scale horizontally across multiple nodes. Other types of database structures, like SQL databases, can be more expensive and resource-intensive to scale
NewSQL databases are also highly scalable but may fall behind NoSQL databases for very large or complex data sets.
3. Consistency and reliability needs
Evaluate the level of consistency required for the data stored and whether eventual consistency is acceptable.
Traditional relational databases, like MySQL or PostgreSQL work when your project needs strong consistency and data integrity. However, if eventual consistency is acceptable, then a NoSQL database such as MongoDB or Cassandra is a good option.
4. Budget and resource constraints
Consider the cost of licensing, hosting, and maintenance for each database option. Traditional RDBMS systems might be cheaper to set up initially, but the costs of scaling and additional tools for modern data integration can add to expenses.
Enterprise database management systems typically offer the most features but can be expensive. Most data management software vendors offer custom pricing for enterprise database solutions.
Some databases also require expensive licensing fees and hardware requirements, which may not be feasible for smaller projects or organizations with limited resources.
Other database management systems may require specialized knowledge or expertise to operate. This leads to additional costs for hiring skilled personnel or outsourcing the task.
Cloud databases and open-source data management tools are the most cost-effective options for current data teams. They eliminate the need for expensive hardware and offer pay-as-you-go pricing models.
Balancing trade-offs and making informed decisions
Data teams must carefully assess available database management systems to make informed decisions and create an effective and dynamic data environment. Balancing out the pros and cons of each type of database is essential to finding a solution that meets project and team needs.
For example, a NewSQL database system can mitigate the trade-off between the performance and scalability considerations of an RDBMS and a NoSQL database. However, it is a relatively new technology that comes with its own cons.
Ultimately, the goal is to choose a database where the strengths cater to the project’s specific purpose and the weaknesses are not significant. This requires some experimentation and iteration to find the best fit.
The types of database management systems are growing with the advent of new technologies.
Most developers and engineers are familiar with RDBMS, but a combination of newer database systems, like distributed and NewSQL databases, can also be used to reach their goals.
The accurate analysis of critical factors like data size and growth projections, expected query volume and complexity, and consistency requirements help data teams pick the best type of database for their needs.
Experimenting and ongoing monitoring are also necessary to ensure that the database meets the organization’s evolving needs over time.
You can read our blog to learn more about data engineering and how to garner valuable data insights.