11 Types Of Python Data Structures For Data Analysis
Data analysis is about wrangling raw information into a usable format to discover new trends and insights. But how can you analyze this data? The answer is Python. It is a popular programming language that offers a powerful toolbox for this task. At the core of Python resides fundamental data structures that act as containers for organizing and manipulating your data. Understanding these structures is important for building efficient and effective data analysis workflows.
This article will dive into the essential Python data structures suited explicitly for data analysis.
What are Data Structures?
Data structures are the foundation for organizing and storing data in a computer’s memory. They allow for efficient access, manipulation, and retrieval of the data. Here are some common data structures:
- Array: An array is a group of the same elements or data items of the same type collected at a contiguous memory location. Once it is created with a specific size, it usually cannot be resized later.
- Tree: A tree is a fundamental data structure that represents and organizes data in a hierarchical format, making it easier to navigate. The top node of the tree is known as the root node, and other nodes below it are called the child nodes.
- Graph: A graph is a data structure that is not linear and is composed of vertices and edges. Vertices, also known as nodes, and edges, which are lines or arcs, establish connections between any two nodes within the graph.
You will gain a complete understanding of some of the widely used data structures in the below sections.
What are Data Structures in Python?
Python data structures are divided into two parts—mutable and immutable. Mutable data structures can be changed after they are created. For example, you can add, remove, or shuffle their order. Mutable data structures can further be divided into lists, dictionaries, and sets.
In contrast, immutable data structures cannot be modified once they are created. Python only has one built-in immutable data structure, i.e., tuple. However, Python’s different third-party packages provide their data structures, like DataFrames and Series in Pandas or arrays in NumPy. You’ll get to know about these in the further sections.
Lists
In Python, lists are implemented as dynamic, mutable arrays containing a sequence of items. They are heterogeneous in nature. For instance, you can store integers, strings, and even functions within the same list. Unlike an array, where you have to define the limit, you can expand the number of elements as you wish in a list.
Here are some common methods through which you can easily manipulate your list:
- list.append() adds a single element at the end of the list.
- list.insert() is a method that inserts an element at the given index, shifting the other elements to the right.
- list.extend() adds the elements to another list at the end of the list by using + or +=.
- list.index() method searches for the given item from the start of the list and returns its index value.
- list.remove() searches for the first instance of the given element and then removes it.
Example of Lists in Python:
Dictionaries
A dictionary in Python is a collection of ordered and changeable key:value pairs. Here, keys are the unique identifiers that give access to the associated element stored in the dictionary, and values can be any data type in Python. Dictionaries are written in curly brackets ‘{}’.
Some common dictionary methods are:
- clear() remove all the elements from the dictionary.
- copy() gives a replica of the dictionary.
- fromkeys() method returns a dictionary with the defined keys and values.
- pop() removes the element with a particular key.
- values() returns a list of all the values in the dictionary.
- The update() method revamps the dictionary with the specified key-value pairs.
Example of Dictionaries in Python:
Sets
Sets in Python are another fundamental data structure that offers a unique way to store and manipulate collections of items. Unlike lists and dictionaries, sets focus on uniqueness and unordered elements. A core characteristic of sets is that they cannot contain duplicate values. It will be silently ignored if you try to add a duplicate element. Sets are defined using ‘{}’, and elements are separated by commas within the brackets.
Here are some of the Set methods:
- add() adds a new element to the set.
- clear() remove all the elements from the set.
- discard() removes the specified item.
- union() returns a set containing the union of sets.
- pop() removes an element from the set.
Example of set in Python:
Tuples
Tuples in Python are immutable sequences, similar to lists, but the major difference is that they cannot be modified once created. They are defined by enclosing comma-separated values within parentheses ‘()’. Tuples are often used to store corresponding pieces of information together, such as coordinates, database records, or function arguments.
The methods you can apply to a tuple are:
- count() method returns the number of times a specified value occurred in a tuple.
- index() method searches the tuple for a particular value and returns its position.
Example of tuples in Python:
Some of the user-defined data structures that you can use in Python for managing data are:
Stack
Stack is an ordered data structure that follows the Last-In-First-Out (LIFO) principle. This means that the most recently added element will be the first one to be removed. You can perform various operations on the Stack, like append or delete.
Implementation of Stack in Python
In Python, there are several ways to implement a stack. Let's explore a few of them in detail:
Method 1: Using a List
The most common way to implement a Stack in Python is by using a list. You can use append() to insert elements to the top of the stack, while pop() removes the element in LIFO order.
Here is an example:
Method 2: Using collections.deque
In addition to the built-in data structures, Python offers some additional options for data collection through its built-in collections module. This module includes various data structures, one of which is deque.
The deque (pronounced "deck") is a "double-ended queue" allows you to insert and delete elements from both the front and rear sides. It is preferred over a list, as the deque performs append and pop operations faster than a list.
Let's understand with an Example:
Linked Lists
Unlike an array, a linked list stores elements more flexibly. Instead of relying on contiguous memory locations, it connects elements using nodes that hold data and an address pointing to the next link in the chain.
Implementation of Linked List in Python
Queues
The queue data structure is like a list in which all additions are made from one end and deleted from the other. It works on the First-In-First-Out (FIFO) principle. This means the first element inserted into the queue will be removed in priority. Queues are used to manage tasks or data that need to be processed in a specific order.
Implementation of Queues in Python
There are different ways to implement a Queue in Python. Let’s take a look at a few of them:
Method 1: Using a List
One simple approach to creating a queue is to utilize a list. You can insert elements using the append() method and pop() to remove them from the queue.
Here is an example:
Method 2: Using collections.deque
In Python, a queue can also be implemented using the deque class from the collections module. The advantage of deque is that appending and deleting elements takes constant time complexity, O(1), compared to lists, which provide O(n) time complexity.
Heaps
A heap is a unique binary tree data structure that is highly efficient in storing a collection of ordered elements. It helps to keep track of the largest and smallest elements in a collection. Heaps are significantly used to implement priority queues, where the highest priority item is always returned first.
In Python, heaps are implemented using the heapq module. This module offers relevant functions to carry out various operations on the heap data structure. Here are a few of them:
- heapify() converts a list into a heap.
- heappush() inserts an element into a heap.
- heappop() removes the smallest element from a heap.
- nlargest() returns the n largest elements from a dataset.
- nsmallest() returns the n smallest elements from a dataset.
Here is an example of using the heapq module:
Now, let's understand how to perform advanced data analysis using Python libraries like NumPy and Pandas. You need to import these libraries to use them in your Python code.
NumPy Arrays
NumPy, short for Numerical Python, is a robust Python library for numerical computation. It supports large, multi-dimensional arrays and provides a wide range of mathematical functions for efficient operations on these arrays.
NumPy arrays are similar to a Python list but have some key differences. The primary distinction is that NumPy arrays are much faster and have stricter requirements on the homogeneity of the objects they contain compared to lists.
NumPy arrays hold elements of the same data type. For example, a NumPy array of strings can only contain strings and no other data types. In contrast, a Python list can include a mixture of strings, numbers, booleans, and other objects.
Here are some important NumPy functions in Python
- np.array() this function creates a NumPy array.
- np.zeros() creates a new NumPy array filled with zeros.
- np.arange() creates an array with a range of values.
- np.ones() creates a new NumPy array filled with ones.
- np.linspace() creates an array with a specified number of evenly spaced values.
Example of Numpy Arrays in Python:
Pandas Series
The Pandas Series can be referred to one-dimensional labeled array in Python that can hold data of any type, such as integers, floats, strings, and more. You can think of a Pandas Series as a single column of data, where each value in the series has a corresponding label, and these labels are collectively referred to as the index.
Here are some commonly used methods available in the Pandas Series:
- size() returns the total number of elements in the underlying data of the series.
- head() returns a specified number of rows from the beginning of a series.
- tail() method returns a specified number of rows from the end of a Series.
- unique() method is used to see the unique values in a particular column.
Example of Pandas Series in Python:
DataFrames
DataFrames are a two-dimensional labeled data structure provided by the Pandas library. It is similar to a table, where data is organized in rows and columns. Pandas DataFrames provide a flexible way to manipulate data, from selecting or replacing specific columns and indices to reshaping the entire dataset.
Here are some commonly used methods on DataFrames in Python:
- DataFrame.pop() returns an item and drops from the frame.
- DataFrame.tail() returns the last n rows.
- DataFrame.to_numpy() converts the DataFrame to a NumPy array.
- DataFrame.head() returns the first n rows.
Example of DataFrame in Python:
Counter
A Counter is a dictionary subclass from Python's collections module for counting hashable objects. It creates a dictionary where elements are stored as keys and their counts as values.
Here are some commonly used methods on Counter in Python:
- Counter.elements() - Returns an iterator of elements repeating as many times as their count
- Counter.subtract([iterable]) - Subtracts counts from another iterable
- Counter.update([iterable]) - Adds counts from another iterable
String
A String is an immutable sequence of characters. Python strings are Unicode by default, making them perfect for handling text data.
Here are some commonly used methods on Strings in Python:
- string.split() - Splits string into a list of substrings
- string.strip() - Removes leading and trailing whitespace
- string.replace(old, new) - Replaces all occurrences of old with a new
- string.upper()/lower() - Converts to uppercase/lowercase
Example of String operations in Python:
Matrix
A Matrix is a 2D array structure, typically implemented using NumPy arrays in Python. It's used for mathematical operations and data representation.
Here are some commonly used methods on Matrices in Python:
- matrix.transpose() - Flips matrix over its diagonal
- matrix.dot(other) - Matrix multiplication
- matrix.reshape() - Changes matrix dimensions
- matrix.sum()/mean() - Calculates sum/average of elements
Example of Matrix in Python:
Unlock the Power of Your Data with Airbyte
Now you know what Python data structures are and how to perform different operations on them. However, to perform queries on this data, you first need to consolidate it on a single platform. This is where Airbyte can help you!
It is a data integration platform that allows you to consolidate all your segregated data on a single platform. Airbyte’s vast connector library offers more than 350 pre-built connectors that can connect to multiple sources and destinations. Beyond its extensive library of pre-built connectors, Airbyte's Connector Development Kit (CDK) empowers you to build custom connectors to suit unique data sources or destinations. This flexibility allows seamless integration regardless of your specific data ecosystem.
However, to ensure the updated information is seamlessly synced with your target system, its change data capture (CDC) capabilities enable you to migrate data efficiently.
Key features of Airbyte are:
- Developer-friendly UI: If you are seeking even greater control and flexibility for designing data pipelines, Airbyte offers PyAirbyte, a Python library that provides programmatic access. With PyAirbyte, you can extract data from multiple connectors supported by Airbyte, enabling the creation of custom data pipelines tailored to your distinct needs. This empowers you to use Airbyte's capabilities within your development workflows.
- Transformation with dbt: Airbyte provides seamless integration with robust data transformation tools like dbt. This allows you to utilize transformation capabilities within your data migration pipelines, ensuring data is appropriately cleaned, normalized, and enriched as needed.
- Robust Security: Airbyte prioritizes security with certifications like SOC 2, GDPR, and ISO and offers HIPAA compliance through its Conduit solution. This contributes to a secure and compliant data migration.
Final Words!
Python offers various data structures tailored for effective data analysis. These include lists, tuples, dictionaries, sets, and stacks, each serving unique purposes and accommodating various data manipulation needs. For example, Lists and Tuples provide ordered collections, while Dictionaries offer key-value pair mappings, and Sets ensure uniqueness. By leveraging these versatile Python data structures, you can manipulate and extract insights from data sets of varying complexities, contributing to streamlined data analysis workflows.