11 Types Of Python Data Structures For Data Analysis

•

May 24, 2024

•

20 Mins Read

Data analysis is about wrangling raw information into a usable format to discover new trends and insights. But how can you analyze this data? The answer is Python. It is a popular programming language that offers a powerful toolbox for this task. At the core of Python resides fundamental data structures that act as containers for organizing and manipulating your data. Understanding these structures is important for building efficient and effective data analysis workflows.

This article will dive into the essential Python data structures suited explicitly for data analysis.

What are Data Structures?

Data structures are the foundation for organizing and storing data in a computer’s memory. They allow for efficient access, manipulation, and retrieval of the data. Here are some common data structures:

Array: An array is a group of the same elements or data items of the same type collected at a contiguous memory location. Once it is created with a specific size, it usually cannot be resized later.‍
Tree: A tree is a fundamental data structure that represents and organizes data in a hierarchical format, making it easier to navigate. The top node of the tree is known as the root node, and other nodes below it are called the child nodes. ‍
Graph: A graph is a data structure that is not linear and is composed of vertices and edges. Vertices, also known as nodes, and edges, which are lines or arcs, establish connections between any two nodes within the graph.

You will gain a complete understanding of some of the widely used data structures in the below sections.

What are Data Structures in Python?

Python data structures are divided into two parts—mutable and immutable. Mutable data structures can be changed after they are created. For example, you can add, remove, or shuffle their order. Mutable data structures can further be divided into lists, dictionaries, and sets.

In contrast, immutable data structures cannot be modified once they are created. Python only has one built-in immutable data structure, i.e., tuple. However, Python’s different third-party packages provide their data structures, like DataFrames and Series in Pandas or arrays in NumPy. You’ll get to know about these in the further sections.

Lists

In Python, lists are implemented as dynamic, mutable arrays containing a sequence of items. They are heterogeneous in nature. For instance, you can store integers, strings, and even functions within the same list. Unlike an array, where you have to define the limit, you can expand the number of elements as you wish in a list.

Here are some common methods through which you can easily manipulate your list:

list.append() adds a single element at the end of the list. ‍
list.insert() is a method that inserts an element at the given index, shifting the other elements to the right. ‍
list.extend() adds the elements to another list at the end of the list by using + or +=. ‍
list.index() method searches for the given item from the start of the list and returns its index value. ‍
list.remove() searches for the first instance of the given element and then removes it.

Example of Lists in Python:


# Define a list
demo_list = [1, 2, 3, 4, 5]

# Append an element to the end of the list
demo_list.append(6)
print("After append:", demo_list)

# Insert an element at a specific index
demo_list.insert(2, 8)  # Insert 7 at index 2
print("After insert:", demo_list)

# Extend the list with another list
demo_list.extend([9, 10, 11])
print("After extend:", demo_list)

# Get the index of an element in the list
index = demo_list.index(4)
print("Index of 4:", index)

# Remove an element from the list
demo_list.remove(3)
print("After remove:", demo_list)

Dictionaries

A dictionary in Python is a collection of ordered and changeable key:value pairs. Here, keys are the unique identifiers that give access to the associated element stored in the dictionary, and values can be any data type in Python. Dictionaries are written in curly brackets ‘{}’.

Some common dictionary methods are:

clear() remove all the elements from the dictionary.‍
copy() gives a replica of the dictionary. ‍
fromkeys() method returns a dictionary with the defined keys and values. ‍
pop() removes the element with a particular key. ‍
values() returns a list of all the values in the dictionary.
The update() method revamps the dictionary with the specified key-value pairs.

Example of Dictionaries in Python:


# Define a dictionary
my_dict = {
    "name": "Siya",
    "age": 26,
    "city": "New York"
}

# Access values by keys
print("Name:", my_dict["name"])
print("Age:", my_dict["age"])
print("City:", my_dict["city"])

Sets

Sets in Python are another fundamental data structure that offers a unique way to store and manipulate collections of items. Unlike lists and dictionaries, sets focus on uniqueness and unordered elements. A core characteristic of sets is that they cannot contain duplicate values. It will be silently ignored if you try to add a duplicate element. Sets are defined using ‘{}’, and elements are separated by commas within the brackets.

Here are some of the Set methods:

add() adds a new element to the set. ‍
clear() remove all the elements from the set. ‍
discard() removes the specified item.
union() returns a set containing the union of sets. ‍
pop() removes an element from the set.

Example of set in Python:


# Define a set
my_set = {1, 2, 3, 4, 5}

# Add an element to the set
my_set.add(6)
print("After adding 6:", my_set)

# Remove an element from the set
my_set.discard(3)
print("After removing 3:", my_set)

# Check if an element is in the set
print("Is 2 in the set?", 2 in my_set)

# Get the length of the set
print("Length of the set:", len(my_set))

# Create another set
new_set = {4, 5, 6, 7, 8}

# Union of sets
union_set = my_set.union(new_set)
print("Union of sets:", union_set)

# Intersection of sets
intersection_set = my_set.intersection(new_set)
print("Intersection of sets:", intersection_set)

Tuples

Tuples in Python are immutable sequences, similar to lists, but the major difference is that they cannot be modified once created. They are defined by enclosing comma-separated values within parentheses ‘()’. Tuples are often used to store corresponding pieces of information together, such as coordinates, database records, or function arguments.

The methods you can apply to a tuple are:

count() method returns the number of times a specified value occurred in a tuple. ‍
index() method searches the tuple for a particular value and returns its position.

Example of tuples in Python:


# Define a tuple
my_tuple = (1, 2, 3, 4, 5, 6, 3, 3)

# Count occurrences of a specific element
count_of_3 = my_tuple.count(3)
print("Count of 3:", count_of_3)

# Find the index of a specific element
index_of_4 = my_tuple.index(4)
print("Index of 4:", index_of_4)

Some of the user-defined data structures that you can use in Python for managing data are:

Stack

Stack is an ordered data structure that follows the Last-In-First-Out (LIFO) principle. This means that the most recently added element will be the first one to be removed. You can perform various operations on the Stack, like append or delete.

Implementation of Stack in Python

In Python, there are several ways to implement a stack. Let's explore a few of them in detail:

Method 1: Using a List

The most common way to implement a Stack in Python is by using a list. You can use append() to insert elements to the top of the stack, while pop() removes the element in LIFO order.

Here is an example:


stack = []

# append() function to push elements into the stack
stack.append('k')
stack.append('l')
stack.append('m')

print(stack)  # Output: ['k', 'l', 'm']

# pop() function to remove element from stack
print(stack.pop(2))	# Output: 'm'
print(stack.pop(1))	# Output: 'l'
print(stack.pop(0))	# Output: 'k'

Method 2: Using collections.deque

In addition to the built-in data structures, Python offers some additional options for data collection through its built-in collections module. This module includes various data structures, one of which is deque.

The deque (pronounced "deck") is a "double-ended queue" allows you to insert and delete elements from both the front and rear sides. It is preferred over a list, as the deque performs append and pop operations faster than a list.

Let's understand with an Example:


from collections import deque
stack = deque()

# append() function to push elements into the stack
stack.append('x')
stack.append('y')
stack.append('z')

print(stack)  # Output: deque(['x', 'y', 'z'])

# pop() function to remove element from stack
print(stack.pop(2))	# Output: 'z'
print(stack.pop(1))	# Output: 'y'
print(stack.pop(0))	# Output: 'x'

Linked Lists

Unlike an array, a linked list stores elements more flexibly. Instead of relying on contiguous memory locations, it connects elements using nodes that hold data and an address pointing to the next link in the chain.

Implementation of Linked List in Python


#Initializing a node
class Node:
    def __init__(self, data):
        self.data = data	# Assigns the given data to the node
        self.next = None	# Initialize the next attribute to null

#Creating a linked list class
class LinkedList:
    def __init__(self):
        self.head = None  # Initialize head as None

#Inserting a new node at the beginning of a linked list
def insertAtBeginning(self, new_data):
        new_node = Node(new_data)  # Create a new node 
        new_node.next = self.head  
        self.head = new_node

Queues

The queue data structure is like a list in which all additions are made from one end and deleted from the other. It works on the First-In-First-Out (FIFO) principle. This means the first element inserted into the queue will be removed in priority. Queues are used to manage tasks or data that need to be processed in a specific order.

Implementation of Queues in Python

There are different ways to implement a Queue in Python. Let’s take a look at a few of them:

Method 1: Using a List

One simple approach to creating a queue is to utilize a list. You can insert elements using the append() method and pop() to remove them from the queue.

Here is an example:


# Implementing queue using a list

queue = []
queue.append(5)  # enqueue
queue.append(7)
queue.append(9)
print(queue)   # [5, 7, 9]

a = queue.pop(0) # dequeue
print(a)     # 5
print(queue) # [7, 9]

Method 2: Using collections.deque

In Python, a queue can also be implemented using the deque class from the collections module. The advantage of deque is that appending and deleting elements takes constant time complexity, O(1), compared to lists, which provide O(n) time complexity.

Heaps

A heap is a unique binary tree data structure that is highly efficient in storing a collection of ordered elements. It helps to keep track of the largest and smallest elements in a collection. Heaps are significantly used to implement priority queues, where the highest priority item is always returned first.

In Python, heaps are implemented using the heapq module. This module offers relevant functions to carry out various operations on the heap data structure. Here are a few of them:

heapify() converts a list into a heap.‍
heappush() inserts an element into a heap.‍
heappop() removes the smallest element from a heap.‍
nlargest() returns the n largest elements from a dataset.‍
nsmallest() returns the n smallest elements from a dataset.

Here is an example of using the heapq module:


import heapq
# Create an empty heap
heap = []

# Push items onto the heap
heapq.heappush(heap, 5)
heapq.heappush(heap, 3)
heapq.heappush(heap, 7)
heapq.heappush(heap, 1)

# Pop and print the smallest item from the heap
smallest = heapq.heappop(heap)
print(smallest) 	
# Output: 1

Now, let's understand how to perform advanced data analysis using Python libraries like NumPy and Pandas. You need to import these libraries to use them in your Python code.

NumPy Arrays

NumPy, short for Numerical Python, is a robust Python library for numerical computation. It supports large, multi-dimensional arrays and provides a wide range of mathematical functions for efficient operations on these arrays.

NumPy arrays are similar to a Python list but have some key differences. The primary distinction is that NumPy arrays are much faster and have stricter requirements on the homogeneity of the objects they contain compared to lists.

NumPy arrays hold elements of the same data type. For example, a NumPy array of strings can only contain strings and no other data types. In contrast, a Python list can include a mixture of strings, numbers, booleans, and other objects.

Here are some important NumPy functions in Python

np.array() this function creates a NumPy array.‍
np.zeros() creates a new NumPy array filled with zeros.
‍np.arange() creates an array with a range of values.‍
np.ones() creates a new NumPy array filled with ones.‍
np.linspace() creates an array with a specified number of evenly spaced values.

Example of Numpy Arrays in Python:


import numpy as np
# Creating a NumPy array
arr1 = np.array([1, 2, 3, 4])
print("NumPy array:", arr1)
# Output: NumPy array: [1 2 3 4]

# Creating a NumPy array of zeros
arr2 = np.zeros(5)
print("NumPy array of zeros:", arr2)
# Output: NumPy array of zeros: [0. 0. 0. 0. 0.]

# Creating a NumPy array with a range of values
arr3 = np.arange(1, 5)
print("NumPy array with range:", arr3)
# Output: NumPy array with range: [1 2 3 4 ]

Pandas Series

The Pandas Series can be referred to one-dimensional labeled array in Python that can hold data of any type, such as integers, floats, strings, and more. You can think of a Pandas Series as a single column of data, where each value in the series has a corresponding label, and these labels are collectively referred to as the index.

Here are some commonly used methods available in the Pandas Series:

size() returns the total number of elements in the underlying data of the series.‍
head() returns a specified number of rows from the beginning of a series.‍
tail() method returns a specified number of rows from the end of a Series. ‍
unique() method is used to see the unique values in a particular column.

Example of Pandas Series in Python:


# Accessing Series values by using index
# The first value has index 0
import pandas as pd
pd.Series ( ["Hadoop", "Spark", "Python", "Oracle"] )
courses = pd.Series ( ["Hadoop", "Spark", "Python", "Oracle"] )
print(courses[2])

# Output: Python

DataFrames

DataFrames are a two-dimensional labeled data structure provided by the Pandas library. It is similar to a table, where data is organized in rows and columns. Pandas DataFrames provide a flexible way to manipulate data, from selecting or replacing specific columns and indices to reshaping the entire dataset.

Here are some commonly used methods on DataFrames in Python:

DataFrame.pop() returns an item and drops from the frame.‍
DataFrame.tail() returns the last n rows.‍
DataFrame.to_numpy() converts the DataFrame to a NumPy array.‍
DataFrame.head() returns the first n rows.

Example of DataFrame in Python:


# Creating a simple Pandas DataFrame:
import pandas as pd

data = {
  "Score": [580, 250, 422],
  "id": [22, 37, 55]
}

#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)

Unlock the Power of Your Data with Airbyte

Now you know what Python data structures are and how to perform different operations on them. However, to perform queries on this data, you first need to consolidate it on a single platform. This is where Airbyte can help you!

It is a data integration platform that allows you to consolidate all your segregated data on a single platform. Airbyte’s vast connector library offers more than 350 pre-built connectors that can connect to multiple sources and destinations. Beyond its extensive library of pre-built connectors, Airbyte's Connector Development Kit (CDK) empowers you to build custom connectors to suit unique data sources or destinations. This flexibility allows seamless integration regardless of your specific data ecosystem.

However, to ensure the updated information is seamlessly synced with your target system, its change data capture (CDC) capabilities enable you to migrate data efficiently.

Key features of Airbyte are:

Developer-friendly UI: If you are seeking even greater control and flexibility for designing data pipelines, Airbyte offers PyAirbyte, a Python library that provides programmatic access. With PyAirbyte, you can extract data from multiple connectors supported by Airbyte, enabling the creation of custom data pipelines tailored to your distinct needs. This empowers you to use Airbyte's capabilities within your development workflows.‍
Transformation with dbt: Airbyte provides seamless integration with robust data transformation tools like dbt. This allows you to utilize transformation capabilities within your data migration pipelines, ensuring data is appropriately cleaned, normalized, and enriched as needed. ‍
Robust Security: Airbyte prioritizes security with certifications like SOC 2, GDPR, and ISO and offers HIPAA compliance through its Conduit solution. This contributes to a secure and compliant data migration.

Final Words!

Python offers various data structures tailored for effective data analysis. These include lists, tuples, dictionaries, sets, and stacks, each serving unique purposes and accommodating various data manipulation needs. For example, Lists and Tuples provide ordered collections, while Dictionaries offer key-value pair mappings, and Sets ensure uniqueness. By leveraging these versatile Python data structures, you can manipulate and extract insights from data sets of varying complexities, contributing to streamlined data analysis workflows.

Limitless data movement with free Alpha and Beta connectors

Introducing: our Free Connector Program

The data movement infrastructure for the modern data teams.

Try a 14-day free trial