Python Pandas DataFrame: Examples & Use Cases For ETL

Jim Kutz
August 29, 2025
20 min read

Summarize with ChatGPT

Summarize with Perplexity

A Python DataFrame, part of the Pandas library, is a powerful and flexible data structure that allows you to work with structured data. Whether you're handling small or large datasets, DataFrame Python enables you to efficiently organize, manipulate, and analyze your data. With its tabular design, you can manage data in rows and columns, simplifying tasks like filtering, transforming, and aggregating.

This article explores the full potential of Python DataFrames, helping you effectively use them for data integration, processing, and analysis tasks.

What Is a Python DataFrame and How Does It Work?

A Python DataFrame is a two-dimensional data structure with labeled axes (rows and columns). It is offered by the Pandas library, an open-source Python library that enables you to perform data manipulation and analysis.

Within a DataFrame, you can perform operations such as column addition and deletion, row addition and deletion, column renaming, and row slicing.

Let's look at a simple illustration of a DataFrame to understand its key components:

Pandas DataFrame

In this Pandas DataFrame: 

  • Column Label/Header: These are the names of the columns, such as Name, Age, Marks, Grade, and Hobby, in the DataFrame.
  • Index/Row Label: Labels, such as S1, S2, S3, S4, and S5, are assigned to rows, helping identify each row in the DataFrame.
  • Column Index: Refers to the position (0, 1, 2, 4, 5, …, N) of each column in the DataFrame.
  • Row Index: Refers to the position (0, 1, 2, 4, 5, …, N) of each row.
  • Row: Represents a single data record.
  • Column: A collection of data points corresponding to a specific feature or attribute.
  • Element/Value/Entry: A single data value located at the intersection of a row and column.

How Can You Create a DataFrame in Python Pandas?

Creating a DataFrame in Python Pandas is simple and can be accomplished through various methods. Let's start with generating an empty DataFrame and then explore different creation techniques.

Prerequisites for getting started include:

  • Install the latest version of Python and Anaconda Navigator.
  • Install Pandas using the pip install pandas command in the Anaconda prompt (Skip this if Pandas is already available).

Generating an empty DataFrame

  1. Open Jupyter Notebook in your browser or development environment.
  2. Create a new notebook by selecting File, then New, then Notebook, and choose the Python kernel.
  3. Import the Pandas library using the following code:
import pandas as pd1
  1. Create an empty DataFrame with this simple command:
df1=pd1.DataFrame()
  1. You can now print the DataFrame df1 to the console:
print(df1)

After executing the above code, you will get the following output:

Empty DataFrame

Creating DataFrames from Python Lists

A list in Python is an ordered collection of items, which can be of any data type. Lists are mutable and are indexed starting from 0. Let’s see an example of creating a DataFrame from a list:

import pandas as pd1
# A Python list
list1 = [1,'XYZ',30,40000.00,'Engineer']
df1 = pd1.DataFrame(list1)
print(df1)

Ouput:

DataFrame from a List

You can also use a list of lists to create a DataFrame by passing it as data, with each inner list representing a row:

import pandas as pd1
# A list of lists
list1 = [[1, 'Alex', 29], [2, 'Jacob', 31], [3, 'Sam', 25]]
df1 = pd1.DataFrame(list1, columns=['ID', 'Name', 'Age'])
print(df1)

‍Output:

DataFrame from a Nested List

Building DataFrames from Dictionary Structures

A dictionary in Python is an unordered collection of key-value pairs. The keys are unique, and each key is associated with a value. To create a DataFrame from a dictionary of lists, the keys can serve as column labels, and the associated list becomes the column data:

import pandas as pd1
# A dictionary of lists
my_dict = {'Name': ['Alex','Jacob'], 'Age': [30, 31]}
df1 = pd1.DataFrame(my_dict)
print(df1)

‍Output:

DataFrame from a Dictionary

Constructing a DataFrame from Pandas Series Objects

A Pandas Series is a one-dimensional array-like object that can hold various data types. You can add the series in a dictionary form to create a DataFrame as follows:

import pandas as pd1
data1 = {'ID': pd1.Series([1, 2]), 'Name': pd.Series(['Alex', 'Jacob'])}
df1 = pd1.DataFrame(data1)
print(df1)

‍Output:

DataFrame from a Dictionary of Pandas Series

For more control over your DataFrame structure, you can customize row and column index labels:

import pandas as pd1
data1 = {'ID': pd1.Series([1, 2], index = ['R1','R2']), 'Name': pd.Series(['Alex', 'Jacob'], index= ['N1', 'N2'])}
df1 = pd1.DataFrame(data1)
print(df1)

‍Output:

Customize Index Labels

Constructing a DataFrame from a Numpy Array

A Numpy array is an array object in Python used for numerical computations. You can create a DataFrame by passing a Numpy array along with column labels:

# import numpy library
import numpy as np
import pandas as pd
# Creating a 2D numpy array of temperature and humidity data
data = np.array([[25, 55], [18, 45], [42, 22]])
df = pd.DataFrame(data, columns=['Temperature', 'Humidity'])
print(df)

‍Output:

DataFrame from a NumPy Array

Creating DataFrames from Existing DataFrames

You can easily create a new DataFrame by filtering an existing DataFrame. This allows you to reuse the structure and data of another DataFrame.

import pandas as pd
# Creating the first DataFrame with a single column 'Quantity' and custom index labels
df1 = pd.DataFrame([10, 20, 30], index=["Apple", "Banana", "Cherry"], columns=["Quantity"])
# Now, create the second DataFrame by selecting specific index labels (Apple and Cherry)
df2 = pd.DataFrame(df1, index=["Apple", "Cherry"])
# Displaying the second DataFrame
print(df2)

Output:

DataFrame from Another DataFrame

How Can You Load Different Data Formats Into Python Pandas DataFrames?

Pandas provides simple and efficient functions to load data from multiple file formats directly into DataFrames. These functions handle the complexities of parsing different file structures and provide consistent DataFrame outputs.

Loading CSV Files into DataFrames

CSV is a widely used format that helps you store tabular data. You can load CSV files into a Pandas DataFrame using the read_csv() function.

Let’s create a simple example in Jupyter Notebook by considering a sample CSV file (demo.csv) that contains the following data:

order_date,"price","item","brand"
2024-11-01,1200.20,"Dress","Hollister"
2024-11-02,2600.00,"Jeans", "Levi’s"

Follow the steps below:

  1. Open Jupyter Notebook and click File > New > Text File.
  2. Save the above CSV data in the file as demo.csv.
  3. In your notebook, click File > New > Notebook and run the following code to load the CSV file into a DataFrame:
import pandas as pd
# Read data from file demo.csv
csv_df = pd.read_csv("demo.csv")
  1. To see the loaded data in DataFrame, use the following syntax:
print(csv_df)

‍Output:

Reading CSV

Processing JSON Data Files

JSON is a popular format for data exchange and storing data in key-value pairs. You can load JSON data into a DataFrame using the read_json() function:

Let us consider a JSON file named demo.json:

[
    {
        "Name": "Alex",
        "Age": 30,
        "Department": "HR"
    },
    {
        "Name": "Jacob",
        "Age": 25,
        "Department": "Finance"
    },
    {
        "Name": "Sam",
        "Age": 35,
        "Department": "Engineering"
    }
]

Run the following Python code to load this JSON file into a DataFrame:

import pandas as pd
# Load data from a JSON file
json_df = pd.read_json('demo.json')
print(json_df)

‍Output:

Reading JSON

Working with Excel Files

Pandas also supports loading data from Excel files (.xls and .xlsx) using the read_excel() function. This function allows you to specify individual sheets within an Excel workbook to load into a DataFrame.

Let us take the example of loading an Excel file named employees.xlsx into Pandas DataFrame in Python using Jupyter Notebook. The file consists of two worksheets.

  1. First, upload the Excel file into the Jupyter Notebook. To do this, go to View > File Browser > Upload and browse your Excel file and click Open.
  2. Next, run the following Python code to load data from the specific sheet:
import pandas as pd
# Load data from Sheet2 in the Excel file
excel_df = pd.read_excel('employees.xlsx', sheet_name='Sheet2')
print(excel_df)

‍Output:

Reading Excel

Loading Parquet Files

Parquet is a columnar storage file format that enables you to store large datasets efficiently. Pandas help you read Parquet files using the read_parquet() function.

To load a Parquet file userdata1.parquet into a DataFrame, you must upload it to the Jupyter Notebook using the same steps as in the previous section. Once you add the Parquet file to the directory, you can run the following code:

import pandas as pd
# Load data from a Parquet file
parquet_df = pd.read_parquet('userdata1.parquet')

Let’s preview only the first two rows of the large parquet file using the head(2) function on the DataFrame.

print(parquet_df.head(2))

Output:

Reading Parquet

Processing XML Data

XML (Extensible Markup Language) is commonly used to store and exchange data. Pandas allow you to read XML files using the read_xml() function.

Let’s first create the following XML file named demo.xml in Jupyter Notebook:

<?xml version="1.0" encoding="UTF-8"?>
<Employees>
    <Employee>
        <Name>Alex</Name>
        <Age>30</Age>
        <Department>HR</Department>
    </Employee>
    <Employee>
        <Name>Jacob</Name>
        <Age>25</Age>
        <Department>Finance</Department>
    </Employee>
</Employees>

Now, run the following Python code to load the XML file into a DataFrame:

import pandas as pd
# Load data from an XML file
xml_df = pd.read_xml('demo.xml')
print(xml_df)

Output:

Reading XML

What Are the Essential Python Pandas Example Operations for DataFrame Manipulation?

DataFrame manipulation forms the core of data analysis workflows. These operations allow you to explore, clean, and transform your data to extract meaningful insights.

Viewing and Exploring Your Data

Understanding your data structure is the first step in any analysis. Pandas provides several methods to examine DataFrame contents and characteristics:

  • print(): Prints the entire DataFrame.
  • head(): Displays the first specified number of rows of the DataFrame. It defaults to 5 if not specified.
  • tail(): Shows the last specified number of rows. It also defaults to 5 if not specified.
  • info(): Provides concise information about the DataFrame, including number of entries, column names, data types, and non-null counts per column.
  • describe(): Displays summary statistics such as mean, median, standard deviation, and percentile, count, max, and min for numerical columns.

Here is a simple example to understand how to view data in a DataFrame:

import pandas as pd
data = {
    "Name": ["Alex", "Jacob", "Sam", "David", "George"],
    "Age": [25, 30, 35, 40, 22],
    "Salary": [50000, 60000, 55000, 65000, 48000],
    "Department": ["HR", "Finance", "IT", "Marketing", "HR"]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Print the entire DataFrame
print("\nEntire DataFrame:")
print(df)

# Display the first 3 rows
print("Head (first 3 rows):")
print(df.head(3))

# Display the last 2 rows
print("\nTail (last 2 rows):")
print(df.tail(2))

# Display concise information about the DataFrame
print("\nInfo:")
df.info()

# Display summary statistics for numerical columns
print("\nDescribe:")
print(df.describe())

Output:

View Data

Selecting Specific Columns and Rows

You can access a column in a DataFrame using a bracket notation (df[‘columnname’]) or a dot notation (df.columnname).

Here’s an example of selecting a column from a DataFrame:

import pandas as pd

data = {
    "Name": ["Alex", "Jacob", "Sam"],
    "Age": [25, 30, 35],
    "Salary": [50000, 60000, 55000]
}

df = pd.DataFrame(data)

# Access column using bracket notation
print("Using Bracket Notation:")
print(df['Name'])

# Access column using dot notation
print("\nUsing Dot Notation:")
print(df.Name)

‍Output:

Select Data

You can also select rows and columns in a DataFrame using two methods:

  • Integer-Location Based Indexing (.iloc): Helps you select rows and columns by their index positions starting from 0.
  • Label-Based Indexing (.loc): Allows you to select rows and columns based on their explicit labels or boolean conditions.

Update the above example by adding the following code for row selection:

# Select the first row using iloc
print("\nSelect the first row using iloc:")
print(df.iloc[0])

# Select the row with index label 1 (if set) using loc
print("\nSelect the row with index label 1 using loc:")
print(df.loc[1])

# Select second column "Age" using iloc
print("Column selected using iloc:")
print(df.iloc[:, 1])  

# Select column "Age" using loc
print("\nColumn selected using loc:")
print(df.loc[:, 'Age'])

‍Output:‍

Select Data Using Different Methods

Adding and Removing DataFrame Elements

Dynamic data modification capabilities make DataFrames flexible for evolving analysis needs. You can add new columns using the following syntax:

df[‘new_column_name’] = []

‍Example:‍

import pandas as pd
data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35]
}
df = pd.DataFrame(data)
print("Before adding a column:")
print(df)

# Adding a new column "Salary"
df['Salary'] = [50000, 60000, 55000]
print("After adding a column:")
print(df)

‍Output:

Modify DataFrame

To remove a column, you can use a drop() function with axis value 1 as follows:

# Dropping the column Salary
df = df.drop('Salary', axis=1)  
print("\nAfter removing the Salary column:")
print(df)

Output:

Dropping a Column from a DataFrame

You can add a row at the end of the DataFrame using loc[].

df.loc[len(df)] = ["David", 40]  
print("\nAfter adding a new row:")
print(df)

‍Output:

Adding a New Row in a DataFrame

To drop a row from the DataFrame by index, you can utilize the drop() function with axis value 0.

# Removing a row at index 1
df = df.drop(1, axis=0)  
print("\nAfter removing the row with index 1:")
print(df)

‍Output:

Removing a Row from a DataFrame

Renaming Columns for Clarity

To rename columns in a DataFrame, use the rename() method. For the above example, run the following code:

df = df.rename(columns={'Name': 'Employee Name'})
print(df)

Output:

Rename Column

Filtering Data Based on Conditions

To filter a DataFrame, use boolean indexing to retrieve the subsets of data according to the actual data values in the DataFrame. This method allows you to create filters using conditions applied to columns.

Run the following code to check the result:

filtered_df = df[df["Age"] > 35]
print(filtered_df)

Output:

Filter Data

The output filters only the information about David, whose age is greater than 35.

Grouping Data for Aggregation

The groupby() function enables you to group data by one or more columns. You can then apply an aggregate function such as sum, mean, min, or max.

Here’s an example of grouping data in a DataFrame:

import pandas as pd
data = {
    "City": ["New York", "Boston", "New York", "Boston"],
    "Temperature": [85, 60, 75, 55],
    "Humidity": [70, 80, 65, 85]
}
df = pd.DataFrame(data)
# Group by City and calculate the average temperature
grouped = df.groupby('City').mean()
print("Grouped by City:")
print(grouped)

‍Output:

Group Data

Combining DataFrames Through Merging

You can merge or join DataFrames using the merge() function. This works similarly to SQL JOIN operations. Here is an example to demonstrate how to merge two DataFrames:

df1 = pd.DataFrame({'ID': [101, 102], 'Name': ['Steeven', 'Aiden']})
df2 = pd.DataFrame({'ID': [101, 102], 'Age': [15, 20]})
merged_df = pd.merge(df1, df2, on='ID')
print(merged_df)

Output:

Merge DataFrames

Sorting Data for Analysis

To sort a DataFrame by one or more columns, you can apply the sortvalues() method. Let’s sort the above merged DataFrame (mergeddf) by ‘Age’ column in descending order:

sorted_df = merged_df.sort_values(by='Age', ascending=False)
print("Sorted DataFrame by Age:")
print(sorted_df)

In this program, you must set the ascending parameter to False to sort the data in descending order. If you specify True, the DataFrame will be sorted in ascending order.

‍Output:

Sort Data

Managing Missing Values Effectively

Pandas enable you to handle missing data in DataFrame using fillna() and dropna() functions. Here’s an example of a DataFrame with missing values:

df_missing_values = pd.DataFrame({
    "Name": ["George", "Ken", None],
    "Age": [45, None, 25]
})

Now, update the program to fill missing values with a default value utilizing the fillna() function:

filled_df = df_missing_values.fillna({'Name': 'Unknown', 'Age': 30})
print("Filled missing values:")
print(filled_df)

You can also drop rows with missing values as follows:

dropped_df = df_missing_values.dropna()
print("\nDropped rows with missing values:")
print(dropped_df)

‍Output:

Handle Missing Values

In this output, you can see that the last two rows in the DataFrame have missing values. Initially, the missing values are handled by filling them in with default values, and then they are dropped.

Precise Data Location and Access

Indexing options such as .loc[] and .iloc[] help you locate specific data by label or position.

‍Example:

# Locate by label
loc_value = merged_df.loc[1, 'Name']  
print("Located using .loc[]:", loc_value)

# Locate by position
iloc_value = merged_df.iloc[1, 1]
print("Located using .iloc[]:", iloc_value)

‍Output:

Locate Data

How Do DataFrames Support ETL Operations?

DataFrames are essential in ETL (Extract, Transform, Load) operations, providing an efficient way for data handling. Here’s how DataFrames are useful in each ETL phase:

Extract

Pandas DataFrames facilitate the extraction of large datasets from various sources such as text files, CSV, Excel sheets, JSON, SQL databases, and APIs. With functions like readcsv(), readsql(), or read_json(), you can load data into a DataFrame, simplifying the extraction step.

Transform

You can apply numerous transformations, such as filtering, grouping, aggregating, and reshaping data, using Python DataFrames. They also help you handle missing values, convert data types, and perform complex string manipulations.

Some functions, such as drop_duplicates() and fillna(), and techniques like boolean indexing allow you to transform data for analysis or additional processing.

Load

After transforming the data, DataFrames make it easy to load the cleaned data into various formats, including databases, CSV files, or Parquet files. DataFrames support methods like tosql(), tocsv(), or to_parquet() to support the ingestion of processed data into the appropriate storage or analysis platforms.

How Can You Extend Python DataFrames with Airbyte for Enhanced Data Integration?

While Python DataFrames are powerful for data manipulation and analysis, their connectivity to diverse data sources can be limited. Airbyte provides the open-source PyAirbyte library to bridge this gap and create more comprehensive data integration workflows.

Airbyte

With PyAirbyte, you can extend the capability of DataFrames by extracting data from hundreds of sources using Airbyte connectors. Following this, the data is loaded into various SQL caches, including Snowflake, BigQuery, DuckDB, or PostgreSQL. You can then convert the data from these caches into a Pandas DataFrame. This enables you to manipulate and analyze the data within a Python environment efficiently.

For detailed insights, you can refer to the PyAirbyte demo. Apart from PyAirbyte, Airbyte offers several other features as follows:

  • Extensive Connector Catalog: Airbyte provides 600+ pre-built connectors to help you transfer data between various platforms and destinations.
  • Custom Connector Development: Using Airbyte’s AI Assistant no-code Connector Builder, you can create custom connectors as per your requirements. The AI assistant prefills the configuration fields, speeding up the process.
  • Change Data Capture (CDC): Airbyte supports CDC, which enables you to track source data changes and replicate them to the destination system. This ensures data consistency across platforms.
  • Multi-Vector Database Support: You can load all your unstructured data directly into Airbyte-supported vector databases, including Pinecone, Milvus, Weaviate, or Qdrant.

Conclusion

In this comprehensive guide, you have learned how Python DataFrames can help you manage datasets within an easy-to-use tabular format. DataFrames are also beneficial in simplifying ETL operations. You can extend the power of DataFrames using Airbyte’s PyAirbyte for enhanced ETL workflows.

Frequently Asked Questions

What is the difference between a DataFrame and a regular Python list?

A DataFrame is a two-dimensional labeled data structure that can hold different data types in columns, while a Python list is a one-dimensional collection of items. DataFrames provide built-in methods for data analysis, filtering, and manipulation that are not available with regular lists, making them much more suitable for data analysis tasks.

How do I handle missing values in a DataFrame?

You can handle missing values using several DataFrame methods, including fillna() to replace missing values with specified values, dropna() to remove rows or columns with missing values, and isnull() to identify where missing values occur. The choice of method depends on your specific data analysis requirements and the nature of the missing data.

Can I convert a DataFrame back to other data formats?

Yes, DataFrames offer multiple export methods, including tocsv() for CSV files, tojson() for JSON format, toexcel() for Excel files, tosql() for database tables, and to_parquet() for Parquet format. These methods offer flexibility in sharing and storing your processed data in various formats, making them suitable for different applications.

What are the performance considerations when working with large DataFrames?

Large DataFrames require careful memory management and may benefit from chunking operations, using efficient data types like category for repeated strings, and leveraging vectorized operations instead of loops. Consider using tools like Dask for datasets that exceed memory capacity, and optimize data loading by specifying column types and using efficient file formats like Parquet.

How does PyAirbyte enhance DataFrame functionality?

PyAirbyte extends DataFrame capabilities by providing access to 600+ pre-built connectors for extracting data from diverse sources, including databases, APIs, and SaaS platforms. It enables you to materialize data directly as DataFrames while supporting advanced features like Change Data Capture (CDC) and integration with modern data warehouses and vector databases.

Limitless data movement with free Alpha and Beta connectors
Introducing: our Free Connector Program
The data movement infrastructure for the modern data teams.
Try a 14-day free trial