Article

Getting Started with Data Analysis in PostgreSQL: Basic Features

•

June 14, 2023

•

12 min read

Summarize with ChatGPT

Before analyzing any data, it is useful to have an awareness of the basic features of the dataset. Properties like maximum, minimum, sum, and average values of a column, and percentile cutoff ranges, give an initial understanding of the data. It is also helpful to be able to divide the data into different subgroups and get these values for each subgroup.

This article presents an introduction to preliminary data analysis using PostgreSQL and illustrates how to study the aforementioned properties of the data.

Prerequisites

Background

It is assumed you have prior practical exposure to the PostgreSQL database. It is helpful to have a working knowledge of basic data analysis.

Systems

The commands in this article are tested on PostgreSQL 14.5. SQL queries should work on all recent versions of PostgreSQL. If you choose to use the included Docker image, your computer needs to have Docker as well as a PostgreSQL client (such as Postico for the Mac or Beekeeper Studio as a cross-platform solution).

Note: in the examples below, $ denotes the operating system prompt as a non-root user. Code snippets without a prompt are SQL statements.

Example Dataset

PostgreSQL has several built-in functions that are useful in analyzing data. It is helpful to see these functions in action in the form of SQL queries on an actual dataset.

This article uses a dataset about cancer statistics. This dataset is based on (U.S) government data and is available publicly (behind a free login wall) on data.world (the author has no affiliation with the website). Kaggle also has many datasets freely available.

The examples in this article are presented as hypothetical analyses performed on this dataset. It is strongly recommended to try out the examples while reading through the article. You have three ways to get the data:

Download a database dump file with the preprocessed data. Choose this option if you are a regular PostgreSQL user who is comfortable (or wants to practice) importing database dumps.

Use a Docker image with PostgreSQL preloaded with the preprocessed data. This is the fastest way for Docker users. Just run the Docker image and you are good to go.

Start from the raw CSV file and process it yourself. Choose this option to spend some time and gain hands-on experience working on a raw dataset and making it analysis-ready. This is a useful exercise if you are expecting to do analytical work yourself.

💡Suggested Read: PostgreSQL Use Cases

Option 1

Download the database dump file:

$ wget https://raw.githubusercontent.com/ahron1/airbyte_docs/main/drafts/postgres_data_analysis/cancer_db_dump.sql

This is essentially a series of SQL commands to recreate the database. Move the dump file to a location the postgres user has access to.

Before importing the dump file, create a new database:

CREATE DATABASE cancer_db ;

As the postgres user, import the dump into the newly created database:

$ psql cancer_db < /path/to/dump_file/cancer_db_dump.sql

\c cancer_db

You can now jump to the Data Analysis section.

Option 2

Alternatively, you can use the Docker image with the data preloaded. If your computer does not have PostgreSQL installed, you can install just the client (to connect to the database running on Docker). On Debian/Ubuntu based systems, the PostgreSQL client can be installed as:

$ apt install postgresql-client

Pull the Docker image:

$ docker pull ahron1/postgres-data-analysis:latest

Check that the new image is now part of your system:

$ docker images

Run the new image:

$ docker run -p 5434:5432 ahron1/postgres-data-analysis &

Connect to the running database instance with username and password postgres:

$ psql -h 0.0.0.0 -p 5434 -U postgres cancer_db

You can now jump to the Data Analysis section to continue further.

Option 3

Instead of using either the Docker image or the database dump, you can also prepare the dataset manually, starting from the raw CSV file. The next two subsections show how to import the dataset (as a CSV file) into PostgreSQL and how to preprocess the data before analyzing it.

Import the Data File into the Database

Download the CSV file containing the data from the webpage of the dataset (the page has download buttons). For this particular dataset, the data file is cancer_reg.csv.

It is advisable to create and use a new database to test the examples. As the postgres user, open a psql prompt on the terminal. You can also use a graphical SQL editor (e.g. Postico, Beekeeper, and many others). The examples below assume the use of a psql terminal.

Create a new database:

CREATE DATABASE cancer_db ;

Connect to the newly created database:

\c cancer_db

Create a new table cancer_data:

CREATE TABLE cancer_data (avganncount DECIMAL, avgdeathsperyear INTEGER, target_deathrate DECIMAL,incidencerate DECIMAL, medincome INTEGER, popest2015 INTEGER, povertypercent DECIMAL, studypercap DECIMAL, binnedinc VARCHAR, medianage DECIMAL, medianagemale DECIMAL, medianagefemale DECIMAL, geography VARCHAR, percentmarried DECIMAL, pctnohs18_24 DECIMAL, pcths18_24 DECIMAL, pctsomecol18_24 DECIMAL, pctbachdeg18_24 DECIMAL, pcths25_over DECIMAL, pctbachdeg25_over DECIMAL, pctemployed16_over DECIMAL, pctunemployed16_over DECIMAL, pctprivatecoverage DECIMAL, pctprivatecoveragealone DECIMAL, pctempprivcoverage DECIMAL, pctpubliccoverage DECIMAL, pctpubliccoveragealone DECIMAL, pctwhite DECIMAL, pctblack DECIMAL, pctasian DECIMAL, pctotherrace DECIMAL, pctmarriedhouseholds DECIMAL, birthrate DECIMAL) ;

This table contains columns for each of the fields in the CSV file. The data types of the columns are based on the data dictionary. It is important to correctly match the column names and data types between the CSV file and the data table. Not doing this leads to errors in importing the data.

Copy the data from the CSV file to the table:

COPY cancer_data 
FROM '/path/to/data/file/cancer_reg.csv' DELIMITER ',' CSV HEADER ;

In the above command, the file path should be based on the location of the cancer_reg.csv file on your system.

Check a couple of rows to get an idea of the data itself:

SELECT * FROM cancer_data LIMIT 2 ;

Data Preprocessing

For most analytical exercises, the original data needs to be modified to make it more amenable to analysis. In the cancer_data table, the column geography contains string values of the form county, state. There is no column for individual states. Writing SQL queries based on a portion of a string value is not the right approach. It is easier to have dedicated columns for county and state names.

Add two new columns, county and state, to the table cancer_data:

ALTER TABLE cancer_data ADD COLUMN county VARCHAR ;

ALTER TABLE cancer_data ADD COLUMN state VARCHAR ;

Update these columns based on the value of the county and state in the column geography:

UPDATE cancer_data 
    SET county = SPLIT_PART(geography, ', ', 1), 
    state = SPLIT_PART(geography, ', ', 2) ;

Notice that the separator is a comma followed by a space. Check that the above command did what was expected:

SELECT geography, county, state 
FROM cancer_data LIMIT 5 ;

To enhance readability and usability, create a materialized view based on the original data. In this example, the materialized view is customized as follows:

It contains only those columns which are used in the analysis
Many of the columns are renamed for better readability
Additional columns with per capita values of some of the data points, such as the number of avg_annual_cases and avg_annual_deaths
Additional columns with the county and state names (this change was done to the table itself)

The following query creates the new materialized view, mv_cancer_data, which will be used for subsequent analyses:

CREATE MATERIALIZED VIEW mv_cancer_data AS 
SELECT 
avganncount AS avg_annual_cases, 
avganncount / popest2015::float AS percapita_annual_cases, 
avgdeathsperyear AS avg_annual_deaths, 
avgdeathsperyear / popest2015::float  AS percapita_annual_deaths, 
medincome AS median_income, 
popest2015 AS population, 
povertypercent AS pc_poverty, 
medianage AS median_age, 
pctemployed16_over AS pc_employed, 
pctunemployed16_over AS pc_unemployed, 
county, 
state 
FROM cancer_data ;

Integrate All Your Data Sources With PostgreSQL Using Airbyte

Try FREE for 14 Days

Data Analysis

This section shows ways to do an exploratory analysis of the data itself - this helps in getting a "feel" of the data before digging deeper. Note that the queries in this section are based on the materialized view, mv_cancer_data, which contains the preprocessed dataset.

Data Description

The materialized view, mv_cancer_data will be used in all further examples. Check the columns and data types in the materialized view:

\d mv_cancer_data

Below is a description of the columns in mv_cancer_data.

avg_annual_cases - Mean number of reported cases of cancer diagnosed annually in the county
percapita_annual_cases - Number of reported cases in the county divided by county population
avg_annual_deaths - Mean number of reported mortalities due to cancer in the county
percapita_annual_deaths - Number of cancer deaths in the county divided by county population
median_income - Median income per county
population - Population of the county
pc_poverty - Percent of county populace in poverty
median_age - Median age of county residents
county - County name
state - State name
pc_employed - Percent of county residents ages 16 and over who are employed
pc_unemployed - Percent of county residents ages 16 and over who are unemployed

Note that percentages of unemployed and employed people do not add up to 100%. The difference is attributed to people who have quit looking for work, are not looking for work, are unable to work, or otherwise are not in the labor force.

Take a look at the form of the data:

SELECT * FROM mv_cancer_data LIMIT 2;

Aggregate Functions

Functions used in this section, such as sum or avg (average), compute their output based on the value of multiple rows. Such functions are called aggregate functions. Aggregate functions do their computation over the rows that match the conditions in the WHERE clause. In the absence of a WHERE clause, the function is applied over all the rows.

Count the Number of Rows

SELECT count(*) FROM mv_cancer_data ;

The above query returns the total number of rows in the table cancer_data. To get the number of entries for Alabama, count the number of rows where the state column has the value Alabama:

SELECT count(*) FROM mv_cancer_data 
WHERE state = 'Alabama' ;

Maximum and Minimum Values of a Column

To get the maximum and minimum values in a column, use the max and min functions respectively.

SELECT MIN(avg_annual_deaths) 
FROM mv_cancer_data ;

The above query returns the minimum number of annual deaths in any county. Get the maximum number of annual (per county) deaths in Alabama counties:

SELECT max(avg_annual_deaths) 
FROM mv_cancer_data 
WHERE state = 'Alabama' ;

Using Aggregate Functions in the WHERE clause

In the previous queries, suppose you also want to get all the other details (columns) of the county which had the maximum number of annual deaths.

SELECT * FROM mv_cancer_data 
WHERE avg_annual_deaths = MAX(avg_annual_deaths) ;

The above query is wrong. It is not allowed to have aggregate functions as part of the WHERE clause of a SQL query. The aggregate function, max, needs to be used within a sub-query whose output can be then used as part of the main query. To get all the columns of the row with the maximum number of annual deaths:

SELECT * FROM mv_cancer_data 
WHERE avg_annual_deaths = 
    (SELECT MAX(avg_annual_deaths) 
    FROM mv_cancer_data) ;

Sum of the Values of a Column

sum adds up the values of a column. To get the total number of annual cancer deaths throughout the country, apply the sum function over all the values in the column avg_annual_deaths. To get the total number of annual deaths throughout the country (across all counties):

SELECT sum(avg_annual_deaths) 
FROM mv_cancer_data ;

To get the total number of annual deaths for a specific state, California:

SELECT sum(avg_annual_deaths) 
FROM mv_cancer_data 
WHERE state = 'California' ;

Average Value of a Column

The avg function averages the (non-null) values of a column. The following query computes the average number of annual cases (per capita) of cancer across all US counties:

SELECT avg(percapita_annual_cases) 
FROM mv_cancer_data ;

To compute the average number of annual cases (per capita) of cancer in counties with an average median income higher than $50,000:

SELECT avg(percapita_annual_cases) 
FROM mv_cancer_data 
WHERE median_income > 50000 ;

To compute the average number of annual cases (per capita) of cancer in counties with an average median income lower than $50,000:

SELECT avg(percapita_annual_cases) 
FROM mv_cancer_data 
WHERE median_income < 50000 ;

Observe, based on the outputs of the above two queries, that the average number of diagnosed cases per capita is higher in higher-income counties.

Exercise

Modify the above two queries to compute the average number of deaths per capita and observe that there are fewer deaths in counties with higher income. The higher number of diagnoses combined with the lower number of deaths can, at least superficially, be attributed to improved testing as well as treatment facilities in higher-income regions.

Percentiles

The percentile_cont(N) function computes the value of the Nth percentile. The following query gives the 95th percentile of the per capita number of annual deaths across all counties:

SELECT PERCENTILE_CONT(0.95) 
    WITHIN GROUP 
    (ORDER BY percapita_annual_deaths) 
FROM mv_cancer_data ;

To get the top 1% of values of an ordered set, you need to find values that are above the 99th percentile. The next query shows the top 1% of counties based on per capita annual deaths:

SELECT county, state, avg_annual_deaths, percapita_annual_deaths 
FROM mv_cancer_data 
WHERE percapita_annual_deaths > 
    (SELECT PERCENTILE_CONT(0.99) 
        WITHIN GROUP 
        (ORDER BY percapita_annual_deaths) 
    FROM mv_cancer_data) ;

The bottom 1% consists of values that are below the 1st percentile:

SELECT county, state, avg_annual_deaths, percapita_annual_deaths 
FROM mv_cancer_data 
WHERE percapita_annual_deaths < 
    (SELECT PERCENTILE_CONT(0.01) 
        WITHIN GROUP 
        (ORDER BY percapita_annual_deaths) 
    FROM mv_cancer_data) ;

Grouping and Partitioning

Grouping

One of the goals of analytical exercises is to group the data into different classes to study and compare specific properties of each class. The grouping operations in PostgreSQL are often used to achieve this.

To get the number of rows for each state:

SELECT state, count(*) 
FROM mv_cancer_data 
GROUP BY state ;

‍

When a query uses a grouping function together with an aggregation function, the aggregation function is applied separately for each group.

To find the total number of deaths for each state:

SELECT state, sum(avg_annual_deaths) AS total_annual_deaths 
FROM mv_cancer_data 
GROUP BY state ;

Add a sorting (ordering) clause in the above query:

SELECT state, sum(avg_annual_deaths) AS total_annual_deaths 
FROM mv_cancer_data 
GROUP BY state 
ORDER BY total_annual_deaths ;

Partitioning and Window Functions

Now consider that for each county, you need to show the total number of cancer cases for the state, and the percentage of state-wide cancer cases accounted for by that county. This is not achievable with the grouping functions shown earlier. You need to use partitioning to do this:

SELECT county, 
    state, 
    (avg_annual_cases), 
    sum(avg_annual_cases) OVER (PARTITION BY state) AS total_state_annual_cases, 
    round(
        (avg_annual_cases *100 / 
        sum(avg_annual_cases) OVER (PARTITION BY state))
    ,1) as pc_county_state 
FROM mv_cancer_data 
WHERE state = 'Alabama' 
ORDER BY pc_county_state DESC ;

Grouping functions reduce the number of rows in the input data by applying an aggregation function across a few different rows into a single row. In contrast, window functions do the aggregate computation but without collapsing the number of rows. Applying a PARTITION BY clause along with a window function modifies the set of rows over which the aggregation function is applied.

Conclusion

This article showed how to do preliminary data analysis using PostgreSQL. Many of the built-in functions used for data analysis are listed under Aggregate Functions in the official PostgreSQL documentation. The documentation also lists a few more functions than discussed here. The official tutorials on aggregate functions and window functions are also helpful for a better understanding of these functions.

Learn More

The next step after preliminary analysis is understanding the statistical properties - mean, standard deviation, correlations, etc. of the data. It is also helpful to use a regression equation to be able to predict the values of one variable based on changes in the other. The second part of this article covers these topics along with relevant examples based on the same dataset used in this article.

Limitless data movement with free Alpha and Beta connectors

Introducing: our Free Connector Program ->

The data movement infrastructure for the modern data teams.

Try a 30-day free trial

About the Author

Arun's interest in building new things from scratch led him to technology and to learning full stack development skills. He owes much of what he knows to well written documentation. He hopes to pay it forward by sharing his own learnings in the form of hands-on tutorials. His current interests are data analysis and AI/ML tools.