Getting Started with Data Analysis in PostgreSQL: Basic Features
Before analyzing any data, it is useful to have an awareness of the basic features of the dataset. Properties like maximum, minimum, sum, and average values of a column, and percentile cutoff ranges, give an initial understanding of the data. It is also helpful to be able to divide the data into different subgroups and get these values for each subgroup.
This article presents an introduction to preliminary data analysis using PostgreSQL and illustrates how to study the aforementioned properties of the data.
Prerequisites
Background
It is assumed you have prior practical exposure to the PostgreSQL database. It is helpful to have a working knowledge of basic data analysis.
Systems
The commands in this article are tested on PostgreSQL 14.5. SQL queries should work on all recent versions of PostgreSQL. If you choose to use the included Docker image, your computer needs to have Docker as well as a PostgreSQL client (such as Postico for the Mac or Beekeeper Studio as a cross-platform solution).
Note: in the examples below, $ denotes the operating system prompt as a non-root user. Code snippets without a prompt are SQL statements.
Example Dataset
PostgreSQL has several built-in functions that are useful in analyzing data. It is helpful to see these functions in action in the form of SQL queries on an actual dataset.
This article uses a dataset about cancer statistics. This dataset is based on (U.S) government data and is available publicly (behind a free login wall) on data.world (the author has no affiliation with the website). Kaggle also has many datasets freely available.
The examples in this article are presented as hypothetical analyses performed on this dataset. It is strongly recommended to try out the examples while reading through the article. You have three ways to get the data:
- Download a database dump file with the preprocessed data. Choose this option if you are a regular PostgreSQL user who is comfortable (or wants to practice) importing database dumps.
- Use a Docker image with PostgreSQL preloaded with the preprocessed data. This is the fastest way for Docker users. Just run the Docker image and you are good to go.
- Start from the raw CSV file and process it yourself. Choose this option to spend some time and gain hands-on experience working on a raw dataset and making it analysis-ready. This is a useful exercise if you are expecting to do analytical work yourself.
Option 1
Download the database dump file:
This is essentially a series of SQL commands to recreate the database. Move the dump file to a location the postgres user has access to.
Before importing the dump file, create a new database:
As the postgres user, import the dump into the newly created database:
Log in as the postgres user and connect to the new database:
You can now jump to the Data Analysis section.
Option 2
Alternatively, you can use the Docker image with the data preloaded. If your computer does not have PostgreSQL installed, you can install just the client (to connect to the database running on Docker). On Debian/Ubuntu based systems, the PostgreSQL client can be installed as:
Pull the Docker image:
Check that the new image is now part of your system:
Run the new image:
Connect to the running database instance with username and password postgres:
You can now jump to the Data Analysis section to continue further.
Option 3
Instead of using either the Docker image or the database dump, you can also prepare the dataset manually, starting from the raw CSV file. The next two subsections show how to import the dataset (as a CSV file) into PostgreSQL and how to preprocess the data before analyzing it.
Import the Data File into the Database
Download the CSV file containing the data from the webpage of the dataset (the page has download buttons). For this particular dataset, the data file is cancer_reg.csv.
It is advisable to create and use a new database to test the examples. As the postgres user, open a psql prompt on the terminal. You can also use a graphical SQL editor (e.g. Postico, Beekeeper, and many others). The examples below assume the use of a psql terminal.
Create a new database:
Connect to the newly created database:
Create a new table cancer_data:
This table contains columns for each of the fields in the CSV file. The data types of the columns are based on the data dictionary. It is important to correctly match the column names and data types between the CSV file and the data table. Not doing this leads to errors in importing the data.
Copy the data from the CSV file to the table:
In the above command, the file path should be based on the location of the cancer_reg.csv file on your system.
Check a couple of rows to get an idea of the data itself:
Data Preprocessing
For most analytical exercises, the original data needs to be modified to make it more amenable to analysis. In the cancer_data table, the column geography contains string values of the form county, state. There is no column for individual states. Writing SQL queries based on a portion of a string value is not the right approach. It is easier to have dedicated columns for county and state names.
Add two new columns, county and state, to the table cancer_data:
Update these columns based on the value of the county and state in the column geography:
Notice that the separator is a comma followed by a space. Check that the above command did what was expected:
To enhance readability and usability, create a materialized view based on the original data. In this example, the materialized view is customized as follows:
- It contains only those columns which are used in the analysis
- Many of the columns are renamed for better readability
- Additional columns with per capita values of some of the data points, such as the number of avg_annual_cases and avg_annual_deaths
- Additional columns with the county and state names (this change was done to the table itself)
The following query creates the new materialized view, mv_cancer_data, which will be used for subsequent analyses:
Data Analysis
This section shows ways to do an exploratory analysis of the data itself - this helps in getting a "feel" of the data before digging deeper. Note that the queries in this section are based on the materialized view, mv_cancer_data, which contains the preprocessed dataset.
Data Description
The materialized view, mv_cancer_data will be used in all further examples. Check the columns and data types in the materialized view:
Below is a description of the columns in mv_cancer_data.
- avg_annual_cases - Mean number of reported cases of cancer diagnosed annually in the county
- percapita_annual_cases - Number of reported cases in the county divided by county population
- avg_annual_deaths - Mean number of reported mortalities due to cancer in the county
- percapita_annual_deaths - Number of cancer deaths in the county divided by county population
- median_income - Median income per county
- population - Population of the county
- pc_poverty - Percent of county populace in poverty
- median_age - Median age of county residents
- county - County name
- state - State name
- pc_employed - Percent of county residents ages 16 and over who are employed
- pc_unemployed - Percent of county residents ages 16 and over who are unemployed
Note that percentages of unemployed and employed people do not add up to 100%. The difference is attributed to people who have quit looking for work, are not looking for work, are unable to work, or otherwise are not in the labor force.
Take a look at the form of the data:
Aggregate Functions
Functions used in this section, such as sum or avg (average), compute their output based on the value of multiple rows. Such functions are called aggregate functions. Aggregate functions do their computation over the rows that match the conditions in the WHERE clause. In the absence of a WHERE clause, the function is applied over all the rows.
Count the Number of Rows
The above query returns the total number of rows in the table cancer_data. To get the number of entries for Alabama, count the number of rows where the state column has the value Alabama:
Maximum and Minimum Values of a Column
To get the maximum and minimum values in a column, use the max and min functions respectively.
The above query returns the minimum number of annual deaths in any county. Get the maximum number of annual (per county) deaths in Alabama counties:
Using Aggregate Functions in the WHERE clause
In the previous queries, suppose you also want to get all the other details (columns) of the county which had the maximum number of annual deaths.
The above query is wrong. It is not allowed to have aggregate functions as part of the WHERE clause of a SQL query. The aggregate function, max, needs to be used within a sub-query whose output can be then used as part of the main query. To get all the columns of the row with the maximum number of annual deaths:
Sum of the Values of a Column
sum adds up the values of a column. To get the total number of annual cancer deaths throughout the country, apply the sum function over all the values in the column avg_annual_deaths. To get the total number of annual deaths throughout the country (across all counties):
To get the total number of annual deaths for a specific state, California:
Average Value of a Column
The avg function averages the (non-null) values of a column. The following query computes the average number of annual cases (per capita) of cancer across all US counties:
To compute the average number of annual cases (per capita) of cancer in counties with an average median income higher than $50,000:
To compute the average number of annual cases (per capita) of cancer in counties with an average median income lower than $50,000:
Observe, based on the outputs of the above two queries, that the average number of diagnosed cases per capita is higher in higher-income counties.
Exercise
Modify the above two queries to compute the average number of deaths per capita and observe that there are fewer deaths in counties with higher income. The higher number of diagnoses combined with the lower number of deaths can, at least superficially, be attributed to improved testing as well as treatment facilities in higher-income regions.
Percentiles
The percentile_cont(N) function computes the value of the Nth percentile. The following query gives the 95th percentile of the per capita number of annual deaths across all counties:
To get the top 1% of values of an ordered set, you need to find values that are above the 99th percentile. The next query shows the top 1% of counties based on per capita annual deaths:
The bottom 1% consists of values that are below the 1st percentile:
Grouping and Partitioning
Grouping
One of the goals of analytical exercises is to group the data into different classes to study and compare specific properties of each class. The grouping operations in PostgreSQL are often used to achieve this.
To get the number of rows for each state:
When a query uses a grouping function together with an aggregation function, the aggregation function is applied separately for each group.
To find the total number of deaths for each state:
Add a sorting (ordering) clause in the above query:
Partitioning and Window Functions
Now consider that for each county, you need to show the total number of cancer cases for the state, and the percentage of state-wide cancer cases accounted for by that county. This is not achievable with the grouping functions shown earlier. You need to use partitioning to do this:
Grouping functions reduce the number of rows in the input data by applying an aggregation function across a few different rows into a single row. In contrast, window functions do the aggregate computation but without collapsing the number of rows. Applying a PARTITION BY clause along with a window function modifies the set of rows over which the aggregation function is applied.
Conclusion
This article showed how to do preliminary data analysis using PostgreSQL. Many of the built-in functions used for data analysis are listed under Aggregate Functions in the official PostgreSQL documentation. The documentation also lists a few more functions than discussed here. The official tutorials on aggregate functions and window functions are also helpful for a better understanding of these functions.
Learn More
The next step after preliminary analysis is understanding the statistical properties - mean, standard deviation, correlations, etc. of the data. It is also helpful to use a regression equation to be able to predict the values of one variable based on changes in the other. The second part of this article covers these topics along with relevant examples based on the same dataset used in this article.