Announcing our acquisition of Grouparoo to accelerate Data Movement. Learn more
In a perfect world, our data would be clean from the very start. There would be no missing values. The columns would be cast to the right types. We’d be able to perform aggregations and use the data in models with no work at all.
Unfortunately, raw data can be very messy. Consider yourself lucky if you’ve never had to apply a bunch of transformations to your data before using it in your models. Every analytics engineer knows that providing clean data for our analysts is a key part of our role.
Messy data in means messy data out. Your data models are only as effective and concise as the data you’re loading into them.
In this article, I will briefly show how you can easily extract raw data from business APIs using Airbyte and ingest them into a data warehouse. Then I will show some common data quality issues when extracting raw data from Google Sheets to a data warehouse. Finally, I’ll introduce you to some popular SQL string functions to clean this data and make it ready to be used by your analysts.
Before transforming our raw data, we need to ingest it using one of the 100 connectors Airbyte has to offer. Make sure you follow the instructions to set up Airbyte locally and create your first data connector.
You’ll also need to set up the destination, or the warehouse that you want your data to be loaded into. Airbyte has detailed instructions on how to do this here.
When loading data to your destination, you’ll notice the option to choose “basic normalization”. By selecting this you are allowing Airbyte to create a schema and tables that fit your data and its destination. Without selecting normalization your data will be uploaded as one column in the form of a JSON blob. A JSON blob would just require further transformation on our end. Let’s take the help and let Airbyte make our job easier!
It’s important to note that this schema and its tables will always exist as raw data directly ingested by Airbyte. We will not be applying our transformations directly to these tables but writing them in SQL files to be run with dbt. We will detail how to set up your dbt source files and base models in an upcoming article.
For now, know that you always want to have a database just for raw data. This way, if you ever need to change your transformations, you can do so without having to reingest the data from your Airbyte source connectors.
As written in Airbyte’s documentation,
“A core tenet of ELT philosophy is that data should be untouched as it moves through the E and L stages so that the raw data is always accessible.”
Keep this in mind when setting up your data warehouse. You should have two separate databases, one for raw data and one for your transformed data.
For this tutorial, I ingested data from a Google Sheet to Snowflake. You can find more information about setting up Airbyte data connectors on the Google Sheets source documentation and the Snowflake destination documentation.
While Google sheets aren’t the most optimal way to automate and keep track of data, they are used very often. When you work for a smaller company, sometimes it’s the only solution you have to keep track of different information like marketing budget, influencer contacts, or product costs.
This being said, data from Google sheets is some of the hardest data to work with. Unlike other applications like Shopify and Facebook, there are no checks in place. Google sheets are prone to lots of human errors. I’ve had many times where someone filled in a value in the wrong format and caused our entire data pipeline to break.
Here’s an example of a messy Google Sheets data set loaded to Snowflake that I’ll be using for this tutorial:
Let’s look at some common scenarios we see in our raw data and how we can transform them into data we can use in our models. Built-in SQL string functions help you clean strings coming from your raw data to query them on your data warehouse.
The first thing you want to do when cleaning any data is change the column names to the names that make the most sense for your analysis. Here, date is a common keyword used across tables, so you will want to change it to be more descriptive.
In addition, we will want to change the column name Campaign Link to a name that doesn’t have a space or capitalization. This will make our analysis smooth. Spaces in column names can cause future issues in your queries. I recommend using snake case, or underscores in between words. You can easily rename table columns with the <span class='text-style-code'>AS</span> statement.
Awesome, now we can easily use these columns in our transformations!
Let’s start by looking at the customer_name column. As you can see, it contains both the first and last names. We want to use this column to create two separate columns, first_name and last_name.
This is where the <span class='text-style-code'>split_part()</span> string function comes in handy. We specify the column we want to split, the character we want to split it by, and whether we want the string to the right or left of that character.
1 will give us the string to the left of the space and -1 will give us the string to the right of the space.
We now have two separate columns, one for the customer’s first name and one for the customer’s last name.
Now, notice there are still some capitalization issues. For the sake of analysis, it makes sense for all strings to be lowercase. This way you won’t have to worry about the case of the string being an issue when filtering or comparing values.
We can do this by using the <span class='text-style-code'>lower()</span> string function.
This query results in the following:
Now our first_name and last_name columns are all lowercase.
Let’s take a look at the campaign_link column. This column follows the structure of a typical link generated from Google or Bing Ads. It’s seen a lot in marketing and can be a bit of a pain to extract.
The key with extracting strings from URLs like this is to recognize the pattern. In some cases there will be a few different patterns in the same column. Luckily, here, it is uniform.
We can see that utm_source is always after a ? and before an &. utm_medium follows this & and is before another &. Lastly, utm_campaign follows the last &. We need to write some SQL code using the <span class='text-style-code'>split_part()</span> function that uses this pattern to extract each value.
When we use the same <span class='text-style-code'>split_part()</span> string function mentioned earlier, we can extract the source, medium, and campaign strings from between the characters that surround them. We use the inner <span class='text-style-code'>split_part()</span> function to extract the values to the right of each = and then another to extract the values to the left of the next character.
This results in three new columns:
Now there are separate columns for each part of the original campaign link.
Let’s look at the lifetime_value column. We know this column must contain numeric values and no text. However, one column has a typo where random letters were entered. In order to further clean this column, we need to first remove this string value.
Snowflake SQL dialect has a handy string function called TRY_TO_NUMBER() which casts values to a number if they are numeric and returns a NULL value if not.
Notice how this SQL function eliminates the decimal places in our original values. Because of this, we will simply want to use this in a case statement to check for NULL values.
We can also see that the decimals aren’t uniform in their formatting and some are negative. For the sake of the data cleaning, we don’t want any negative values in this column. Instead, we want to replace negatives with a 0.
We can use the <span class='text-style-code'>iff()</span> function for testing conditions within a column. First, you specify the condition. Then, if it’s false, it will return the first value specified. If it’s true, the second value specified will be returned.
Here, I am testing to see whether the lifetime_value is less than 0 or not. If it is, the function will return 0. If it isn’t, I am then rounding the value to two decimal places using the <span class='text-style-code'>round()</span> function. This function is great for rounding currency values or any decimals. You specify the column you want to round and then the number of decimal places.
Let’s look at the phone column. We can see that some of these phone numbers aren’t actually phone numbers at all. We want to minimize the number of inaccurate pieces of data we have, so let’s see if we can add a condition to this column.
We know phone numbers can only be 11 digits at most, possibly 10 if the country code isn’t included. In order to limit the amount of fake data in our table, let’s remove any phone numbers that don’t have 10 or 11 digits. Better yet, let’s just make the column value NULL where this condition isn’t met.
Here I am using an <span class='text-style-code'>IFF()</span> function again to check if the values in the column meet a certain condition. I am using the <span class='text-style-code'>len()</span> function to check the characters in the column value. If the phone number isn’t 10 or 11 characters, we are replacing it with a null value.
Now, we only have one phone number, but at least we know it is accurate! Clean data is better than dirty data, even if there is less of it. Quality over quantity.
Like with phone numbers, we want to make sure the email column contains accurate email addresses. There are a few checks we can put in place to ensure these are accurate. One of them is to make sure the values contain an @, a character all emails have.
We can do this by using the <span class='text-style-code'>charindex()</span> string function. This function returns the number position of a specified character in the string you provide it. If the character is not present in the string, the function will return 0.
Here, we are searching for the @ symbol in the email column. We are using a case statement to check whether the result of this function is a number other than 0. If it’s not 0, it must contain an @. If it is 0, it’s not a proper email address and we replace that value with null.
The new emails will look like this:
Similar to the phone column, the email column now has NULL values where the previous values didn’t meet the conditions specified.
In addition to an @ sign, an email address must end with “.com”. We can check to make sure the values in this email column all end with this string. If not, we can replace it with a null value.
Here, we will be using the <span class='text-style-code'>LIKE</span> statement to ensure it contains this substring.
The % sign means that any string can be present before the “.com”. But, since we don’t have a % at the end of “.com”, the value must end after it. So a value like madison@.comgmail would not be accurate and will be replaced by NULL.
Since all of our emails fit this condition, our resulting table looks the same.
There are so many different ways that dates can be entered into a spreadsheet. I’ve come across this issue frequently when using Google sheets as data sources. Each time a new person would enter data into the sheet, the formatting of the column would change! As you can see here, dates were entered differently, some with slashes and others with hyphens.
Luckily, there is a SQL function <span class='text-style-code'>to_date()</span> that converts strings into dates. When the data is first ingested, the columns with dates will be of data type varchar. Casting them using this function will turn them to dates with the data type of date, making the column ready for analysis.
This will result in a column where the dates look like this:
Now our raw data has been cleaned into data our data analysts can actually use! These SQL cleaning functions are vital in creating usable data models. You will use them over and over again, so it is important that you understand how they work.
Here’s an overview of the SQL string functions we learned today:
The next step is to write your code in a SQL file to be used and automated by dbt. This code in conjunction with dbt will create an automated system where your data is always clean and available when your analysts need them. And this right here is what differentiates companies that are data-driven from those that are data-informed. But we will get into this in another article. Happy data cleaning!
Madison Schott is an Analytics Engineer, health & wellness enthusiast. She blogs about the modern data stack on Medium.
Get all your ELT data pipelines running in minutes with Airbyte.