With the responsibility of tracking and storing data comes the need to keep it secure. Oftentimes businesses are storing valuable information about their customers, their behaviors, and how they interact with their products. Customers expect businesses to keep this information secure, only for the company to see and use. Security is a key piece of data governance, defining how are protecting your data from getting into the hands of others. When I think of security in relation to data governance, three things come to mind:
- Data masking: Protecting customer privacy by hiding PII data
- Access control: Limiting who can see and change data
- Classifying and tagging: Organizing data to help quickly identify sensitive information
Luckily, Snowflake makes it easy to ensure you are following all of these best practices when it comes to security. Snowflake as a cloud data product puts security above all else, ensuring you have tight control over who accesses and views certain pieces of data.
Snowflake Data Masking
If you aren’t familiar with data masking, this refers to the hiding of column values at loading. So, the values still exist in the data, however, they are masked depending on who is viewing them. This is helpful in case you need to store these values and use them in compliance processes or even a downstream model. However, you can control which users can see them otherwise.
Data masking is a great feature to use with a data ingestion tool like Airbyte that does not automatically mask PII data, but you can use the low-code connector development kit to transform data before it’s loaded to destination. Using the dynamic data masking available within Snowflake we can hide PII data ingested by Airbyte at run-time.
Let’s go over how to do just this.
First, whoever is apply the masking policies should have the MASKING_ADMIN role. This role should be granted privileges to create, apply, and apply on a masking policy. You can create this role and assign it privileges by running the following commands:
Remember that you also have to grant this role to the user who will be applying the masking policies. Typically this will be an analytics engineer or data engineer on the team.
Now, you can create a masking policy! This is essentially a case when statement in SQL that dictates the conditions for masking a column value and what the masked value will look like.
Let’s say we have a column with someone’s credit card number. Analysts, engineers, and whoever else uses your cloud platform for analysis should not be able to see this value. The only person who should be able to see this is a customer service member. We will want to limit access to only those with the role CUSTOMER_SERVICE. Everyone else should not have access. We would be able to write a masking policy that looks like this:
This `hide_value` policy now says that only users using the CUSTOMER_SERVICE role can see the value of the column this masking policy is being applied to.
Next, you need to apply this policy to a column in a table or view. You can apply it to one when you first create it or alter it after it’s already created to include the policy.
Now you would only be able to see the credit card number value when using the role CUSTOMER_SERVICE. When using any other role you would see '*********' for every value in the `credit_card_number` column.
Another option for masking data in Snowflake is to use a basic case when statement in one of your base models that reads from your raw data. Because most users don’t have access to the raw data in your cloud data platform but do so the views, this is a great place to utilize data masking. Because dynamic data masking is only available with an enterprise account in Snowflake, this makes for a great alternative when that’s not an option.
Let’s say you have a base model called `user_payments`, just like the table above. This exists as a view, reading from the` user_payments` table in your RAW database. When you write the SQL file that creates your base model, simply add in a case when statement similar to the one you included in your masking policy.
Just make sure your raw data access is restricted to those not allowed to view PII data!
Snowflake Access Control
When utilizing a cloud data platform like Snowflake, limiting access depending on users and roles is key to maintaining a secure environment. Users should always have the least amount of access needed to do their jobs. This means that an analyst who uses base models in their analysis and doesn’t write them themself doesn’t need access to raw data, for example. Limiting access based on what’s needed to do their job will help protect your data from being compromised in any way.
Because we extensively discussed Snowflake users and roles and the types of control they should have here, we won’t go into it too much in this section. But, here are some of the key points to keep in mind:
- You can always assign a role READ permissions if they need to see the data but you don’t want them to have the power to change it.
- Only the analytics engineer or the person owning ingestion and transformation should have access to raw data.
- Create a role for each new set of permissions. These can then be assigned to users.
- Create a user for each new tool you connect to Snowflake. This will help you better manage which tools are touching what.
Tagging and Classifying in Snowflake
Within Snowflake, a tag is a schema-level object that can be assigned to another Snowflake object. Tags always contain string values. They are great to use to keep track of a certain group of Snowflake objects as part of a larger data governance strategy. You can use tags to help you with compliance, data discovery, tracking sensitive information, or tracking resource usage. There are tons of possibilities aside from helping with security.
Just like with masking policies, Snowflake recommends that you create a role specifically for assigning tags. You can do this like so:
Then be sure to assign this role to the user who will be creating tags.
Now you are ready to create a tag! Let’s say we want to tag schemas within our databases with a tag called `sensitive_data`. The purpose of this tag is to let us know whether a schema contains PII data or not. We need to create the tag and specify the values that it is allowed to contain- either TRUE or FALSE.
Now, when you assign a schema a value for this specific tag, you will only be allowed to choose TRUE or FALSE. This helps to protect your tags from having values that don’t make sense for your use-case.
Similar to masking policies, you can assign a tag to an already existing object or to a new one that you create. To add a tag to a column in an already-existing view, we would do this:
You have set the tag `has_sensitive_data` to be true for the `credit_card_number` column! Now, what if we want to track these tags and find all of the columns that have the `has_sensitive_data` tag? We can do this by running a few different queries.
In order to see all of the tags created in your account, run the following:
The following will show you all the tags and their values listed on the objects within a table or view:
This query will show you the tags assigned to a certain column in your schema:
When using these tags to group different types of schemas, tables, or columns, this is referred to as classification by Snowflake. Classification can be helpful to identify all of the objects with a specific tag and then assign a masking policy to them as previously discussed.
For example, if we wanted to apply the masking policy we created earlier on columns that have sensitive data, we could first identify these columns by running the following query:
Then, for each column that results from the previous query, we can run our command to apply the masking policy to it.
This is a great way to use classification and data masking together to ensure you have a data environment that is secure in protecting customer information.
Keeping your data secure is one of the most important things to do when implementing a data governance strategy. While data governance refers to the internal standards set within your company, one of those standards is security. How secure your data is bleeds into other major areas of data governance such as availability and usability. When security isn’t prioritized first, these areas tend to suffer as a result.
Snowflake allows you to keep your data secure by implementing masking policies, tight access control, and a detailed classification system. Using a tool that puts security first gives you confidence in your data moving forward to other initiatives like transformation and observability.