Despite the “free spirit, anything goes” narrative we’ve heard about data, there are many rules that address the security and complexity of moving data between destinations. There’s not a single law, but a variety of them, and in some cases, the repercussions for straying from them can be severe. Navigating which ones apply to a given dataset can be the biggest challenge to following them, and the best starting point is also probably the easiest: recognizing what aspects of your intended data transfer might trigger regulatory requirements.
To stay on top of data security, here are four key questions to consider before you move data.
Data protection regulations are tied to categories of data, which means the first step is understanding the type of data you intend to move. This usually means evaluating the dataset at a field level rather than as a whole and considering the data itself. There are multiple ways to categorize data, and data can be more than one type, each one potentially triggering its own set of requirements.
To help categorize the data, consider the following:
Does the dataset contain personal data?
Personal data is a big umbrella term for any information that can be used (directly or indirectly) to identify an individual, so it catches a lot. If the data can directly identify a person, then it is considered “personally identifiable information” (PII). Examples include name, address (home or email), phone number, and social security number. Almost half of US states have data security laws outlining safeguard requirements for this information, including protocols around its transfer.
Some personal data, like IP address, geolocation, cookie ID, birth date, or mobile advertising identifier, are device-based or not otherwise directly linkable to an individual; they are considered “personal data” because they can be used indirectly or in combination with other information to identify a specific individual. Five states (and more soon to come) have data privacy laws imposing obligations for the protection of this general personal information, including for its transfer, sharing, and portability.
Note that information can represent an individual and not be personal data. This category of data might be called de-identified or “pseudonymous”. Data can be transformed through encryption, hashing, or other permutations to be de-identified or pseudonymized, and you can think of this transformation as a basic security measure that lowers the risk of harm to individuals if the data is mishandled. Because there’s lower risk, de-identified data is typically subject to less stringent requirements. Some data breach laws create a “safe harbor” mechanism that absolves companies of certain obligations in the event of a security incident if the data that was breached was de-identified. Other data security laws may even exempt de-identified data altogether as long as certain criteria about that first security step are met, such as the application of an irreversible transformation or the separation of the encryption key from the data itself in storage. In any case, de-identification of personal information at rest and in transit has become such a staple of standard security practices that the absence of encryption or the failure to use secure transfer protocols leaves companies open to class action lawsuits regarding their “unreasonable” security protections.
Most data protection regulations apply only to personal data. Data that is anonymous, such as device-level information or aggregate statistics even if at the individual level, does not pose a risk to individuals and is generally not covered by data protection regulations at all. So if you don’t have personal data, you’re probably in the clear.
Is any of the personal data sensitive?
If you are working with personal data, then another consideration is whether it is “sensitive”. Sensitive data reveals things like an individual’s sexual orientation, interests, medical history, biometric and genetic information, race or ethnicity, religious beliefs, or political opinions – information a person might reasonably feel is private.
Determining whether your dataset meets the threshold for sensitive personal data is important because many regulations impose stricter requirements for sensitive personal data as part of their provisions. And some personal data is sensitive enough to have requirements specific to its protection. For example, medical information, if originating from a healthcare or health insurance provider, is protected by the Health Insurance Portability and Accountability Act (HIPAA); many states have special biometric data security laws; and some industry standards for special categories of data, such as the Payment Card Industry Data Security Standard (PCI DSS) for credit card information, are widely adopted enough to have become enforceable as “reasonable” data security practices by the Federal Trade Commission (FTC).
If you determine that you’re working with personal data, the next consideration is where it originated. Origin can be many things. It can be the state or country in which the individual resides or was located at the time the data was collected. It can also be the type of company that collected the data, as in the HIPAA example of a healthcare provider. It could also be the entity from whom the data was purchased.
Every state has a security breach law that effectively mandates companies have “reasonable” data security protections in place for the collection, storage, and disclosure of their citizens’ personal data. If the personal data is about an individual from one of the five US states that has a privacy law on the books (CA, VA, CO, UT, and CT) then the movement of the data – whether you originally collected it or you purchased it from someone else – could trigger more specific obligations. And if the personal data was collected from an individual outside of the US, then a slew of obligations likely apply if the data will be processed in or transferred to the US. Countries in the European Union for example, protect their citizens’ data privacy through the General Data Protection Regulation (GDPR familiarly) and despite several legal cases over the last several years, the US and the EU have agreed at various points on the conditions that must be met for legal cross-border transfers.
Data protection laws are framed in terms of data controllership, so it’s important to think about where you’re sending your data and why, particularly if the dataset contains personal information. Is the data going to a service provider for marketing purposes? To a customer for their internal use? Or from a sales platform to a database for analytics? Even data going to or between your own database(s) for analytics purposes can trigger requirements if the data originated outside the US or is being transferred to another country.
You may not have the answers to this question and it can be hard to make a guess. But know that at the very least, regulations require that you keep a record of where personal data goes and most want the record to indicate the purpose. So record what you can.
Lastly, you should consider how your data is getting from point A to point B. Data can be transferred in so many ways, physically and digitally, and data protection laws have been around long enough to cover both. For the most part, the security requirements are the same if you’re transferring data using your own infrastructure or a third party tool, though some data protection laws require that you record all the tools used to process personal data and a transfer tool would be one (e.g. GDPR Article 18, Record of Processing).
If you’re using a service provider or third-party software to move the data, consider whether they will have access to the data you’re sending by virtue of their service design and what protections they have in place to prevent access or protect your data while in their systems. Even if they don’t have access, consider whether the transfer itself has security controls in place to protect the data from unauthorized access while in transit.
Data is among your company’s most valuable assets, which means data security is a top priority. Protecting data is the responsibility not just of security, legal, and privacy teams, but also of everyone who handles it. In fact, the people who work with data are a company’s first line of defense for data protection, and just knowing what to watch for can be all it takes to make sure nothing is missed. So whether you’re responsible for complying with the rules or you have a legal team to assist you, answering these questions first will be the best way to make sure your data transfer doesn’t trigger any alarm bells.
Interested in trying Airbyte to future-proof your data pipelines? Get a free trial of our fully managed solution, Airbyte Cloud. You're also welcome to join our community Slack channel to share thoughts and questions with thousands of data engineers.
Get all your ELT data pipelines running in minutes with Airbyte.