Enterprises often retain massive amounts of unstructured data in cost-effective storage so that at some point in the future it may be analyzed or repurposed for competitive advantage. In-fact, a Google search for “what percentage of data is unstructured, “what percentage of data is semi-structured”, and “what percentage of data is structured”, returns the following numbers:
- Approximately 70% to 90% of the data generated by enterprises is unstructured data.
- Approximately 5% to 10% of the data generated by enterprises is semi-structured data.
- Approximately 20% of the data generated by enterprises is structured data.
Even though the majority of data that is collected is unstructured, many of the tools used by large enterprises for data analysis and business intelligence are designed for handling structured data (not unstructured data). This disconnect frequently leads to scenarios where companies collect massive amounts of data, but they don’t have the tools or the skills to analyze or make use of it!
If your company is collecting large amounts of unstructured data, then additional engineering efforts will likely be required before Data Analysts can effectively analyze your data. In this blog you will learn about the difference between structured data, semi-structured data, and unstructured data. You will also see why unstructured data must go through additional processing before it can be used for analysis. Finally, I will briefly touch on some of the tools that can be used to structure your unstructured data.
The disconnect between Managers and Data Analysts
One of the main problems with storing unstructured data is that management often doesn’t realize the amount of data wrangling and data cleansing that needs to be done before a Data Analyst can make use of the data. This is encapsulated in a tweet by Seth Rosen:
In other words, you can’t efficiently run SQL, analyze, or visualize your data until it has been structured. The lack of understanding about what can and cannot be done with unstructured data leads to conversations such as the following between Managers, Data Analysts, and Data Engineers in enterprises across the world:
Manager: Hey Data Analyst! We have a huge amount of data in our data lake, and I would like you to use it to provide us with some analysis and visualizations. Could you get that to me next week?
Data Analyst: Not really. The data that we have in the data lake is unstructured, and I can't do much with it until it's structured.
Manager: What do you need in order to structure it?
Data Analyst: Our data engineers have the skills to do that.
Manager: Oh, I see – OK, I’ll talk to the Data Engineer. Hey, Data Engineer! Can you structure this data by next week?
Data Engineer: Next week sounds a bit unrealistic. I’ll need to talk to the teams that have created the data in order to be sure that I understand how it is organized and what it means. After that, I can write some custom code or use a pre-existing tool to extract information from the unstructured data, and then store it in a structured table. Let’s set up a project and allocate time and resources to do this, it could take some time!
Manager: You mean that we have all this data, but we can’t use it unless we spend a bunch of engineering resources to process it first?
Data Engineer: That is correct.
Manager: My boss is not going to be happy … Can you help me to get a better understanding of the difference between structured and unstructured data? I’m going to need some good justification for this!
Data Engineer: Sure, I’ve written a summary of the difference between structured, semi-structured, and unstructured data for you. Keep reading!
What is unstructured data
Unstructured data is data that does not conform to a data model and that has no easily identifiable structure. Unstructured data cannot be easily queried and is difficult to analyze. Examples of unstructured data could be the contents of an email, contents of a word document, data from social media, photos, videos, survey results, etc.
One example of unstructured data is free-form text that contains interesting information, but that has not been formatted or stored with a well defined schema. You could imagine a scenario where a team has been asked to gather the name, age and phone number of people without specific instructions about how the data should be gathered. This could look as follows:
When data is stored in the above format, it is difficult and computationally expensive to query the data. For example, if a data set contained billions of rows with such data, then finding all the records where the age is 29 would be slow and inefficient.
What is structured data
Structured data refers to data that has been formatted into a well-defined schema. An example would be data that is stored with precisely defined columns in a relational database. Examples of structured fields could be age, name, phone number, etc. Storing data in a structured format allows it to be easily understood and queried with SQL and used by analytics tools.
If your enterprise has gathered large amounts of unstructured data, then it is likely that you will need to transform it into structured data before you can efficiently make use of it. Extracting structured data from unstructured data is often done during the data transformation stage in either an ETL or an ELT process.
In order to efficiently make use of the unstructured data given in the previous example, it could be transformed it into a structured table the looks like the following:
Storing the data in a structured manner makes it much more efficient to query the data. Finding the users where their age is 29 becomes a trivial operation for a database or data warehouse, and could be achieved with the following SQL statement:
ℹ️ Data analysts typically work with structured data stored in relational databases or data warehouses to solve business problems using languages such as SQL, R or Python and visualization tools such as Tableau, Grafana, or Looker Studio.
What is semi-structured data
Semi-structured data is data that lacks a rigid structure and that does not conform directly to a data model, but that has tags, metadata, or elements that describe the data. Examples of semi-structured data are JSON, CSV, or XML files.
Semi-structured data often contains enough information that it can be easily converted into structured data. For example, a text file containing a JSON representation of the same data as presented above could look as follows:
If you are working with a tabular/relational database, then you will likely want to transform the semi-structured data into a table with columns corresponding to the field names, and rows corresponding to each JSON object.
ℹ️ Some NoSQL data stores (eg. MongoDB and Elasticsearch) work directly with JSON objects, and some data stores such as Snowflake support querying semi-structured data.
How can you transform unstructured data into structured data
If you are using an ELT pipeline to load your data into a target system, then you will often want to transform the raw unstructured data into a structured format after it has been stored in the destination. Common methods for transforming data after it has been loaded into the destination include dbt, Spark, or custom code. It is worth noting that Airbyte has excellent support for transforming your data with dbt – however, a detailed discussion on how to transform data is beyond the scope of this article.
On the other hand, if you are using the ETL (legacy) approach for data integration, then data is transformed inside the ETL data pipeline before it is loaded into the destination. In this case, the method for transforming the data would be tool-specific and is also outside the scope of this article.
Collecting data is an important step of your data integration strategy, and this step is well served by Airbyte. Given that up to 80% of the data that is collected by enterprises is unstructured, another commonly required step is the conversion of your data into a structured format that can be efficiently used by SQL and analytics tools.
If you haven’t yet tried out Airbyte to fulfill your data integration needs then you may wish to try Airbyte Cloud for a managed experience, or download Airbyte Open-Source to deploy it in your own infrastructure!