> Hello Hacker News! We're so happy you liked our post. We're currently running the State of Data Engineering survey! Whether you work in data, have opinions about Airflow vs Dagster, or want to shout out your favorite Data community, substack, or YouTube creator, or just think Data has become overcomplicated, come fill out our survey!
There's a buzz of excitement around data engineering right now, and for a good reason. Since its inception, there has been no slowdown in the data engineering field. New technologies and concepts are appearing particularly fast lately. As we near the end of 2022, it is a good moment to take a step back and evaluate the current state of data engineering.
What may the data engineer role of today look like in the future? Will it even exist? In this blog post, I look at the past and the present of the data engineering role, examining emerging trends to offer you some predictions about the future.
The Past: From Business Intelligence to Big Data
To grasp the present and future of data engineering, it is essential to understand its historical development. So let’s start by recalling some of the most relevant events and technologies that emerged in the data landscape and gave rise to today's data engineering role.
Data warehousing was one of the earliest modern attempts to make sense of our data, and it may be traced back to the 1980s when the first business data warehouses began to take shape. The word "data warehouse" was not formally used until the late 1980s by Bill Inmon, who is considered the father of data warehousing. SQL became a standard database language in the 1980s, and we still use it today!
While Inmon established a solid theoretical foundation for data warehousing principles, Ralph Kimball's The Data Warehouse Toolkit from 1996 set the foundations of dimensional modeling.
With the introduction of massively parallel processing (MPP) databases, data warehousing initiated the era of scalable analytics. This made it possible to handle previously unimaginable data volumes. Jobs like business intelligence engineer were created to manage data warehouses.
In the early 2000s, after the dot-com bubble burst, a small group of companies remained – including Yahoo, Google, and Amazon – that would eventually become tech giants. The unprecedented growth of the tech giants inspired engineers to find more sophisticated solutions to address the more demanding data needs. The reason? The monolithic databases and data warehouses available then were not enough to handle the new workloads.
The above resulted in Google releasing its famous Google File System paper describing a “scalable file system for large distributed data-intensive applications” in 2003 and the MapReduce paper describing a “simplified data processing on large clusters paradigm” in 2004. Then, Yahoo's developers released the Hadoop distributed file system in 2006. At the same time, hardware like servers and RAM became inexpensive and commonplace.
The innovations of the 2000s, along with businesses of all sizes amassing data in the terabytes and even petabytes range, gave rise to what we know as Big Data, and the era of the big data engineer began.
At the time, big data engineers extensively used Apache Hadoop, an open-source framework. The Hadoop project is made up of four main modules:
- Hadoop Common: The standard utilities that support the other Hadoop modules.
- Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
- Hadoop YARN: A framework for job scheduling and cluster resource management.
- Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
Data engineers needed expertise in software development and low-level infrastructure hacking with a focused skill set to make the most of the Hadoop framework. Later, they needed to acquire experience in other Hadoop-related Apache technologies, such as Hive and, more recently, Spark.
Around the time of the Hadoop appearance, Amazon decided to launch the first public cloud by making its internal technology available through Amazon Web Services (AWS). Other public cloud providers like Google Cloud and Microsoft Azure emerged soon after. The public cloud revolutionized how software and data applications are built and delivered, making it one of the most important technologies of the era.
One of the main advantages of the cloud is that companies save huge fixed up-front costs compared to buying their own hardware.
When using their own hardware, companies must accurately predict what their workloads will look like to ensure they have sufficient resources available due to long purchase cycles – this inherently results in over-provisioning since it is difficult to predict the future accurately. On the other hand, using the cloud just requires that companies pay incremental costs for the resources they use, and these resources can be easily scaled as the requirements go up or down.
Fast-forward to the early 2010s, several data-centric technologies emerged and rose to prominence, such as Amazon Redshift – the first cloud-native massively parallel processing database – followed by BigQuery and, more recently, Snowflake.
The early to mid-2010s was a remarkable moment in the world of data. At that point, any company could use the same state-of-the-art data tools as the most prominent IT corporations.
Along with cloud data warehouses, we got new tools to manage data workflows, some of the most popular being:
- Orchestration: Airflow
- Transformation: dbt
- BI: Looker, Metabase, Periscope, Mode, etc.
Justin Chau recently interviewed Alex Gronemeyer, senior data engineer at Airbyte, to talk through all the changes she has experienced firsthand as she transitioned through different stages in her career in the world of data.
“Coming from a [big] data engineering background where I was writing pipelines in Spark and Scala for Hadoop, we hadn't really moved into the cloud; it was all on-premise, [we did] a lot of Hive queries and called APIs directly to ingest data. There was a lot of heavy lift to get new data into our database, so I have a huge appreciation for [modern data stack] tools.”, says Alex.
With the emergence of data products and the modern data stack, the big data engineer title became somehow obsolete (in the end, all data is big these days, right?), and a more broad and simple term arose: data engineer.
In 2017, Maxime Beauchemin – the creator of Airflow – wrote a famous article describing the transition from being a business intelligence engineer to becoming a data engineer. At that time, the world became aware of the rise of the data engineer. This further catalyzed when he released the functional data engineering article in 2018.
The Present: The Contemporary Data Engineer
Data professionals have seen a dramatic shift in their careers due to data tools, with previously menial tasks elevated to increasingly strategic ones. Since the specifics of Big Data frameworks have been abstracted away, the contemporary data engineer may focus on the bigger picture, increasingly taking care of tasks farther up the value chain, such as data modeling, quality, security, management, architecture, and orchestration.
At the same time, data engineers have been increasingly adopting software engineering best practices. Even though software and data engineering are different disciplines, they also share some essential similarities: both require solving a problem by writing, deploying, and maintaining code. Therefore, the contemporary data engineer is familiar with agile development, code testing, and version control practices – to name a few.
Apart from best practices, data engineering has also adopted concepts from software engineering. An excellent example is functional data engineering, which has its roots in functional programming. The main idea behind it is that a task – such as moving data from system A to system B – should be idempotent, which means it should deliver the same result every time it is executed. When tasks fail or the logic needs to be changed, we need to know that re-running the task is safe and will not result in duplicate data or any other type of incorrect state. Therefore, idempotency is essential for data pipeline operability.
Another adopted concept is declarative programming, a level of abstraction on top of imperative programming that focuses on the what rather than the how. A declarative data pipeline would say, “move data from system A to system B, " without specifying the exact data flow. Declarative pipelines are relevant in data engineering because they’re a good foundation for observability, data quality monitoring, and data lineage.
The declarative concept is highly tied to the trend of moving away from data pipelines and embracing data products – which is made possible by the abstractions provided by modern data orchestrators. Data engineers now think more about the product the pipeline is intended to deliver, such as a particular dashboard or view, and then build the pipeline based on that. The idea of data as a product is a core value of the data mesh architecture.
The rise of Python has also made its way into data engineering, as it has been solidified as a highly robust programming language over the years. The data community has widely adopted Python due to its many built-in libraries for processing and displaying data. One prominent example is pandas, developed specifically to extract and transform data. Other important tools based on Python have appeared, such as PySpark, an interface for Apache Spark that allows interactively analyzing data in a distributed environment.
It’s safe to say that Python and SQL are the languages that any data engineer today must know.
The way organizations structure data teams has changed over the years. Now we see a shift towards decentralized data teams, self-serve data platforms, and ways to store data beyond the data warehouse – such as the data lake, data lakehouse, or the previously mentioned data mesh – to better serve the needs of each data consumer.
Even though organizational changes are controversial and experts have different opinions on what works best, we see a tendency to make domain experts owners of the data they use instead of having one central team in charge of “the source of truth.” As a result, many data engineers now belong to a central platform team responsible for optimizing different aspects of the data stack instead of owning data.
The main challenge with the above architectural and organizational switch is maintaining a common understanding of the data. That’s why we see the adoption of concepts like the semantic layer, which maps complex data into familiar business terms to offer a unified, consolidated view of data across systems.
So, how could the data engineering role of today be defined? Let’s take the definition from the Fundamentals of Data Engineering, as it’s one of the most recent and complete: “Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information that supports downstream use cases, such as analysis and machine learning. Data engineering intersects security, data management, DataOps, data architecture, orchestration, and software engineering.”
A data engineer today oversees the whole data engineering process, from collecting data from various sources to making it available for downstream processes. The role requires familiarity with the multiple stages of the data engineering lifecycle and an aptitude for evaluating data tools for optimal performance across several dimensions, including price, speed, flexibility, scalability, simplicity, reusability, and interoperability.
The Future: Where is The Data Engineer Going?
It seems that most, if not all, of the data engineering trends point to increased abstraction, simplification, and maturity of the field by adopting software engineering best practices. What does this mean for the future? I see four main tendencies that can be summarized as follows:
- Data tools will decrease in complexity while adding more functionality and features.
- There will be a specialization increase, giving rise to new roles within data engineering.
- The gap between data producers and consumers will narrow.
- Improved data management thanks to the adoption of DataOps.
Let’s go deeper into the above tendencies.
If we take a step back, we can see that one of the data engineers’ primary focuses has continually been establishing and maintaining connections between data sources and destinations via elaborate pipelines. A noteworthy development on this end that perfectly exemplifies the tendency toward easy-to-use tools and simplification is managed data connectors.
As Alex Gronemeyer mentions when asked about working with data connectors, “It was really interesting bringing in data from business systems, it just takes a couple minutes to set up a new connector and get data flowing in, and that was something I'd never experienced before. Then, when I wanted to start modeling a new dataset to be used in a report downstream, most of my work focused on the data modeling, cleansing, and joining things together. I didn't have to bake in another week of time just to get the data and see what it looked like; that was already taken care of.”
Airbyte is an open-source tool that provides hundreds of off-the-shelf data connectors. For example, you could create a data pipeline from Postgres to Snowflake without writing code. This new generation of data tools is exciting and appealing even for highly technical professionals because when removing the need to create yet another ELT script, they get time and bandwidth for other initiatives that are more important to their company. This trend doesn’t seem to be slowing down in the future.
The immediate result of simplified tools that allow any data practitioner – such as data analysts and data scientists – to set a data pipeline in minutes is that data engineers are no longer bottlenecks. Self-serve analytics will likely continue to empower downstream data consumers in the future.
As mentioned before, new roles in the data world might emerge, just like the analytics engineer role that appeared in the early 2020s. An analytics engineer is a professional who most likely started their career as a business/data analyst; hence they’re well versed in SQL and building dashboards. Self-serve platforms and transformation tools like dbt allow them greater autonomy from data engineers. We might see more of these specialization roles appear in the future.
Companies are ingesting more data than ever before, thanks to the expanded capabilities given by improved data tools. As more stakeholders interact with data throughout its lifecycle and make decisions based on it, being able to trust the data has become critical. Because of that, data quality will remain at the top of a data team's priority list.
Increased focus on data quality has recently led to the emergence of a new role that, I believe, will continue to grow: data reliability engineer, a specialization of the data engineering role that focuses on data quality and availability. Data reliability engineers apply DevOps best practices to data systems, such as CI/CD, service-level agreements (SLAs), monitoring, and observability.
Titles and responsibilities may also shift on the other side of the spectrum, where software engineering and data engineering meet. The shift may be propelled by data apps that combine software and analytics. It’s possible that in the future, software engineers will need to be well-versed in data engineering. With the advent of streaming and event-driven architectures, the separation between upstream backend systems and downstream analytics will fade.
The trend of data producers becoming more conscious of analytics and data science use cases will continue to grow. There’s already an increasing adoption of data contracts: an agreement between the owner of a source system and the team responsible for ingesting data into a data pipeline, which only suggests a tighter coupling between producers and consumers in the future.
If we look at the big picture – beyond technology or tools – the data ecosystem is moving towards increased collaboration between stakeholders. This has led to the development of new mindsets, such as DataOps. As defined by Gartner: “DataOps is a collaborative data management practice focused on improving the communication, integration, and automation of data flows between data managers and data consumers across an organization.”
At the end of the day, the explosion of new data tools and practices converges to solving a persistent problem: data management, working better together, and providing value. This area will dramatically improve in the coming years.
Some question if all of the circumstances mentioned above will lead to the disappearance of data engineers in the future. I don’t believe that will be the case. More sophisticated tools, the fading gap between producers and consumers, and the implementation of DataOps mean that data engineers will focus on more strategic tasks without necessarily being intermediaries but rather advisors and enablers of automation.
Titles and responsibilities will also morph, potentially deeming the “data engineer” term obsolete in favor of more specialized and specific titles. But data engineering will always be necessary, as companies increasingly rely on data and require the development of new data-driven infrastructure and processes.
The future data engineer will be responsible for designing flexible data architectures that adapt to changing needs. That includes making decisions about tools and processes that provide the most value to the business.