The end of the year is a wonderful time to reflect not only on a personal level but also on what happened in the data ecosystem. Since data engineering is a rapidly developing discipline, it can be challenging to stay up with developments and gain a thorough understanding of the state of the industry at any given time.
Whether you are interested in becoming a data engineer or are already a data engineer looking to improve your skills, you've come to the right place to get your dose of what’s happening in the field.
In this blog post, written in collaboration with Simon Späti, we walk you through the 12 most important trends in data engineering (plus a bonus!) and how they'll shape the industry in 2023.
Language: SQL and Python Continue to Dominate, With Rust Looking Promising
Any data engineer must be familiar with Structured Query Language (SQL) – the universal language of databases and data warehouses. Data engineers may have lower complexity SQL requirements than data analysts, but SQL remains essential for anyone in data and will likely be for the foreseeable future.
For some years now, Python has been a must-have in any data engineer toolbox, even surpassing Java's popularity in the field. The wide Python adoption is partly due to the third-party libraries for processing and transforming data, such as pandas. Another popular tool is PySpark, a Python interface for Apache Spark that enables interactive data analysis in a distributed setting.
Spark (either written in Python or Scala) remains widely used, but it’s typically reserved for intensive data processing or distributed semi-structured files (like CSV or Parquet) on Amazon S3.
Recently, though, a new contender has emerged: Rust. As this article explains, Rust improves some of the problematic aspects of Python for data engineering.
For example, Rust is an ahead-of-time compiled language; meanwhile, Python is a dynamic language. As a result, when programming in Rust, bugs can be discovered early in the development process. With Python, you’ll only find some bugs at run-time, sometimes when the code is already in production.
Additionally, Python is dynamically typed meanwhile Rust is strongly typed. Data pipelines, in particular, benefit from a strongly typed programming language because they ingest data from various sources over which they have no control. Data compatibility problems could be avoided downstream if pre-defined type expectations enforce types at compile time.
Besides the above, Rust is also fast and platform-independent, but it has a few disadvantages over Python. For example, it’s not as easy to learn – a higher degree of coding knowledge is required to use it efficiently, and it doesn’t have nearly as many libraries and frameworks.
Another general downside of Rust is that it’s not compatible with frameworks that use the JVM.
The above doesn’t mean you need to stop improving your Python or Scala skills in favor of Rust, but it’s something to keep on your radar.
Abstraction: Custom ETL Is Replaced by Data Connectors
Building and maintaining ETL or ELT pipelines to move data between various systems has been a major part of the job of data engineers. While that is still part of the work, things are changing fast.
As we've highlighted in a previous article, the field is moving toward greater abstraction and simplification. Out-of-the-box data connectors are part of that trend as they remove the need to write yet another ETL script.
All data-driven companies around the world are trying to solve their data integration needs by writing very similar scripts. Why not abstract the details and use a product instead of reinventing the wheel?
That’s what data connectors do: abstracting implementation details so you can create an ELT pipeline from MySQL to Snowflake, for example, without programming or maintenance. You usually just need to select a source, a destination, and a synchronization mode (full or incremental load) and frequency.
This new generation of data tools, like data connectors, is appealing even to highly technical professionals who can focus on more strategic endeavors instead of creating new solutions from scratch. In addition, these tools allow any data practitioner – such as data analysts and data scientists – to set a data pipeline easily without data engineers being a bottleneck.
In short, your job as a data engineer will likely be spent on more strategic tasks rather than programming the nth ELT.
Latency: Better Tools Available In The Streaming Domain
Streaming data may become more popular than batch processing because it allows for real-time data analysis as it is being generated, rather than waiting for a batch of data to be collected before processing it.
Streaming use cases are still rare as analytics focuses on the past instead of data generated at the moment. Still, it can be helpful in various situations, such as monitoring and detecting fraud, financial services, and IoT.
Additionally, streaming data is often more scalable and efficient than batch processing, making it a better choice for dealing with large amounts of data. With batch processing, data must be collected and stored before processing, which can introduce significant delays. In contrast, streaming data can be analyzed as it is generated, allowing faster and more timely decision-making.
Zack Wilson, Staff Data Engineer at Airbnb, predicts that in 2023 streaming data engineering jobs will account for 15-20% of all data engineers jobs and pay the most.
While streaming data and batch processing have challenges, streaming data is generally more difficult to work with than batch data. Because streaming data is continuous and unbounded, so it can be generated at a high rate and potentially have an unlimited volume. Streaming can make it difficult to store and process streaming data in real-time, requiring specialized infrastructure and algorithms designed to handle high-velocity data. It's also harder to recover, as:
- Streaming data is often processed in real-time, meaning any delays or outages can have immediate and potentially significant impacts on the system.
- The continuous nature of streaming data means that it is difficult to "rewind" the stream and reprocess data that may have been lost or corrupted during a failure—making it harder to ensure that the results of a streaming data pipeline are consistent and accurate.
In contrast, batch data is typically stored in a fixed location and is processed in smaller, discrete chunks, which makes it easier to manage. However, this also means that batch processing is less flexible and cannot handle real-time data analysis.
As time passes, tooling improves for data engineers to recover and avoid stream inconsistency. Some include Apache Flink, Spark, Kafka, Beam, and NiFi, to name a few.
Architecture: The Lakehouse and the Semantic Layer Combine Strengths
With the growth of data and the people working with it, it’s even more critical to make it accessible to everyone. Instead of locking into proprietary tools and formats, people put the data into open Data Lakes, Warehouses, and Lakehouses for everyone to access.
But only some people are as technical as to use dbt to make sense of the raw data. That’s where the approach of the Semantic Layer comes in. Besides the inefficient way of copying data not only during integration but also during the transformation, it makes sense to leave the data at its place and use advanced in-memory techniques to query these data on the fly.
The lakehouse and the semantic layer have these goals to leave the data in the lake or warehouse and apply analytics and machine learning. That approach might also be cheaper due to not duplicating the data, which initially thought interesting business KPI that is no longer valid after a couple of months doesn’t cost you any more.
This architecture also improves Data Governance as standard features like data modeling, access control, optional caching, and API interfaces are handled at the top level once for all data storage. It usually also reduces the tendency of creating data silos or ingesting into another layer, such as an OLAP Cube, as data can be shared easily across the organization.
These will be more important in 2023 as the data engineering stack will grow, increasing the need for standard ways to fetch data or handle access permission
Trust: Data Quality and Observability Become an Essential
Improved data tools and technologies have allowed businesses to ingest, store and analyze a record amount of data. As more companies become data-driven and more stakeholders make decisions based on data, being able to trust such data has become crucial.
But still, it happens to every data engineer: you deploy a shiny new data pipeline, only to be told by a data analyst that something doesn’t look right. Now you need to debug, fix the root cause of the issue, and backfill the data.
Data downtime – periods when data is incomplete, missing, or wrong — is at the top of the list of difficulties that a data engineer must tackle. As Ari Bajo discusses, there are various reasons why minimizing downtime and maintaining data quality is a challenge for data engineers.
The problem is that, instead of developing a comprehensive strategy to handle data outages, teams frequently address data quality issues on an ad hoc basis. However, this is changing.
Observability, a relatively recent addition to the engineering lexicon, refers to the monitoring, tracking, and triaging of events to reduce downtime and speaks to the need to be proactive rather than reactive. As DevOps adds observability to software, data teams increasingly apply it to data as a more holistic approach to data quality. Other best practices include documenting and testing data as you go.
New data observability tools, fortunately, employ automated monitoring, root cause analysis, data lineage, and data health insights to assist us in detecting, resolving, and preventing data downtime.
As a data engineer, there’s a high chance that you will implement data quality and observability practices in your work.
Usability: The New Focus Is on Data Products and Data Contracts
With the Data Mesh architecture increasing in popularity, Data Products and Data Contracts dominated the data space in 2022. The key takeaway is that data should be treated as a product. Data engineers take great pride in the data pipelines we develop, but data consumers are only interested in the end result.
To incorporate product thinking into data pipelines, a shift in how we handle data and its dependencies is happening. We are shifting from an imperative approach to a declarative one, which allows us to define the data (product) as a standalone object, even before it exists. It also makes integration into external tools such as Airbyte, dbt, and Python code for machine learning easier. Instead of stitching together the DAGs in the orchestration part, each tool focuses only on the data product.
With an increasing number of tools in the data stack, it's crucial to integrate, orchestrate and automate them as much as possible. Orchestration is an essential part of the Data Engineering Lifecycle. Up until now, Apache Airflow was a good start, but newer orchestration tools such as Prefect, Temporal, and Dagster make more sense if you come from a green field or have complex orchestration needs.
Dagster effectively creates a Data Contract between tools with the Software-Defined Assets. Airbyte integrates and extracts the data in a specific schema, location, and format. The downstream event, such as dbt, acts event-driven based on certain assumptions defined in the SDA. Each step is decoupled from the DAG, and the sub-jobs know if an upstream job has changed, meaning there is no need for running the complete DAG anymore.
Much of this ties into Functional Data Engineering, which is vital for data engineering. We can reduce side effects by slicing a DAG into functional tasks, as each function has a defined input and output. Each task can be written, tested, reasoned about, and debugged in isolation without understanding the external context or history of events surrounding their execution.
Writing functional data pipelines is a welcoming side effect of a data-aware and declarative approach. Above all, you can finally unit test your data pipelines, which is otherwise not possible. Sure, you can do all of it with plain Python and use more lambda and map functions, but it would be enforced with all the benefits explained.
Above all tools, having the data consumers as a first thought and working towards the outcome of it is most important.
Openness: When It Comes to Data Tools, Open Source Is the Answer
Given that a single organization can't create and maintain every data connector, framework, or component along the data stack, open source is a method to tackle the challenges in a new way together.
Let’s take some software companies as an example. Red Hat, Elastic, and Cloudera have implemented successful strategies around open source. Airbnb has more than 30 open-source projects, and Google has more than 200. In data, Databricks open-sourced Apache Spark, so now everyone has access to its internals.
The beauty of open-source is that the more you share, the more feedback, contributions, and, most importantly, trust a product receives. Usually, open-sourcing improves a product, as contributions from the community cannot compare to those made by a small team of in-house engineers.
Even though it may seem counterproductive to share code publicly, as competitors may try to replicate it, doing so may be the most effective method of collaboration and creativity. The pressure to turn a profit is ever-present. Still, there is ample opportunity to build AI-powered, business-driven innovations on top of a data stack that strengthens with each new open-source contribution.
A16z says it well: “As software eats the world, open source eats software.” More of such software in the Awesome Data Engineering list provides access to a curated list of open-source data engineering tools.
Standards: Database Open Standard Simplified by DuckDB
With the explosion of tools in the open-source data ecosystem, there is the need to couple them, maximize the tools into a single data stack, and simplify along the way. To achieve this, we need open standards that everyone is implementing.
DuckDB is a database that got lots of love due to its simplicity in the analytical phase. DuckDB is an in-process SQL OLAP database management system for analytical queries, similar to SQLite for relational databases. Due to its columnar-vectorized query execution engine, queries are still interpreted, but a large batch of values (a "vector") is processed in one operation. Each database is a single file on disk and supports a vast amount of data.
It also makes it easy to set up an OLAP system within seconds compared to heavier systems such as Apache Druid, Clickhouse, and Pinot. Its single file approach leads to use cases such as binging the data to the client with no network latency and zero copies (on top of parquets in S3), which weren't thinkable before.
Besides DuckDB, there are already standards such as S3 for the storage interface (many other storage providers implemented the S3 API as a default) and Apache Parquet as File Format in data lakes. Or standards such as Table Formats, which bundle distributed files into one database-like table. It is an abstraction layer between your physical data files and how they are structured to form a table.
Open standards are also vital for the Modern Data Stack's success in integrating the various tools into a powerful data stack. With more existing tools maturing, more of them will transform themself into open data standards. Time will tell which tools it will be.
Roles: Data Practitioners Expand Their Knowledge into Data Engineering
Around a decade ago, the world went crazy on hiring data scientists without realizing that a robust underlying data platform and quality data need to be in place for data science to deliver value.
The world has only started understanding the need to hire data engineers instead of data scientists. That’s because solid data engineering is the basis of all analysis, machine learning, and data product. Data science and analytics only work if quality data is cleaned and correct, and some estimate that data scientists do 80% data cleansing and 20% data science.
Recently, we are seeing professionals in adjacent fields turn to data engineering to expand their knowledge, creating more specialized roles within data engineering.
The above is not only limited to data scientists. Another example is the emerging “analytics engineer” role that debuted in the early 2020s. An analytics engineer is a professional who most likely began their career as a business or data analyst; therefore, they are familiar with SQL and the creation of dashboards. Transformation tools such as dbt allow them to be a hybrid between a data analyst and a data engineer.
The rise of highly abstracted and easy-to-use data tools from the Modern Data Stack increasingly allows professionals without a data engineering background to take over the tasks of traditional data engineering, effectively expanding their knowledge base and increasing specialization in the field.
Along these same lines, infrastructure roles such as DevOps or data reliability engineering are emerging as data quality and observability gain importance and attention. These subfields ensure data quality and availability by implementing continuous integration and continuous deployment (CI/CD), service level agreements (SLAs), monitoring, and observability in data systems.
We may also see the emergence of specialized roles at the opposite end of the spectrum, where software engineering and data engineering intersect. In the future, software engineers may need to be well-versed in data engineering with the advent of streaming and event-driven architectures.
Collaboration: DataOps Reduces Data Silos and Improves Teamwork
The data engineer role has historically been challenging because we sit in the middle of two worlds: data producers and consumers. Or, in other words, the operational and analytics world.
Producers and consumers usually are not in contact with each other or speak different languages. In this sense, the data engineer acts as the translator. But since we don’t control the narrative of the producers and don’t know the consumers' needs, some things may “get lost in translation.”
With increasing amounts of data and the diversity of data consumers, this problem becomes more painful and disrupts the efficient flow of data.
As a result, novel approaches like DataOps have arisen. DataOps is a method of working that facilitates communication and collaboration between data engineers, data scientists, and other data professionals to avoid silos.
Even though its name and concept derive from DevOps – a set of practices, tools, and a cultural philosophy that automate and integrate the processes between software development and IT – DataOps isn’t just DevOps applied to data pipelines.
Prukalpa describes it best: DataOps combines the best parts of Lean, Product Thinking, Agile, and DevOps and applies them to data management.
- Product Thinking: Seeing data as a product provides the best value to customers.
- Lean: Identifying the value of a data product helps eliminate waste and be more efficient.
- Agile: Embracing iterative product development and Minimum Viable Products (MVP) enables quick feedback from stakeholders.
- DevOps: Applying software engineering practices focusing on CI/CD, monitoring, observability, etc.
Data quality and observability, which we discussed in the previous section, are part of DataOps. As a data engineer, you will most likely work with some DataOps best practices and technology.
Adoption: Data Engineering Is Key Regardless of Industry or Business Size
Every company that wants to be data-driven needs to have data analytics in place, which, in most cases, requires data engineers to enable it. This condition applies regardless of the industry and size of the company.
Of course, things may look different if you work in a bank than if you work in an early-stage startup, but something is for sure: there’s no shortage of data engineering job opportunities in every sector.
That’s confirmed by the answers provided by more than 800 survey respondents to Airbyte’s State of Data Engineering Survey, which closed a few weeks ago. As you can see in the graphs below, the company and team sizes are almost evenly distributed. The majority of respondents' titles are “data engineer.”
Foundations: The Data Engineering Lifecycle is Evergreen
If you have read this far into the blog post, you may have already realized it: a lot is going on in data engineering. Every day brings a new tool, product, or groundbreaking concept. However, it's important to remember that the fundamentals have stayed the same and will stay the same for the foreseeable future.
But what are those fundamentals? The best description is illustrated by The Data Engineering Lifecycle, a concept that comes from the book Fundamentals of Data Engineering.
In the book, Joe Reis and Matt Housley argue that “the concepts of data generation, ingestion, orchestration, transformation, and storage are critical in any data environment regardless of the underlying technology.” These concepts have remained the same over time, which is good news because it means you don’t have to be dragged along by every upcoming buzzword.
Since data engineering is still a relatively new field, there is currently no established curriculum at most universities that covers it. Although there is no shortage of online courses and boot camps, many are tailored to help students become experts in a specific technology.
As a result, people entering the field (and veterans alike) sometimes fixate on such technologies, neglecting data management aspects. Resources like Fundamentals of Data Engineering aim to provide a foundation for the knowledge and skills needed in data engineering. We believe it’s a great starting point for anyone, independent of your expertise in the field.
The best data engineers are the ones who can see their responsibilities through the lenses of both the business and the technology. Our best advice is to avoid fixating on specific, passing trends and focus on the big picture.
(Bonus) Community: Data Creators and Practitioners Interact on Several Platforms
We know that data can be challenging, but something certain is that you will not be alone in your journey to becoming a better data engineer. Fortunately, great data content creators constantly share their knowledge and experiences in different mediums, like newsletters or podcasts. There is also a presence in social media and forums.
We compiled the most popular in each category based on our State Of Data Engineering Survey results. The list is not exhaustive, but it offers a great starting point.
- Towards Data Science
- Data Engineering Weekly
- Analytics Engineering Roundup
- Seattle Data Guy
- Benn Stancil Substack
- Data Engineering Podcast
- Analytics Engineering Podcast
- Data Stack Show
- Analytics Power Hour
Forums and Social Media
The data engineering field holds a lot of promise. As this article showed you, there has been a growing awareness among organizations and professionals of the value of a solid data foundation. Without data engineering, there would be no analysis, machine learning models, or data products, and thankfully, the discipline is maturing and evolving to be up to the challenge.
We hope this article provided a good overview of where data engineering is and where it may be going in 2023!