Data & AI
Article

12 Things You Need to Know to Become a Better Data Engineer in 2023

Thalia Barrera
Simon Späti
December 9, 2022
15 min read
Limitless data movement with free Alpha and Beta connectors
Limitless data movement with free Alpha and Beta connectors
Introducing: our Free Connector Program ->

Openness: When It Comes to Data Tools, Open Source Is the Answer

Given that a single organization can't create and maintain every data connector, framework, or component along the data stack, open source is a method to tackle the challenges in a new way together.

Let’s take some software companies as an example. Red Hat, Elastic, and Cloudera have implemented successful strategies around open source. Airbnb has more than 30 open-source projects, and Google has more than 200. In data, Databricks open-sourced Apache Spark, so now everyone has access to its internals.

The beauty of open-source is that the more you share, the more feedback, contributions, and, most importantly, trust a product receives. Usually, open-sourcing improves a product, as contributions from the community cannot compare to those made by a small team of in-house engineers.

Airbyte number of connectors has seen an upward trend thanks to community contributions, especially during Hacktoberfest 2022.

Even though it may seem counterproductive to share code publicly, as competitors may try to replicate it, doing so may be the most effective method of collaboration and creativity. The pressure to turn a profit is ever-present. Still, there is ample opportunity to build AI-powered, business-driven innovations on top of a data stack that strengthens with each new open-source contribution. 

A16z says it well: “As software eats the world, open source eats software.” More of such software in the Awesome Data Engineering list provides access to a curated list of open-source data engineering tools. 

Standards: Database Open Standard Simplified by DuckDB

With the explosion of tools in the open-source data ecosystem, there is the need to couple them, maximize the tools into a single data stack, and simplify along the way. To achieve this, we need open standards that everyone is implementing.

DuckDB is a database that got lots of love due to its simplicity in the analytical phase. DuckDB is an in-process SQL OLAP database management system for analytical queries, similar to SQLite for relational databases. Due to its columnar-vectorized query execution engine, queries are still interpreted, but a large batch of values (a "vector") is processed in one operation. Each database is a single file on disk and supports a vast amount of data.

It also makes it easy to set up an OLAP system within seconds compared to heavier systems such as Apache Druid, Clickhouse, and Pinot. Its single file approach leads to use cases such as binging the data to the client with no network latency and zero copies (on top of parquets in S3), which weren't thinkable before.

Besides DuckDB, there are already standards such as S3 for the storage interface (many other storage providers implemented the S3 API as a default) and Apache Parquet as File Format in data lakes. Or standards such as Table Formats, which bundle distributed files into one database-like table. It is an abstraction layer between your physical data files and how they are structured to form a table.

Open standards are also vital for the Modern Data Stack's success in integrating the various tools into a powerful data stack. With more existing tools maturing, more of them will transform themself into open data standards. Time will tell which tools it will be.

Roles: Data Practitioners Expand Their Knowledge into Data Engineering

Around a decade ago, the world went crazy on hiring data scientists without realizing that a robust underlying data platform and quality data need to be in place for data science to deliver value. 

The world has only started understanding the need to hire data engineers instead of data scientists. That’s because solid data engineering is the basis of all analysis, machine learning, and data product. Data science and analytics only work if quality data is cleaned and correct, and some estimate that data scientists do 80% data cleansing and 20% data science.

Recently, we are seeing professionals in adjacent fields turn to data engineering to expand their knowledge, creating more specialized roles within data engineering.

The above is not only limited to data scientists. Another example is the emerging “analytics engineer” role that debuted in the early 2020s. An analytics engineer is a professional who most likely began their career as a business or data analyst; therefore, they are familiar with SQL and the creation of dashboards. Transformation tools such as dbt allow them to be a hybrid between a data analyst and a data engineer.

The rise of highly abstracted and easy-to-use data tools from the Modern Data Stack  increasingly allows professionals without a data engineering background to take over the tasks of traditional data engineering, effectively expanding their knowledge base and increasing specialization in the field.

Along these same lines, infrastructure roles such as DevOps or data reliability engineering are emerging as data quality and observability gain importance and attention. These subfields ensure data quality and availability by implementing continuous integration and continuous deployment (CI/CD), service level agreements (SLAs), monitoring, and observability in data systems.

We may also see the emergence of specialized roles at the opposite end of the spectrum, where software engineering and data engineering intersect. In the future, software engineers may need to be well-versed in data engineering with the advent of streaming and event-driven architectures.

Collaboration: DataOps Reduces Data Silos and Improves Teamwork

The data engineer role has historically been challenging because we sit in the middle of two worlds: data producers and consumers. Or, in other words, the operational and analytics world.

Producers and consumers usually are not in contact with each other or speak different languages. In this sense, the data engineer acts as the translator. But since we don’t control the narrative of the producers and don’t know the consumers' needs, some things may “get lost in translation.” 

With increasing amounts of data and the diversity of data consumers, this problem becomes more painful and disrupts the efficient flow of data.

As a result, novel approaches like DataOps have arisen. DataOps is a method of working that facilitates communication and collaboration between data engineers, data scientists, and other data professionals to avoid silos.

Even though its name and concept derive from DevOps – a set of practices, tools, and a cultural philosophy that automate and integrate the processes between software development and IT – DataOps isn’t just DevOps applied to data pipelines.

Prukalpa describes it best: DataOps combines the best parts of Lean, Product Thinking, Agile, and DevOps and applies them to data management.

  • Product Thinking: Seeing data as a product provides the best value to customers.
  • Lean: Identifying the value of a data product helps eliminate waste and be more efficient.
  • Agile: Embracing iterative product development and Minimum Viable Products (MVP) enables quick feedback from stakeholders.
  • DevOps: Applying software engineering practices focusing on CI/CD, monitoring, observability, etc.
Graphic explanation of the disciplines involved in DataOps. Inspired by Prukalpa's article.

Data quality and observability, which we discussed in the previous section, are part of DataOps. As a data engineer, you will most likely work with some DataOps best practices and technology.

Adoption: Data Engineering Is Key Regardless of Industry or Business Size

Every company that wants to be data-driven needs to have data analytics in place, which, in most cases, requires data engineers to enable it. This condition applies regardless of the industry and size of the company.

Of course, things may look different if you work in a bank than if you work in an early-stage startup, but something is for sure: there’s no shortage of data engineering job opportunities in every sector.

That’s confirmed by the answers provided by more than 800 survey respondents to Airbyte’s State of Data Engineering Survey, which closed a few weeks ago. As you can see in the graphs below, the company and team sizes are almost evenly distributed. The majority of respondents' titles are “data engineer.”

Results of Airbyte's State of Data Engineering survey
Results of Airbyte's State of Data Engineering survey

Foundations: The Data Engineering Lifecycle is Evergreen

If you have read this far into the blog post, you may have already realized it: a lot is going on in data engineering. Every day brings a new tool, product, or groundbreaking concept. However, it's important to remember that the fundamentals have stayed the same and will stay the same for the foreseeable future.

But what are those fundamentals? The best description is illustrated by The Data Engineering Lifecycle, a concept that comes from the book Fundamentals of Data Engineering

In the book, Joe Reis and Matt Housley argue that “the concepts of data generation, ingestion, orchestration, transformation, and storage are critical in any data environment regardless of the underlying technology.” These concepts have remained the same over time, which is good news because it means you don’t have to be dragged along by every upcoming buzzword.

Since data engineering is still a relatively new field, there is currently no established curriculum at most universities that covers it. Although there is no shortage of online courses and boot camps, many are tailored to help students become experts in a specific technology.

As a result, people entering the field (and veterans alike) sometimes fixate on such technologies, neglecting data management aspects. Resources like Fundamentals of Data Engineering aim to provide a foundation for the knowledge and skills needed in data engineering. We believe it’s a great starting point for anyone, independent of your expertise in the field.

The best data engineers are the ones who can see their responsibilities through the lenses of both the business and the technology. Our best advice is to avoid fixating on specific, passing trends and focus on the big picture.

(Bonus) Community: Data Creators and Practitioners Interact on Several Platforms

We know that data can be challenging, but something certain is that you will not be alone in your journey to becoming a better data engineer. Fortunately, great data content creators constantly share their knowledge and experiences in different mediums, like newsletters or podcasts. There is also a presence in social media and forums. 

We compiled the most popular in each category based on our State Of Data Engineering Survey results. The list is not exhaustive, but it offers a great starting point.

Newsletters

Podcasts

Youtube Channels

Forums and Social Media

Wrapping Up

The data engineering field holds a lot of promise. As this article showed you, there has been a growing awareness among organizations and professionals of the value of a solid data foundation. Without data engineering, there would be no analysis, machine learning models, or data products, and thankfully, the discipline is maturing and evolving to be up to the challenge. 

We hope this article provided a good overview of where data engineering is and where it may be going in 2023!

The data movement infrastructure for the modern data teams.
Try a 14-day free trial