If there's any field changing rapidly right now, it’s data engineering. The field has always moved quickly, but now, with advancements in AI and a shifting landscape of tools and technologies, it's undergoing a transformation like never before. As a data professional, navigating this ever-evolving landscape can be daunting. What should you focus on? How can you ensure you're not just keeping up, but staying ahead of the curve?
While we don’t have all the answers, we're definitely on the hunt for them. That's why, in addition to analyzing tools, technologies, and data communities, we've tapped into the wisdom of those at the forefront of the field. We've consulted with experts and founders of the hottest data tools to get their insights. This enriches our guide with expert knowledge. So, here's our take on the
5 big trends that everyone in data engineering needs to prepare for. And, more than just trends, we're offering practical tips on how you can use these insights to your advantage.
Our goal is simple: to equip you with the information you need to navigate the data engineering trends of 2024 confidently. And yes, you guessed it – we're kicking things off with AI.
Generative AI and LLMs: Putting Data to Work
The late 2022
introduction of OpenAI’s ChatGPT marked a significant moment in the tech world, sparking a surge in both interest and investment in Generative AI (GenAI).
Large Language Models (LLMs) represent a branch of AI, where models are trained on extensive amounts of text data to learn how to predict responses to human inputs. GPT (Generative Pretrained Transformer), especially its iterations developed by OpenAI, stands as a prime example of LLMs. These models have demonstrated remarkable capabilities in generating coherent and contextually relevant text, opening doors to more advanced applications of GenAI.
With each iteration, GPT models have become increasingly powerful, managing more complex tasks and processing data more efficiently.
ChatGPT, in particular, has achieved explosive global popularity, marking AI’s first major inflection point in public adoption.
In terms of data tools, one of the most exciting developments enabled by GenAI are conversational interfaces. This innovation has the potential to profoundly impact customer engagement, sales, marketing, and IT support.
In analytics, these advances are driving demand for natural language question answering and conversational search, exemplified by
“chat with your data” applications where users interact with databases without needing SQL.
some argue AI hasn't radically altered the role of data professionals yet, it has introduced new capabilities and will continue to do so in 2024 and beyond. For example, now we have Github Copilot to help us with our code, and we can use GenAI for tasks such as generating SQL from natural language, translating between SQL dialects, creating boilerplate data pipeline DAGs, developing relational schemas for complex data types, and so on.
As GenAI models become more advanced, their potential to handle increasingly complex data engineering tasks will grow, further transforming the role of data engineers from routine task managers to strategic problem-solvers and innovators.
But data engineers are not just users of AI, we are also enablers.
Addressing an organization’s AI needs requires adapting our skills and knowledge beyond the realm of traditional analytics pipelines. This involves becoming familiar with new architectures, like
vector databases, which are essential for managing the sophisticated data structures inherent in AI and ML. Equally important is the understanding of unstructured data and exploring new types of data sources that extend beyond the conventional databases or APIs.
That’s why at Airbyte, we've made it a priority to support unstructured data sources, including
extracting text from documents, and vector database destinations.
For data engineers in the field, take advantage of AI-powered tools to improve your productivity - but make sure to take a step back to understand what exactly the AI you’re using is abstracting. Without that technical knowledge, you will be less effective at wielding AI tools in the future.
“Using techniques such as LLMs and vector embeddings is bringing about a new era in data engineering. It is enabling us to turn challenges into opportunities, and raw data into insightful narratives. With each stage of the data lifecycle being enhanced by Generative AI, the potential for innovation is limitless. Vector Databases are not replacing traditional database solutions. Instead, they represent an entirely new category of database. They allow solutions to be built on top of unstructured and semi-structured data of any kind, such as text, images, radio, video, and their combinations, in the form of vector embeddings. This is a paradigm shift for the entire information retrieval and data ecosystem.” - Andre Zayarni - Co-founder at Qdrant Software Meets Data: A Renewed Focus on Fundamentals
The worlds of software engineering and data engineering are increasingly overlapping.
This trend isn't new, but it continues to evolve and become stronger. The adoption of CI/CD, testing, version control, monitoring, observability, etc. once exclusive to software engineering, are now becoming integral to the data engineering lifecycle.
Why is this happening now? Two main drivers are influencing this shift. First, there's the increasing complexity and importance of data products. Data teams are now viewed as critical product teams, reflecting the strategic value of data. Second, there’s a response to the growing demand for higher quality in data, especially important in the age of Generative AI where poor data quality can lead to flawed models.
The “shift left” approach, popularized by DevOps, is being adopted in
DataOps, ensuring that activities like testing and security are integrated early in the development process, not as afterthoughts.
Moving forward, we expect a deeper integration of these software engineering practices into data engineering tools. Concepts like Infrastructure-as-Code (IaC) for managing data infrastructure and declarative data pipelines are making their way into data engineering. This is exemplified by Airbyte’s
Terraform provider, or modern orchestration tools like Dagster and Kestra.
As a data engineer, how can you adapt to and benefit from this trend? Start by familiarizing yourself with some of the core components of DevOps, namely: CI/CD, automated testing, and version control. Our other recommendation is to embrace the fundamentals, but let your tools do the heavy lifting. Many data tools are introducing features that make implementing DevOps best practices straightforward.
As Joe Reis
mentions, why delve into the nitty-gritty of low-level infrastructure and code when you can leverage advanced tools that handle these aspects for you? This approach will free you up to focus on high impact, strategic initiatives, like enhancing data quality, refining data models, and implementing robust security measures. These are the areas where you can truly add value, driving the success of your data initiatives. “We are returning to the core principles and patterns that have been the cornerstone of data engineering over the years. This includes a renewed focus on aspects such as data modeling and other elements of the data engineering lifecycle, from security and data governance – right down to the evolving realm of data contracts. The integration of software engineering practices into data engineering is not just continuing; it's intensifying.” - Simon Spati, author of Data Engineering Design Patterns Modern Data Stack: From Boom to Maturity
The modern data stack (MDS) has been a buzzword in the data world for a few years. But what exactly is it? In simple terms, it's a collection of tools and technologies specifically designed to handle various aspects of data processing, from collection and storage to analysis and visualization.
While some may argue that the allure for the MDS is declining, given its potential downsides, the reality is that it's maturing. The MDS is evolving from a vast pool of potential options to a smaller set of well-established and reliable tools from which you can build scalable, performant data systems.
On the supply side of the equation, excessive investments in the data space has led to a proliferation of data companies that will need to start looking for funding soon. Over the next 12 months, expect a large number of companies to exit or close shop.
On the demand side, data engineers are overwhelmed by the number of tools at their disposal. There is a growing realization that a smaller set of flexible tools is better than a large number of hyper-specialized tools. This lets you get the benefits of all the innovation in the data space, without spending a fortune on maintaining an intricate web of data tools.
In 2024 and beyond, only the most useful and valuable tools in each category will remain. Survival of the fittest, indeed.
Looking ahead, we can expect a few other key developments in this area. First, there will be a focus on the resilience of open-source software within MDS. The tools that survive will likely be those that are not only robust but also have strong community support. Second, the emphasis will be on vetting and choosing the right tool for the job, rather than defaulting to building bespoke solutions. And last, existing MDS tools will continue to adapt to the increasing demands of AI and machine learning.
Tools that can support these
modern AI use cases will be well positioned to support the growing data needs of organizations keen on leveraging AI.
As a data engineer navigating this maturing landscape, your focus should be on selecting the right tools and technologies for your needs. Prioritize tools backed by well-funded teams, offering strong support and open-source options. Pay attention to tools that are cost-conscious with favorable pricing models, especially as managing costs in your data stack becomes more relevant. And most importantly, stay attuned to how these tools are evolving to meet the demands of AI and machine learning – this is where the future of data engineering is headed.
“When facing the proliferation of tools that came with MDS, it is easy to lose focus on what is important and succumb to the pursuit of the next shiny object. In order to not lose track of what matters, always go back to first principles. You’re building an infrastructure and to do so, you need: movement, storage, processing and a bit of observability. Focus on getting those fundamentals right. Only then you need to figure out the tools to interact with the data and extract insights from it. That’s when you need to think about the persona working with that data, and that’s what should drive your choice, not the fancy marketing of yet another solution.” Michel Tricot, Co-founder & CEO at Airbyte. Data Lakehouse Architecture: Blurring Lines in Data Storage
In the beginning, there were relational transactional databases, designed to capture and reflect the state of the world. These Online Transactional Processing (OLTP) databases were best suited for shallow-and-wide queries (like inserting or updating many attributes of a few entities). While handling transactions for a given application is great, there was a need to bring data from different databases under one roof for processing and analysis. Online Analytical Processing systems emerged to fill this gap.
First, we had OLTP databases, designed to capture and reflect the state of the world. Then, we got OLAP systems (often categorized as Data Warehouses), to efficiently perform aggregations and analysis on our data at larger and larger scales
For a long while, this is how the storage and processing of structured data was thought about. But with the emergence of the web came an unprecedented need to store large amounts of semi-structured or unstructured data. So, in the early 2000s, a handful of tools were invented to efficiently store and analyze this new wave of “big data”. They did so by distributing storage and compute, and by adopting cheaper, more flexible ways of storing data, like object or document storage.
These vast pools of unstructured data were aptly named “Data Lakes”. Unfortunately, performing analytics on top of data lakes required a considerable amount of operational complexity.
This brings us to the present, where a new set of technologies has blurred the lines between warehouse, lake, and transactional systems. These technologies, once assembled, are known as a
Data Lakehouses are typically composed of an underlying object store (for structured, semi-structured and unstructured data), a data layer (composed of the file and table formats and a metastore), and a processing layer. They can be built to support ACID transactions, schema enforcement and governance, diverse data types, and multiple kinds of workloads. Quite a few companies and products have come to market seeking to make Data Lakehouses simple to operate and reasonable to maintain, including industry heavyweights like
Snowflake and Databricks.
In 2023, the yearly MAD Landscape by Matt Tuck
rebranded the Data Lakes category to “Data Lakes / Lakehouses” to reflect the Data Lakehouse trend. In 2024, we expect to see significant movement in the Data Lakehouse space, including progress on interoperability between the major open table formats ( Iceberg, Hudi and Delta). You can also chcekout any amazing article on Data Lake vs Lakehouse for a deeper understanding.
We also expect to see increased adoption of the Data Lakehouse architecture. This is in line with the industry trend of seeking to lower the number of tools used and integrations maintained by data organizations, in an effort to reduce cloud costs and operational complexity.
Data Engineers who are unfamiliar with open table formats, distributed processing engines, or metastores should consider investing their time in learning how these technologies work and how to use them in the wild. This will ensure that you’re ready to thrive amidst this architectural shift in data storage and processing.
"Openness and decentralization are key themes in modern data trends, addressing the challenges of ever-expanding data scales. Data lakehouses provide access to a single copy of your data across a broad ecosystem of tools. Data virtualization unifies your data across a decentralized web of data sources, while data mesh decentralizes the labor division required to curate the data. At Dremio, we focus on creating a platform that leverages and enhances these trends, empowered by technologies like Apache Parquet, Apache Arrow, and Apache Iceberg." - Alex Merced, Developer Advocate at Dremio Data Mesh: Reshaping Data Management and Collaboration
Data Mesh is a concept that is often misunderstood, perhaps because of our tendency to focus on architecture and technology when it comes to data. In reality, Data Mesh is more about managing human resources and setting standards within data teams. It's an aspirational concept, aiming to reshape how we handle data across different departments.
The essence of Data Mesh lies in its approach to central data governance. It's about establishing common ground for data quality, shared infrastructure, and best practices. Each data team within an organization must adhere to certain standards, particularly when dealing with datasets that span multiple business domains.
The concept of Data Mesh became hot in 2021. Now that the dust has settled, we see
a clear divide in the data community regarding Data Mesh. Some have implemented it with significant success, while others dismiss it as mere marketing jargon.
One major hurdle with Data Mesh is its feasibility in terms of resources. Not every company has the luxury of multiple data teams to implement and maintain a Data Mesh effectively. Typically, smaller companies will rely on a central data team. However, as the demand for data products grows in 2024 and beyond, the Data Mesh adoption will continue to expand, especially in larger organizations.
For data engineers looking to stay ahead, understanding the concept of Data Mesh remains important. Familiarize yourself with its advantages and disadvantages, and consider where it fits best. Data Mesh tends to work well in large organizations where teams are already aligned around business domains and services. If you encounter barriers preventing teams from accessing and using data efficiently, moving towards a Data Mesh approach could be beneficial. “Whatever we call it - data mesh or otherwise - doing federated/decentralized data will continue to expand, it's the only way to maintain speed and flexibility at scale for most companies. My only hope is that it is done well and not with a technology-only, or tech first approach. There is always a marketing-driven hype cycle around topics and terms but if you talk to people that are actually trying to change their organization through data - rather than just doing exciting data work that doesn't tie to business value - data mesh is VERY alive and well.” - Scott Hirleman, Founder of Data Mesh Understanding Conclusion
And there you have it – a deep dive into the top 5 trends shaping data engineering in 2024. It's clear that our field is in the midst of an exciting, transformative phase, largely driven by the relentless pace of AI advancements.
But to truly understand the direction we're headed, we need your insights. That's why we invite you to participate in our
State of Data & AI survey. Your responses will help us keep the community informed, ensuring that the insights we share are grounded in the real-world experiences of data professionals like you. By participating, you're not just offering your perspective – you're helping us all stay honest and informed about the state of data engineering.