The data world is constantly changing, with new trends emerging every month. It’s never been more important to weed out short-lived trends from those that are here to stay. Data practitioners must find tools that adapt to these long-lasting trends, making it possible to solve the most challenging problems with innovative, scalable solutions.
AI is a great example of one of these trends and one we will cover in this article. AI started as a trendy buzzword but has quickly transformed into a concept that is changing the data industry as we know it. Companies that choose to embrace it and lean into the changes that it brings will be the ones that win in the data space.
As a data person, it is easy to fall behind in the next big thing if you and your company aren’t working with tools that can grow and scale to your changing needs. Keeping an eye out for the trends in the data engineering space, and the companies who choose to embrace them, can help you find the right tooling for your data stack.
Here are the top data engineering trends in 2024, and how Airbyte aligns with them.
1. AI Integration: Transforming the Data World Everyone is wondering how they can leverage AI to their advantage, especially in the data world. I’ve seen tool after tool launching AI-enabled features to keep up with the demand. However, many of these companies are missing the mark. Instead of adding AI capabilities throughout their tool, they should be encouraging the production of high-quality data. Top-knotch data hygiene practices are a prerequisite to AI.
Insights produced by AI are only as good as the quality of your data. By focusing on AI before data quality, you are setting yourself up for failure. Various measures of your data like freshness, accuracy, and completeness need to be evaluated before AI can blindly work on your data. After evaluating these, you must add testing and observability to ensure you don’t act on this low-quality data, creating a black box of issues caused by AI-enablement.
ETL tools that prioritize high-quality data are a must-have for any data stack before AI-enabled tools can be considered. Airbyte is a great example of a data ingestion tool that follows data quality best practices, ensuring you only have the best data available in your data warehouse. It checks that data is complete and arrives on time using per-row-error-handling (it won’t fail your entire sync) and heartbeats . This essentially triggers Airbyte to restart jobs when an issue is detected between the source and destination.
While AI is a trend that’s here to stay, the order in which many people are going about using it is wrong. Focus on data quality now by replacing sub-par tools with tools that follow best practices like deduplication, normalization, and version control. Doing this will allow you to later implement AI in a way that will provide infinite value to your organization.
2. Enhancing Security and Governance With AI and the need for better quality data also comes a need for better access control over data. We do not want AI to have access to PII and any data not approved for use by consumers in this way. This again comes back to the idea that we need to properly evaluate the quality and security of our data before even considering adding AI-based tools to our tech stack.
It is more important now than ever to choose tools wisely in your data stack and ensure they are handling your data securely. This is usually done through creating users and roles within your data warehouse, assigning them specifically to the users and tools that touch your data. Without the ability for data stack tools to connect to your warehouse via users, you are giving full access to all of your data to the companies that own these tools.
Creating a role and user for each of your tools now, and eventually your AI-enabled tools, will limit access to sensitive data like your raw data, and allow tooling companies to only access pre-transformed datasets created to use by AI. This, along with other data governance practices, will ensure AI tools do not surpass their scope of use.
Other governance best practices to keep in mind when using AI include:
Encryption of data Storage limitation Data minimization Masking PII data Airbyte keeps your data secure by deleting your data as soon as it is moved from point A to B, not keeping your data longer than it needs to. Airbyte Cloud data, which is the hosted version of Airbyte’s open source tool, is deployed on isolated pods, ensuring different customers’ data is kept separate. It also encrypts all data using TLS, a protocol used to keep data shared over the internet safe and secure.
Lastly, Airbyte supports user management , meaning you have the ability to add and remove users from your workspace. This ensures users with limited knowledge of your syncs aren’t able to access sync settings, preventing something from being incorrectly changed within your ingestion environment.
All of the tools within your modern data stack should have features similar to this in order to maintain high-quality, secure data throughout your entire ETL pipeline.
3. Streamlining Collaboration with Data Contracts The disconnect experienced between engineering and data teams is an age-old issue that data contracts aim to solve. This disconnect is often seen in engineering making database changes such as schema or data type changes without letting the data team know, causing downstream failures that the data team then has to clean up. Data contracts are meant to add more transparency into this process and ensure data changes aren’t made without proper approval.
In the last year, we’ve seen various solutions pop up to help solve this problem: dbt contracts, ownership tags, and schema change tests are just a few of these. While these are helpful for lessening the occurrences of this problem, they still don’t get to the root of it.
dbt contracts fail your data pipeline when a field doesn’t exist or its data type has changed, delaying data to stakeholders. This often means the data team needs to remove the contract quickly (which defeats the whole purpose in the first place), or work with engineering to understand why this happened. Working with engineering to get to the root of why a change occurred can take hours or days, time that stakeholders often don’t have to wait on key metrics.
Ownership tags help to alert the responsible engineer or stakeholder to a test failure, but still don’t prevent that failure from happening in the first place. Schema change tests also fail downstream data models from running, making the entire cycle repeat itself. And, once again, we are brought to the same process of either removing the tests and tags or investigating them further, both which take precious time away from engineering and data teams.
While progress is happening in the right direction, we still have yet to land on a sustainable and long-term solution. These test failures shouldn’t make it to the data modeling layer, but should be identified before the tests have a chance to fail. This would save the data team time in investigating and fixing breaking changes.
Airbyte helps to establish data contracts by allowing you to decide how you want to detect and propagate schema changes . This means that if your data is changed at the source, Airbyte gives you the ability to ignore these changes or automatically handle them. This gives you control over whether changes will break your downstream data models or not.
Airbyte detects 5 different types of changes in your source data:
New column Removal of column New stream Removal of stream Column data type changes It also allows you to choose how to handle them, offering four different settings for you to configure in each of your syncs:
Propagate all Propagate column changes only Detect changes and manually approve Detect changes and pause connection Having the ability to handle how your data ingestion tool deals with changes in your source data gives you full control over your data environment. Whether you choose to propagate the changes, having tests detect them and fail in your modeling layer, or pause the connection when new changes are made, giving you time to investigate further without worrying about failures or downstream data quality issues, is a decision that’s in your control.
4. The New Wave of Orchestrators: Moving Beyond Airflow We are finally seeing alternatives to Airflow! A new wave of data orchestration tools have emerged, making it easier than ever for data teams to build reliable, sustainable data pipelines. Many of these alternatives are Python-based, hosted by the tool itself, and contain built-in monitoring features.
Inputs and outputs of tasks in tools like Dagster, Prefect, and Kestra are easily defined whereas Airflow must store and parse data between related tasks. With Airflow, running tasks outside of a schedule causes many issues and many DAGs need a schedule in order to run at all. Dagster, Prefect, and Kestra offer faster runs and more flexibility, making it easier on data engineers.
Dagster ’s main advantage over Airflow are its testing and debugging features. Dagster facilitates local development, unit testing, code reviews, and continuous integration, making it easier to follow data engineering best practices and reduce the risk of errors. Airbyte offers an integration directly with Dagster, making it easy to streamline the development and maintenance of your pipeline.
Prefect allows for event-based triggers whereas Airflow does not. It also automatically handles dependencies between tasks, reducing the chances of gaps in your data. Not to mention, Airbyte also offers a task for running data ingestion syncs within your Prefect pipeline. This allows you to sync your Airbyte connectors directly within your data pipeline.
Kestra is a bit different from these other two solutions in that it uses a decoupled microservice-oriented architecture along with Postgres, Java, Kafka, and Elasticsearch. This allows it to focus on scaling and processing large amounts of data. Airbyte allows you to take advantage of Kestra’s declarative nature to simplify your pipelines and rapidly iterate.
Check out Airbyte’s blog post for more on these three orchestration tools and how you can integrate them with Airbyte.
Conclusion: Building a Future-Proof Data Stack When choosing any modern data stack tool, make sure you take these data engineering trends into consideration. It’s unlikely that they will be going anywhere over the next 5 years, so it’s imperative to build a data stack with these trends in mind. Keeping these trends in mind will allow you to scale with your stack, rather than having to build a new one from scratch every few years.
Airbyte is just one example of a modern data stack tool that adapts to the long-term trends in the data engineering space. Reading through a tool’s documentation, checking out their blog, and keeping up to date with their founders on LinkedIn are all great ways to judge how well a tool listens to its customers and adapts its product as needed.