Why would you have to make data engineering so complicated?
At one point-in-time, large organizations needed another type of CEO – the Chief Electricity Officer. This was a time before there was an accessible and reliable grid to plug into. Organizations that required complex electricity setups employed a CEO. They became extinct over 100 years ago due to the standardization of the consumption layer, and increased availability and reliability of electricity from the main grid.
Enter the role of the Chief Digital Officer (CDO), sometimes now called the Chief Digital and Artificial Intelligence Officer (CDAIO). Similar to the CEO in my example, this role exists to create accessible and reliable technology solutions.
We’re at a turning point in human history where multiple life changing technologies are coming of age all at once, the most prevalent such as Generative AI becoming mainstream only recently. The dirty little secret of Generative AI, and broader AI applications is that the outcome is heavily influenced by the quality and understanding of the context from the model being used, and the underlying data.
In my experience, this is also true in something as simple as producing a dashboard. In the end, technology, no matter how complex it is under the hood, breaks down when there is a misunderstanding as to the meaning of the outcome.
Having been a frustrated business executive, looking at numbers on dashboards that don’t make sense, and having also helped customers build data analytics stacks from scratch, I can say that the gap comes from a chasm between knowledge about the business, and semantic details of how the information gets processed.
In large corporate environments, a notion as simple as a "customer" can literally have 3 different meanings depending on who you talk to, and come from 6 different systems all with different naming conventions that need to be transformed. This is because the goal of software systems and the engineers that make them is to create a working, and usable transactional system that solves a unique need for the department or business (an HR system like Workday, an ERP like SAP, etc.). Connecting information in a meaningful way across these systems is where the challenges lie .
With that in mind, here are 3 top challenges for achieving data engineering traction and justifying ROI , followed by strategies to overcome them.
Top 3 Most Common Data Engineering Challenges Challenge 1: Resource Efficiency in Data Management Your most expensive resources still spend a significant amount of time simply moving data around and managing it. While it is necessary to achieve the end result, it is a non-value added task that can and should be streamlined as much as possible by the many available tools in the modern data stack. One of the major reasons the E (Extract) in ETL/ELT is due to source variety, and source complexity. As you can see in the graph below, the most expensive resources spend well over 45% of their time loading and cleaning data. In my experience, these resources perform the Extract piece manually in a way that is not reusable.
How data scientists spend their time. Image courtesy: Anaconda “2020 State of Data Science: Moving From Hype toward Maturity" . Challenge 2: The Complexity of Data Engineering Tooling The big data engineering tooling vendors that large corporate entities typically favor are not incentivized to help you solve your data engineering issues. Though the tools offered by large vendors and hyperscalers solve parts of the data engineering problem, the bigger vendors are incentivized by having you and your company utilize their solution at scale. What this means is that there is not only cost pressures from an expensive resources perspective, but cost pressures from different SaaS vendors at play.
Some vendor decisions are also not reversible, meaning, if you favor open source or other tools down the line, the switching cost is extremely high. A complement to this challenge is that your team needs to be trained on the specific toolset, and, oftentimes there is no active general community to help solve issues you may have with the tooling, you need to rely on the vendor who often charges for these services.
BCG, A New Architecture to Manage Data Costs and Complexity Challenge 3: Infrastructure and Cost Management The third most common challenge is related to infrastructure setup and maintenance. Depending on the architecture of your data stack, and the capabilities of your internal team to manage infrastructure, you may end up favoring a mix of SaaS solutions instead of setting up OSS solutions in your own infrastructure.
Managing infrastructure is also a non-value added activity but a necessary one given the complexities of the toolset available in data engineering today. What this means is that oftentimes, the total cost of ownership of the entire data engineering stack is unknown to the business from a CAPEX vs. OPEX perspective and the costs creep up over time. When data engineering related projects reach a certain scale, it will end up requiring even a fully functional data engineering team to justify their ROI and usage to the business users funding the activities.
In fact, a recent industry study performed by Dremio concluded that 56% of companies dealing with big data expect that they can achieve an overall savings of over 50% on their total cost of ownership if they rethought their data stack!
BCG, A New Architecture to Manage Data Costs and Complexity Strategies to Overcome Data Engineering Challenges Given all these complexities, and gaps between technology and business stakeholders, how can one successfully tackle the common challenges in data engineering and become a major value driver for the organization? Here are 3 general categories that can help frame a solution.
Strategy 1: Optimizing Resource Allocation Find out where your most expensive resources are spending most of their time. You can do this by performing a simple time study of daily activities. If you’re worried about a "big-brother" vibe with your best technical resources, explain to them that the reason for the study is to find ways to remove barriers and use their time for value-added work.
Most technical people I’ve met welcome ways for work to be easier and more enjoyable. If it turns out the source variety is causing a lot of time to be devoted to simply extract (the E in ETL/ELT), consider open-source and readily available tools like Airbyte that is built to make Extraction very easy and quick to set up and maintain. With this burden reduced, the ROI on your entire data team will go up as the people will spend more time on delivering value to the business.
Strategy 2: Simplifying the Data Stack Analyze all the various tools being used in your data stack both from a fit and purpose perspective. In almost all cases that I’ve been personally involved in, the data stack was too complex, built over years of various pet projects and systems being implemented.
Pro-tip: you don’t need to map out the entire enterprise data model to get started. Simply take the top 5 most used or highest value data pipelines, and follow the data from source all the way to consumption. Map out all the various tools, transformations, and manual efforts required to get data moving across the stack. Think of it in terms of ETL/ELT.
If it turns out you have multiple tools and methods for extraction, multiple tools for transformations, and multiple tools for loading and storing, you may need to consolidate and find a way to simplify your tech stack down to the industry leading toolset.
For extraction, Airbyte provides the largest pre-built connector set in the industry and should cover most if not all your use cases.
Strategy 3: Calculating Total Cost of Ownership Calculate the Total Cost of Ownership of your current data stack. This includes all the tools, equipment, resources, etc. Thinking of it from a business perspective, you are trying to weigh the benefits and value of having a data engineering team versus the cost for building and maintaining the team.
If you are a technical manager or leader, this is important because this is how the executive branch and the people funding your initiatives think about the business. The tricky part here is defining the benefits of running a proper data engineering team. Oftentimes, with proper analytics setup, companies can offer new products or services, avoid costly mistakes, etc. Do your best to try and find value or past use cases where having the data team proved to be beneficial.
The cost piece of the equation should be more straightforward though sometimes difficult to drill across what is CAPEX vs OPEX. If your own internal ROI calculation doesn’t look positive, this is a sign that you need to adopt new strategies and focus your results on the highest impact initiatives.
All data engineering work starts with Extraction, and ends with a curated dataset that is used for various reasons. Moving towards industry leading architectures like the lakehouse or medallion architecture help to separate concerns and cost between compute and storage, and really leverage the scalability of the cloud.
Tools like Airbyte plug in perfectly with this type of architecture, and, very importantly, are vendor agnostic, meaning that they are not locked into a specific cloud vendor and you can even run workloads on-premises as required.
Conclusion Hopefully you have found this article beneficial for how to think about narrowing down and focusing on data engineering challenges. Thanks to tools like Airbyte, you can make data Extraction as seamless as turning on the lights, freeing up valuable time and efforts for your advanced analytics resources to develop models that will help take advantage of the recent wave in Artificial Intelligence.