The data industry is more fragmented than ever with a proliferation of tools, services, and new technologies. There are thousands of data companies competing in hundreds of niches, such as data integration, analytics, observability and cataloging, to name a few.
Just take a look at the first chart in this State of Data Engineering 2024 article and you’ll see what I mean. This is exciting for data practitioners, because there is so much innovation happening, but it is also overwhelming. Data leaders must not only choose a tool among so many overlapping options, they also must choose between open source and closed source alternatives, as well as self hosted or managed options, and there are quite a few trade offs to consider.
In this article, I will zoom in on the choice between open source and closed source Extract, Transform, and Load (ELT) tools, and the implications of that choice, namely the potential build up of technical debt due to an overreliance on closed source tooling.
Open Source vs Closed Source Software The distinction between open source and closed source software is an important foundational concept to make clear. Many people have written about this before , and have strong opinions about the technicalities in their differences, but I’ll try to do it justice with a quick overview.
Open source software (OSS) is software that is freely available to the general public to use, modify, and distribute. This means the source code itself is public (usually in code sharing repositories like GitHub), and anyone can see it or contribute to it.
Closed source software (CSS), on the other hand, is available only to those who are licensed to use it, often for a fee, and the source code is owned and controlled by the creators. Only the creators can see and modify the source code.
There are many tradeoffs between these two models, chief among them cost, features, customization, support, and security. Notably, OSS has been growing rapidly in popularity for many reasons, some of which will be discussed in this article.
Defining Technical Debt In software development, technical debt can be defined as “the implied future cost of reworking required when choosing an easy but limited solution instead of a better approach that could take more time.” It is analogous with monetary debt, because it can accumulate “interest” over time if not addressed, making it even harder to change later.
Technical debt is not necessarily a bad thing, but rather just a tradeoff between speed, quality, and future proofing. One might choose to take on some technical debt in exchange for a faster solution. However, if this approach is always taken, eventually that debt starts to compound and create significant issues.
There are many types of technical debt, such as code, design, testing, documentation, performance and cost debt.
To offer a simple example, code debt might be the choice to copy-paste a snippet of code many times with slight modifications to achieve some functionality across a range of inputs, rather than generalizing the logic to adhere to DRY principles . This approach is probably the fastest way to achieve your desired behavior from that code across its inputs, but when the time comes to change your logic, you need to change it multiple times in multiple places. The effort to refactor such code to a genericized, reusable function would be an example of “paying down” the tech debt you accumulated.
Why Too Much Technical Debt is Bad For Teams The main costs of excessive technical debt for software teams are reduced development velocity and developer frustration due to increased maintenance overhead, product defects, and lack of agility.
When too much tech debt accumulates, it becomes nearly impossible to change things about your software without breaking it or spending an excessive amount of time validating the changes. Complexity in the software increases along with maintenance costs. Troubleshooting becomes harder and developers’ ability to respond to incidents effectively is degraded. This all slows down developers, limits their ability to produce new features, and creates frustration.
The impacts of excessive tech debt are felt across many dimensions, including financially, system security and stability, and team satisfaction and morale.
Self Hosted vs Managed Software Another important distinction is self-hosted tooling vs managed cloud services.
Self hosting a piece of software means you manage all the underlying infrastructure required for that software to run in production environments. You are responsible for installation, configuration, deployment, uptime, maintenance, and monitoring. Managed services, on the other hand, handle all of this for you via a cloud hosting provider.
Many open source software is offered both as self hosted or managed - Airbyte is one example . Self hosting can be cheaper in terms of dollars, but also comes with a lot more responsibility for developers and infrastructure teams, and minimal or no support beyond what’s available from the community.
ETL Tools: Open or Closed Source? Self Hosted or Managed? A comparison of various popular ELT tools and their offerings can be found below. Deep dive comparisons between Airbyte and each of these tools can be found here .
What does technical debt have to do with the choice between open and closed source ELT software? Your choices in usage of OSS or CSS can result in the buildup of technical debt. This is primarily due to tradeoffs in software features, customization, cost, and support between each model.
How Closed Source ELT Tooling Can Be a Source of Technical Debt Now that we’ve set context, let’s explore the implications of using closed source ELT tooling and its effect on accumulation of technical debt.
There is a time and place for investing in closed source ELT tools, which is discussed in the next section. However, overreliance on these tools, especially as your organization scales, can lead to tech debt build up. This is due mainly to lack of customization and transparency.
Lack of Customization As a user, you don’t have much control over implementation details in closed source tools, beyond the configuration interfaces that are exposed for you.
Suppose for example you’re using Fivetran to extract a large volume of data from a third party API and load it into Snowflake. Imagine that the destination connector is implemented such that JSON data types get casted to VARIANT types on the fly during loading using Snowflake’s PARSE_JSON function. This can be an expensive transformation, and it doesn’t really make sense to do during the load step, especially if you’re not going to be needing every record downstream. Instead, it would be better to incrementally transform this data as needed in a downstream transformation layer.
What if the volume of data is too large and the load-time transform starts causing queued queries in your Snowflake warehouse? You would have no ability to fix this performance bottleneck, short of opening an issue with your closed source tool.
Given the perceived and anticipated friction that might be associated with the ticket resolution process, you might instead choose to “work around” the issue, by increasing your cluster count on the Snowflake warehouse, or otherwise setting up dedicated Snowflake compute resources for the Fivetran connection. This is a perfect example of tech debt that leads to elevated costs.
This exact situation actually happened to me, except I was using an open source data integration tool, so it was easy for me to just modify the source code, and force all JSON types to be replicated as TEXT, so they could be cast to VARIANT as needed in a downstream dbt project.
Personally, I’ve never used Fivetran, and I’d be surprised if this type of behavior were defined in a Snowflake destination connector. Nonetheless, the point still stands that you have limited control over such details with closed source tools .
Importantly, with small data volumes, this behavior would be a non-issue and probably never even noticed. So, the emergence of tech debt only occurs once your data replications start to scale up, at which point more customization might be needed via open source tools.
Lack of Transparency If you don’t know what’s going on under the hood, you might make assumptions. These assumptions can become hard to maintain, and you may start trying to catch edge cases that might not ever occur.
Suppose you’ve configured an ELT tool to replicate a database table into your data warehouse. It replicates successfully, and you build some dimensional data models on top of that data to incrementally transform the data on an hourly schedule. These incremental transformations depend on the existence of a unique surrogate key and a loaded_at timestamp generated by the replication tool.
Now suppose there’s a schema change - an existing column gets renamed (in other words, deleted, and a new column created in its place). If the exact behavior for how schema changes are handled isn’t clear, it can raise a lot of questions for engineers.
Are the already-replicated values from the old column name backfilled into the new column? Is the whole table fully refreshed? What about metadata columns? Are values in the loaded_at column incremented? What about unique surrogate keys/hashes for the rows? Do these change for existing rows after a schema change?
These behaviors are common sources of bugs in incremental transformation pipelines. An engineer who has experienced such bugs before might make the decision to just abandon attempts at transforming the data incrementally, and instead opt for full refresh transforms, or on the fly transforms using views. While this approach protects from defects related to bad incremental logic, it is also much slower and more costly.
All of these behaviors should arguably be documented in any good tool. However, there is always the chance that there are gaps in documentation, that it has become out of date, or that the documentation is unclear or hard to interpret. These are all scenarios where an engineer can greatly benefit from being able to view the source code. Having full transparency into the implementation details in this case can ensure proper modeling of the data generated by the tool, and enable engineers to avoid tech debt buildup.
Elevated Costs Each of the above examples demonstrate how closed source tooling can lead to technical debt which indirectly elevates costs.
Costs can of course be impacted directly by the choice of tooling as well. One example is Stitch, a data replication tool similar to Fivetran and Airbyte. Stitch is a managed service which uses an open source framework, Singer, under the hood. So it is not technically fully “closed source”, but the example is also relevant for true closed source tooling.
Stitch uses a “pay per row” pricing model for data replications. Users must also pay for the compute resources utilized by their data warehouses to actually execute the loading queries to move data into the warehouse. In this way, users are paying “double” for data replications. Such hidden costs are common in closed source tools, and they consume financial resources that can be better utilized elsewhere.
When Closed Source ELT Tooling is Appropriate Closed source ELT tooling is appropriate for small scale, less mature data pipelines. When you have few datasets that are small in size, your data platform is just getting off the ground, and you need a quick way to move data, closed source tools can be a very effective choice.
The upfront setup and configuration work will usually be minimal, so you can focus on moving fast and creating value for the business. As your business and data platform starts to scale, consider migrating towards more open source tools.
Striking the Right Balance of Closed and Open Source ELT Tools for Your Data Platform In general, one can think about the optimal proportion of open vs closed source ELT tooling in relation to the maturity of the data platform in which those tools are deployed.
Immature data platforms will generally rely more on paid, proprietary ELT tools. On the other hand, mature data platforms will rely more on open source tools. At either end of the spectrum, one can still expect to find some open source or closed source tools.
Proportion of open source tools against data maturity One exception to the “rule of thumb” shared above is to avoid closed source tooling for critical pipeline components, even in the nascent stages of your data platform. These components are bound to be the most costly to maintain in the long term, and require the most customization and flexibility as your data architectures evolve. So it makes sense to ensure you have full control over them from the beginning.
Conclusion There are no hard and fast rules for choosing ELT software for your organization. It depends on a great number of factors, such as features, customization needs, costs, and security.
With this said, it is prudent to give strong consideration to open source tools for the long term vision of your data platform. Overreliance on proprietary, closed source ELT tools will hinder your organization’s ability to scale via the accumulation of technical debt.